Code
= StandardScaler() scaler
We can apply K-means clustering to try and identify which states are more developed/deprived than others. Clustering is an unsupervised classification of patterns (i.e., GDP per capita and life expectancy across states) into similar groups (clusters), and k-means does this by classifying the states into k-number of groups.
Since our data set has values of different magnitude, we first standardise our variables.
= StandardScaler() scaler
= scaler.fit_transform(us_rescaled_final[selected_variables]) us_rescaled_final_scaled
We then run the kneed package to find an optimal number of k clusters.
# Number of clusters to try out
= list(range(2, 10))
n_clusters
# Run kmeans for each value of k
= []
inertias for k in n_clusters:
# Initialize and run
= KMeans(n_clusters=k, n_init=10)
kmeans
kmeans.fit(us_rescaled_final_scaled)
# Save the "inertia"
inertias.append(kmeans.inertia_)
# Initialize the knee algorithm
= KneeLocator(n_clusters, inertias, curve='convex', direction='decreasing')
kn
# Print out the knee
print(kn.knee)
6
Run the k-means clustering with the optimal k value. We can visualise the results in the table and map below.
= KMeans(n_clusters=6, n_init=10)
kmeans
# Perform the fit
kmeans.fit(us_rescaled_final_scaled)
# Extract the labels
'label'] = kmeans.labels_ us_rescaled_final[
# Number of states per cluster
'label', as_index=False).size() us_rescaled_final.groupby(
label | size | |
---|---|---|
0 | 0 | 16 |
1 | 1 | 9 |
2 | 2 | 1 |
3 | 3 | 14 |
4 | 4 | 10 |
5 | 5 | 1 |
# Average feature per cluster
"label", as_index=False)[
us_rescaled_final.groupby(
selected_variables="label") ].mean().sort_values(by
label | REALGDPpercapita | life_expectancy | MedHHInc | PctBach | UnemploymentRate | LabForParticipationRate | Labor_Productivity_2023 | TotalPop | PovertyRate | netexport | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 63641.287918 | 77.162500 | 79378.875000 | 0.217191 | 0.041852 | 0.649131 | 104.720938 | 1.963924e+06 | 0.108652 | -1.437363e+09 |
1 | 1 | 48835.305886 | 72.233333 | 60957.555556 | 0.164726 | 0.053548 | 0.584285 | 105.058889 | 3.697484e+06 | 0.166678 | -9.561401e+08 |
2 | 2 | 217272.523022 | 75.300000 | 106287.000000 | 0.260746 | 0.064270 | 0.719794 | 114.806000 | 6.720790e+05 | 0.145306 | 1.553795e+08 |
3 | 3 | 63293.022605 | 75.364286 | 73875.857143 | 0.201638 | 0.051788 | 0.628321 | 107.645143 | 1.219532e+07 | 0.128928 | -4.489356e+10 |
4 | 4 | 73713.193650 | 78.180000 | 93235.800000 | 0.244063 | 0.043336 | 0.672959 | 114.374100 | 5.703807e+06 | 0.093520 | -1.743753e+10 |
5 | 5 | 82783.538426 | 78.300000 | 96334.000000 | 0.224029 | 0.063654 | 0.638570 | 118.074000 | 3.924278e+07 | 0.119664 | -2.715042e+11 |
import hvplot.pandas
# Map plot clusters
us_rescaled_final.hvplot(="label",
c=False,
dynamic=1000,
width=1000,
height=True,
geo="viridis",
cmap )
States in Cluster 1 appear to be the worst off economically across most of the 10 variables compared to other clusters. Cluster 3 has the lowest mean value in real GDP per capita, life expectancy, median household income, percentage of bachelor’s degree, labour participation rate, and highest mean value in poverty rate.
Clusters 2 and 5 have only 1 state, District of Columbia and California respectively. Both clusters have high median household income and real GDP per capita, and labor productivity
One limitation of cluster is its unsupervised nature. This means that it is possible for the clusters to be unrelated to development/deprivation. Nonetheless, the results from our table suggest that the classification are related to level of economic development/deprivation.