Code
scaler = StandardScaler()We can apply K-means clustering to try and identify which states are more developed/deprived than others. Clustering is an unsupervised classification of patterns (i.e., GDP per capita and life expectancy across states) into similar groups (clusters), and k-means does this by classifying the states into k-number of groups.
Since our data set has values of different magnitude, we first standardise our variables.
scaler = StandardScaler()us_rescaled_final_scaled = scaler.fit_transform(us_rescaled_final[selected_variables])We then run the kneed package to find an optimal number of k clusters.
# Number of clusters to try out
n_clusters = list(range(2, 10))
# Run kmeans for each value of k
inertias = []
for k in n_clusters:
# Initialize and run
kmeans = KMeans(n_clusters=k, n_init=10)
kmeans.fit(us_rescaled_final_scaled)
# Save the "inertia"
inertias.append(kmeans.inertia_)
# Initialize the knee algorithm
kn = KneeLocator(n_clusters, inertias, curve='convex', direction='decreasing')
# Print out the knee
print(kn.knee)6
Run the k-means clustering with the optimal k value. We can visualise the results in the table and map below.
kmeans = KMeans(n_clusters=6, n_init=10)
# Perform the fit
kmeans.fit(us_rescaled_final_scaled)
# Extract the labels
us_rescaled_final['label'] = kmeans.labels_# Number of states per cluster
us_rescaled_final.groupby('label', as_index=False).size()| label | size | |
|---|---|---|
| 0 | 0 | 16 |
| 1 | 1 | 9 |
| 2 | 2 | 1 |
| 3 | 3 | 14 |
| 4 | 4 | 10 |
| 5 | 5 | 1 |
# Average feature per cluster
us_rescaled_final.groupby("label", as_index=False)[
selected_variables
].mean().sort_values(by="label")| label | REALGDPpercapita | life_expectancy | MedHHInc | PctBach | UnemploymentRate | LabForParticipationRate | Labor_Productivity_2023 | TotalPop | PovertyRate | netexport | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 63641.287918 | 77.162500 | 79378.875000 | 0.217191 | 0.041852 | 0.649131 | 104.720938 | 1.963924e+06 | 0.108652 | -1.437363e+09 |
| 1 | 1 | 48835.305886 | 72.233333 | 60957.555556 | 0.164726 | 0.053548 | 0.584285 | 105.058889 | 3.697484e+06 | 0.166678 | -9.561401e+08 |
| 2 | 2 | 217272.523022 | 75.300000 | 106287.000000 | 0.260746 | 0.064270 | 0.719794 | 114.806000 | 6.720790e+05 | 0.145306 | 1.553795e+08 |
| 3 | 3 | 63293.022605 | 75.364286 | 73875.857143 | 0.201638 | 0.051788 | 0.628321 | 107.645143 | 1.219532e+07 | 0.128928 | -4.489356e+10 |
| 4 | 4 | 73713.193650 | 78.180000 | 93235.800000 | 0.244063 | 0.043336 | 0.672959 | 114.374100 | 5.703807e+06 | 0.093520 | -1.743753e+10 |
| 5 | 5 | 82783.538426 | 78.300000 | 96334.000000 | 0.224029 | 0.063654 | 0.638570 | 118.074000 | 3.924278e+07 | 0.119664 | -2.715042e+11 |
import hvplot.pandas# Map plot clusters
us_rescaled_final.hvplot(
c="label",
dynamic=False,
width=1000,
height=1000,
geo=True,
cmap="viridis",
)States in Cluster 1 appear to be the worst off economically across most of the 10 variables compared to other clusters. Cluster 3 has the lowest mean value in real GDP per capita, life expectancy, median household income, percentage of bachelor’s degree, labour participation rate, and highest mean value in poverty rate.
Clusters 2 and 5 have only 1 state, District of Columbia and California respectively. Both clusters have high median household income and real GDP per capita, and labor productivity
One limitation of cluster is its unsupervised nature. This means that it is possible for the clusters to be unrelated to development/deprivation. Nonetheless, the results from our table suggest that the classification are related to level of economic development/deprivation.