3. Clustering

4. Cluster Analysis

We can apply K-means clustering to try and identify which states are more developed/deprived than others. Clustering is an unsupervised classification of patterns (i.e., GDP per capita and life expectancy across states) into similar groups (clusters), and k-means does this by classifying the states into k-number of groups.

Since our data set has values of different magnitude, we first standardise our variables.

Code
scaler = StandardScaler()
Code
us_rescaled_final_scaled = scaler.fit_transform(us_rescaled_final[selected_variables])

We then run the kneed package to find an optimal number of k clusters.

Code
# Number of clusters to try out
n_clusters = list(range(2, 10))

# Run kmeans for each value of k
inertias = []
for k in n_clusters:
    
    # Initialize and run
    kmeans = KMeans(n_clusters=k, n_init=10)
    kmeans.fit(us_rescaled_final_scaled)
    
    # Save the "inertia"
    inertias.append(kmeans.inertia_)

# Initialize the knee algorithm
kn = KneeLocator(n_clusters, inertias, curve='convex', direction='decreasing')

# Print out the knee 
print(kn.knee)
6

Run the k-means clustering with the optimal k value. We can visualise the results in the table and map below.

Code
kmeans = KMeans(n_clusters=6, n_init=10)

# Perform the fit
kmeans.fit(us_rescaled_final_scaled)

# Extract the labels
us_rescaled_final['label'] = kmeans.labels_
Code
# Number of states per cluster
us_rescaled_final.groupby('label', as_index=False).size()
label size
0 0 16
1 1 9
2 2 1
3 3 14
4 4 10
5 5 1
Code
# Average feature per cluster
us_rescaled_final.groupby("label", as_index=False)[
    selected_variables
].mean().sort_values(by="label")
label REALGDPpercapita life_expectancy MedHHInc PctBach UnemploymentRate LabForParticipationRate Labor_Productivity_2023 TotalPop PovertyRate netexport
0 0 63641.287918 77.162500 79378.875000 0.217191 0.041852 0.649131 104.720938 1.963924e+06 0.108652 -1.437363e+09
1 1 48835.305886 72.233333 60957.555556 0.164726 0.053548 0.584285 105.058889 3.697484e+06 0.166678 -9.561401e+08
2 2 217272.523022 75.300000 106287.000000 0.260746 0.064270 0.719794 114.806000 6.720790e+05 0.145306 1.553795e+08
3 3 63293.022605 75.364286 73875.857143 0.201638 0.051788 0.628321 107.645143 1.219532e+07 0.128928 -4.489356e+10
4 4 73713.193650 78.180000 93235.800000 0.244063 0.043336 0.672959 114.374100 5.703807e+06 0.093520 -1.743753e+10
5 5 82783.538426 78.300000 96334.000000 0.224029 0.063654 0.638570 118.074000 3.924278e+07 0.119664 -2.715042e+11
Code
import hvplot.pandas
Code
# Map plot clusters
us_rescaled_final.hvplot(
    c="label",
    dynamic=False,
    width=1000,
    height=1000,
    geo=True,
    cmap="viridis",
    )

States in Cluster 1 appear to be the worst off economically across most of the 10 variables compared to other clusters. Cluster 3 has the lowest mean value in real GDP per capita, life expectancy, median household income, percentage of bachelor’s degree, labour participation rate, and highest mean value in poverty rate.

Clusters 2 and 5 have only 1 state, District of Columbia and California respectively. Both clusters have high median household income and real GDP per capita, and labor productivity

One limitation of cluster is its unsupervised nature. This means that it is possible for the clusters to be unrelated to development/deprivation. Nonetheless, the results from our table suggest that the classification are related to level of economic development/deprivation.