3. Clustering

4. Cluster Analysis

We can apply K-means clustering to try and identify which states are more developed/deprived than others. Clustering is an unsupervised classification of patterns (i.e., GDP per capita and life expectancy across states) into similar groups (clusters), and k-means does this by classifying the states into k-number of groups.

Since our data set has values of different magnitude, we first standardise our variables.

Code

scaler = StandardScaler()

Code

us_rescaled_final_scaled = scaler.fit_transform(us_rescaled_final[selected_variables])

We then run the kneed package to find an optimal number of k clusters.

Code

# Number of clusters to try out
n_clusters = list(range(2, 10))

# Run kmeans for each value of k
inertias = []
for k in n_clusters:
    
    # Initialize and run
    kmeans = KMeans(n_clusters=k, n_init=10)
    kmeans.fit(us_rescaled_final_scaled)
    
    # Save the "inertia"
    inertias.append(kmeans.inertia_)

# Initialize the knee algorithm
kn = KneeLocator(n_clusters, inertias, curve='convex', direction='decreasing')

# Print out the knee 
print(kn.knee)

Run the k-means clustering with the optimal k value. We can visualise the results in the table and map below.

Code

kmeans = KMeans(n_clusters=6, n_init=10)

# Perform the fit
kmeans.fit(us_rescaled_final_scaled)

# Extract the labels
us_rescaled_final['label'] = kmeans.labels_

Code

# Number of states per cluster
us_rescaled_final.groupby('label', as_index=False).size()

	label	size
0	0	16
1	1	9
2	2	1
3	3	14
4	4	10
5	5	1

Code

# Average feature per cluster
us_rescaled_final.groupby("label", as_index=False)[
    selected_variables
].mean().sort_values(by="label")

	label	REALGDPpercapita	life_expectancy	MedHHInc	PctBach	UnemploymentRate	LabForParticipationRate	Labor_Productivity_2023	TotalPop	PovertyRate	netexport
0	0	63641.287918	77.162500	79378.875000	0.217191	0.041852	0.649131	104.720938	1.963924e+06	0.108652	-1.437363e+09
1	1	48835.305886	72.233333	60957.555556	0.164726	0.053548	0.584285	105.058889	3.697484e+06	0.166678	-9.561401e+08
2	2	217272.523022	75.300000	106287.000000	0.260746	0.064270	0.719794	114.806000	6.720790e+05	0.145306	1.553795e+08
3	3	63293.022605	75.364286	73875.857143	0.201638	0.051788	0.628321	107.645143	1.219532e+07	0.128928	-4.489356e+10
4	4	73713.193650	78.180000	93235.800000	0.244063	0.043336	0.672959	114.374100	5.703807e+06	0.093520	-1.743753e+10
5	5	82783.538426	78.300000	96334.000000	0.224029	0.063654	0.638570	118.074000	3.924278e+07	0.119664	-2.715042e+11

Code

import hvplot.pandas

Code

# Map plot clusters
us_rescaled_final.hvplot(
    c="label",
    dynamic=False,
    width=1000,
    height=1000,
    geo=True,
    cmap="viridis",
    )

States in Cluster 1 appear to be the worst off economically across most of the 10 variables compared to other clusters. Cluster 3 has the lowest mean value in real GDP per capita, life expectancy, median household income, percentage of bachelor’s degree, labour participation rate, and highest mean value in poverty rate.

Clusters 2 and 5 have only 1 state, District of Columbia and California respectively. Both clusters have high median household income and real GDP per capita, and labor productivity

One limitation of cluster is its unsupervised nature. This means that it is possible for the clusters to be unrelated to development/deprivation. Nonetheless, the results from our table suggest that the classification are related to level of economic development/deprivation.