4. Predictive Modelling

5. Predictive Modelling

With the results, policymakers might be interested in which variables are key predictors of GDP per capita, life expectancy, median household income, and education levels. We can apply a predictive model using decision trees to identify the key predictors. First we select the variables we want to model for.

Code
# Models
from sklearn.ensemble import RandomForestRegressor

# Model selection
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV

# Pipelines
from sklearn.pipeline import make_pipeline

# Preprocessing
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
Code
selected_variables
['REALGDPpercapita',
 'life_expectancy',
 'MedHHInc',
 'PctBach',
 'UnemploymentRate',
 'LabForParticipationRate',
 'Labor_Productivity_2023',
 'TotalPop',
 'PovertyRate',
 'netexport']
Code
usa_modelling = us_rescaled_final[selected_variables + ["geometry"]].dropna()

We then split the data into training and test set.

Code
# Split the data 70/30
train_set, test_set = train_test_split(usa_modelling, test_size=0.3, random_state=42)

We then run the predictive model for each dependent variable (5.1 to 5.4). The barplot shows ranks the order of importance of each independent variable in predicting the dependent variable.

5.1 GDP per capita against variables

Code
# the target labels: log of REALGDPpercapita
y_GDPtrain = np.log(train_set["REALGDPpercapita"])
y_GDPtest = np.log(test_set["REALGDPpercapita"])
Code
# The features
GDPfeature_cols = [
    'life_expectancy',
    'MedHHInc',
    'PctBach',
    'UnemploymentRate',
    'LabForParticipationRate',
    'Labor_Productivity_2023',
    'TotalPop',
    'PovertyRate',
    'netexport',
]
X_GDPtrain = train_set[GDPfeature_cols].values
X_GDPtest = test_set[GDPfeature_cols].values
Code
# Make a random forest pipeline
forest_pipeline = make_pipeline(
    StandardScaler(), RandomForestRegressor(n_estimators=100, random_state=42)
)

# Run the 10-fold cross validation
GDPscores = cross_val_score(
    forest_pipeline,
    X_GDPtrain,
    y_GDPtrain,
    cv=10,
)

# Report
print("R^2 scores = ", GDPscores)
print("Scores mean = ", GDPscores.mean())
print("Score std dev = ", GDPscores.std())
R^2 scores =  [ 0.42875078  0.7136654  -4.31496545  0.76525778  0.57269267 -3.96447004
  0.10651717 -0.45416047 -3.25476358  0.60702265]
Scores mean =  -0.8794453102356599
Score std dev =  1.9846415566046676
Code
# Fit on the training data
forest_pipeline.fit(X_GDPtrain, y_GDPtrain)

# What's the test score?
forest_pipeline.score(X_GDPtest, y_GDPtest)
0.5397527625306848
Code
# Extract the regressor from the pipeline
forest_model = forest_pipeline["randomforestregressor"]
Code
import hvplot.pandas
Code
# Create the data frame of importances
importanceGDP = pd.DataFrame(
    {"Feature": GDPfeature_cols, "Importance": forest_model.feature_importances_}
).sort_values("Importance")


importanceGDP.hvplot.barh(x="Feature", y="Importance")

We find that the variables that matter most for real GDP per capita is median household income and labor force participation rate. The test score is 0.54 meaning that the model can explain 54% of the variance in real GDP per capita.

5.2 Life Expectancy against variables

Code
# the target labels: log of life_expectancy
y_LEtrain = np.log(train_set["life_expectancy"])
y_LEtest = np.log(test_set["life_expectancy"])
Code
# The features
LEfeature_cols = [
    'REALGDPpercapita',
    'MedHHInc',
    'PctBach',
    'UnemploymentRate',
    'LabForParticipationRate',
    'Labor_Productivity_2023',
    'TotalPop',
    'PovertyRate',
    'netexport',
]
X_LEtrain = train_set[LEfeature_cols].values
X_LEtest = test_set[LEfeature_cols].values
Code
# Make a random forest pipeline
LEforest_pipeline = make_pipeline(
    StandardScaler(), RandomForestRegressor(n_estimators=100, random_state=42)
)

# Run the 10-fold cross validation
LEscores = cross_val_score(
    forest_pipeline,
    X_LEtrain,
    y_LEtrain,
    cv=10,
)

# Report
print("R^2 scores = ", LEscores)
print("Scores mean = ", LEscores.mean())
print("Score std dev = ", LEscores.std())
R^2 scores =  [ 6.95094617e-01  8.85361627e-01  5.46261824e-01  9.60021456e-01
  9.51224933e-01  8.97905382e-01  4.45589520e-03  3.75819263e-01
 -1.72440959e+01  7.90774467e-01]
Scores mean =  -1.1137176482882758
Score std dev =  5.384420926046987
Code
# Fit on the training data
LEforest_pipeline.fit(X_LEtrain, y_LEtrain)

# What's the test score?
LEforest_pipeline.score(X_LEtest, y_LEtest)
0.7472553097289523
Code
# Extract the regressor from the pipeline
LEforest_model = LEforest_pipeline["randomforestregressor"]
Code
# Create the data frame of importances
importanceLE = pd.DataFrame(
    {"Feature": LEfeature_cols, "Importance": LEforest_model.feature_importances_}
).sort_values("Importance")


importanceLE.hvplot.barh(x="Feature", y="Importance")

We find that the variables that matter most for life expectancy are percentage of bachelor’s degree gradautes, median household income and poverty rate. The test score is 0.75 meaning that the model can explain 75% of the variance in life expectancy.

5.3 Median Household Income against variables

Code
# the target labels: log of MedHHInc
y_HHINCtrain = np.log(train_set["MedHHInc"])
y_HHINCtest = np.log(test_set["MedHHInc"])
Code
# The features
HHINCfeature_cols = [
    'REALGDPpercapita',
    'life_expectancy',
    'PctBach',
    'UnemploymentRate',
    'LabForParticipationRate',
    'Labor_Productivity_2023',
    'TotalPop',
    'PovertyRate',
    'netexport',
]
X_HHINCtrain = train_set[HHINCfeature_cols].values
X_HHINCtest = test_set[HHINCfeature_cols].values
Code
# Make a random forest pipeline
HHINCforest_pipeline = make_pipeline(
    StandardScaler(), RandomForestRegressor(n_estimators=100, random_state=42)
)

# Run the 10-fold cross validation
HHINCscores = cross_val_score(
    forest_pipeline,
    X_HHINCtrain,
    y_HHINCtrain,
    cv=10,
)

# Report
print("R^2 scores = ", HHINCscores)
print("Scores mean = ", HHINCscores.mean())
print("Score std dev = ", HHINCscores.std())
R^2 scores =  [ 0.87021441  0.79626634 -1.81343269  0.94035599  0.91059592  0.85661744
  0.78405563 -0.84718815  0.59170603  0.44849992]
Scores mean =  0.35376908235570953
Score std dev =  0.8808696066587679
Code
# Fit on the training data
HHINCforest_pipeline.fit(X_HHINCtrain, y_HHINCtrain)

# What's the test score?
HHINCforest_pipeline.score(X_HHINCtest, y_HHINCtest)
0.8226485614034753
Code
# Extract the regressor from the pipeline
HHINCforest_model = HHINCforest_pipeline["randomforestregressor"]
Code
# Create the data frame of importances
importanceHHINC = pd.DataFrame(
    {"Feature": HHINCfeature_cols, "Importance": HHINCforest_model.feature_importances_}
).sort_values("Importance")


importanceHHINC.hvplot.barh(x="Feature", y="Importance")

We find that the variables that matter most for median household income are percentage of bacehlor’s degree graduate, life expectancy, real GDP per capita and poverty rate. The test score is 0.82 meaning that the model can explain 82% of the variance in median household income.

5.4 Education levels

Code
# the target labels: log of PctBach
y_PctBachtrain = np.log(train_set["PctBach"])
y_PctBachtest = np.log(test_set["PctBach"])
Code
# The features
PctBachfeature_cols = [
    'REALGDPpercapita',
    'life_expectancy',
    'MedHHInc',
    'UnemploymentRate',
    'LabForParticipationRate',
    'Labor_Productivity_2023',
    'TotalPop',
    'PovertyRate',
    'netexport',
]
X_PctBachtrain = train_set[PctBachfeature_cols].values
X_PctBachtest = test_set[PctBachfeature_cols].values
Code
# Make a random forest pipeline
PctBachforest_pipeline = make_pipeline(
    StandardScaler(), RandomForestRegressor(n_estimators=100, random_state=42)
)

# Run the 10-fold cross validation
PctBachscores = cross_val_score(
    forest_pipeline,
    X_PctBachtrain,
    y_PctBachtrain,
    cv=10,
)

# Report
print("R^2 scores = ", PctBachscores)
print("Scores mean = ", PctBachscores.mean())
print("Score std dev = ", PctBachscores.std())
R^2 scores =  [ 0.75033831  0.78667313  0.43620382  0.86629284  0.79880008  0.81455566
  0.08470399 -4.21484583 -0.12530893  0.81489821]
Scores mean =  0.1012311282401515
Score std dev =  1.4755868640441185
Code
# Fit on the training data
PctBachforest_pipeline.fit(X_PctBachtrain, y_PctBachtrain)

# What's the test score?
PctBachforest_pipeline.score(X_PctBachtest, y_PctBachtest)
0.7216614717862473
Code
# Extract the regressor from the pipeline
PctBachforest_model = PctBachforest_pipeline["randomforestregressor"]
Code
# Create the data frame of importances
importancePctBach = pd.DataFrame(
    {"Feature": PctBachfeature_cols, "Importance": PctBachforest_model.feature_importances_}
).sort_values("Importance")


importancePctBach.hvplot.barh(x="Feature", y="Importance")

We find that the variable that matter most for education levels is life expectancy. The test score is 0.72 meaning that the model explains 72% of the variance in percentage of bachelor’s degree holders.