With the results, policymakers might be interested in which variables are key predictors of GDP per capita, life expectancy, median household income, and education levels. We can apply a predictive model using decision trees to identify the key predictors. First we select the variables we want to model for.
We then split the data into training and test set.
Code
# Split the data 70/30train_set, test_set = train_test_split(usa_modelling, test_size=0.3, random_state=42)
We then run the predictive model for each dependent variable (5.1 to 5.4). The barplot shows ranks the order of importance of each independent variable in predicting the dependent variable.
5.1 GDP per capita against variables
Code
# the target labels: log of REALGDPpercapitay_GDPtrain = np.log(train_set["REALGDPpercapita"])y_GDPtest = np.log(test_set["REALGDPpercapita"])
Code
# The featuresGDPfeature_cols = ['life_expectancy','MedHHInc','PctBach','UnemploymentRate','LabForParticipationRate','Labor_Productivity_2023','TotalPop','PovertyRate','netexport',]X_GDPtrain = train_set[GDPfeature_cols].valuesX_GDPtest = test_set[GDPfeature_cols].values
Code
# Make a random forest pipelineforest_pipeline = make_pipeline( StandardScaler(), RandomForestRegressor(n_estimators=100, random_state=42))# Run the 10-fold cross validationGDPscores = cross_val_score( forest_pipeline, X_GDPtrain, y_GDPtrain, cv=10,)# Reportprint("R^2 scores = ", GDPscores)print("Scores mean = ", GDPscores.mean())print("Score std dev = ", GDPscores.std())
# Fit on the training dataforest_pipeline.fit(X_GDPtrain, y_GDPtrain)# What's the test score?forest_pipeline.score(X_GDPtest, y_GDPtest)
0.5397527625306848
Code
# Extract the regressor from the pipelineforest_model = forest_pipeline["randomforestregressor"]
Code
import hvplot.pandas
Code
# Create the data frame of importancesimportanceGDP = pd.DataFrame( {"Feature": GDPfeature_cols, "Importance": forest_model.feature_importances_}).sort_values("Importance")importanceGDP.hvplot.barh(x="Feature", y="Importance")
We find that the variables that matter most for real GDP per capita is median household income and labor force participation rate. The test score is 0.54 meaning that the model can explain 54% of the variance in real GDP per capita.
5.2 Life Expectancy against variables
Code
# the target labels: log of life_expectancyy_LEtrain = np.log(train_set["life_expectancy"])y_LEtest = np.log(test_set["life_expectancy"])
Code
# The featuresLEfeature_cols = ['REALGDPpercapita','MedHHInc','PctBach','UnemploymentRate','LabForParticipationRate','Labor_Productivity_2023','TotalPop','PovertyRate','netexport',]X_LEtrain = train_set[LEfeature_cols].valuesX_LEtest = test_set[LEfeature_cols].values
Code
# Make a random forest pipelineLEforest_pipeline = make_pipeline( StandardScaler(), RandomForestRegressor(n_estimators=100, random_state=42))# Run the 10-fold cross validationLEscores = cross_val_score( forest_pipeline, X_LEtrain, y_LEtrain, cv=10,)# Reportprint("R^2 scores = ", LEscores)print("Scores mean = ", LEscores.mean())print("Score std dev = ", LEscores.std())
# Fit on the training dataLEforest_pipeline.fit(X_LEtrain, y_LEtrain)# What's the test score?LEforest_pipeline.score(X_LEtest, y_LEtest)
0.7472553097289523
Code
# Extract the regressor from the pipelineLEforest_model = LEforest_pipeline["randomforestregressor"]
Code
# Create the data frame of importancesimportanceLE = pd.DataFrame( {"Feature": LEfeature_cols, "Importance": LEforest_model.feature_importances_}).sort_values("Importance")importanceLE.hvplot.barh(x="Feature", y="Importance")
We find that the variables that matter most for life expectancy are percentage of bachelor’s degree gradautes, median household income and poverty rate. The test score is 0.75 meaning that the model can explain 75% of the variance in life expectancy.
5.3 Median Household Income against variables
Code
# the target labels: log of MedHHIncy_HHINCtrain = np.log(train_set["MedHHInc"])y_HHINCtest = np.log(test_set["MedHHInc"])
Code
# The featuresHHINCfeature_cols = ['REALGDPpercapita','life_expectancy','PctBach','UnemploymentRate','LabForParticipationRate','Labor_Productivity_2023','TotalPop','PovertyRate','netexport',]X_HHINCtrain = train_set[HHINCfeature_cols].valuesX_HHINCtest = test_set[HHINCfeature_cols].values
Code
# Make a random forest pipelineHHINCforest_pipeline = make_pipeline( StandardScaler(), RandomForestRegressor(n_estimators=100, random_state=42))# Run the 10-fold cross validationHHINCscores = cross_val_score( forest_pipeline, X_HHINCtrain, y_HHINCtrain, cv=10,)# Reportprint("R^2 scores = ", HHINCscores)print("Scores mean = ", HHINCscores.mean())print("Score std dev = ", HHINCscores.std())
# Fit on the training dataHHINCforest_pipeline.fit(X_HHINCtrain, y_HHINCtrain)# What's the test score?HHINCforest_pipeline.score(X_HHINCtest, y_HHINCtest)
0.8226485614034753
Code
# Extract the regressor from the pipelineHHINCforest_model = HHINCforest_pipeline["randomforestregressor"]
Code
# Create the data frame of importancesimportanceHHINC = pd.DataFrame( {"Feature": HHINCfeature_cols, "Importance": HHINCforest_model.feature_importances_}).sort_values("Importance")importanceHHINC.hvplot.barh(x="Feature", y="Importance")
We find that the variables that matter most for median household income are percentage of bacehlor’s degree graduate, life expectancy, real GDP per capita and poverty rate. The test score is 0.82 meaning that the model can explain 82% of the variance in median household income.
5.4 Education levels
Code
# the target labels: log of PctBachy_PctBachtrain = np.log(train_set["PctBach"])y_PctBachtest = np.log(test_set["PctBach"])
Code
# The featuresPctBachfeature_cols = ['REALGDPpercapita','life_expectancy','MedHHInc','UnemploymentRate','LabForParticipationRate','Labor_Productivity_2023','TotalPop','PovertyRate','netexport',]X_PctBachtrain = train_set[PctBachfeature_cols].valuesX_PctBachtest = test_set[PctBachfeature_cols].values
Code
# Make a random forest pipelinePctBachforest_pipeline = make_pipeline( StandardScaler(), RandomForestRegressor(n_estimators=100, random_state=42))# Run the 10-fold cross validationPctBachscores = cross_val_score( forest_pipeline, X_PctBachtrain, y_PctBachtrain, cv=10,)# Reportprint("R^2 scores = ", PctBachscores)print("Scores mean = ", PctBachscores.mean())print("Score std dev = ", PctBachscores.std())
# Fit on the training dataPctBachforest_pipeline.fit(X_PctBachtrain, y_PctBachtrain)# What's the test score?PctBachforest_pipeline.score(X_PctBachtest, y_PctBachtest)
0.7216614717862473
Code
# Extract the regressor from the pipelinePctBachforest_model = PctBachforest_pipeline["randomforestregressor"]
Code
# Create the data frame of importancesimportancePctBach = pd.DataFrame( {"Feature": PctBachfeature_cols, "Importance": PctBachforest_model.feature_importances_}).sort_values("Importance")importancePctBach.hvplot.barh(x="Feature", y="Importance")
We find that the variable that matter most for education levels is life expectancy. The test score is 0.72 meaning that the model explains 72% of the variance in percentage of bachelor’s degree holders.