Building on the descriptive analysis, this notebook explores the predictive relationships between global health indicators through linear regression. The primary question: How strongly does national income predict health expenditure?
From the descriptive analysis, GNI and health expenditure showed the strongest correlation. Let’s examine this relationship more closely.
plot(ind$gni, ind$healthexp,
main = "GNI vs. Health Expenditure per Capita",
xlab = "Gross National Income per Capita (USD)",
ylab = "Health Expenditure per Capita (USD)",
pch = 19,
col = rgb(0.2, 0.4, 0.6, 0.6),
cex = 1.2)
grid()A linear model with healthexp as the dependent variable
and gni as the independent variable:
##
## Call:
## lm(formula = healthexp ~ gni, data = ind)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3855.4 -138.5 53.5 167.3 4991.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -210.326641 69.862209 -3.011 0.00298 **
## gni 0.107054 0.003063 34.953 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 753.5 on 180 degrees of freedom
## (40 observations deleted due to missingness)
## Multiple R-squared: 0.8716, Adjusted R-squared: 0.8709
## F-statistic: 1222 on 1 and 180 DF, p-value: < 0.00000000000000022
Interpretation:
Does unemployment improve the model?
##
## Call:
## lm(formula = healthexp ~ gni + unemp, data = ind)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3803.6 -144.4 54.2 197.1 4942.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -316.971555 118.577630 -2.673 0.00826 **
## gni 0.107965 0.003186 33.891 < 0.0000000000000002 ***
## unemp 12.067714 10.799834 1.117 0.26543
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 765.7 on 167 degrees of freedom
## (52 observations deleted due to missingness)
## Multiple R-squared: 0.8749, Adjusted R-squared: 0.8734
## F-statistic: 584 on 2 and 167 DF, p-value: < 0.00000000000000022
Result: Adding unemployment does not improve the model — the adjusted R² remains at 0.87 and unemployment’s p-value (0.265) is not significant. GNI alone is sufficient.
par(mfrow = c(1, 2))
hist(schools$poorpct, main = "Distribution of Poverty Rate",
xlab = "% Students in Poverty", col = "tomato", border = "white")
hist(schools$p_passed, main = "Distribution of Pass Rate",
xlab = "% Students Passed", col = "mediumpurple", border = "white")plot(schools$poorpct, schools$p_passed,
main = "Schools with Higher Poverty Rates Have Lower Pass Rates",
xlab = "Percent of Students in Poverty",
ylab = "Percent Passed",
pch = 19,
col = rgb(0.2, 0.4, 0.6, 0.6),
cex = 1.2)
grid()##
## Call:
## lm(formula = p_passed ~ poorpct, data = schools)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47.908 -7.152 0.898 7.443 31.933
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 88.96757 0.97996 90.79 <0.0000000000000002 ***
## poorpct -0.40982 0.01747 -23.46 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.86 on 501 degrees of freedom
## Multiple R-squared: 0.5235, Adjusted R-squared: 0.5226
## F-statistic: 550.5 on 1 and 501 DF, p-value: < 0.00000000000000022
plot(
schools$poorpct,
schools$p_passed,
main = "Schools with Higher Poverty Rates Tend to Have Lower Pass Rates",
xlab = "Percent of Students in Poverty",
ylab = "Percent Passed",
pch = 19,
col = rgb(0.2, 0.4, 0.6, 0.6),
cex = 1.2
)
abline(schools_mod1, col = "darkred", lwd = 2)
grid()##
## Call:
## lm(formula = p_passed ~ poorpct + p_minority, data = schools)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.921 -7.355 0.894 7.450 32.994
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 89.59312 0.98805 90.676 < 0.0000000000000002 ***
## poorpct -0.31957 0.03207 -9.966 < 0.0000000000000002 ***
## p_minority -0.10132 0.03031 -3.342 0.000893 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.74 on 500 degrees of freedom
## Multiple R-squared: 0.5339, Adjusted R-squared: 0.5321
## F-statistic: 286.4 on 2 and 500 DF, p-value: < 0.00000000000000022
Adding predicted values and residuals to identify outlier schools that perform significantly above or below expectations given their poverty levels:
schools_pr <- schools %>%
add_predictions(schools_mod1) %>%
add_residuals(schools_mod1)
# Schools performing significantly ABOVE expectations (positive residuals)
outliers_positive <- schools_pr %>%
filter(resid > 2 * sd(resid, na.rm = TRUE)) %>%
arrange(desc(resid))
# Schools performing significantly BELOW expectations (negative residuals)
outliers_negative <- schools_pr %>%
filter(resid < -2 * sd(resid, na.rm = TRUE)) %>%
arrange(resid)Schools outperforming expectations (pass rate much higher than predicted by poverty level):
kable(outliers_positive %>% select(Name, poorpct, p_passed, pred, resid),
col.names = c("School", "Poverty %", "Pass Rate", "Predicted", "Residual"),
digits = 1,
caption = "Schools with large positive residuals")| School | Poverty % | Pass Rate | Predicted | Residual |
|---|---|---|---|---|
| PHYLLIS WHEATLEY EL | 93.0 | 82.8 | 50.9 | 31.9 |
| JULIA C FRAZIER EL | 98.5 | 78.5 | 48.6 | 29.9 |
| FANNIE C HARRIS EL | 96.1 | 77.3 | 49.6 | 27.7 |
| ROACH EL | 44.5 | 98.2 | 70.7 | 27.5 |
| ALTA MESA EL | 85.3 | 81.4 | 54.0 | 27.4 |
| MILAM EL | 53.0 | 94.6 | 67.2 | 27.4 |
| AUSTIN EL | 68.3 | 87.8 | 61.0 | 26.9 |
| CALDWELL EL | 65.4 | 89.0 | 62.2 | 26.8 |
| STEPHEN C FOSTER EL | 90.7 | 78.2 | 51.8 | 26.5 |
| SAM HOUSTON EL | 79.0 | 82.6 | 56.6 | 26.0 |
| MONTGOMERY EL | 62.5 | 88.3 | 63.4 | 24.9 |
| CENTERVILLE EL | 59.8 | 89.2 | 64.5 | 24.8 |
| GOLDEN RULE EL | 77.7 | 81.7 | 57.1 | 24.5 |
Schools underperforming expectations (pass rate much lower than predicted):
kable(outliers_negative %>% select(Name, poorpct, p_passed, pred, resid),
col.names = c("School", "Poverty %", "Pass Rate", "Predicted", "Residual"),
digits = 1,
caption = "Schools with large negative residuals")| School | Poverty % | Pass Rate | Predicted | Residual |
|---|---|---|---|---|
| PLEASANT RUN EL | 0.0 | 41.1 | 89.0 | -47.9 |
| WILMER EL | 75.0 | 19.8 | 58.2 | -38.4 |
| MILLBROOK EL | 0.0 | 51.6 | 89.0 | -37.4 |
| KEMP INT | 43.1 | 35.7 | 71.3 | -35.6 |
| ADELLE TURNER EL | 40.7 | 39.1 | 72.3 | -33.2 |
| THE ACT ACAD AT J L GREER | 11.9 | 51.1 | 84.1 | -33.0 |
| T E BAXTER EL | 35.9 | 44.9 | 74.3 | -29.4 |
| ROLLING HILLS EL | 0.5 | 59.7 | 88.8 | -29.1 |
| JEFFERSON EL | 69.7 | 33.0 | 60.4 | -27.4 |
| JAMES S HOGG EL | 92.6 | 24.1 | 51.0 | -27.0 |
| ORAN M ROBERTS EL | 98.3 | 21.9 | 48.7 | -26.8 |
| LANCASTER INT | 13.7 | 57.1 | 83.3 | -26.2 |
| FINCH EL | 58.5 | 39.1 | 65.0 | -25.8 |
ggplot(schools_pr, aes(x = poorpct, y = resid)) +
geom_point(alpha = 0.5, color = "steelblue") +
geom_hline(yintercept = 0, linetype = "dashed", color = "darkred") +
labs(title = "Residual Plot: Poverty vs. Model Residuals",
x = "Percent of Students in Poverty",
y = "Residual (Actual - Predicted)") +
theme_minimal()Data sources: WHO Global Health Observatory, California Schools Dataset