Overview

Building on the descriptive analysis, this notebook explores the predictive relationships between global health indicators through linear regression. The primary question: How strongly does national income predict health expenditure?

Setup

library(tidyverse)
library(dplyr)
library(ggplot2)
library(knitr)
library(psych)
library(janitor)
library(modelr)

Load Data

ind <- read_csv('data/globalind.csv', col_names = TRUE)
schools <- read_csv('data/schools.csv', col_names = TRUE)

Part I: Global Indicators

Reviewing the Strongest Correlation

From the descriptive analysis, GNI and health expenditure showed the strongest correlation. Let’s examine this relationship more closely.

plot(ind$gni, ind$healthexp,
     main = "GNI vs. Health Expenditure per Capita",
     xlab = "Gross National Income per Capita (USD)",
     ylab = "Health Expenditure per Capita (USD)",
     pch = 19,
     col = rgb(0.2, 0.4, 0.6, 0.6),
     cex = 1.2)
grid()

Simple Linear Regression: GNI → Health Expenditure

A linear model with healthexp as the dependent variable and gni as the independent variable:

ind_mod1 <- lm(formula = healthexp ~ gni, data = ind)
summary(ind_mod1)
## 
## Call:
## lm(formula = healthexp ~ gni, data = ind)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3855.4  -138.5    53.5   167.3  4991.7 
## 
## Coefficients:
##                Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept) -210.326641   69.862209  -3.011              0.00298 ** 
## gni            0.107054    0.003063  34.953 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 753.5 on 180 degrees of freedom
##   (40 observations deleted due to missingness)
## Multiple R-squared:  0.8716, Adjusted R-squared:  0.8709 
## F-statistic:  1222 on 1 and 180 DF,  p-value: < 0.00000000000000022

Interpretation:

  • R² = 0.87 — 87% of the variation in health expenditure is explained by GNI alone.
  • p-value < 0.05 — The relationship is statistically significant.
  • GNI coefficient ≈ 0.107 — For every $1 increase in GNI per capita, health expenditure increases by approximately $0.11. In practical terms: for every $10 increase in GNI, health spending rises by about $1.

Multiple Regression: Adding Unemployment

Does unemployment improve the model?

ind_mod2 <- lm(formula = healthexp ~ gni + unemp, data = ind)
summary(ind_mod2)
## 
## Call:
## lm(formula = healthexp ~ gni + unemp, data = ind)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3803.6  -144.4    54.2   197.1  4942.3 
## 
## Coefficients:
##                Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept) -316.971555  118.577630  -2.673              0.00826 ** 
## gni            0.107965    0.003186  33.891 < 0.0000000000000002 ***
## unemp         12.067714   10.799834   1.117              0.26543    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 765.7 on 167 degrees of freedom
##   (52 observations deleted due to missingness)
## Multiple R-squared:  0.8749, Adjusted R-squared:  0.8734 
## F-statistic:   584 on 2 and 167 DF,  p-value: < 0.00000000000000022

Result: Adding unemployment does not improve the model — the adjusted R² remains at 0.87 and unemployment’s p-value (0.265) is not significant. GNI alone is sufficient.

Part II: California Schools

Exploring Poverty and Academic Performance

par(mfrow = c(1, 2))
hist(schools$poorpct, main = "Distribution of Poverty Rate",
     xlab = "% Students in Poverty", col = "tomato", border = "white")
hist(schools$p_passed, main = "Distribution of Pass Rate",
     xlab = "% Students Passed", col = "mediumpurple", border = "white")

plot(schools$poorpct, schools$p_passed,
     main = "Schools with Higher Poverty Rates Have Lower Pass Rates",
     xlab = "Percent of Students in Poverty",
     ylab = "Percent Passed",
     pch = 19,
     col = rgb(0.2, 0.4, 0.6, 0.6),
     cex = 1.2)
grid()

Regression: Poverty → Pass Rate

schools_mod1 <- lm(formula = p_passed ~ poorpct, data = schools)
summary(schools_mod1)
## 
## Call:
## lm(formula = p_passed ~ poorpct, data = schools)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.908  -7.152   0.898   7.443  31.933 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 88.96757    0.97996   90.79 <0.0000000000000002 ***
## poorpct     -0.40982    0.01747  -23.46 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.86 on 501 degrees of freedom
## Multiple R-squared:  0.5235, Adjusted R-squared:  0.5226 
## F-statistic: 550.5 on 1 and 501 DF,  p-value: < 0.00000000000000022
plot(
  schools$poorpct,
  schools$p_passed,
  main = "Schools with Higher Poverty Rates Tend to Have Lower Pass Rates",
  xlab = "Percent of Students in Poverty",
  ylab = "Percent Passed",
  pch = 19,
  col = rgb(0.2, 0.4, 0.6, 0.6),
  cex = 1.2
)
abline(schools_mod1, col = "darkred", lwd = 2)
grid()

Multiple Regression: Poverty + Minority Percentage

schools_mod2 <- lm(formula = p_passed ~ poorpct + p_minority, data = schools)
summary(schools_mod2)
## 
## Call:
## lm(formula = p_passed ~ poorpct + p_minority, data = schools)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.921  -7.355   0.894   7.450  32.994 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 89.59312    0.98805  90.676 < 0.0000000000000002 ***
## poorpct     -0.31957    0.03207  -9.966 < 0.0000000000000002 ***
## p_minority  -0.10132    0.03031  -3.342             0.000893 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.74 on 500 degrees of freedom
## Multiple R-squared:  0.5339, Adjusted R-squared:  0.5321 
## F-statistic: 286.4 on 2 and 500 DF,  p-value: < 0.00000000000000022

Residual Analysis

Adding predicted values and residuals to identify outlier schools that perform significantly above or below expectations given their poverty levels:

schools_pr <- schools %>%
  add_predictions(schools_mod1) %>%
  add_residuals(schools_mod1)

# Schools performing significantly ABOVE expectations (positive residuals)
outliers_positive <- schools_pr %>%
  filter(resid > 2 * sd(resid, na.rm = TRUE)) %>%
  arrange(desc(resid))

# Schools performing significantly BELOW expectations (negative residuals)
outliers_negative <- schools_pr %>%
  filter(resid < -2 * sd(resid, na.rm = TRUE)) %>%
  arrange(resid)

Schools outperforming expectations (pass rate much higher than predicted by poverty level):

kable(outliers_positive %>% select(Name, poorpct, p_passed, pred, resid),
      col.names = c("School", "Poverty %", "Pass Rate", "Predicted", "Residual"),
      digits = 1,
      caption = "Schools with large positive residuals")
Schools with large positive residuals
School Poverty % Pass Rate Predicted Residual
PHYLLIS WHEATLEY EL 93.0 82.8 50.9 31.9
JULIA C FRAZIER EL 98.5 78.5 48.6 29.9
FANNIE C HARRIS EL 96.1 77.3 49.6 27.7
ROACH EL 44.5 98.2 70.7 27.5
ALTA MESA EL 85.3 81.4 54.0 27.4
MILAM EL 53.0 94.6 67.2 27.4
AUSTIN EL 68.3 87.8 61.0 26.9
CALDWELL EL 65.4 89.0 62.2 26.8
STEPHEN C FOSTER EL 90.7 78.2 51.8 26.5
SAM HOUSTON EL 79.0 82.6 56.6 26.0
MONTGOMERY EL 62.5 88.3 63.4 24.9
CENTERVILLE EL 59.8 89.2 64.5 24.8
GOLDEN RULE EL 77.7 81.7 57.1 24.5

Schools underperforming expectations (pass rate much lower than predicted):

kable(outliers_negative %>% select(Name, poorpct, p_passed, pred, resid),
      col.names = c("School", "Poverty %", "Pass Rate", "Predicted", "Residual"),
      digits = 1,
      caption = "Schools with large negative residuals")
Schools with large negative residuals
School Poverty % Pass Rate Predicted Residual
PLEASANT RUN EL 0.0 41.1 89.0 -47.9
WILMER EL 75.0 19.8 58.2 -38.4
MILLBROOK EL 0.0 51.6 89.0 -37.4
KEMP INT 43.1 35.7 71.3 -35.6
ADELLE TURNER EL 40.7 39.1 72.3 -33.2
THE ACT ACAD AT J L GREER 11.9 51.1 84.1 -33.0
T E BAXTER EL 35.9 44.9 74.3 -29.4
ROLLING HILLS EL 0.5 59.7 88.8 -29.1
JEFFERSON EL 69.7 33.0 60.4 -27.4
JAMES S HOGG EL 92.6 24.1 51.0 -27.0
ORAN M ROBERTS EL 98.3 21.9 48.7 -26.8
LANCASTER INT 13.7 57.1 83.3 -26.2
FINCH EL 58.5 39.1 65.0 -25.8
ggplot(schools_pr, aes(x = poorpct, y = resid)) +
  geom_point(alpha = 0.5, color = "steelblue") +
  geom_hline(yintercept = 0, linetype = "dashed", color = "darkred") +
  labs(title = "Residual Plot: Poverty vs. Model Residuals",
       x = "Percent of Students in Poverty",
       y = "Residual (Actual - Predicted)") +
  theme_minimal()


Data sources: WHO Global Health Observatory, California Schools Dataset