Global Health Inequality: Regression Analysis

Overview

Building on the descriptive analysis, this notebook explores the predictive relationships between global health indicators through linear regression. The primary question: How strongly does national income predict health expenditure?

Setup

library(tidyverse)
library(dplyr)
library(ggplot2)
library(knitr)
library(psych)
library(janitor)
library(modelr)

Load Data

ind <- read_csv('data/globalind.csv', col_names = TRUE)
schools <- read_csv('data/schools.csv', col_names = TRUE)

Part I: Global Indicators

Reviewing the Strongest Correlation

From the descriptive analysis, GNI and health expenditure showed the strongest correlation. Let’s examine this relationship more closely.

plot(ind$gni, ind$healthexp,
     main = "GNI vs. Health Expenditure per Capita",
     xlab = "Gross National Income per Capita (USD)",
     ylab = "Health Expenditure per Capita (USD)",
     pch = 19,
     col = rgb(0.2, 0.4, 0.6, 0.6),
     cex = 1.2)
grid()

Simple Linear Regression: GNI → Health Expenditure

A linear model with healthexp as the dependent variable and gni as the independent variable:

ind_mod1 <- lm(formula = healthexp ~ gni, data = ind)
summary(ind_mod1)

## 
## Call:
## lm(formula = healthexp ~ gni, data = ind)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3855.4  -138.5    53.5   167.3  4991.7 
## 
## Coefficients:
##                Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept) -210.326641   69.862209  -3.011              0.00298 ** 
## gni            0.107054    0.003063  34.953 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 753.5 on 180 degrees of freedom
##   (40 observations deleted due to missingness)
## Multiple R-squared:  0.8716, Adjusted R-squared:  0.8709 
## F-statistic:  1222 on 1 and 180 DF,  p-value: < 0.00000000000000022

Interpretation:

R² = 0.87 — 87% of the variation in health expenditure is explained by GNI alone.
p-value < 0.05 — The relationship is statistically significant.
GNI coefficient ≈ 0.107 — For every $1 increase in GNI per capita, health expenditure increases by approximately $0.11. In practical terms: for every $10 increase in GNI, health spending rises by about $1.

Multiple Regression: Adding Unemployment

Does unemployment improve the model?

ind_mod2 <- lm(formula = healthexp ~ gni + unemp, data = ind)
summary(ind_mod2)

## 
## Call:
## lm(formula = healthexp ~ gni + unemp, data = ind)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3803.6  -144.4    54.2   197.1  4942.3 
## 
## Coefficients:
##                Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept) -316.971555  118.577630  -2.673              0.00826 ** 
## gni            0.107965    0.003186  33.891 < 0.0000000000000002 ***
## unemp         12.067714   10.799834   1.117              0.26543    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 765.7 on 167 degrees of freedom
##   (52 observations deleted due to missingness)
## Multiple R-squared:  0.8749, Adjusted R-squared:  0.8734 
## F-statistic:   584 on 2 and 167 DF,  p-value: < 0.00000000000000022

Result: Adding unemployment does not improve the model — the adjusted R² remains at 0.87 and unemployment’s p-value (0.265) is not significant. GNI alone is sufficient.

Part II: California Schools

Exploring Poverty and Academic Performance

par(mfrow = c(1, 2))
hist(schools$poorpct, main = "Distribution of Poverty Rate",
     xlab = "% Students in Poverty", col = "tomato", border = "white")
hist(schools$p_passed, main = "Distribution of Pass Rate",
     xlab = "% Students Passed", col = "mediumpurple", border = "white")

plot(schools$poorpct, schools$p_passed,
     main = "Schools with Higher Poverty Rates Have Lower Pass Rates",
     xlab = "Percent of Students in Poverty",
     ylab = "Percent Passed",
     pch = 19,
     col = rgb(0.2, 0.4, 0.6, 0.6),
     cex = 1.2)
grid()

Regression: Poverty → Pass Rate

schools_mod1 <- lm(formula = p_passed ~ poorpct, data = schools)
summary(schools_mod1)

## 
## Call:
## lm(formula = p_passed ~ poorpct, data = schools)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.908  -7.152   0.898   7.443  31.933 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 88.96757    0.97996   90.79 <0.0000000000000002 ***
## poorpct     -0.40982    0.01747  -23.46 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.86 on 501 degrees of freedom
## Multiple R-squared:  0.5235, Adjusted R-squared:  0.5226 
## F-statistic: 550.5 on 1 and 501 DF,  p-value: < 0.00000000000000022

plot(
  schools$poorpct,
  schools$p_passed,
  main = "Schools with Higher Poverty Rates Tend to Have Lower Pass Rates",
  xlab = "Percent of Students in Poverty",
  ylab = "Percent Passed",
  pch = 19,
  col = rgb(0.2, 0.4, 0.6, 0.6),
  cex = 1.2
)
abline(schools_mod1, col = "darkred", lwd = 2)
grid()

Multiple Regression: Poverty + Minority Percentage

schools_mod2 <- lm(formula = p_passed ~ poorpct + p_minority, data = schools)
summary(schools_mod2)

## 
## Call:
## lm(formula = p_passed ~ poorpct + p_minority, data = schools)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.921  -7.355   0.894   7.450  32.994 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 89.59312    0.98805  90.676 < 0.0000000000000002 ***
## poorpct     -0.31957    0.03207  -9.966 < 0.0000000000000002 ***
## p_minority  -0.10132    0.03031  -3.342             0.000893 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.74 on 500 degrees of freedom
## Multiple R-squared:  0.5339, Adjusted R-squared:  0.5321 
## F-statistic: 286.4 on 2 and 500 DF,  p-value: < 0.00000000000000022

Residual Analysis

Adding predicted values and residuals to identify outlier schools that perform significantly above or below expectations given their poverty levels:

schools_pr <- schools %>%
  add_predictions(schools_mod1) %>%
  add_residuals(schools_mod1)

# Schools performing significantly ABOVE expectations (positive residuals)
outliers_positive <- schools_pr %>%
  filter(resid > 2 * sd(resid, na.rm = TRUE)) %>%
  arrange(desc(resid))

# Schools performing significantly BELOW expectations (negative residuals)
outliers_negative <- schools_pr %>%
  filter(resid < -2 * sd(resid, na.rm = TRUE)) %>%
  arrange(resid)

Schools outperforming expectations (pass rate much higher than predicted by poverty level):

kable(outliers_positive %>% select(Name, poorpct, p_passed, pred, resid),
      col.names = c("School", "Poverty %", "Pass Rate", "Predicted", "Residual"),
      digits = 1,
      caption = "Schools with large positive residuals")

Schools with large positive residuals
School	Poverty %	Pass Rate	Predicted	Residual
PHYLLIS WHEATLEY EL	93.0	82.8	50.9	31.9
JULIA C FRAZIER EL	98.5	78.5	48.6	29.9
FANNIE C HARRIS EL	96.1	77.3	49.6	27.7
ROACH EL	44.5	98.2	70.7	27.5
ALTA MESA EL	85.3	81.4	54.0	27.4
MILAM EL	53.0	94.6	67.2	27.4
AUSTIN EL	68.3	87.8	61.0	26.9
CALDWELL EL	65.4	89.0	62.2	26.8
STEPHEN C FOSTER EL	90.7	78.2	51.8	26.5
SAM HOUSTON EL	79.0	82.6	56.6	26.0
MONTGOMERY EL	62.5	88.3	63.4	24.9
CENTERVILLE EL	59.8	89.2	64.5	24.8
GOLDEN RULE EL	77.7	81.7	57.1	24.5

Schools underperforming expectations (pass rate much lower than predicted):

kable(outliers_negative %>% select(Name, poorpct, p_passed, pred, resid),
      col.names = c("School", "Poverty %", "Pass Rate", "Predicted", "Residual"),
      digits = 1,
      caption = "Schools with large negative residuals")

Schools with large negative residuals
School	Poverty %	Pass Rate	Predicted	Residual
PLEASANT RUN EL	0.0	41.1	89.0	-47.9
WILMER EL	75.0	19.8	58.2	-38.4
MILLBROOK EL	0.0	51.6	89.0	-37.4
KEMP INT	43.1	35.7	71.3	-35.6
ADELLE TURNER EL	40.7	39.1	72.3	-33.2
THE ACT ACAD AT J L GREER	11.9	51.1	84.1	-33.0
T E BAXTER EL	35.9	44.9	74.3	-29.4
ROLLING HILLS EL	0.5	59.7	88.8	-29.1
JEFFERSON EL	69.7	33.0	60.4	-27.4
JAMES S HOGG EL	92.6	24.1	51.0	-27.0
ORAN M ROBERTS EL	98.3	21.9	48.7	-26.8
LANCASTER INT	13.7	57.1	83.3	-26.2
FINCH EL	58.5	39.1	65.0	-25.8

ggplot(schools_pr, aes(x = poorpct, y = resid)) +
  geom_point(alpha = 0.5, color = "steelblue") +
  geom_hline(yintercept = 0, linetype = "dashed", color = "darkred") +
  labs(title = "Residual Plot: Poverty vs. Model Residuals",
       x = "Percent of Students in Poverty",
       y = "Residual (Actual - Predicted)") +
  theme_minimal()

Data sources: WHO Global Health Observatory, California Schools Dataset