The Two-Level Reality of Regression
Understanding regression requires distinguishing between what exists in reality versus what we can observe and estimate from our sample data.
Population Level (True but Unknown)
Simple linear regression assumes there's a true relationship in the population:
$$ y = \beta_0 + \beta_1x + \epsilon $$
Where:
- $\beta_0$ = true population intercept (unknown parameter)
- $\beta_1$ = true population slope (unknown parameter)
- $\epsilon$ = random error term with mean 0
These Greek letter parameters represent reality—the actual relationship that would exist if we could observe the entire population.
$$\beta_1 = \rho_{XY} \frac{\sigma_Y}{\sigma_X}$$
$$\beta_0 = \mu_Y - \beta_1 \mu_X$$
Where:
- $\rho_{XY} = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y}$ is the population correlation coefficient
- $\sigma_X = \sqrt{\text{Var}(X)}$ and $\sigma_Y = \sqrt{\text{Var}(Y)}$ are population standard deviations
- $\mu_X = E[X]$ and $\mu_Y = E[Y]$ are the population means
Sample Level (What We Estimate)
From our sample data, we estimate the population parameters using least squares:
$$ \hat{y} = \hat{\beta}_0 + \hat{\beta}_1x $$
Where:
- $\hat{\beta}_0$ = sample estimate of the true intercept
- $\hat{\beta}_1$ = sample estimate of the true slope
- $\hat{y}$ = predicted value of $y$ for a given $x$
The "hat" notation ($\hat{}$) always means "estimate of" or "predicted."
$$ \hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} $$
$$ \hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x} $$
Important distinction:
- $\epsilon$: The true, unobservable error in the population model
- $e_i = y_i - \hat{y}_i$: The observable residuals from our fitted model
Residuals are our best approximation of the true errors, but they're not the same thing.
Why We Care About the Distinction
This two-level framework matters because:
- Inference: We want to make statements about $\beta_1$ (the true slope), not just $\hat{\beta}_1$ (our estimate)
- Uncertainty: Our estimate $\hat{\beta}_1$ has sampling variability—different samples give different estimates
- Hypothesis Testing: We test claims about the true parameter $\beta_1$, using our estimate $\hat{\beta}_1$
Standard Error of the Slope Estimate
$$ SE(\hat{\beta}_1) = \sqrt{\frac{\frac{1}{n-2} \sum_{i=1}^n (y_i - \hat{y}_i)^2}{\sum_{i=1}^n (x_i - \bar{x})^2}} $$
t-Statistic for Testing $H_0: \beta_1 = 0$
$$ t = \frac{\hat{\beta}_1}{SE(\hat{\beta}_1)} = \frac{\hat{\beta}_1}{\sqrt{\frac{\hat{\sigma}^2}{\sum_{i=1}^n (x_i - \bar{x})^2}}} $$
This t-statistic follows a t-distribution with $(n-2)$ degrees of freedom under the null hypothesis.
Doubling the Data Points
On $R^2$:
- $R^2$ measures the proportion of variance explained: $R^2 = 1 - \frac{SSE}{SST}$
- Adding data doesn't automatically change $R^2$—it depends on whether new points follow the same pattern
- If new points are consistent with the existing relationship, $R^2$ may stay similar or slightly improve
- If new points are more scattered, $R^2$ could decrease
On p-values:
- More data typically reduces $SE(\hat{\beta}_1)$ (denominator gets larger due to more spread in $x$-values)
- Smaller standard error means larger $|t|$-statistic for the same slope estimate
- Larger $|t|$-statistic means smaller p-value, stronger evidence against $H_0: \beta_1 = 0$
On the analytical solutions for $\beta_0$ and $\beta_1$:
- Doubling the number of data points has no effect on the true population parameters themselves
- $\beta_1 = \frac{\text{Cov}(X,Y)}{\text{Var}(X)}$ remains constant—it's a fixed property of the population relationship
- $\beta_0 = \mu_Y - \beta_1 \mu_X$ also remains constant—it depends only on population moments
- The correlation $\rho_{XY}$ and population standard deviations $\sigma_X, \sigma_Y$ are unchanged
On the least squares estimates:
- The actual values of $\hat{\beta}_0$ and $\hat{\beta}_1$ will change (new sample, new estimates)
Key Takeaway: More data gives us better estimates of the same underlying truth, not a different truth.
SST, SSE, and R² — Tiny Worked Example
We have three data points: (x_i, y_i) = (1,1), (2,2), (3,2). We fit the simple OLS model $\hat{y} = \beta_0 + \beta_1 x$ and compute SST, SSE, and R² step by step.
1) Mean of y
$\bar{y} = (1 + 2 + 2)/3 = 5/3 \approx 1.6667$
2) Total variation in y (SST)
$SST = \sum (y_i - \bar{y})^2$
$= (1 - 1.6667)^2 + (2 - 1.6667)^2 + (2 - 1.6667)^2$
$= 0.4444 + 0.1111 + 0.1111 = 0.6667$
3) Fit OLS line
Centered sums:
- $\bar{x} = (1 + 2 + 3)/3 = 2$
- $\sum (x_i-\bar{x})(y_i-\bar{y}) = 1.0$
- $\sum (x_i-\bar{x})^2 = 2$
Coefficients:
$\beta_1 = 1.0 / 2 = 0.5$,
$\beta_0 = \bar{y} - \beta_1 \bar{x} = 1.6667 - 0.5 \cdot 2 = 0.6667$
x |
y |
$\hat{y} = 0.6667 + 0.5 \cdot x$ |
$y - \hat{y}$ |
$(y - \hat{y})^2$ |
1 | 1.0000 | 1.1667 | -0.1667 | 0.0278 |
2 | 2.0000 | 1.6667 | 0.3333 | 0.1111 |
3 | 2.0000 | 2.1667 | -0.1667 | 0.0278 |
$SSE = \sum (y - \hat{y})^2$ |
0.1667 |
4) Unexplained variation (SSE)
$SSE = \sum (y_i - \hat{y}_i)^2 = 0.1667$
5) R² (fraction of variance explained)
$R^2 = 1 - \frac{SSE}{SST} = 1 - \frac{0.1667}{0.6667} = 0.75$
Interpretation: the regression explains 75% of the variance in y.
Notes: If SSE = 0 ⇒ R² = 1 (perfect fit). If SSE ≈ SST ⇒ R² ≈ 0 (no explanatory power).
If SSE > SST ⇒ R² < 0 (worse than predicting the mean).
6) F-statistic for testing overall model significance
The F-statistic tests whether the regression model explains a significant amount of variance compared to a model with no predictors (i.e., just the intercept).
$F = \frac{MSR}{MSE} = \frac{SSR/k}{SSE/(n-k-1)}$
Where:
- $SSR$ = regression sum of squares (explained variation)
- $SSE$ = error sum of squares (unexplained variation)
- $k$ = number of predictors (excluding the intercept)
- $n$ = number of observations
- $MSR$ = mean square regression
- $MSE$ = mean square error
Since $SSR = SST - SSE$, we can substitute:
$F = \frac{(SST - SSE)/k}{SSE/(n-k-1)}$
For our example:
- $SST = 0.6667$
- $SSE = 0.1667$
- $SSR = SST - SSE = 0.6667 - 0.1667 = 0.5000$
- $k = 1$ (one predictor: $x$)
- $n = 3$ (three observations)
$F = \frac{0.5000/1}{0.1667/(3-1-1)} = \frac{0.5000}{0.1667/1} = \frac{0.5000}{0.1667} = 3.0$
Interpretation: This F-statistic of 3.0 (with 1 and 1 degrees of freedom) tests $H_0$: the regression model is no better than just predicting the mean $\bar{y}$ for all observations.
Note: In simple linear regression, $F = t^2$ where $t$ is the t-statistic for testing $H_0: \beta_1 = 0$.
Interpreting F-statistic Values
The F-statistic is fundamentally a ratio of signal to noise:
$F = \frac{\text{Signal (explained variance per predictor)}}{\text{Noise (unexplained variance per residual df)}}$
Understanding what different F-values mean:
F ≈ 1 (close to 1)
- The model's explained variance (SSR) is about the same order as the unexplained variance (SSE per degree of freedom)
- Adding predictors doesn't improve the model much beyond just predicting the mean
- Typically → not significant
F < 1 (very small)
- The model actually explains less variance than would be expected by chance
- Strong evidence that predictors are useless or even harmful
- You'd fail to reject the null hypothesis (no relationship)
F moderately large (say, > 4 or 5 depending on n, k)
- Suggests predictors improve the model compared to noise
- You check the corresponding p-value:
- If p < 0.05 → statistically significant (reject null)
- If p is higher → still not significant, even if F > 1
F very large (≫ 10, 20, 50...)
- Strong evidence that predictors explain a lot of variance relative to noise
- P-value will usually be extremely small
- The regression model has strong explanatory power
Quick Rule of Thumb
- Small F (≈0 → < 1): model worse than noise
- Around 1: no improvement over mean-only model
- Big F (≫1): model explains meaningful variance (significance depends on df & p-value)
Why this makes sense: Under the null hypothesis (no real relationship), we'd expect F ≈ 1 on average because MSR would just reflect random variation in the predictors, while MSE reflects the true error variance. So the ratio should be around 1 if there's no real signal.
Remember: The degrees of freedom matter! An F of 4 might be significant with small sample sizes but not with very large ones, which is why we always check the p-value for formal hypothesis testing.
What the F-statistic Actually Tests
The F-statistic in regression tests a very specific null hypothesis:
$H_0: \beta_1 = \beta_2 = \cdots = \beta_k = 0$
(all slopes are zero at the same time)
What this means:
- Under $H_0$, none of the independent variables help explain the dependent variable
- The model with predictors is no better than a simple mean-only model
- The regression equation reduces to just $\hat{y} = \hat{\beta}_0 = \bar{y}$
The alternative hypothesis is:
$H_A: \text{At least one } \beta_j \neq 0$
Decision Rule
- If F is large (and p-value small) → reject $H_0$. At least one predictor has explanatory power
- If F is small → fail to reject $H_0$. The predictors (slopes) jointly have no explanatory power
Important Nuances
- Joint test: The F-test examines whether ALL slopes are zero simultaneously, not individual slopes
- Which predictors matter: Even if F is significant, it doesn't tell you WHICH predictors are important (you need individual t-tests for that)
- Simple vs. multiple regression:
- In simple linear regression (one predictor): $F = t^2$, so F-test and t-test are equivalent
- In multiple regression: you could have significant F-test but some individual slopes not significant (and vice versa, though rarer)
The Big Picture: The F-test compares your full regression model against the simplest possible model (just predicting the mean). It answers: "Is this collection of predictors, taken together, better than knowing nothing at all?"