The Two-Level Reality of Regression

Understanding regression requires distinguishing between what exists in reality versus what we can observe and estimate from our sample data.

Population Level (True but Unknown)

Simple linear regression assumes there's a true relationship in the population:

$$ y = \beta_0 + \beta_1x + \epsilon $$

Where:

These Greek letter parameters represent reality—the actual relationship that would exist if we could observe the entire population.

$$\beta_1 = \rho_{XY} \frac{\sigma_Y}{\sigma_X}$$
$$\beta_0 = \mu_Y - \beta_1 \mu_X$$

Where:


Sample Level (What We Estimate)

From our sample data, we estimate the population parameters using least squares:

$$ \hat{y} = \hat{\beta}_0 + \hat{\beta}_1x $$

Where:

The "hat" notation ($\hat{}$) always means "estimate of" or "predicted."

$$ \hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} $$
$$ \hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x} $$

Important distinction:

Residuals are our best approximation of the true errors, but they're not the same thing.


Why We Care About the Distinction

This two-level framework matters because:

Standard Error of the Slope Estimate

$$ SE(\hat{\beta}_1) = \sqrt{\frac{\frac{1}{n-2} \sum_{i=1}^n (y_i - \hat{y}_i)^2}{\sum_{i=1}^n (x_i - \bar{x})^2}} $$

t-Statistic for Testing $H_0: \beta_1 = 0$

$$ t = \frac{\hat{\beta}_1}{SE(\hat{\beta}_1)} = \frac{\hat{\beta}_1}{\sqrt{\frac{\hat{\sigma}^2}{\sum_{i=1}^n (x_i - \bar{x})^2}}} $$

This t-statistic follows a t-distribution with $(n-2)$ degrees of freedom under the null hypothesis.


Doubling the Data Points

On $R^2$:

On p-values:

On the analytical solutions for $\beta_0$ and $\beta_1$:

On the least squares estimates:

Key Takeaway: More data gives us better estimates of the same underlying truth, not a different truth.


SST, SSE, and R² — Tiny Worked Example

We have three data points: (x_i, y_i) = (1,1), (2,2), (3,2). We fit the simple OLS model $\hat{y} = \beta_0 + \beta_1 x$ and compute SST, SSE, and R² step by step.

1) Mean of y

$\bar{y} = (1 + 2 + 2)/3 = 5/3 \approx 1.6667$

2) Total variation in y (SST)

$SST = \sum (y_i - \bar{y})^2$
$= (1 - 1.6667)^2 + (2 - 1.6667)^2 + (2 - 1.6667)^2$ $= 0.4444 + 0.1111 + 0.1111 = 0.6667$

3) Fit OLS line

Centered sums:

Coefficients:

$\beta_1 = 1.0 / 2 = 0.5$,     $\beta_0 = \bar{y} - \beta_1 \bar{x} = 1.6667 - 0.5 \cdot 2 = 0.6667$
x y $\hat{y} = 0.6667 + 0.5 \cdot x$ $y - \hat{y}$ $(y - \hat{y})^2$
11.00001.1667-0.16670.0278
22.00001.6667 0.33330.1111
32.00002.1667-0.16670.0278
$SSE = \sum (y - \hat{y})^2$ 0.1667

4) Unexplained variation (SSE)

$SSE = \sum (y_i - \hat{y}_i)^2 = 0.1667$

5) R² (fraction of variance explained)

$R^2 = 1 - \frac{SSE}{SST} = 1 - \frac{0.1667}{0.6667} = 0.75$

Interpretation: the regression explains 75% of the variance in y.

Notes: If SSE = 0 ⇒ R² = 1 (perfect fit). If SSE ≈ SST ⇒ R² ≈ 0 (no explanatory power). If SSE > SST ⇒ R² < 0 (worse than predicting the mean).

6) F-statistic for testing overall model significance

The F-statistic tests whether the regression model explains a significant amount of variance compared to a model with no predictors (i.e., just the intercept).

$F = \frac{MSR}{MSE} = \frac{SSR/k}{SSE/(n-k-1)}$

Where:

Since $SSR = SST - SSE$, we can substitute:

$F = \frac{(SST - SSE)/k}{SSE/(n-k-1)}$

For our example:

$F = \frac{0.5000/1}{0.1667/(3-1-1)} = \frac{0.5000}{0.1667/1} = \frac{0.5000}{0.1667} = 3.0$

Interpretation: This F-statistic of 3.0 (with 1 and 1 degrees of freedom) tests $H_0$: the regression model is no better than just predicting the mean $\bar{y}$ for all observations.

Note: In simple linear regression, $F = t^2$ where $t$ is the t-statistic for testing $H_0: \beta_1 = 0$.


Interpreting F-statistic Values

The F-statistic is fundamentally a ratio of signal to noise:

$F = \frac{\text{Signal (explained variance per predictor)}}{\text{Noise (unexplained variance per residual df)}}$

Understanding what different F-values mean:

F ≈ 1 (close to 1)

F < 1 (very small)

F moderately large (say, > 4 or 5 depending on n, k)

F very large (≫ 10, 20, 50...)

Quick Rule of Thumb

Why this makes sense: Under the null hypothesis (no real relationship), we'd expect F ≈ 1 on average because MSR would just reflect random variation in the predictors, while MSE reflects the true error variance. So the ratio should be around 1 if there's no real signal.

Remember: The degrees of freedom matter! An F of 4 might be significant with small sample sizes but not with very large ones, which is why we always check the p-value for formal hypothesis testing.


What the F-statistic Actually Tests

The F-statistic in regression tests a very specific null hypothesis:

$H_0: \beta_1 = \beta_2 = \cdots = \beta_k = 0$

(all slopes are zero at the same time)

What this means:

The alternative hypothesis is:

$H_A: \text{At least one } \beta_j \neq 0$

Decision Rule

Important Nuances

The Big Picture: The F-test compares your full regression model against the simplest possible model (just predicting the mean). It answers: "Is this collection of predictors, taken together, better than knowing nothing at all?"