The Two-Level Reality of Regression

Understanding regression requires distinguishing between what exists in reality versus what we can observe and estimate from our sample data.

Population Level (True but Unknown)

Simple linear regression assumes there's a true relationship in the population:

y = \beta_0 + \beta_1x + \epsilon

Where:

$\beta_0$ = true population intercept (unknown parameter)
$\beta_1$ = true population slope (unknown parameter)
$\epsilon$ = random error term with mean 0

These Greek letter parameters represent reality—the actual relationship that would exist if we could observe the entire population.

\beta_1 = \rho_{XY} \frac{\sigma_Y}{\sigma_X}

\beta_0 = \mu_Y - \beta_1 \mu_X

Where:

$\rho_{XY} = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y}$ is the population correlation coefficient
$\sigma_X = \sqrt{\text{Var}(X)}$ and $\sigma_Y = \sqrt{\text{Var}(Y)}$ are population standard deviations
$\mu_X = E[X]$ and $\mu_Y = E[Y]$ are the population means

Sample Level (What We Estimate)

From our sample data, we estimate the population parameters using least squares:

\hat{y} = \hat{\beta}_0 + \hat{\beta}_1x

Where:

$\hat{\beta}_0$ = sample estimate of the true intercept
$\hat{\beta}_1$ = sample estimate of the true slope
$\hat{y}$ = predicted value of $y$ for a given $x$

The "hat" notation ($\hat{}$) always means "estimate of" or "predicted."

\hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}

\hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x}

Important distinction:

$\epsilon$: The true, unobservable error in the population model
$e_i = y_i - \hat{y}_i$: The observable residuals from our fitted model

Residuals are our best approximation of the true errors, but they're not the same thing.

Why We Care About the Distinction

This two-level framework matters because:

Inference: We want to make statements about $\beta_1$ (the true slope), not just $\hat{\beta}_1$ (our estimate)
Uncertainty: Our estimate $\hat{\beta}_1$ has sampling variability—different samples give different estimates
Hypothesis Testing: We test claims about the true parameter $\beta_1$, using our estimate $\hat{\beta}_1$

Standard Error of the Slope Estimate

SE(\hat{\beta}_1) = \sqrt{\frac{\frac{1}{n-2} \sum_{i=1}^n (y_i - \hat{y}_i)^2}{\sum_{i=1}^n (x_i - \bar{x})^2}}

t-Statistic for Testing $H_0: \beta_1 = 0$

t = \frac{\hat{\beta}_1}{SE(\hat{\beta}_1)} = \frac{\hat{\beta}_1}{\sqrt{\frac{\hat{\sigma}^2}{\sum_{i=1}^n (x_i - \bar{x})^2}}}

This t-statistic follows a t-distribution with $(n-2)$ degrees of freedom under the null hypothesis.

Doubling the Data Points

On $R^2$:

$R^2$ measures the proportion of variance explained: $R^2 = 1 - \frac{SSE}{SST}$
Adding data doesn't automatically change $R^2$—it depends on whether new points follow the same pattern
If new points are consistent with the existing relationship, $R^2$ may stay similar or slightly improve
If new points are more scattered, $R^2$ could decrease

On p-values:

More data typically reduces $SE(\hat{\beta}_1)$ (denominator gets larger due to more spread in $x$-values)
Smaller standard error means larger $|t|$-statistic for the same slope estimate
Larger $|t|$-statistic means smaller p-value, stronger evidence against $H_0: \beta_1 = 0$

On the analytical solutions for $\beta_0$ and $\beta_1$:

Doubling the number of data points has no effect on the true population parameters themselves
$\beta_1 = \frac{\text{Cov}(X,Y)}{\text{Var}(X)}$ remains constant—it's a fixed property of the population relationship
$\beta_0 = \mu_Y - \beta_1 \mu_X$ also remains constant—it depends only on population moments
The correlation $\rho_{XY}$ and population standard deviations $\sigma_X, \sigma_Y$ are unchanged

On the least squares estimates:

The actual values of $\hat{\beta}_0$ and $\hat{\beta}_1$ will change (new sample, new estimates)

Key Takeaway: More data gives us better estimates of the same underlying truth, not a different truth.

SST, SSE, and R² — Tiny Worked Example

We have three data points: (x_i, y_i) = (1,1), (2,2), (3,2). We fit the simple OLS model $\hat{y} = \beta_0 + \beta_1 x$ and compute SST, SSE, and R² step by step.

1) Mean of y

$\bar{y} = (1 + 2 + 2)/3 = 5/3 \approx 1.6667$

2) Total variation in y (SST)

SST = \sum (y_i - \bar{y})^2

$= (1 - 1.6667)^2 + (2 - 1.6667)^2 + (2 - 1.6667)^2$ $= 0.4444 + 0.1111 + 0.1111 = 0.6667$

3) Fit OLS line

Centered sums:

$\bar{x} = (1 + 2 + 3)/3 = 2$
$\sum (x_i-\bar{x})(y_i-\bar{y}) = 1.0$
$\sum (x_i-\bar{x})^2 = 2$

Coefficients:

\beta_1 = 1.0 / 2 = 0.5$,     $\beta_0 = \bar{y} - \beta_1 \bar{x} = 1.6667 - 0.5 \cdot 2 = 0.6667

x	y	$\hat{y} = 0.6667 + 0.5 \cdot x$	$y - \hat{y}$	$(y - \hat{y})^2$
1	1.0000	1.1667	-0.1667	0.0278
2	2.0000	1.6667	0.3333	0.1111
3	2.0000	2.1667	-0.1667	0.0278
$SSE = \sum (y - \hat{y})^2$				0.1667

4) Unexplained variation (SSE)

$SSE = \sum (y_i - \hat{y}_i)^2 = 0.1667$

5) R² (fraction of variance explained)

R^2 = 1 - \frac{SSE}{SST} = 1 - \frac{0.1667}{0.6667} = 0.75

Interpretation: the regression explains 75% of the variance in y.

Notes: If SSE = 0 ⇒ R² = 1 (perfect fit). If SSE ≈ SST ⇒ R² ≈ 0 (no explanatory power). If SSE > SST ⇒ R² < 0 (worse than predicting the mean).

6) F-statistic for testing overall model significance

The F-statistic tests whether the regression model explains a significant amount of variance compared to a model with no predictors (i.e., just the intercept).

F = \frac{MSR}{MSE} = \frac{SSR/k}{SSE/(n-k-1)}

Where:

$SSR$ = regression sum of squares (explained variation)
$SSE$ = error sum of squares (unexplained variation)
$k$ = number of predictors (excluding the intercept)
$n$ = number of observations
$MSR$ = mean square regression
$MSE$ = mean square error

Since $SSR = SST - SSE$, we can substitute:

F = \frac{(SST - SSE)/k}{SSE/(n-k-1)}

For our example:

$SST = 0.6667$
$SSE = 0.1667$
$SSR = SST - SSE = 0.6667 - 0.1667 = 0.5000$
$k = 1$ (one predictor: $x$)
$n = 3$ (three observations)

F = \frac{0.5000/1}{0.1667/(3-1-1)} = \frac{0.5000}{0.1667/1} = \frac{0.5000}{0.1667} = 3.0

Interpretation: This F-statistic of 3.0 (with 1 and 1 degrees of freedom) tests $H_0$: the regression model is no better than just predicting the mean $\bar{y}$ for all observations.

Note: In simple linear regression, $F = t^2$ where $t$ is the t-statistic for testing $H_0: \beta_1 = 0$.

Interpreting F-statistic Values

The F-statistic is fundamentally a ratio of signal to noise:

F = \frac{\text{Signal (explained variance per predictor)}}{\text{Noise (unexplained variance per residual df)}}

Understanding what different F-values mean:

F ≈ 1 (close to 1)

The model's explained variance (SSR) is about the same order as the unexplained variance (SSE per degree of freedom)
Adding predictors doesn't improve the model much beyond just predicting the mean
Typically → not significant

F < 1 (very small)

The model actually explains less variance than would be expected by chance
Strong evidence that predictors are useless or even harmful
You'd fail to reject the null hypothesis (no relationship)

F moderately large (say, > 4 or 5 depending on n, k)

Suggests predictors improve the model compared to noise
You check the corresponding p-value:
- If p < 0.05 → statistically significant (reject null)
- If p is higher → still not significant, even if F > 1

F very large (≫ 10, 20, 50...)

Strong evidence that predictors explain a lot of variance relative to noise
P-value will usually be extremely small
The regression model has strong explanatory power

Quick Rule of Thumb

Small F (≈0 → < 1): model worse than noise
Around 1: no improvement over mean-only model
Big F (≫1): model explains meaningful variance (significance depends on df & p-value)

Why this makes sense: Under the null hypothesis (no real relationship), we'd expect F ≈ 1 on average because MSR would just reflect random variation in the predictors, while MSE reflects the true error variance. So the ratio should be around 1 if there's no real signal.

Remember: The degrees of freedom matter! An F of 4 might be significant with small sample sizes but not with very large ones, which is why we always check the p-value for formal hypothesis testing.

What the F-statistic Actually Tests

The F-statistic in regression tests a very specific null hypothesis:

H_0: \beta_1 = \beta_2 = \cdots = \beta_k = 0

(all slopes are zero at the same time)

What this means:

Under $H_0$, none of the independent variables help explain the dependent variable
The model with predictors is no better than a simple mean-only model
The regression equation reduces to just $\hat{y} = \hat{\beta}_0 = \bar{y}$

The alternative hypothesis is:

H_A: \text{At least one } \beta_j \neq 0

Decision Rule

If F is large (and p-value small) → reject $H_0$. At least one predictor has explanatory power
If F is small → fail to reject $H_0$. The predictors (slopes) jointly have no explanatory power

Important Nuances

Joint test: The F-test examines whether ALL slopes are zero simultaneously, not individual slopes
Which predictors matter: Even if F is significant, it doesn't tell you WHICH predictors are important (you need individual t-tests for that)
Simple vs. multiple regression:
- In simple linear regression (one predictor): $F = t^2$, so F-test and t-test are equivalent
- In multiple regression: you could have significant F-test but some individual slopes not significant (and vice versa, though rarer)

The Big Picture: The F-test compares your full regression model against the simplest possible model (just predicting the mean). It answers: "Is this collection of predictors, taken together, better than knowing nothing at all?"