Aside: OLS via Algebra and Matrices

Why Are We Doing This?

In the GLS chapter, we talk about how GLS and OLS are solving related but different problems. OLS minimizes the sum of squared errors. GLS minimizes those same distances but relative to the covariance matrix of the residuals which is how it accounts for autocorrelation. If you want to understand what GLS is actually doing differently, it helps to first get clear on what OLS is doing at all.

So that’s what this aside is for. We’ll derive the OLS solution first using algebra, then using matrix notation, and then implement both in R using mtcars. The matrix version matters because GLS is basically OLS with an extra matrix in the mix. Once you see that, the connection between the two methods becomes a lot more concrete.

You don’t need to memorize any of this. But if you’ve ever wondered what lm() is actually solving, here it is.

The Algebraic Approach

In simple linear regression, we model \(y\) as a linear function of \(x\):

\[y_i = \beta_0 + \beta_1 X_i + \epsilon_i\]

where \(\beta_0\) is the intercept, \(\beta_1\) is the slope, and \(\epsilon_i\) is the error for observation \(i\). The goal is to find the values of \(\beta_0\) and \(\beta_1\) that make our predictions as good as possible. “As good as possible” means minimizing the sum of squared errors (SSE):

\[SSE = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\]

where \(\hat{y}_i = \beta_0 + \beta_1 X_i\) is what our model predicts. We take partial derivatives of SSE with respect to \(\beta_0\) and \(\beta_1\), set them to zero, and solve:

\[\frac{\partial SSE}{\partial \beta_0} = -2 \sum_{i=1}^{n} (y_i - \beta_0 - \beta_1 X_i) = 0\] \[\frac{\partial SSE}{\partial \beta_1} = -2 \sum_{i=1}^{n} (y_i - \beta_0 - \beta_1 X_i) X_i = 0\]

Solving these gives the OLS estimates:

\[\hat{\beta}_1 = \frac{\sum_{i=1}^{n} (X_i - \bar{X})(y_i - \bar{y})}{\sum_{i=1}^{n} (X_i - \bar{X})^2}\] \[\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{X}\]

That’s it. That’s what lm() is doing under the hood for a simple regression. Let’s verify it in R using mpg as a function of wt from mtcars.

Code
x <- mtcars$wt
y <- mtcars$mpg

beta1 <- sum((x - mean(x)) * (y - mean(y))) / sum((x - mean(x))^2)
beta0 <- mean(y) - beta1 * mean(x)
c(intercept = beta0, slope = beta1)
intercept     slope 
37.285126 -5.344472 
Code
# Same thing with lm()
coef(lm(mpg ~ wt, data = mtcars))
(Intercept)          wt 
  37.285126   -5.344472 

Good. They match.

The Matrix Approach

The algebraic approach works fine for one predictor. But the moment you add a second predictor, the algebra gets tedious fast. The matrix approach handles any number of predictors cleanly.

In matrix notation, the linear model is:

\[\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}\]

where \(\mathbf{y}\) is an \(n \times 1\) vector of the response, \(\mathbf{X}\) is an \(n \times p\) matrix of predictors (with a column of ones prepended for the intercept), \(\boldsymbol{\beta}\) is a \(p \times 1\) vector of coefficients, and \(\boldsymbol{\epsilon}\) is an \(n \times 1\) vector of errors.

Minimizing the SSE in matrix form:

\[SSE = (\mathbf{y} - \mathbf{X} \boldsymbol{\beta})^T (\mathbf{y} - \mathbf{X} \boldsymbol{\beta})\]

Taking the derivative with respect to \(\boldsymbol{\beta}\) and setting it to zero gives the matrix normal equations, and solving for \(\boldsymbol{\beta}\):

\[\hat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}\]

That formula is the OLS solution. Everything that comes after it in the R code below is just implementing that one equation.

Implementation in R

Let’s work through it with the same mtcars example. First, set up \(\mathbf{X}\) and \(\mathbf{y}\):

Code
X <- as.matrix(mtcars$wt)
X <- cbind(1, X)   # prepend a column of ones for the intercept
y <- as.matrix(mtcars$mpg)

Now apply the formula \(\hat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}\) directly. In R, t() is the transpose, %*% is matrix multiplication, and solve() inverts a matrix.

Code
beta <- solve(t(X) %*% X) %*% t(X) %*% y
beta
          [,1]
[1,] 37.285126
[2,] -5.344472
Code
# And again, verify with lm()
coef(lm(mpg ~ wt, data = mtcars))
(Intercept)          wt 
  37.285126   -5.344472 

Same numbers. The matrix formula and lm() are solving the same problem.

Why This Matters for GLS

Here’s the connection. In OLS we minimize:

\[SSE = (\mathbf{y} - \mathbf{X} \boldsymbol{\beta})^T (\mathbf{y} - \mathbf{X} \boldsymbol{\beta})\]

In GLS, when the errors are correlated, we minimize instead:

\[(\mathbf{y} - \mathbf{X} \boldsymbol{\beta})^T \boldsymbol{\Sigma}^{-1} (\mathbf{y} - \mathbf{X} \boldsymbol{\beta})\]

where \(\boldsymbol{\Sigma}\) is the covariance matrix of the errors. If the errors are independent, \(\boldsymbol{\Sigma}\) is just a scaled identity matrix and GLS collapses to OLS. When the errors are autocorrelated – as they often are with spatial data – \(\boldsymbol{\Sigma}\) carries that structure, and the GLS solution accounts for it.

That’s the whole story. OLS is a special case of GLS. And if you understood the matrix formula above, you now understand what GLS is doing differently.