← All posts

Linear Regression, From the Ground Up

March 18, 2026MLoptimizationlinear-algebra

The Problem

Here's a model that fits its training data perfectly:

import numpy as np

np.random.seed(42)
x = np.linspace(0, 1, 10)
y = 0.5 * x + 0.1 + np.random.normal(0, 0.05, size=10)

# Fit a degree-9 polynomial to 10 points
coeffs = np.polyfit(x, y, deg=9)
xy00.51degree-9 fittrain MSE ≈ 0

The polynomial passes through (or near) every single training point. Training MSE is essentially zero. And if you ask it to predict on new points between those it was trained on — it's useless. It has memorized noise instead of learning signal.

This is the problem we're solving. A model that minimizes training loss is not necessarily a good model. Understanding why — and what to do about it — is what this post is about.

Formalizing the Loss

To fix the problem, we need to be precise about what "good fit" means. Define the mean squared error:

L(w)=1ni=1n(yiwTxi)2=1nyXw22L(w) = \frac{1}{n} \sum_{i=1}^{n} (y_i - w^T x_i)^2 = \frac{1}{n} \|y - Xw\|_2^2

where XRn×dX \in \mathbb{R}^{n \times d} is the design matrix, yRny \in \mathbb{R}^n is the target vector, and wRdw \in \mathbb{R}^d is the weight vector we're optimizing.

Why squared error? Two reasons. First, it's differentiable everywhere — unlike absolute error, which has a kink at zero that complicates optimization. Second, squaring penalizes large errors disproportionately, which is usually what we want: a prediction that's off by 10 should cost more than ten predictions off by 1.

The key geometric fact: L(w)L(w) is a quadratic bowl in weight space. It's strictly convex (assuming XX has full column rank), which means there's exactly one global minimum and gradient descent will find it. This bowl is the stage for everything that follows.

Two Paths to the Minimum

The Normal Equations

Set wL=0\nabla_w L = 0 and solve directly:

w^=(XTX)1XTy\hat{w} = (X^TX)^{-1}X^Ty

This is the closed-form solution. It's exact, elegant, and O(n3)O(n^3) due to the matrix inversion. For moderate nn and dd, it's fine. It breaks in two situations:

  1. Scale: When nn or dd is large, inverting XTXX^TX is expensive and numerically unstable.
  2. Multicollinearity: When features are correlated, XTXX^TX becomes near-singular and the inversion blows up. We'll fix this in the Ridge section.

Gradient Descent

Instead of solving wL=0\nabla_w L = 0 directly, follow the gradient downhill. Compute:

wL=2nXT(yXw)\nabla_w L = -\frac{2}{n} X^T(y - Xw)

Then update:

wwηwLw \leftarrow w - \eta \cdot \nabla_w L

where η\eta is the learning rate. In NumPy:

def gd_step(X, y, w, lr=0.01):
    n = len(y)
    grad = -2/n * X.T @ (y - X @ w)
    return w - lr * grad

Note the sign: the gradient points uphill, so we subtract it to go downhill. The learning rate η\eta controls step size — too large and you overshoot the minimum, too small and convergence is glacially slow. In practice, η[104,101]\eta \in [10^{-4}, 10^{-1}] is a reasonable starting range.

Gradient descent doesn't require inverting anything, which is why it scales to problems where normal equations can't.

Overfitting, Visualized

Return to the polynomial from Section 1. As we increase the degree, training loss keeps falling. Test loss tells a different story:

DegreeTrain MSETest MSE
10.00310.0028
30.00180.0021
50.00090.0041
70.00030.089
9~0.00004.7

The U-shape in test loss is the bias-variance tradeoff in action. A low-degree model underfits: it has high bias (wrong assumptions about the function shape) but low variance (stable across datasets). A high-degree model overfits: near-zero bias, but enormous variance — it's fitting the noise in this particular sample, not the underlying signal.

The fix isn't to use a lower-degree polynomial. It's to constrain the weights. A high-degree polynomial with small weights can't oscillate wildly — it's forced to be close to a simpler function. Controlling model complexity means controlling the scale of the weights, and that's exactly what regularization does.

Ridge Regression (L2)

Add an L2 penalty on the weights:

Lridge(w)=1nyXw22+λw22L_{\text{ridge}}(w) = \frac{1}{n} \|y - Xw\|_2^2 + \lambda \|w\|_2^2

Setting wLridge=0\nabla_w L_{\text{ridge}} = 0 gives updated normal equations:

w^ridge=(XTX+λI)1XTy\hat{w}_{\text{ridge}} = (X^TX + \lambda I)^{-1}X^Ty

Two things to notice:

  1. The inversion problem is fixed. When λ>0\lambda > 0, XTX+λIX^TX + \lambda I is strictly positive definite, which means it's always invertible — regardless of multicollinearity. This is not a side effect; it's a feature.

  2. Weights are shrunk, not zeroed. The λI\lambda I term pulls w^ridge\hat{w}_{\text{ridge}} toward zero relative to w^\hat{w}. Larger λ\lambda = more shrinkage = simpler model.

The geometric picture makes this concrete. The unconstrained OLS solution w^\hat{w} sits at the center of the loss ellipses. Ridge adds a constraint: the solution must lie within an L2 ball of radius roughly 1/λ1/\lambda. The constrained optimum is where the smallest ellipse touches the ball:

w₁w₂ŵ (OLS)ŵ_ridge0

The contact point is smooth — there's nothing special about any axis direction, so Ridge shrinks all weights but rarely zeros any of them.

The gradient descent update for Ridge adds one term:

def ridge_gd_step(X, y, w, lr=0.01, lam=0.1):
    n = len(y)
    grad = -2/n * X.T @ (y - X @ w) + 2 * lam * w
    return w - lr * grad

Lasso Regression (L1)

Swap the L2 penalty for an L1 penalty:

Llasso(w)=1nyXw22+λw1L_{\text{lasso}}(w) = \frac{1}{n} \|y - Xw\|_2^2 + \lambda \|w\|_1

There's no closed-form solution. The L1 norm w1=jwj\|w\|_1 = \sum_j |w_j| is not differentiable at zero — the gradient doesn't exist where any wj=0w_j = 0. Instead, use subgradient methods or coordinate descent, which cycles through each weight and updates it while holding others fixed (efficient because the L1 subproblem has a known closed-form solution per coordinate: the soft-thresholding operator).

The geometry explains why Lasso produces sparse solutions. In 2D, the L1 constraint region is a diamond. The loss contours are the same ellipses as before, centered at w^\hat{w}. The constrained optimum is where the smallest ellipse first touches the diamond:

w₁w₂ŵ (OLS)ŵ_lasso (w₁=0)0

The ellipses hit the corner of the diamond — not a smooth edge, but the vertex where w1=0w_1 = 0. That's not a coincidence: the corners of the L1 ball are exactly the coordinate-sparse points. In 2D this is visually obvious; in higher dimensions it's the same principle — the extreme points (vertices) of the 1\ell_1 ball always have most coordinates at zero, so the optimum tends to land there.

This is the key difference from Ridge: Lasso doesn't just shrink weights, it zeros them out. It's built-in feature selection.

Ridge (L2)Lasso (L1)
Penaltyw22\|w\|_2^2w1\|w\|_1
SolutionShrinks all weightsZeros out some weights
GeometryBall (smooth contact)Diamond (corner contact)
SolverClosed formCoordinate descent / subgradient
Use whenMulticollinearityFeature selection needed

What Linear Regression Teaches You

Linear regression is the simplest model in the ML toolkit. But if you derive it carefully — from the loss surface to gradient descent to regularization — you've already touched every idea that matters in modern ML.

Loss surfaces, convexity, gradient descent, learning rate sensitivity, overfitting, bias-variance, L1 vs L2 penalties: all of it is here, in the setting where the math is clean enough to see clearly. The jump from linear regression to a deep network is mostly scale, not concept.

If you want both sparsity and shrinkage, L1 + L2 combined gives you Elastic Net — a practical choice when neither pure Lasso nor pure Ridge fits your problem.


Next: the probabilistic view — MLE recovers least squares, and the Bayesian prior on weights recovers Ridge.