Linear Regression, From the Ground Up
The Problem
Here's a model that fits its training data perfectly:
import numpy as np
np.random.seed(42)
x = np.linspace(0, 1, 10)
y = 0.5 * x + 0.1 + np.random.normal(0, 0.05, size=10)
# Fit a degree-9 polynomial to 10 points
coeffs = np.polyfit(x, y, deg=9)The polynomial passes through (or near) every single training point. Training MSE is essentially zero. And if you ask it to predict on new points between those it was trained on — it's useless. It has memorized noise instead of learning signal.
This is the problem we're solving. A model that minimizes training loss is not necessarily a good model. Understanding why — and what to do about it — is what this post is about.
Formalizing the Loss
To fix the problem, we need to be precise about what "good fit" means. Define the mean squared error:
where is the design matrix, is the target vector, and is the weight vector we're optimizing.
Why squared error? Two reasons. First, it's differentiable everywhere — unlike absolute error, which has a kink at zero that complicates optimization. Second, squaring penalizes large errors disproportionately, which is usually what we want: a prediction that's off by 10 should cost more than ten predictions off by 1.
The key geometric fact: is a quadratic bowl in weight space. It's strictly convex (assuming has full column rank), which means there's exactly one global minimum and gradient descent will find it. This bowl is the stage for everything that follows.
Two Paths to the Minimum
The Normal Equations
Set and solve directly:
This is the closed-form solution. It's exact, elegant, and due to the matrix inversion. For moderate and , it's fine. It breaks in two situations:
- Scale: When or is large, inverting is expensive and numerically unstable.
- Multicollinearity: When features are correlated, becomes near-singular and the inversion blows up. We'll fix this in the Ridge section.
Gradient Descent
Instead of solving directly, follow the gradient downhill. Compute:
Then update:
where is the learning rate. In NumPy:
def gd_step(X, y, w, lr=0.01):
n = len(y)
grad = -2/n * X.T @ (y - X @ w)
return w - lr * gradNote the sign: the gradient points uphill, so we subtract it to go downhill. The learning rate controls step size — too large and you overshoot the minimum, too small and convergence is glacially slow. In practice, is a reasonable starting range.
Gradient descent doesn't require inverting anything, which is why it scales to problems where normal equations can't.
Overfitting, Visualized
Return to the polynomial from Section 1. As we increase the degree, training loss keeps falling. Test loss tells a different story:
| Degree | Train MSE | Test MSE |
|---|---|---|
| 1 | 0.0031 | 0.0028 |
| 3 | 0.0018 | 0.0021 |
| 5 | 0.0009 | 0.0041 |
| 7 | 0.0003 | 0.089 |
| 9 | ~0.0000 | 4.7 |
The U-shape in test loss is the bias-variance tradeoff in action. A low-degree model underfits: it has high bias (wrong assumptions about the function shape) but low variance (stable across datasets). A high-degree model overfits: near-zero bias, but enormous variance — it's fitting the noise in this particular sample, not the underlying signal.
The fix isn't to use a lower-degree polynomial. It's to constrain the weights. A high-degree polynomial with small weights can't oscillate wildly — it's forced to be close to a simpler function. Controlling model complexity means controlling the scale of the weights, and that's exactly what regularization does.
Ridge Regression (L2)
Add an L2 penalty on the weights:
Setting gives updated normal equations:
Two things to notice:
-
The inversion problem is fixed. When , is strictly positive definite, which means it's always invertible — regardless of multicollinearity. This is not a side effect; it's a feature.
-
Weights are shrunk, not zeroed. The term pulls toward zero relative to . Larger = more shrinkage = simpler model.
The geometric picture makes this concrete. The unconstrained OLS solution sits at the center of the loss ellipses. Ridge adds a constraint: the solution must lie within an L2 ball of radius roughly . The constrained optimum is where the smallest ellipse touches the ball:
The contact point is smooth — there's nothing special about any axis direction, so Ridge shrinks all weights but rarely zeros any of them.
The gradient descent update for Ridge adds one term:
def ridge_gd_step(X, y, w, lr=0.01, lam=0.1):
n = len(y)
grad = -2/n * X.T @ (y - X @ w) + 2 * lam * w
return w - lr * gradLasso Regression (L1)
Swap the L2 penalty for an L1 penalty:
There's no closed-form solution. The L1 norm is not differentiable at zero — the gradient doesn't exist where any . Instead, use subgradient methods or coordinate descent, which cycles through each weight and updates it while holding others fixed (efficient because the L1 subproblem has a known closed-form solution per coordinate: the soft-thresholding operator).
The geometry explains why Lasso produces sparse solutions. In 2D, the L1 constraint region is a diamond. The loss contours are the same ellipses as before, centered at . The constrained optimum is where the smallest ellipse first touches the diamond:
The ellipses hit the corner of the diamond — not a smooth edge, but the vertex where . That's not a coincidence: the corners of the L1 ball are exactly the coordinate-sparse points. In 2D this is visually obvious; in higher dimensions it's the same principle — the extreme points (vertices) of the ball always have most coordinates at zero, so the optimum tends to land there.
This is the key difference from Ridge: Lasso doesn't just shrink weights, it zeros them out. It's built-in feature selection.
| Ridge (L2) | Lasso (L1) | |
|---|---|---|
| Penalty | ||
| Solution | Shrinks all weights | Zeros out some weights |
| Geometry | Ball (smooth contact) | Diamond (corner contact) |
| Solver | Closed form | Coordinate descent / subgradient |
| Use when | Multicollinearity | Feature selection needed |
What Linear Regression Teaches You
Linear regression is the simplest model in the ML toolkit. But if you derive it carefully — from the loss surface to gradient descent to regularization — you've already touched every idea that matters in modern ML.
Loss surfaces, convexity, gradient descent, learning rate sensitivity, overfitting, bias-variance, L1 vs L2 penalties: all of it is here, in the setting where the math is clean enough to see clearly. The jump from linear regression to a deep network is mostly scale, not concept.
If you want both sparsity and shrinkage, L1 + L2 combined gives you Elastic Net — a practical choice when neither pure Lasso nor pure Ridge fits your problem.
Next: the probabilistic view — MLE recovers least squares, and the Bayesian prior on weights recovers Ridge.