Linear Regression, From the Ground Up

The Problem

Here's a model that fits its training data perfectly:

import numpy as np

np.random.seed(42)
x = np.linspace(0, 1, 10)
y = 0.5 * x + 0.1 + np.random.normal(0, 0.05, size=10)

# Fit a degree-9 polynomial to 10 points
coeffs = np.polyfit(x, y, deg=9)

The polynomial passes through (or near) every single training point. Training MSE is essentially zero. And if you ask it to predict on new points between those it was trained on — it's useless. It has memorized noise instead of learning signal.

This is the problem we're solving. A model that minimizes training loss is not necessarily a good model. Understanding why — and what to do about it — is what this post is about.

Formalizing the Loss

To fix the problem, we need to be precise about what "good fit" means. Define the mean squared error:

$L(w) = \frac{1}{n} \sum_{i=1}^{n} (y_i - w^T x_i)^2 = \frac{1}{n} \|y - Xw\|_2^2$

where $X \in \mathbb{R}^{n \times d}$ is the design matrix, $y \in \mathbb{R}^n$ is the target vector, and $w \in \mathbb{R}^d$ is the weight vector we're optimizing.

Why squared error? Two reasons. First, it's differentiable everywhere — unlike absolute error, which has a kink at zero that complicates optimization. Second, squaring penalizes large errors disproportionately, which is usually what we want: a prediction that's off by 10 should cost more than ten predictions off by 1.

The key geometric fact: $L(w)$ is a quadratic bowl in weight space. It's strictly convex (assuming $X$ has full column rank), which means there's exactly one global minimum and gradient descent will find it. This bowl is the stage for everything that follows.

Two Paths to the Minimum

The Normal Equations

Set $\nabla_w L = 0$ and solve directly:

$\hat{w} = (X^TX)^{-1}X^Ty$

This is the closed-form solution. It's exact, elegant, and $O(n^3)$ due to the matrix inversion. For moderate $n$ and $d$ , it's fine. It breaks in two situations:

Scale: When $n$ or $d$ is large, inverting $X^TX$ is expensive and numerically unstable.
Multicollinearity: When features are correlated, $X^TX$ becomes near-singular and the inversion blows up. We'll fix this in the Ridge section.

Gradient Descent

Instead of solving $\nabla_w L = 0$ directly, follow the gradient downhill. Compute:

$\nabla_w L = -\frac{2}{n} X^T(y - Xw)$

Then update:

$w \leftarrow w - \eta \cdot \nabla_w L$

where $\eta$ is the learning rate. In NumPy:

def gd_step(X, y, w, lr=0.01):
    n = len(y)
    grad = -2/n * X.T @ (y - X @ w)
    return w - lr * grad

Note the sign: the gradient points uphill, so we subtract it to go downhill. The learning rate $\eta$ controls step size — too large and you overshoot the minimum, too small and convergence is glacially slow. In practice, $\eta \in [10^{-4}, 10^{-1}]$ is a reasonable starting range.

Gradient descent doesn't require inverting anything, which is why it scales to problems where normal equations can't.

Overfitting, Visualized

Return to the polynomial from Section 1. As we increase the degree, training loss keeps falling. Test loss tells a different story:

Degree	Train MSE	Test MSE
1	0.0031	0.0028
3	0.0018	0.0021
5	0.0009	0.0041
7	0.0003	0.089
9	~0.0000	4.7

The U-shape in test loss is the bias-variance tradeoff in action. A low-degree model underfits: it has high bias (wrong assumptions about the function shape) but low variance (stable across datasets). A high-degree model overfits: near-zero bias, but enormous variance — it's fitting the noise in this particular sample, not the underlying signal.

The fix isn't to use a lower-degree polynomial. It's to constrain the weights. A high-degree polynomial with small weights can't oscillate wildly — it's forced to be close to a simpler function. Controlling model complexity means controlling the scale of the weights, and that's exactly what regularization does.

Ridge Regression (L2)

Add an L2 penalty on the weights:

$L_{\text{ridge}}(w) = \frac{1}{n} \|y - Xw\|_2^2 + \lambda \|w\|_2^2$

Setting $\nabla_w L_{\text{ridge}} = 0$ gives updated normal equations:

$\hat{w}_{\text{ridge}} = (X^TX + \lambda I)^{-1}X^Ty$

Two things to notice:

The inversion problem is fixed. When $\lambda > 0$ , $X^TX + \lambda I$ is strictly positive definite, which means it's always invertible — regardless of multicollinearity. This is not a side effect; it's a feature.
Weights are shrunk, not zeroed. The $\lambda I$ term pulls $\hat{w}_{\text{ridge}}$ toward zero relative to $\hat{w}$ . Larger $\lambda$ = more shrinkage = simpler model.

The geometric picture makes this concrete. The unconstrained OLS solution $\hat{w}$ sits at the center of the loss ellipses. Ridge adds a constraint: the solution must lie within an L2 ball of radius roughly $1/\lambda$ . The constrained optimum is where the smallest ellipse touches the ball:

The contact point is smooth — there's nothing special about any axis direction, so Ridge shrinks all weights but rarely zeros any of them.

The gradient descent update for Ridge adds one term:

def ridge_gd_step(X, y, w, lr=0.01, lam=0.1):
    n = len(y)
    grad = -2/n * X.T @ (y - X @ w) + 2 * lam * w
    return w - lr * grad

Lasso Regression (L1)

Swap the L2 penalty for an L1 penalty:

$L_{\text{lasso}}(w) = \frac{1}{n} \|y - Xw\|_2^2 + \lambda \|w\|_1$

There's no closed-form solution. The L1 norm $\|w\|_1 = \sum_j |w_j|$ is not differentiable at zero — the gradient doesn't exist where any $w_j = 0$ . Instead, use subgradient methods or coordinate descent, which cycles through each weight and updates it while holding others fixed (efficient because the L1 subproblem has a known closed-form solution per coordinate: the soft-thresholding operator).

The geometry explains why Lasso produces sparse solutions. In 2D, the L1 constraint region is a diamond. The loss contours are the same ellipses as before, centered at $\hat{w}$ . The constrained optimum is where the smallest ellipse first touches the diamond:

The ellipses hit the corner of the diamond — not a smooth edge, but the vertex where $w_1 = 0$ . That's not a coincidence: the corners of the L1 ball are exactly the coordinate-sparse points. In 2D this is visually obvious; in higher dimensions it's the same principle — the extreme points (vertices) of the $\ell_1$ ball always have most coordinates at zero, so the optimum tends to land there.

This is the key difference from Ridge: Lasso doesn't just shrink weights, it zeros them out. It's built-in feature selection.

	Ridge (L2)	Lasso (L1)
Penalty	$\\|w\\|_2^2$	$\\|w\\|_1$
Solution	Shrinks all weights	Zeros out some weights
Geometry	Ball (smooth contact)	Diamond (corner contact)
Solver	Closed form	Coordinate descent / subgradient
Use when	Multicollinearity	Feature selection needed

What Linear Regression Teaches You

Linear regression is the simplest model in the ML toolkit. But if you derive it carefully — from the loss surface to gradient descent to regularization — you've already touched every idea that matters in modern ML.

Loss surfaces, convexity, gradient descent, learning rate sensitivity, overfitting, bias-variance, L1 vs L2 penalties: all of it is here, in the setting where the math is clean enough to see clearly. The jump from linear regression to a deep network is mostly scale, not concept.

If you want both sparsity and shrinkage, L1 + L2 combined gives you Elastic Net — a practical choice when neither pure Lasso nor pure Ridge fits your problem.

Next: the probabilistic view — MLE recovers least squares, and the Bayesian prior on weights recovers Ridge.