Ridge Regression for Factor Decomposition

The Risk Decomposition sub-pill fits a ridge regression of portfolio returns onto factor returns. This article explains ridge regression (Hoerl & Kennard 1970), what its single tuning parameter does, and why ridge is the right choice for backtest-scale factor decomposition.

The basic regression

Given T rebalance periods, portfolio returns r_p,t, and factor returns r_i,t for i in {value, quality, momentum, growth}:

r_p,t = α + ∑_i β_i · r_i,t + ε_t

OLS estimates the β coefficients that minimise the sum of squared errors:

minimise: ∑_t ε_t²

The ridge modification

Ridge adds an L2 penalty on the coefficient magnitudes:

minimise: ∑_t ε_t² + λ · ∑_i β_i²

where λ (the ridge penalty, named alpha in scikit-learn) controls the strength of the shrinkage. λ = 0 reproduces OLS; large λ pushes all coefficients toward zero.

Why shrinkage helps in small samples

OLS is unbiased — on average it produces the correct β — but it has high variance when the regressors are correlated or the sample is small. Ridge introduces bias (the β coefficients are too small in expectation) but reduces variance enough that the mean squared error of the β estimates is lower than OLS. This is the bias-variance trade-off; for typical backtest sample sizes (T = 20–60) and correlated factor returns, ridge wins on MSE.

Closed-form solution

Ridge has a closed-form solution. With X as the regressor matrix (T × K) and y as the response vector:

β_ridge = (X^TX + λI)⁻¹ X^Ty

The λI term is the only change from OLS's (X^TX)⁻¹. The added λI stabilises the inverse when X^TX is near-singular (which happens with correlated factors).

Choosing λ

The default in FM103 is λ = 0.1, which is mild. The trade-off:

λ near 0: behaves like OLS; coefficients are most variable; R² is highest.
λ ~ 0.1: gentle shrinkage; coefficients are stabilised slightly; R² slightly lower.
λ > 1: aggressive shrinkage; all β pulled near zero; R² falls materially.

For backtest factor decomposition the goal is interpretable coefficients, not predictive R² maximisation. Mild λ is the right default.

Cross-validation alternative

A more rigorous λ choice uses k-fold cross-validation on the regression sample. FM103 keeps λ fixed because the sample is too small (T = 20–60) for cross-validation to be reliable and because the sub-pill aims for stable, comparable results across users.

Ridge vs. Lasso

Lasso (Tibshirani 1996) uses an L1 penalty instead of L2:

minimise: ∑_t ε_t² + λ · ∑_i |β_i|

Lasso's key property: it can set coefficients exactly to zero, performing variable selection. Ridge cannot. For factor decomposition we want all four factors represented in the output (even if one has small β); ridge is the right choice. For high-dimensional factor models with many candidate factors, Lasso's automatic selection becomes useful.

Elastic Net

Elastic Net combines L1 and L2: λ₁ · ∑ |β_i| + λ₂ · ∑ β_i². Useful when you want both shrinkage and variable selection. For the four-factor decomposition in FM103, the added complexity isn't worth it; pure ridge is sufficient.

Interpretation caveats

Ridge coefficients are smaller than OLS. Don't compare ridge β to OLS β from another study without noting the shrinkage.
R² is slightly lower than OLS. The bias trade-off costs a bit of fit.
Variance decomposition uses the ridge β. The per-factor contribution numbers in the sub-pill are based on the shrunk coefficients — they're slightly conservative.

Ridge Regression for Factor Decomposition

The basic regression

The ridge modification

Why shrinkage helps in small samples

Closed-form solution

Choosing λ

Cross-validation alternative

Ridge vs. Lasso

Elastic Net

Interpretation caveats

Further Reading

Foundational papers

Textbook references

Related QuanterLab articles

Try it in QuanterLab