Regularization
Quick View
Video Tutorial:
- StatQuest with Josh Starmer: Regularization Part 1: Ridge (L2) Regression
- StatQuest with Josh Starmer: Regularization Part 2: Lasso (L1) Regression
- StatQuest with Josh Starmer: Regularization Part 3: Elastic Net Regression
What is Regularization?
Regularization is a technique used in machine learning and regression to prevent overfitting by adding a penalty to the loss function. The penalty discourages overly complex models and large coefficients, helping the model generalize better to unseen data.
Why Do We Need Regularization?
-
Overfitting:
- When a model becomes too complex, it memorizes the training data, leading to poor performance on test data.
- Example: In polynomial regression, high-degree polynomials might perfectly fit the training data but fail to generalize.
-
Ill-Conditioned Data:
- When predictors are highly correlated or there are many predictors relative to observations, the regression model can become unstable.
-
Bias-Variance Tradeoff:
- Regularization introduces some bias but reduces variance, improving the model’s robustness.
Types of Regularization: Why Ridge, Lasso, and Elastic Net?
These are three popular regularization methods used for linear regression:
1. Ridge Regression (L2 Regularization):
-
Penalty: Adds the squared magnitude of coefficients to the loss function.
$$
\text{Loss Function: } \sum_{i=1}^n (y_i - \hat{y}_ i)^2 + \lambda \sum_ {j=1}^p \beta_ j^2
$$- $ \lambda $: Regularization parameter (controls penalty strength).
- $ \beta_j $: Coefficients of predictors.
-
Effect:
- Shrinks coefficients towards zero, but never makes them exactly zero.
- Reduces the impact of less important predictors without removing them entirely.
-
Use Case:
- Works well when many predictors are correlated.
2. Lasso Regression (L1 Regularization):
-
Penalty: Adds the absolute value of coefficients to the loss function.
$$
\text{Loss Function: } \sum_ {i=1}^n (y_ i - \hat{y}_ i)^2 + \lambda \sum_ {j=1}^p |\beta_ j|
$$ -
Effect:
- Can shrink some coefficients to exactly zero, effectively performing feature selection.
- Helps in creating sparse models by keeping only the most relevant predictors.
-
Use Case:
- Useful when you expect only a subset of predictors to be important.
3. Elastic Net Regression:
-
Penalty: Combines both L1 (lasso) and L2 (ridge) penalties.
$$
\text{Loss Function: } \sum_{i=1}^n (y_ i - \hat{y}_ i)^2 + \lambda_ 1 \sum_{j=1}^p |\beta_ j| + \lambda_ 2 \sum_{j=1}^p \beta_ j^2
$$ -
Effect:
- Balances the strengths of Ridge and Lasso regression.
- Retains the ability to perform feature selection (like Lasso) while handling multicollinearity (like Ridge).
-
Use Case:
- Best when there are many predictors and some are correlated, but feature selection is also desired.
Comparison of Regularization Methods:
Method | Penalty | Effect on Coefficients | Use Case |
---|---|---|---|
Ridge | $ \beta_j^2 $ | Shrinks coefficients, no zeros. | Multicollinearity or many predictors. |
Lasso | $|\beta_j|$ | Shrinks coefficients to zero. | Feature selection with fewer predictors. |
Elastic Net | $|\beta_j| + \beta_j^2 $ | Combination of Ridge and Lasso. | Multicollinearity with feature selection. |
Why Are They Discussed Together?
- All three are extensions of linear regression.
- They regularize the model to prevent overfitting, but they differ in the type of penalty they impose on the coefficients.
Ridge Regression
Ridge Regression Loss Function
The Ridge regression modifies the Ordinary Least Squares (OLS) cost function by adding a penalty (regularization term) to the sum of squared coefficients:
$$
\text{Loss} = \sum_ {i=1}^n \left( y_ i - \hat{y}_ i \right)^2 + \lambda \sum_{j=1}^p \beta_ j^2
$$
Where:
- $ y_i $: Observed target value.
- $ \hat{y}_i $: Predicted value ($ \hat{y}_i = X_i \cdot \beta $).
- $ \beta_j $: Coefficients of the regression model.
- $ \lambda $: Regularization parameter (also called penalty parameter).
Ridge Coefficient Solution
The Ridge regression coefficients are obtained by solving the following optimization problem:
$$
\min_{\beta} \{ \|y - X\beta\|^2 + \lambda \|\beta\|^2 \}
$$
-
Matrix Form:
- Rewrite the problem in matrix notation:
$$
\min_{\beta} \{ (y - X\beta)^T (y - X\beta) + \lambda \beta^T \beta \}
$$
- Rewrite the problem in matrix notation:
-
Solution for $ \beta $:
- Differentiating the loss function with respect to $ \beta $, we get:
$$
\beta = \left( X^T X + \lambda I \right)^{-1} X^T y
$$ - Here:
- $ X^T X $: Correlation matrix of predictors.
- $ \lambda I $: Regularization term, where $ I $ is the identity matrix.
- $ \lambda $: Controls the trade-off between minimizing the squared error and penalizing large coefficients.
- Differentiating the loss function with respect to $ \beta $, we get:
Why Add $ \lambda I $?
- Inverse of $ X^T X $ might not exist if the predictors are highly correlated or there are fewer observations than predictors (multicollinearity).
- Adding $ \lambda I $ ensures that $ X^T X + \lambda I $ is always invertible.
Finding the Optimal $ \lambda $
-
Grid Search with Cross-Validation:
- Evaluate the model’s performance (e.g., Mean Squared Error) for different values of $ \lambda $.
- Use k-fold cross-validation to select the $ \lambda $ that minimizes validation error.
-
Mathematical Insight:
- When $ \lambda = 0 $: Ridge reduces to Ordinary Least Squares (OLS).
- As $ \lambda \to \infty $: Coefficients $ \beta \to 0 $ (model becomes very simple).
-
Validation-Based Optimization:
- Define a range of $ \lambda $ values (e.g., $ \lambda = [0.001, 0.01, 0.1, 1, 10, 100] $).
- For each $ \lambda $, perform cross-validation and select the value with the lowest error.
Example: Finding $ \lambda $ with Cross-Validation
Here’s Python code to find the optimal $ \lambda $ using grid search:
|
0.21209508879201902
Key Insights
-
Ridge Regression Purpose:
- Penalizes large coefficients to reduce model complexity and improve generalization.
-
Finding $ \lambda $:
- Perform grid search with cross-validation to select $ \lambda $ that minimizes validation error.
|
{'Mean Squared Error (OLS)': 3.747535481239866, 'Mean Squared Error (Ridge)': 3.7344119726941427, 'OLS Coefficients': array([4.0641917 , 3.31246222]), 'Ridge Coefficients': array([2.93270253, 2.91932805])}
In this specific example, ridge regression slight reduced the mean squared error by reducing the contribution of feature 1. Contribution of the feature 1 and feature 2 are almost the same (Blue color in barplot). The linear regression plot was updated by removing the effects of feature 2.
Compare 3 Methods
Code continue from above
|
{'Mean Squared Error (OLS)': 3.747535481239866, 'Mean Squared Error (Ridge)': 3.7344119726941427, 'Mean Squared Error (Lasso)': 3.681549054209485, 'Mean Squared Error (Elastic Net)': 3.810867636824328, 'OLS Coefficients': array([4.0641917 , 3.31246222]), 'Ridge Coefficients': array([2.93270253, 2.91932805]), 'Lasso Coefficients': array([3.3579283 , 2.97769008]), 'Elastic Net Coefficients': array([2.72149455, 2.71825096])}
Regularization