Linear Regression
Linear Regression
Vectors and Matrix
In numpy, the dot product can be written np.dot(w,x) or w@x.
Vectors will always be column vectors. Thus:
$$
x =
\begin{bmatrix}
x_{1} \\
\vdots \\
x_{n}
\end{bmatrix}
, \quad w^T = [w_{1}, \ldots, w_{n}]
$$
$$
w^Tx = [w_{1}, \ldots, w_{n}]
\begin{bmatrix}
x_{1} \\
\vdots \\
x_{n}
\end{bmatrix}
= \sum_{i=1}^{n} w_{i}x_{i}
$$
$$
x =
\begin{bmatrix}
x_{1} \\
\vdots \\
x_{n}
\end{bmatrix}
, \quad
W =
\begin{bmatrix}
w_{1,1} & \ldots & w_{1,n} \\
\vdots & \ddots & \vdots \\
w_{m,1} & \ldots & w_{m,n}
\end{bmatrix}
$$
$$Wx =
\begin{bmatrix}
w_{1,1} & \ldots & w_{1,n} \\
\vdots & \ddots & \vdots \\
w_{m,1} & \ldots & w_{m,n}
\end{bmatrix}
\begin{bmatrix}
x_{1} \\
\vdots \\
x_{n}
\end{bmatrix} =
\begin{bmatrix}
\sum_{i=1}^{n} w_{1,i}x_{i} \\
\vdots \\
\sum_{i=1}^{n} w_{m,i}x_{i}
\end{bmatrix}
$$
Vector and Matrix Gradients
The gradient of a scalar function with respect to a vector or matrix is:
The symbol $\frac{\sigma f}{\sigma x_ 1}$ means “partial derivative of f with respect to x1”
$$
\frac{\partial f}{\partial x} =
\begin{bmatrix}
\frac{\partial f}{\partial x_1} \\
\vdots \\
\frac{\partial f}{\partial x_n}
\end{bmatrix}
,
\quad
\frac{\partial f}{\partial W} =
\begin{bmatrix}
\frac{\partial f}{\partial w_{1,1}} & \cdots & \frac{\partial f}{\partial w_{1,n}} \\
\vdots & \ddots & \vdots \\
\frac{\partial f}{\partial w_{m,1}} & \cdots & \frac{\partial f}{\partial w_{m,n}}
\end{bmatrix}
$$
© Vladimir Nasteski |
$$ f(x) = w^ T x + b = \sum_{j=0} ^{D-1} w_ j x_ j + b $$
- $f(x) = y$
- Generally, we want to choose the weights and bias, w and b, in order to minimize the errors.
- The errors are the vertical green bars in the figure at right, ε = f(x) − y.
- Some of them are positive, some are negative. What does it mean to “minimize” them?
- $ f(x) = w^ T x + b = \sum_{j=0} ^{D-1} w_ j x_ j + b $
- Training token errors Using that notation, we can define a signed error term for every training token: ε = f(xi) - yi
- The error term is positive for some tokens, negative for other tokens. What does it mean to minimize it?
Mean-squared error
Squared: tends to notice the big values and trying ignor small values.
One useful criterion (not the only useful criterion, but perhaps the most common) of “minimizing the error” is to minimize the mean squared error:
$$ \mathcal{L} = \frac{1}{2n} \sum_{i=1}^ {n} \varepsilon_i^ 2 = \frac{1}{2n} \sum_{i=1}^ {n} (f(x_ i) - y_ i)^ 2 $$
The factor $\frac{1}{2}$ is included so that, so that when you differentiate ℒ , the 2 and the $\frac{1}{2}$ can cancel each other.
MSE = Parabola
Notice that MSE is a non -negative quadratic function of f(x~i~) = w^T^ x~i~ + b, therefore it’s a non negative quadratic function of w . Since it’s a non -negative quadratic function of w, it has a unique minimum that you can compute in closed form! We won’t do that today. $\mathcal{L} = \frac{1}{2n} \sum_{i=1}^ {n} (f(x_ i) - y_ i)^ 2$
The iterative solution to linear regression (gradient descent):
- Instead of minimizing MSE in closed form, we’re going to use an iterative algorithm called gradient descent. It works like this:
- Start: random initial w and b (at t=0)
- Adjust w and b to reduce MSE (t=1)
- Repeat until you reach the optimum (t = ∞).
$ w \leftarrow w - \eta \frac{\partial \mathcal{L}}{\partial w} $
$ b \leftarrow b - \eta \frac{\partial \mathcal{L}}{\partial b} $
Finding the gradient
The loss function $ \mathcal{L} $ is defined as:
$$ \mathcal{L} = \frac{1}{2n} \sum_{i=1}^{n} L_i, \quad L_i = \varepsilon_i^2, \quad \varepsilon_i = w^T x_i + b - y_i $$
To find the gradient, we use the chain rule of calculus:
$$ \frac{\partial \mathcal{L}}{\partial w} = \frac{1}{2n} \sum_{i=1}^{n} \frac{\partial L_i}{\partial w}, \quad \frac{\partial L_i}{\partial w} = 2\varepsilon_i \frac{\partial \varepsilon_i}{\partial w}, \quad \frac{\partial \varepsilon_i}{\partial w} = x_i $$
Putting it all together,
$$ \frac{\partial \mathcal{L}}{\partial w} = \frac{1}{n} \sum_{i=1}^{n} \varepsilon_i x_i $$
• Start from random initial values of $w$ and $b(at\ t= 0)$.
• Adjust $w$ and $b$ according to:
$$ w \leftarrow w - \frac{\eta}{n} \sum_{i=1}^{n} \varepsilon_i x_i $$
$$ b \leftarrow b - \frac{\eta}{n} \sum_{i=1}^{n} \varepsilon_i $$
Intuition:
- Notice the sign:
- $ w \leftarrow w - \frac{\eta}{n} \sum_{i=1}^{n} \varepsilon_i x_i $
- If $ \varepsilon_i $ is positive ($ f(x_i) > y_i $), then we want to reduce $ f(x_i) $, so we make $ w $ less like $ x_i $
- If $ \varepsilon_i $ is negative ($ f(x_i) < y_i $), then we want to increase $ f(x_i) $, so we make $ w $ more like $ x_i $
Gradient Descent
- If $n$ is large, computing or differentiating MSE can be expensive.
- The stochastic gradient descent algorithm picks one training token $(x_i, y_i)$ at random (“stochastically”), and adjusts $w$ in order to reduce the error a little bit for that one token:
$$ w \leftarrow w - \eta \frac{\partial \mathcal{L}_i}{\partial w} $$
…where
$$ \mathcal{L}_i = \varepsilon_i^2 = \frac{1}{2}(f(x_i) - y_i)^2 $$
Stochastic gradient descent
$$
\mathcal{L}_i = \varepsilon_i^2 = \frac{1}{2}(w^T x_i + b - y_i)^2
$$
If we differentiate that, we discover that:
$$
\frac{\partial \mathcal{L}_i}{\partial w} = \varepsilon_i x_i,
\quad
\frac{\partial \mathcal{L}_i}{\partial b} = \varepsilon_i
$$
So the stochastic gradient descent algorithm is:
$$
w \leftarrow w - \eta \varepsilon_i x_i,
\quad
b \leftarrow b - \eta \varepsilon_i
$$
Code Example
|
PS: In the old script, I wasn’t fully understand how the linear regression works. So, I just made this script based on my personal understanding. So, you can find that I was using one dots from the set to calculate the loss and updates the $w$ and $b$ each time. But in real application situation, this way would make the updating process very noisy because single points could be very unreliable. According to the function above, we could use this function to update them:
- $ w \leftarrow w - \frac{\eta}{n} \sum_{i=1}^{n} \varepsilon_i x_i $
- $ b \leftarrow b - \frac{\eta}{n} \sum_{i=1}^{n} \varepsilon_i $
For using this function, two things you may like to change from the example:
|
How do them different from each other?
Single Points loss | Summed loss | |
---|---|---|
Last 5 rounds after so many iteration | 1.1539 2.8111 |
0.9947 2.8171 |
Explained | Because the result from a single point has significant noise, the jitter can never be eliminated, making it difficult to achieve local optimization. | Because the loss is based on the entire dataset, it is very easy to achieve local optimization. |
This is why selecting an appropriate batch size during machine learning training is crucial. In the first example, the batch size is set to 1, while in the second, it equals the size of the training data. A very large batch size, such as using the entire dataset, can significantly increase the computational load, especially if the model is complex. Additionally, it may lead to the model getting stuck in local optima, which depends on the quality of the dataset. Therefore, choosing an appropriate batch size is essential for effective training.
Perceptron
Linear classifier: Definition
A linear classifier is defined by
$$
f(x) = \text{argmax } Wx + b
$$
where:
$$
Wx + b =
\begin{bmatrix}
w_{1,1} & \ldots & w_{1,d} \\
\vdots & \ddots & \vdots \\
w_{v,1} & \ldots & w_{v,d}
\end{bmatrix}
\begin{bmatrix}
x_{1} \\
\vdots \\
x_{d}
\end{bmatrix}
+
\begin{bmatrix}
b_{1} \\
\vdots \\
b_{v}
\end{bmatrix}=
\begin{bmatrix}
w_{1}^T x + b_{1} \\
\vdots \\
w_{v}^T x + b_{v}
\end{bmatrix}
$$
$w_k, b_k$ are the weight vector and bias corresponding to class $k$, and the argmax function finds the element of the vector $wx$ with the largest value.
Gradient descent
Suppose we have training tokens $(x_i, y_i)$, and we have some initial class vectors $w_1$ and $w_2$. We want to update them as
$$
w_1 \leftarrow w_1 - \eta \frac{\partial \mathcal{L}}{\partial w_1}
$$
$$
w_2 \leftarrow w_2 - \eta \frac{\partial \mathcal{L}}{\partial w_2}
$$
…where $\mathcal{L}$ is some loss function. What loss function makes sense?
Transformation
Transformations can be a powerful tool for addressing various issues with model assumptions or improving the model fit. Applying transformations can help meet the necessary assumptions of linear regression, which are linearity, constant variance, normality of residuals, and independence of errors.
Common Transformations: Log, Square Root, Inverse, and Box-Cox
How Box-Cox Transformation Helps
|
The simulated data demonstrates how the Box-Cox transformation can significantly improve the linear regression results for non-linear and skewed data:
-
Raw Data Results:
- $R^2$ (Coefficient of Determination): 0.94
- The quadratic trend with noise leads to poor linear regression fit.
-
Transformed Data Results:
- $R^2$: 0.99
- The Box-Cox transformation stabilizes variance and linearizes the relationship, improving the regression fit.
-
Optimal Lambda:
- The optimal Box-Cox parameter $\lambda = 0.35$ ensures the best transformation.
This shows how the Box-Cox transformation is beneficial for datasets where relationships are non-linear or the dependent variable is highly skewed.
-
Box-Cox Transformation:
The Box-Cox transformation $ y_i^{(bc)} $ is defined as:
$$
y_i^{(bc)} = \begin{cases}
\frac{y_i^\lambda - 1}{\lambda} & \text{if } \lambda \neq 0 \\
\ln(y_i) & \text{if } \lambda = 0
\end{cases}
$$- This transformation is defined for $ y_i \geq 0 $.
- It requires choosing an appropriate $ \lambda $, which can be done using statistical libraries that maximize the likelihood or minimize skewness.
-
Inverse Box-Cox Transformation:
The inverse transformation, which is used to revert the transformed data back to its original scale, is given by:
$$
y_i = \begin{cases}
\left( | \lambda y_i^{(bc)} + 1 |^{\frac{1}{\lambda}} \right) & \text{if } \lambda \neq 0 \\
e^{y_i ^{(bc)}} & \text{if } \lambda = 0
\end{cases}
$$- This transformation allows the original data values to be reconstructed from the transformed values, which is useful for interpretation and comparison with the original data scale after performing analyses or modeling on transformed data.
Choosing $ \lambda $**
The value of $ \lambda $ can significantly affect the skewness and homoscedasticity of the residuals in regression modeling. It is usually selected to maximize the log-likelihood function of obtaining the transformed data under a normal distribution assumption or through cross-validation procedures.
Tools like R and Python offer built-in functions to automatically find the optimal $ \lambda $ that best normalizes the data. For example, in R, the boxcox
function from the MASS package can be used to estimate $ \lambda $.
How to find the λ
- Cross-validation
- consider choices of at different scales, e.g.,
λ ∈ {10−4, 10−3 ,10−2 ,10−1 ,1,10}
- consider choices of at different scales, e.g.,
- for each λi,
- iteratively build new random Fold from Training Set
- fit Cross-Validation Train Set using
- compute MSE for current Fold on Validation set
- for each λi record average error σ and over all Folds
- iteratively build new random Fold from Training Set
- λ with smallest error - largest λ within one σ.
Polynomial
Polynomial transformation is another common method used in regression and modeling when the relationship between the independent and dependent variables is non-linear. Here’s how it works and how it compares to other methods like Box-Cox.
Key Points from the Slide:
-
Linear Model of Explanatory Variables:
- The slide starts with a simple linear regression model involving two explanatory variables $ x_1 $ and $ x_2 $:
$$
y_i = \beta_2 x_2^{(i)} + \beta_1 x_1^{(i)} + \beta_0 + \xi^{(i)}
$$- $ \beta_0 $: Intercept
- $ \beta_1, \beta_2 $: Coefficients
- $ \xi^{(i)} $: Error term
- This is a standard multiple linear regression model.
- The slide starts with a simple linear regression model involving two explanatory variables $ x_1 $ and $ x_2 $:
-
Polynomial Regression:
- For non-linear relationships, a polynomial transformation of the explanatory variable is applied. For example:
$$
y_i = \beta_2 (x{(i)})2 + \beta_1 (x{(i)})1 + \beta_0 + \xi^{(i)}
$$ - The design matrix $ X_i^T $ becomes:
$$
X_i^T = [(x{(i)})2, (x{(i)})1, 1]
$$
- For non-linear relationships, a polynomial transformation of the explanatory variable is applied. For example:
-
Regression Application:
- After transforming the data, we can use standard regression methods to fit a model as before.
-
Challenge:
- Choosing the Polynomial Order: It’s difficult to determine the optimal polynomial degree ($ n $) for the model. A higher degree can lead to overfitting, while a lower degree might underfit.
Polynomial Transform vs. Box-Cox Transform:
Aspect | Polynomial Transform | Box-Cox Transform |
---|---|---|
Purpose | Captures non-linear relationships by adding polynomial terms. | Stabilizes variance and makes data more normally distributed. |
Flexibility | Highly flexible with different polynomial degrees. | Limited to monotonic transformations. |
Ease of Use | Requires selecting an appropriate degree ($ n $). | Only requires finding the optimal $ \lambda $ parameter. |
Risk | Risk of overfitting with high-degree polynomials. | Minimal risk of overfitting, but doesn’t capture non-linearity. |
Output | Adds multiple polynomial terms (e.g., $ x^2, x^3, \dots $). | Applies a monotonic transformation to the dependent variable. |
|
Regularization
Regularization involves adding a penalty to the loss function that a model minimizes. This penalty typically discourages complex models by imposing a cost on having larger parameter values, thereby promoting simpler, more robust models.
Types of Regularization
Type | Description | Effect | Equation |
---|---|---|---|
L1 Regularization (Lasso) | Adds the absolute value of the magnitude of coefficients as a penalty term. | Leads to sparse models where some weights are zero, effectively performing feature selection. | $ L = \text{Loss} + \lambda \sum |\beta_i| $ |
L2 Regularization (Ridge) | Adds the squared magnitude of coefficients as a penalty term. | Distributes weight more evenly across features, less robust to outliers than L1. | $ L = \text{Loss} + \lambda \sum \beta_i^2 $ |
Elastic Net | Combination of L1 and L2 regularization. | Balances feature selection and robustness, useful for correlated features. | $ L = \text{Loss} + \lambda_1 \sum |\beta_i| + \lambda_2 \sum\beta_i^2 $ |
Role of Regularization in Modeling
- Preventing Overfitting: By penalizing large coefficients, regularization reduces the model’s tendency to memorize training data, thus helping it to generalize better to new, unseen data.
- Stability: Regularization can make the learning algorithm more stable by smoothing the optimization landscape.
- Handling Multicollinearity: In regression, multicollinearity can make the model estimation unstable and sensitive to noise in the data. Regularization helps by constraining the coefficients path.
Regularization and $\lambda$
In regularization, $ \lambda $ (often called the regularization parameter) controls the strength of the penalty applied to the model. This is conceptually different from the $ \lambda $ used in transformations like Box-Cox, though both involve tuning a parameter to achieve better model behavior:
- In Box-Cox transformations, $ \lambda $ is selected to normalize the distribution of a variable or make relationships more linear.
- In regularization, $ \lambda $ adjusts the trade-off between fitting the training data well (low bias) and keeping the model parameters small (low variance).
Performance
Residuals and Standardized Residuals
Residuals: $e = y - \hat{y} $
Mean Square Error: $m = \frac{e^T e}{N}$
The equation you provided is for the standardized residual $ S_i $ in the context of regression analysis. Here’s a breakdown of the equation and its components:
Standardized Residuals
$$
S_i = \frac{e_i}{\sigma} = \frac{e_i}{\sqrt{\frac{e^T e}{N} (1 - h_{i,i})}}
$$
- $ e_i $ is the residual for the $ i $-th observation, which is the difference between the observed value and the value predicted by the regression model.
- $ \sigma $ is the estimated standard deviation of the residuals.
- $ e^T e $ is the sum of squared residuals.
- $ N $ is the number of observations.
- $ h_{i,i} $ is the leverage of the $ i $-th observation, a measure from the hat matrix that indicates the influence of the $ i $-th observation on its own predicted value.
Explanation:
-
Standardized Residuals: These are the residuals $ e_i $ normalized by an estimate of their standard deviation. Standardized residuals are useful for identifying outliers in regression analysis because they are scaled to have a variance of 1, except for their adjustment by the leverage factor.
-
Leverage $ h_{i,i} $: The leverage is a value that indicates how far an independent variable deviates from its mean. High leverage points can have an unduly large effect on the estimate of regression coefficients.
-
Square Root Denominator: The denominator of the standard deviation estimate involves the sum of squared residuals, which measures the overall error of the model, divided by the number of observations, adjusted for the leverage of each observation. This adjustment accounts for the fact that observations with higher leverage have a greater impact on the fit of the model and therefore should have less influence on the standardized residual.
Coefficient of Determination (R^2)
The coefficient of determination, $ R^2 $, measures how well a regression model explains the variance in the dependent variable. It is commonly used to evaluate the goodness-of-fit of a model.
Formula for $ R^2 $
$$ R^2 = 1- \frac{\text{var}(\hat{y})}{\text{var}(y)}= 1 - \frac{\text{SS}_ {\text{res}}}{\text{SS}_ {\text{tot}}} $$
Where:
- $ \text{SS}_ {\text{res}} $: Residual sum of squares
$$ \text{SS}_ {\text{res}} = \sum_{i=1}^n \left( y_i - \hat{y}_ i \right)^2 $$ - $ \text{SS}_ {\text{tot}} $: Total sum of squares
$$ \text{SS}_ {\text{tot}} = \sum_{i=1}^n \left( y_i - \bar{y} \right)^2 $$
Here:
- $ y_i $: Observed value of the dependent variable
- $ \hat{y}_ i $: Predicted value from the model
- $ \bar{y} $: Mean of the observed values
- $ n $: Number of observations
Interpretation of $ R^2 $
- $ R^2 = 1 $: Perfect fit (the model explains all the variance in the data).
- $ R^2 = 0 $: The model does not explain any of the variance (as bad as predicting the mean).
- $ R^2 < 0 $: The model performs worse than predicting the mean.
Outliers
For increasing the performance of regression results, outliers detection and deletion is very important.
Hat Matrix (Leverage)
The hat matrix, denoted as $ H $, is a matrix used in regression analysis that maps the observed response values to the predicted response values. It is derived from the least squares solution of a linear regression model.
1. Estimated Coefficients ($ \hat{\beta} $)
- The coefficients ($ \hat{\beta} $) in a multiple linear regression are estimated using the following formula:
$$
\hat{\beta} = (X^T X)^{-1} (X^T y)
$$- $ X $: Design matrix containing the predictors (including a column of ones for the intercept).
- $ y $: Response vector (actual observations).
- $ X^T X $: Matrix capturing the relationships between predictors.
- $ (X^T X)^{-1} $: Inverse of $ X^T X $, ensuring the system of equations is solvable.
2. Predictions ($ y^{§} $)
- Predicted values ($ y^{§} $) are computed by applying the estimated coefficients ($ \hat{\beta} $) to the predictors:
$$
y^{§} = X \hat{\beta}
$$- Substituting $ \hat{\beta} $ into this equation:
$$
y^{§} = X (X^T X)^{-1} (X^T y)
$$ - Rearranging gives:
$$
y^{§} = (X (X^T X)^{-1} X^T) y
$$
- Substituting $ \hat{\beta} $ into this equation:
3. The hat matrix $ H $ is given by:
$$
H = X (X^T X)^{-1} X^T
$$
Where:
- $ X $: The design matrix (contains the predictors with an added intercept column)
- $ X^T $: The transpose of $ X $
- $ (X^T X)^{-1} $: The inverse of $ X^T X $
Properties of the Hat Matrix
- It is symmetric ($ H = H^T $).
- It is idempotent ($ H^2 = H $).
- The diagonal elements of $ H $, $ h_{ii} $, represent the leverage of each observation.
Identifying Outliers Using the Hat Matrix
Leverage ($ h_{ii} $)
- The diagonal elements $ h_{ii} $ of the hat matrix measure the leverage of each observation.
- Leverage quantifies how much an observation influences its fitted value. Higher leverage indicates that the observation has more influence on the regression line.
Outliers and High Leverage Points
- High Leverage: Observations with large $ h_{ii} $ are far from the mean of the predictors and could disproportionately affect the regression model.
- Threshold: $ h_{ii} > \frac{2p}{n} $, where $ p $ is the number of predictors (including the intercept) and $ n $ is the number of observations.
- Outlier: An observation with both high leverage and a large residual (difference between observed and predicted values) is likely to be an outlier.
Practical Steps to Detect Outliers
- Compute the Hat Matrix:
- $ H = X (X^T X)^{-1} X^T $
- Extract Leverage:
- $ h_{ii} = \text{diag}(H) $
- Combine with Residuals:
- Use Cook’s Distance or Studentized Residuals to identify observations that are both high-leverage and high-residual.
- Thresholds:
- High leverage: $ h_{ii} > \frac{2p}{n} $
- Cook’s Distance: $ D_i > 4/n $
Python Example
|
Benefits of Using the Hat Matrix for Outlier Detection
- Objective Identification:
- Provides a mathematically robust way to detect influential points.
- Diagnostic Tool:
- Helps identify data points that disproportionately affect model predictions.
Combined with Standard Residues
High leverage alone my not enough to confirm it is an outlier. By combine the standard residue with leverage, we could make a better conclusion.
-
High Leverage + Large Residual:
- A point with high $ h_{i,i} $ (leverage) and high $ |s_i| $ (standardized residual) is more likely to be an influential outlier.
-
Identifying Outliers:
- Use $ s_i $ to standardize residuals, allowing comparison regardless of varying leverage.
-
Model Diagnostics:
- This methodology helps evaluate model robustness and identify points that might unduly influence the regression fit.
Residual Variance ($ \sigma_i^2 $)
- $ \sigma_i^2 = m(1 - h_{i,i}) = \frac{e^T e}{N}(1 - h_{i,i}) $:
- Residual variance depends on leverage $ h_{i,i} $ and the overall residual sum of squares.
- Points with high leverage ($ h_{i,i} $) reduce the denominator, increasing the influence of residuals.
Standardized Residual ($ s_i $)
- $ s_i = \frac{e_i}{\sigma} = \frac{e_i}{\sqrt{\frac{e^T e}{N}(1 - h_{i,i})}} $:
- $ e_i $: Residual for data point $ i $ ($ e_i = y_i - \hat{y}_i $).
- $ s_i $: Standardized residual accounts for leverage, enabling fair comparison of residuals across all data points.
- It identifies outliers with unusually large residuals relative to their variance.
Threshold for Outliers
- The slide provides thresholds for detecting outliers based on $ s_i $:
- $ |s_i| > 1 $: Unusual at the 68% level.
- $ |s_i| > 2 $: Unusual at the 95% level.
- $ |s_i| > 3 $: Unusual at the 99% level.
- Points with $ |s_i| $ exceeding these thresholds are flagged as potential outliers.
Cook’s distance
- Coefficients and prediction with full training set: $ y^{p} = X\hat{\beta} $
- Coefficients and prediction excluding item i from training set: $ y_i^{p} = X\hat{\beta}_i $
- Cook’s distance for point i:
$$ \text{Cook’s distance} = \frac{(y^{p} - y_i^ {p})^ T (y^{p} - y_i^{p})}{dm} $$
Where:- $d$ is the number of predictors (features) in the model, degree of freedom.
- $m = \frac{e^T e}{N}$ is the mean squared error.
- $N$ is the number of observations in the dataset.
Cook’s distance is the mean squared of error difference between with and without the point.
Linear Regression