0 Posted 2024-12-30Updated 2024-12-30Notes / Class / UIUC / AI19 minutes read (About 2802 words)

Linear Model Optimization

Measure Information

Both Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) are measures used to evaluate the quality of statistical models, particularly in the context of selecting the best model size or complexity.

Why Use AIC and BIC?
- When building statistical or machine learning models, we often face the challenge of balancing model fit (how well the model explains the data) with model simplicity (avoiding overfitting).
- AIC and BIC are metrics that help in selecting the best model by incorporating penalties for the number of parameters used in the model.
Akaike Information Criterion (AIC):
- AIC estimates the relative quality of a model for a given dataset.
- Formula:
  $$
  \text{AIC} = 2k - 2 \ln(L)
  $$
  - $ k $: Number of parameters in the model.
  - $ L $: Likelihood of the model (how well it fits the data).
- Objective: Choose the model with the lowest AIC value, which balances fit and complexity.
Bayesian Information Criterion (BIC):
- Similar to AIC, BIC adds a stronger penalty for model complexity to account for overfitting.
- Formula:
  $$
  \text{BIC} = k \ln(n) - 2 \ln(L)
  $$
  - $ n $: Number of observations in the dataset.
  - $ k $: Number of parameters in the model.
- Objective: Choose the model with the lowest BIC value for a balance between fit and simplicity, especially when sample size $ n $ is large.
Key Difference Between AIC and BIC:
- AIC focuses on model quality and is less strict about model size.
- BIC penalizes complexity more heavily, making it more conservative in selecting simpler models.

Applications:

Model selection in regression, time-series analysis, and machine learning.
Comparing models with different numbers of features or parameters.
Evaluating trade-offs between underfitting and overfitting.

Would you like a detailed example or visual demonstration of how AIC and BIC are used?

Criterion	Formula	Focus	Penalty for Complexity	Use Case	Objective
Akaike Information Criterion (AIC)	$ 2k - 2\ln(L) $	Model fit vs. simplicity	Proportional to $ k $	Choosing models that balance goodness-of-fit and simplicity	Minimize AIC
Bayesian Information Criterion (BIC)	$ k\ln(n) - 2\ln(L) $	Model fit vs. parsimony	Stronger penalty with $ \ln(n) $	Suitable for large datasets and emphasizing simpler models	Minimize BIC
Penalty Strength	Moderate	High	Depends on Sample Size ($ n $)	Larger datasets lead to stricter penalties in BIC
Common Application	Time-series, regression, machine learning	Model selection across varying complexity	Multi-model comparison	Best when balancing underfitting and overfitting

AIC:
- Prefers models with a better balance between complexity and fit.
- Less conservative than BIC, suitable for small datasets or exploratory analysis.
BIC:
- Stronger emphasis on simplicity.
- More appropriate for larger datasets or when avoiding overfitting is crucial.
Choosing Between AIC and BIC:
- Use AIC if you prioritize model quality over strict simplicity.
- Use BIC if simplicity and generalization are more important.

Likelihood

When calculating AIC or BIC, the likelihood refers to how well the model trained on the training data explains the same training data. The likelihood is not calculated on the test data, as AIC and BIC are measures of model quality on the training dataset itself.

Likelihood in AIC/BIC Context:

Training Data:
- We use the model parameters (e.g., coefficients in regression) estimated from the training data to calculate the likelihood of the training data.
Likelihood Calculation:
- For a model trained on the training data, the likelihood is the probability (or density) of the observed training data under the model:
  $$
  L(\theta | \text{Training Data}) = \prod_{i=1}^n f(y_i | \theta)
  $$
  Where:
  - $ y_i $: Observed target value.
  - $ \theta $: Model parameters estimated during training.
  - $ f(y_i | \theta) $: Probability density of $ y_i $ under the model.
Log-Likelihood for AIC/BIC:
- Instead of working with $ L $, we calculate the log-likelihood to simplify computations:
  $$
  \ln L(\theta | \text{Training Data}) = \sum_{i=1}^n \ln f(y_i | \theta)
  $$

Steps to Calculate Likelihood for AIC/BIC:

Train the Model:
- Use the training data to estimate the model parameters ($ \theta $).
Calculate Predictions ($ \hat{y}_i $):
- Predict the mean or central tendency of the model for each training data point.
Calculate Residuals and Likelihood:
- Assume a distribution for the residuals (e.g., normal distribution).
- For a normal distribution:
  $$
  f(y_i | \hat{y}_i, \sigma) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - \hat{y}_i)^ 2}{2\sigma^ 2}\right)
  $$
- The log-likelihood becomes:
  $$
  \ln L = \sum_{i=1}^n \left[ -\frac{1}{2} \ln(2\pi\sigma^2) - \frac{(y_i - \hat{y}_i)^ 2}{2\sigma^ 2} \right]
  $$

$\sigma$ represents the standard deviation of the residuals

Example: Using Training Data to Calculate Likelihood

from sklearn.linear_model import LinearRegression
import numpy as np

# Simulated training data
X_train = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
y_train = np.array([1.2, 2.3, 2.8, 4.1, 5.3])

# Train a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)
y_pred_train = model.predict(X_train)

# Calculate residuals and variance
residuals = y_train - y_pred_train
sigma_squared = np.var(residuals, ddof=1)  # Variance of residuals

# Calculate log-likelihood
n = len(y_train)
log_likelihood = -0.5 * n * np.log(2 * np.pi * sigma_squared) - np.sum((residuals**2) / (2 * sigma_squared))

# AIC and BIC
k = 2  # Number of parameters (intercept + slope)
aic = 2 * k - 2 * log_likelihood
bic = k * np.log(n) - 2 * log_likelihood

{"Log-Likelihood": log_likelihood, "AIC": aic, "BIC": bic}

Example

Source: Model selection with AIC and AICc

Forward stagewise regression and Backward stagewise regression

Backward stagewise regression and Forward stagewise regression are methods for variable selection and model fitting, primarily used in regression contexts. They are stepwise procedures for adding or removing predictors in a systematic way to improve model performance or interpretability.

Backward Stagewise Regression

Overview:

Starts with a full model (all predictors included).
Gradually removes predictors one by one, based on a criterion (e.g., p-value, AIC, or adjusted $ R^2 $).
The goal is to find a smaller, simpler model without significantly compromising the fit.

Procedure:

Begin with a model containing all predictors.
Evaluate the significance of each predictor (e.g., using p-values).
Remove the least significant predictor (highest p-value) that exceeds a predefined

threshold (e.g., $p > 0.05$).

Refit the model and repeat the process until all remaining predictors are statistically significant or meet the stopping criteria.

Advantages:

Simple and interpretable.
Useful for removing irrelevant predictors in high-dimensional datasets.

Disadvantages:

Can miss optimal combinations of predictors.
Sensitive to multicollinearity among predictors.

2. Forward Stagewise Regression

Overview:

Starts with an empty model (no predictors included).
Gradually adds predictors one at a time, based on a criterion (e.g., reducing residual sum of squares or improving AIC/BIC).
The goal is to build a model step-by-step, adding only significant predictors.

Procedure:

Begin with an empty model.
Evaluate all predictors not yet in the model, adding the one that most improves the model fit (e.g., the one with the smallest p-value or largest improvement in $ R^2 $).
Refit the model and repeat the process until no additional predictors meet the inclusion criteria.

Advantages:

Can handle datasets with a large number of predictors.
Less likely to overfit compared to starting with a full model.

Disadvantages:

Ignores potential joint effects of predictors (e.g., interactions).
May miss the best subset of predictors.

Key Differences Between Backward and Forward Stagewise Regression

Feature	Backward Stagewise	Forward Stagewise
Starting Point	Full model (all predictors).	Empty model (no predictors).
Procedure	Removes predictors iteratively.	Adds predictors iteratively.
Use Case	Small datasets with fewer predictors.	Large datasets with many predictors.
Limitations	May retain redundant predictors.	May miss joint effects of predictors.

When to Use Each Method?

Backward Stagewise:
- When you suspect many predictors are irrelevant.
- When computational resources are not a concern (since fitting starts with a large model).
Forward Stagewise:
- When you have a large number of predictors and computational efficiency is critical.
- When you want a simpler starting point and add complexity gradually.

A Quick Example

# Re-import necessary libraries after environment reset
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression

# Generate a dataset with 100 samples and 10 features
np.random.seed(42)
X, y = make_regression(n_samples=100, n_features=10, noise=10, random_state=42)

# Initialize model and variables for Forward Stagewise Regression
selected_features = []
remaining_features = list(range(X.shape[1]))
forward_scores = []

# Forward Stagewise Regression
for _ in range(len(remaining_features)):
    scores = []
    for feature in remaining_features:
        # Fit a model with the current feature added
        features_to_test = selected_features + [feature]
        model = LinearRegression().fit(X[:, features_to_test], y)
        score = model.score(X[:, features_to_test], y)  # R^2 score
        scores.append((score, feature))
    
    # Select the feature with the highest R^2 score
    scores.sort(reverse=True)
    best_score, best_feature = scores[0]
    forward_scores.append(best_score)
    selected_features.append(best_feature)
    remaining_features.remove(best_feature)

# Results of Forward Stagewise Regression
selected_features_forward = selected_features  # Save selected features for clarity

# Backward Stagewise Regression
selected_features_backward = list(range(X.shape[1]))
backward_scores = []

for _ in range(len(selected_features_backward) - 1):
    scores = []
    for feature in selected_features_backward:
        # Fit a model with the current feature removed
        features_to_test = [f for f in selected_features_backward if f != feature]
        model = LinearRegression().fit(X[:, features_to_test], y)
        score = model.score(X[:, features_to_test], y)  # R^2 score
        scores.append((score, feature))
    
    # Remove the feature with the smallest impact on R^2 score
    scores.sort(reverse=True)
    best_score, worst_feature = scores[-1]
    backward_scores.append(best_score)
    selected_features_backward.remove(worst_feature)

# Plot R^2 scores for Forward and Backward Stagewise Regression
plt.figure(figsize=(12, 6))

# Forward Stagewise Regression
plt.plot(range(1, len(forward_scores) + 1), forward_scores, marker='o', label='Forward Stagewise', color='blue')

# Backward Stagewise Regression
plt.plot(range(len(backward_scores), 0, -1), backward_scores, marker='o', label='Backward Stagewise', color='red')

# Formatting the plot
plt.title("R² Scores During Forward and Backward Stagewise Regression")
plt.xlabel("Number of Features")
plt.ylabel("R² Score")
plt.xticks(range(1, len(forward_scores) + 1))
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

Limitations

Forward and Backward Stagewise Regression can become computationally expensive and impractical when dealing with a large number of features (e.g., 1000+ features) because:

High Computational Cost:
- Both methods involve iteratively adding or removing features, which requires fitting a model at each step. For large datasets, this becomes infeasible.
Potential Overfitting:
- With a large number of features, stepwise methods might select features that fit noise in the data rather than actual patterns.
Ignoring Interactions:
- These methods do not account for interactions between features, which can lead to suboptimal feature selection.

Alternative Methods for Large Feature Spaces

Method	Description	Advantages	Disadvantages	Best Use Case
Lasso Regression (L1)	Shrinks coefficients and sets some to exactly zero for feature selection.	- Efficient for high-dimensional data. - Automatically selects features. - Prevents overfitting.	- May ignore correlated features. - Requires hyperparameter tuning ($ \lambda $).	When many features are irrelevant, and sparse solutions are desired.
Elastic Net	Combines L1 (Lasso) and L2 (Ridge) regularization.	- Balances feature selection and handling multicollinearity. - Suitable for correlated features.	- More complex than Lasso. - Requires tuning of both $ \lambda $ and $ \alpha $.	When predictors are highly correlated, and feature selection is needed.
Recursive Feature Elimination (RFE)	Iteratively removes the least important features based on a chosen model.	- Works with any estimator (e.g., linear, tree-based). - Provides a rank of feature importance.	- Computationally expensive. - Sensitive to model choice and training data.	When model-specific feature ranking is required.
Principal Component Analysis (PCA)	Reduces dimensionality by transforming features into uncorrelated components that capture most variance.	- Handles high-dimensional data well. - Removes multicollinearity. - No need for target variable.	- Components are linear combinations of features, losing interpretability. - Not ideal for feature selection.	When reducing dimensionality is more important than interpretability.
Tree-Based Feature Importance	Uses models like Random Forest or Gradient Boosting to rank feature importance.	- Naturally handles non-linearity. - Accounts for feature interactions. - Fast for large datasets.	- Can be biased toward high-cardinality features. - Does not directly reduce feature count.	When using tree-based models or ranking feature importance is a priority.
Mutual Information	Measures the statistical dependency between features and the target variable.	- Non-parametric. - Detects non-linear relationships.	- Computationally expensive for many features. - Does not handle feature interactions.	When quantifying feature relevance to the target variable without assumptions is needed.
Feature Clustering	Groups similar features into clusters and uses cluster representatives for modeling.	- Reduces redundancy in correlated features. - Scales well with high-dimensional data.	- May lose specific feature contributions. - Requires a meaningful distance metric.	When dealing with highly correlated features or datasets with groups of similar features.
Embedding-Based Methods	Uses deep learning or models like word2vec to transform features into a lower-dimensional space.	- Captures complex relationships between features. - Flexible for large feature spaces.	- Requires advanced techniques and computational resources. - May lose interpretability.	When handling very high-dimensional data (e.g., text, genomic data) with complex dependencies.

Recommendations:

Lasso Regression: If feature selection is the goal and the data has many irrelevant features.
Elastic Net: If features are highly correlated and Lasso alone may struggle.
PCA: When interpretability is less important, and you want to reduce dimensionality.
Tree-Based Importance: For datasets where feature importance ranking is needed, especially with tree-based models.
Feature Clustering: For correlated features where redundancy needs to be reduced.

M-Estimators

M-Estimators (Maximum Likelihood-type Estimators) are a general class of estimators in statistics used for robust parameter estimation. They extend the principle of Maximum Likelihood Estimation (MLE) to allow for more flexibility and robustness, especially in the presence of outliers or non-normal errors.

What Are M-Estimators?

Definition:
- M-Estimators generalize Maximum Likelihood Estimators by minimizing a loss function (also called the objective function) over the parameters of interest.
Loss Function:
- The core idea is to minimize a function of residuals:
  $$
  \hat{\theta} = \arg\min_{\theta} \sum_{i=1}^n \rho\left(\frac{r_i}{\sigma}\right)
  $$
  Where:
  - $ r_i = y_i - f(x_i, \theta) $: Residual (difference between observed and predicted values).
  - $ \rho(\cdot) $: A loss function that determines the contribution of residuals.
  - $ \sigma $: Scale parameter (controls the spread).
Goal:
- Instead of focusing purely on minimizing squared residuals (like in Ordinary Least Squares), M-Estimators allow for more flexible functions to make the estimator less sensitive to outliers.

Examples of M-Estimators

Type	Loss Function ($ \rho $)	Characteristics
Ordinary Least Squares (OLS)	$ \rho( r ) = r^2 $	Highly sensitive to outliers. Minimizes sum of squared errors.
Huber Loss	$\rho( r) = \begin{cases} r^2 & \text{if } \|r\| \leq c \\ 2c\|r\| - c^2 & \text{if } \|r\| > c\end{cases}$	Combines squared loss (for small residuals) and absolute loss (for large residuals).
Tukey’s Biweight	$\rho( r ) = \begin{cases} c^2\left(1 - \left[1 - \left(\frac{r}{c}\right)^ 2\right]^ 3\right) & \text{if } \|r\| \leq c \\ c^2 & \text{if } \|r\| > c \end{cases}$	Completely ignores residuals larger than a threshold $ c $.
Huberized Absolute Loss	$\rho( r) = \|r\|$	Linear penalty, robust but less efficient.

Advantages of M-Estimators

Robustness to Outliers
Flexibility
Generalization of MLE:
- MLE is a special case of M-Estimators, making them widely applicable in parametric settings.

When to Use M-Estimators?

Presence of Outliers:
Non-Normal Errors:
Heavy-Tailed Distributions:

Practical Example: Using Huber Loss

Below is an example of applying Huber Loss to regression in Python:

from sklearn.linear_model import HuberRegressor
import numpy as np
import matplotlib.pyplot as plt

# Simulate data with outliers
np.random.seed(42)
X = np.linspace(0, 10, 100).reshape(-1, 1)
y = 3 * X.flatten() + np.random.normal(0, 1, size=X.shape[0])
y[::10] += 20  # Add outliers every 10th point

# Fit Ordinary Least Squares (OLS) Regression
from sklearn.linear_model import LinearRegression
ols = LinearRegression().fit(X, y)

# Fit Huber Regression
huber = HuberRegressor(epsilon=1.35).fit(X, y)

# Plot the results
plt.scatter(X, y, color="blue", label="Data with Outliers")
plt.plot(X, ols.predict(X), color="red", label="OLS Regression Line")
plt.plot(X, huber.predict(X), color="green", label="Huber Regression Line")
plt.title("Comparison of OLS and Huber Regression")
plt.legend()
plt.xlabel("X")
plt.ylabel("y")
plt.show()