AI: Logistic Regression
Logistic Regression
Logistic regression is a supervised machine learning algorithm used for binary classification tasks. Unlike linear regression, which predicts continuous values, logistic regression predicts the probability that a given input belongs to a certain class.
Key Concepts in Logistic Regression
-
Logistic Function (Sigmoid Function):
- Logistic regression uses the sigmoid function to map predicted values to probabilities:
$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$- $ z = X \beta $: Linear combination of features.
- The output of $ \sigma(z) $ is always between 0 and 1, representing the probability.
- Logistic regression uses the sigmoid function to map predicted values to probabilities:
-
Logit Link Function:
- The logit function is the natural logarithm of the odds (log-odds) of the binary outcome:
$$
g(\theta) = \log\left(\frac{P(y=1|X)}{P(y=0|X)}\right)
$$ - It transforms probabilities into log-odds:
$$
g(\theta) = X^T\beta
$$
- The logit function is the natural logarithm of the odds (log-odds) of the binary outcome:
-
Inverse Link Function:
- To map the log-odds ($ X^T\beta $) back to probabilities, we use the inverse of the logit function:
$$
P(y=1|X, \beta) = \frac{e{XT\beta}}{1 + e{XT\beta}}
$$- This is the sigmoid function, which outputs probabilities between 0 and 1.
- To map the log-odds ($ X^T\beta $) back to probabilities, we use the inverse of the logit function:
-
Decision Boundary:
- For binary classification:
- If $ \sigma(z) \geq 0.5 $, classify the input as Class 1.
- If $ \sigma(z) < 0.5 $, classify the input as Class 0.
- For binary classification:
-
Log-Likelihood:
- Logistic regression optimizes the log-likelihood instead of minimizing residuals (like in linear regression):
$$
\ell(\beta) = \sum_{i=1}^n \left[ y_i \ln(\hat{y}_i) + (1 - y_i) \ln(1 - \hat{y}_i) \right]
$$
Where:- $ \hat{y}_i = \sigma(z_i) $: Predicted probability.
- $ y_i $: Actual class (0 or 1).
- Logistic regression optimizes the log-likelihood instead of minimizing residuals (like in linear regression):
-
Negative Log-Likelihood:
- The optimization process in machine learning (and statistics) often involves minimizing a cost function. Since the log-likelihood is a measure of fit (higher is better), we take its negative to convert the maximization problem into a minimization problem:
- $$ -\ln L(\beta) = -\sum_{i=1}^n \left[ y_i X_i^T \beta - \ln(1 + e^ {X_i^ T \beta}) \right] $$
-
Optimization:
- The goal is to find the coefficients $ \beta $ that maximize the log-likelihood using algorithms like Gradient Descent or Newton’s Method.
What is a Link Function?
A link function connects the linear predictor ($ X\beta $) to the mean of the response variable in a generalized linear model (GLM). It provides a transformation that ensures the predicted values from the model stay within the valid range for the response variable.
Why negative Log-likelihood function?
- The negative log-likelihood is used to simplify optimization by turning a maximization problem into a minimization one.
- The formula on the slide and in my explanation are equivalent, just written in slightly different forms.
Applications of Logistic Regression
- Binary classification problems such as:
- Email spam detection (Spam/Not Spam).
- Disease diagnosis (Positive/Negative).
- Customer churn prediction (Churn/No Churn).
Practical Example: Binary Classification with Logistic Regression
Below is a Python example using scikit-learn:
Problem: Predict whether a person has heart disease based on two features.
|
Logistic Regression for Multiclass Classification
Logistic regression can be extended to handle multiclass classification problems where the target variable has more than two classes. The two common approaches are One-vs-Rest (OvR) and Softmax (Multinomial) logistic regression.
One-vs-Rest (OvR)
Overview:
- In OvR, a separate binary classifier is trained for each class.
- For a class $ k $, the classifier treats:
- $ y = k $ as positive (1).
- $ y \neq k $ as negative (0).
- Each classifier predicts the probability of the input belonging to its class.
Prediction:
- For a new data point, the class with the highest probability is chosen:
$$
\hat{y} = \arg\max_{k} P(y = k | x)
$$
Softmax (Multinomial) Logistic Regression
Softmax logistic regression generalizes binary logistic regression to multiple classes. Instead of fitting separate binary classifiers, it predicts the probability for all classes simultaneously using the softmax function.
Softmax Function:
$$
P(y = k | x) = \frac{e^{X \beta_k}}{\sum_{j=1}^K e^{X \beta_j}}
$$
Where:
- $ K $: Total number of classes.
- $ \beta_k $: Coefficients for class $ k $.
- $ P(y = k | x) $: Probability of class $ k $ given the input $ x $.
Prediction:
- For a new data point, the class with the highest softmax probability is chosen:
$$
\hat{y} = \arg\max_{k} P(y = k | x)
$$
Summary of Methods:
Method | When to Use | Advantages | Disadvantages |
---|---|---|---|
One-vs-Rest | Small datasets with a limited number of classes. | Easy to implement, interpretable. | Can struggle with overlapping classes. |
Softmax | When normalized probabilities across classes are needed. | Probabilities are calibrated. | Computationally expensive. |
Softmax approach (also called multinomial logistic regression)
-
C-Class Classification:
- The goal is to classify the target variable $ y $ into one of $ C $ classes:
$$
y \in {0, 1, \dots, C-1}
$$
- The goal is to classify the target variable $ y $ into one of $ C $ classes:
-
Discrete Probability Distribution:
- The probabilities $ \theta_0, \theta_1, \dots, \theta_{C-1} $ represent the likelihood of a data point belonging to each class.
- These probabilities satisfy:
$$
\theta_i \in [0, 1] \quad \text{and} \quad \sum_{i=0}^{C-1} \theta_i = 1
$$
-
Link Function:
- The relationship between the linear model ($ X\beta $) and the class probabilities is established using the Softmax function:
$$
g(\theta) = \log \left( \frac{\theta_i}{1 - \sum_{u=0}^{C-1} \theta_u} \right) = X^T \beta
$$
- The relationship between the linear model ($ X\beta $) and the class probabilities is established using the Softmax function:
-
Class Probabilities:
- For each class $ i $, the probability is computed as:
$$
P(y = i | X, \beta) = \frac{e^ {X^ T \beta_i}}{1 + \sum_{j=0}^ {C-1} e^ {X^ T \beta_j}}
$$ - For the last class $ C-1 $, the probability is:
$$
P(y = C-1 | X, \beta) = \frac{1}{1 + \sum_{i=0}^{C-2} e^ {X^T \beta_i}}
$$
- For each class $ i $, the probability is computed as:
AI: Logistic Regression