0 Posted 2024-10-09Updated 2024-10-09Notes / Class / UIUC / AI7 minutes read (About 1005 words)

Softmax

Key idea: $f_c(x) =$ posterior probability of cass $c$

A perceptron has a one-hot output vector, in which $f_c(x) = 1$ if the
neural net thinks $c$ is the most likely value of $y$, and 0 otherwise
A softmax computes $f_c(x) \approx Pr(Y =c |x)$. The conditions for this to be true are:
- It needs to satisfy the axioms of probability:
  $$ 0 \leq f_c(x) \leq 1, \quad \sum_{c=1}^{v} f_c(x) = 1$$
- The weight matrix, $W$, is trained using a loss function that encourages $f(x)$ to approximate posterior probability of the labels on some training dataset:
  $$f_c(x) \approx \Pr(Y = c|x)$$

Softmax satisfies the axioms of probability

Axiom #1, probabilities are non-negative $(f_k(x) \geq 0)$. There are many ways to do this, but one way that works is to choose:

$$
f_c(x) \propto \exp(w_c^T x + b_c)
$$
Axiom #2, probabilities should sum to one $(\sum_{k=1}^{v} f_k(x) = 1)$. This can be done by normalizing:

$$
f(x) = [f_1(x), …, f_v(x)]^T
$$
$$
f_c(x) = \frac{\exp(w_c^T x + b_c)}{\sum_{k=0}^{v-1} \exp(w_k^T x + b_k)}
$$

where $w_k^T$ is the $k^{th}$ row of the matrix $W$.

The logistic sigmoid function

For a two-class classifier, we don’t really need the vector label. If we define $w = w_2 - w_1$ and $b = b_2 - b_1$, then the softmax simplifies to:

$$
f(Wx + b) =
\begin{bmatrix}
\text{Pr}(Y = 1|x) \\
\text{Pr}(Y = 2|x)
\end{bmatrix} =
\begin{bmatrix}
\frac{1}{1+e^ {-(w^ Tx+b)}} \\
\frac{e^ {-(w^ Tx+b)}}{1+e^ {-(w^ Tx+b)}}
\end{bmatrix} =
\begin{bmatrix}
\sigma(w^Tx + b) \\
1 - \sigma(w^Tx + b)
\end{bmatrix}
$$

… so instead of the softmax, we use a scalar function called the logistic sigmoid function:

$$
\sigma(z) = \frac{1}{1+e^{-z}}
$$

This function is called sigmoid because it is S-shaped.

For $z \to -\infty$, $\sigma(z) \to 0$

For $z \to +\infty$, $\sigma(z) \to 1$

Gradient descent

Suppose we have training tokens $(x_i, y_i)$, and we have some initial class vectors $w_1$ and $w_2$. We want to update them as

$$
w_1 \leftarrow w_1 - \eta \frac{\partial \mathcal{L}}{\partial w_1}
$$

$$
w_2 \leftarrow w_2 - \eta \frac{\partial \mathcal{L}}{\partial w_2}
$$

…where $\mathcal{L}$ is some loss function. What loss function makes sense?

Zero-one loss function

The most obvious loss function for a classifier is its classification error rate,

$$
\mathcal{L} = \frac{1}{n} \sum_{i=1}^{n} \ell(\hat{f}(x_i), y_i)
$$

Where $\ell(\hat{y}, y)$ is the zero-one loss function,

$$
\ell(f(x), y) =
\begin{cases}
0 & \text{if } f(x) = y \\
1 & \text{if } f(x) \neq y
\end{cases}
$$

The problem with zero-one loss is that it’s not differentiable.

A loss function that learns probabilities

Suppose we have a softmax output, so we want $f_c(x) \approx \Pr(Y = c|x)$. We can train this by learning $W$ and $b$ to maximize the probability of the training corpus. If we assume all training tokens are independent, we get:

$$
W, b = \underset{W,b}{\text{argmax}} \prod_{i=1}^{n} \Pr(Y = y_i|x_i) = \underset{W,b}{\text{argmax}} \sum_{i=1}^{n} \ln \Pr(Y = y_i|x_i)
$$

But remember that $f_c(x) \approx \Pr(Y = c|x)$! Therefore, maximizing the log probability of training data is the same as minimizing the cross entropy between the neural net and the ground truth:

$$
W, b = \underset{W,b}{\text{argmin}} -\frac{1}{n} \sum_{i=1}^{n} \mathcal{L}_ i, \quad \mathcal{L}_ i = - \log f_ {y_ i}(x_ i)
$$

Cross-entropy

This loss function:

$$
\mathcal{L} = - \ln f_{y}(x)
$$

is called cross-entropy. It measures the difference in randomness between:

Truth: $Y = y$ with probability 1.0, $\ln(1.0) = 0$, minus the
Neural net estimate: $Y = y$ with probability $f_{y}(x)$.

Thus

$$
\mathcal{L} = 0 - \ln f_{y}(x)
$$

Gradient of the cross-entropy of the softmax

Since we have these definitions:

$$
\mathcal{L} = - \ln f_{y}(x), \quad f_{y}(x) = \frac{\exp(z_{y})}{\sum_{k=1}^{v} \exp(z_{k})}, \quad z_{c} = w_c^T x + b_c
$$

Then:

$$
\frac{\partial \mathcal{L}}{\partial w_c} = \left( \frac{\partial \mathcal{L}}{\partial z_c} \right) \left( \frac{\partial z_c}{\partial w_c} \right) = \left( \frac{\partial \mathcal{L}}{\partial z_c} \right) x
$$

…where:

$$
\frac{\partial \mathcal{L}}{\partial z_c} =
\begin{cases}
f_{c}(x_i) - 1 & c = y \\
f_{c}(x_i) & c \neq y
\end{cases}
$$

Similarity to linear regression

For linear regression, we had:

$$
\frac{\partial \mathcal{L}}{\partial w} = \epsilon x, \quad \epsilon = f(x) - y
$$

For the softmax classifier with cross-entropy loss, we have

$$
\frac{\partial \mathcal{L}}{\partial w_c} = \epsilon_c x
$$

$$
\epsilon_c =
\begin{cases}
f_c(x_i) - 1 & c = y \text{ (output should be 1)} \\
f_c(x_i) & \text{otherwise (output should be 0)}
\end{cases}
$$

Similarity to perceptron

Suppose we have a training token $(x, y)$, and we have some initial class vectors $w_c$. Using softmax and cross-entropy loss, we can update the weight vectors as

$$
w_c \leftarrow w_c - \eta \epsilon_c x
$$

…where

$$
\epsilon_c =
\begin{cases}
f_c(x_i) - 1 & c = y_i \\
f_c(x_i) & \text{otherwise}
\end{cases}
$$

In other words, like a perceptron,

$$
\epsilon_c =
\begin{cases}
\epsilon_c < 0 & c = y_i \\
\epsilon_c > 0 & \text{otherwise}
\end{cases}
$$

Outline

Softmax:
$$ f_c(x) = \frac{\exp(w_c^T x + b_c)}{\sum_{k=1}^{v} \exp(w_k^T x + b_k)} \approx \Pr(Y = c|x) $$
Cross-entropy:
$$ \mathcal{L} = - \ln f_{y}(x) $$
Derivative of the cross-entropy of a softmax:
$$ \frac{\partial \mathcal{L}}{\partial w_c} = \epsilon_c x, \quad \epsilon_c =
\begin{cases}
f_c(x_i) - 1 & c = y \text{ (output should be 1)} \\
f_c(x_i) & \text{otherwise (output should be 0)}
\end{cases} $$
Gradient descent:
$$ w_c \leftarrow w_c - \eta \epsilon_c x $$