The beautiful origin of Cross-Entropy Loss

Cross-Entropy Loss is used for the majority of classification tasks in Machine Learning and has a relatively simple equation. Let $y =(y_{1},...y_{K})$ be the true distribution (often one-hot encoded), and $\hat y=(\hat y_{1},...\hat y_{K})$ the predicted probabilities. Then we can write the Cross-Entropy loss in the following fashion.

\[ H(y, \hat y) \;=\; -\sum_{k=1}^{K} y_{k}\,\log\hat y_{k} \]

But where does this formula come from?

Let’s imagine we have an image classification model, defined by its parameters $\theta$. Given an input image $x_{i}$, the model produces an output distribution $\hat y$.
Naturally, what we would like to compute is the similarity between the prediction of our model to the ground truth, or more formal, the similarity between the model’s predicted distribution $\hat y$ and the true distribution $y$.

So let’s take a step back and see how we can measure this similarity for a less abstract probability distribution. Consider two coins, a fair coin with probability distribution $P$, where heads and tails each have a probability of $0.5$ and a mystery coin with unknown distribution $Q$, where heads has probability $q_1$ and tails $q_2$.

\[ P = \begin{cases} p_1=0.5, & \text{heads}\\ p_2=0.5, & \text{tails} \end{cases} \qquad Q = \begin{cases} q_1, & \text{heads}\\ q_2, & \text{tails} \end{cases} \]

How can we reason about the similarity of these two coins? One way would be to perform an experiment. The core idea is that if $Q$ is similar to $P$, it should assign a high probability to the outcomes that $P$ generates. So, the experiment goes like this: We throw the fair coin $N$ times and observe tails $N_{T}$ times and heads $N_{H}$ times. Now, we can compare how likely the observed sequence is under both distributions by simply calculating the probability of both the fair coin and the mystery coin producing this outcome.

\[ P(\text{Observations | fair coin}) = p_1^{N_{H}}\,p_2^{N_{T}} \]\[ Q(\text{Observations | mystery coin}) = q_1^{N_{H}}\,q_2^{N_{T}} \]

To find the similarity we calculate the ratio of the two. We normalize by sample size $N$ and take the logarithm to make it easier to analyze.

$$ \begin{aligned} &= \frac{p_{1}^{N_H}\,p_{2}^{N_T}}{q_{1}^{N_H}\,q_{2}^{N_T}} && \text{| normalize by } N\\ \\ &= \left( \frac{p_{1}^{N_H}\,p_{2}^{N_T}}{q_{1}^{N_H}\,q_{2}^{N_T}} \right)^{\!1/N} && \text{| take log } \\ \\ &= \log \left( \left( \frac{p_{1}^{N_H}\,p_{2}^{N_T}}{q_{1}^{N_H}\,q_{2}^{N_T}} \right)^{1/N} \right) && \text{| log rules} \\ \\ &= \frac{N_H}{N} \log p_{1} + \frac{N_T}{N} \log p_{2} - \frac{N_H}{N} \log q_{1} - \frac{N_T}{N} \log q_{2} && \text{| as } N \to \infty \text{, } \frac{N_H}{N} \to p_1 \text{, } \frac{N_T}{N} \to p_2\\ \\ &= p_1 \log p_1 + p_2 \log p_2 - p_1 \log q_1 - p_2 \log q_2 && \text{| group terms} \\ \\ &= p_1 \log \frac{p_1}{q_1} + p_2 \log \frac{p_2}{q_2} \end{aligned} $$

Our derived equation is equivalent to the Kullback Leibler (KL) divergence for the discrete case - and does not only hold for two, but $k$ classes!

$$ D_{\text{KL}}(P \parallel Q) = \sum_{k=1}^{K} p_k \log \frac{p_k}{q_k} $$

The KL divergence is a famous measure in Machine Learning for the distance between two distributions. And we just learned where that comes from.

Now, back to our image classification example: An intuitive loss function would be to minimize the distance between the true distribution $y$ and our predicted distribution $\hat y$. Since the KL divergence measures the distance between distributions, it follows naturally to start here.

$$ \begin{aligned} \mathop{\arg\min}_{\theta} D_{\text{KL}}(y \parallel \hat y) &= \mathop{\arg\min}_{\theta} \sum_{k=1}^{K} y_k \log \frac{y_k}{\hat y_k} \\ \\ &= \mathop{\arg\min}_{\theta} \sum_{k=1}^{K} y_k (\log y_k - \log \hat y_k) \\ \\ &= \mathop{\arg\min}_{\theta} \sum_{k=1}^{K} y_k \log y_k - \sum_{k=1}^{K} y_k \log \hat y_k \\ \\ \end{aligned} $$

Because the first summation term does not depend on the model parameters $\theta$, minimizing the KL divergence w.r.t $\theta$ is the same as minimizing $- \sum_{k=1}^{K} y_k \log \hat y_k$. This term is equal to the Cross-Entropy Loss and where we started off from.