Cross-Entropy Loss is used for the majority of classification tasks in Machine Learning and has a relatively simple equation. Let \(y =(y_{1},...y_{K})\) be the true distribution (often one-hot encoded), and \(\hat y=(\hat y_{1},...\hat y_{K})\) the predicted probabilities. Then we can write the Cross-Entropy loss in the following fashion.
\[ H(y, \hat y) \;=\; -\sum_{k=1}^{K} y_{k}\,\log\hat y_{k} \]But where does this formula come from?
Let’s imagine we have an image classification model, defined by its parameters \(\theta\). Given an input image \(x_{i}\), the model produces an output distribution \(\hat y\).
Naturally, what we would like to compute is the similarity between the prediction of our model to the ground truth, or more formal, the similarity between the model’s predicted distribution \(\hat y\) and the true distribution \(y\).
So let’s take a step back and see how we can measure this similarity for a less abstract probability distribution. Consider two coins, a fair coin with probability distribution \(P\), where heads and tails each have a probability of \(0.5\) and a mystery coin with unknown distribution \(Q\), where heads has probability \(q_1\) and tails \(q_2\).
\[ P = \begin{cases} p_1=0.5, & \text{heads}\\ p_2=0.5, & \text{tails} \end{cases} \qquad Q = \begin{cases} q_1, & \text{heads}\\ q_2, & \text{tails} \end{cases} \]How can we reason about the similarity of these two coins? One way would be to perform an experiment. The core idea is that if \(Q\) is similar to \(P\), it should assign a high probability to the outcomes that \(P\) generates. So, the experiment goes like this: We throw the fair coin \(N\) times and observe tails \(N_{T}\) times and heads \(N_{H}\) times. Now, we can compare how likely the observed sequence is under both distributions by simply calculating the probability of both the fair coin and the mystery coin producing this outcome.
\[ P(\text{Observations | fair coin}) = p_1^{N_{H}}\,p_2^{N_{T}} \]\[ Q(\text{Observations | mystery coin}) = q_1^{N_{H}}\,q_2^{N_{T}} \]To find the similarity we calculate the ratio of the two. We normalize by sample size \(N\) and take the logarithm to make it easier to analyze.
$$ \begin{aligned} &= \frac{p_{1}^{N_H}\,p_{2}^{N_T}}{q_{1}^{N_H}\,q_{2}^{N_T}} && \text{| normalize by } N\\ \\ &= \left( \frac{p_{1}^{N_H}\,p_{2}^{N_T}}{q_{1}^{N_H}\,q_{2}^{N_T}} \right)^{\!1/N} && \text{| take log } \\ \\ &= \log \left( \left( \frac{p_{1}^{N_H}\,p_{2}^{N_T}}{q_{1}^{N_H}\,q_{2}^{N_T}} \right)^{1/N} \right) && \text{| log rules} \\ \\ &= \frac{N_H}{N} \log p_{1} + \frac{N_T}{N} \log p_{2} - \frac{N_H}{N} \log q_{1} - \frac{N_T}{N} \log q_{2} && \text{| as } N \to \infty \text{, } \frac{N_H}{N} \to p_1 \text{, } \frac{N_T}{N} \to p_2\\ \\ &= p_1 \log p_1 + p_2 \log p_2 - p_1 \log q_1 - p_2 \log q_2 && \text{| group terms} \\ \\ &= p_1 \log \frac{p_1}{q_1} + p_2 \log \frac{p_2}{q_2} \end{aligned} $$Our derived equation is equivalent to the Kullback Leibler (KL) divergence for the discrete case - and does not only hold for two, but \(k\) classes!
$$ D_{\text{KL}}(P \parallel Q) = \sum_{k=1}^{K} p_k \log \frac{p_k}{q_k} $$The KL divergence is a famous measure in Machine Learning for the distance between two distributions. And we just learned where that comes from.
Now, back to our image classification example: An intuitive loss function would be to minimize the distance between the true distribution \(y\) and our predicted distribution \(\hat y\). Since the KL divergence measures the distance between distributions, it follows naturally to start here.
$$ \begin{aligned} \mathop{\arg\min}_{\theta} D_{\text{KL}}(y \parallel \hat y) &= \mathop{\arg\min}_{\theta} \sum_{k=1}^{K} y_k \log \frac{y_k}{\hat y_k} \\ \\ &= \mathop{\arg\min}_{\theta} \sum_{k=1}^{K} y_k (\log y_k - \log \hat y_k) \\ \\ &= \mathop{\arg\min}_{\theta} \sum_{k=1}^{K} y_k \log y_k - \sum_{k=1}^{K} y_k \log \hat y_k \\ \\ \end{aligned} $$Because the first summation term does not depend on the model parameters \(\theta\), minimizing the KL divergence w.r.t \(\theta\) is the same as minimizing \(- \sum_{k=1}^{K} y_k \log \hat y_k\). This term is equal to the Cross-Entropy Loss and where we started off from.