Probability and Information Theory
Chapter Overview
Deep learning is fundamentally a probabilistic framework. Neural networks learn probability distributions over data, make predictions with uncertainty, and are trained using probabilistic objectives. This chapter develops the probability theory and information theory necessary to understand these probabilistic aspects of deep learning.
We cover probability distributions, conditional probability, expectation, and variance—the building blocks for understanding neural network outputs as probabilistic models. We then introduce information theory concepts like entropy, cross-entropy, and KL divergence, which form the basis for loss functions used in training.
Learning Objectives
After completing this chapter, you will be able to:
- Work with probability distributions and compute expectations
- Apply Bayes' theorem to understand conditional probabilities
- Understand entropy as a measure of uncertainty
- Derive and apply cross-entropy loss for classification
- Use KL divergence to measure distribution differences
- Interpret neural network outputs as probability distributions
Probability Fundamentals
Random Variables and Distributions
- Discrete random variables: Take countable values (e.g., class labels)
- Continuous random variables: Take values in continuous ranges
For logits $\vz = [2.1, 0.5, -1.2, 3.4, 0.8, -0.5, 1.1, -2.0, 0.3, 1.8]$, the model predicts class 3 with highest probability $\approx 68.9\%$.
Conditional Probability and Bayes' Theorem
Information Theory
Entropy
Entropy measures average uncertainty. Higher entropy means more uncertainty.
Biased coin: $P(\text{heads}) = 0.9$, $P(\text{tails}) = 0.1$
Cross-Entropy
- Predicted: $\hat{\mathbf{p}} = [0.2, 0.6, 0.2]$ $\Rightarrow$ $L = -\log(0.6) \approx 0.511$
- More confident: $\hat{\mathbf{p}} = [0.1, 0.8, 0.1]$ $\Rightarrow$ $L = -\log(0.8) \approx 0.223$ (better)
- Wrong prediction: $\hat{\mathbf{p}} = [0.7, 0.2, 0.1]$ $\Rightarrow$ $L = -\log(0.2) \approx 1.609$ (bad)
import torch
import torch.nn as nn
# Logits: shape (batch_size, num_classes)
logits = torch.tensor([[2.0, 1.0, 0.1],
[0.5, 2.5, 1.0]])
labels = torch.tensor([0, 1])
# CrossEntropyLoss applies softmax internally
criterion = nn.CrossEntropyLoss()
loss = criterion(logits, labels)
print(f"Loss: {loss.item():.4f}")
Kullback-Leibler Divergence
Properties: (1) $D_{\text{KL}}(p \| q) \geq 0$ with equality iff $p = q$, (2) Not symmetric: $D_{\text{KL}}(p \| q) \neq D_{\text{KL}}(q \| p)$
Practical Considerations for Cross-Entropy and Softmax
Key optimizations for large vocabularies include sampled softmax (approximating the full softmax using a subset of $K$ negative samples), adaptive softmax (exploiting the Zipfian distribution of natural language with hierarchical prediction), and subword tokenization (reducing vocabulary size through BPE or WordPiece). Modern models use vocabularies of 30,000--50,000 subword tokens, balancing per-token cost against sequence length. For a detailed computational analysis, see Chapter~12.
KL Divergence in Practice
KL divergence appears throughout modern deep learning as a measure of distribution similarity. Understanding its computational properties and applications is essential for implementing techniques like variational autoencoders, knowledge distillation, and reinforcement learning from human feedback.
Applications in Modern Deep Learning
Variational Autoencoders (VAEs) use KL divergence as a regularization term to ensure that the learned latent distribution $q(z|x)$ remains close to a prior distribution $p(z)$, typically a standard Gaussian. The VAE loss function combines reconstruction loss with a KL divergence term:
For a Gaussian encoder with mean $\mu$ and variance $\sigma^2$, the KL divergence to a standard Gaussian has a closed form:
This closed form makes VAEs computationally efficient, as the KL term requires only $O(d)$ operations for a $d$-dimensional latent space, typically much smaller than the reconstruction loss computation.
Knowledge Distillation transfers knowledge from a large teacher model to a smaller student model by minimizing the KL divergence between their output distributions. The student is trained to match not just the hard labels but the full probability distribution produced by the teacher:
where $T$ is a temperature parameter that softens the distributions. The KL divergence term encourages the student to learn the relative confidences between classes that the teacher has learned, not just the most likely class. This is particularly valuable when the teacher assigns non-negligible probability to multiple classes, indicating genuine ambiguity or similarity between categories.
The computational cost of knowledge distillation is dominated by running both teacher and student models, with the KL divergence computation itself being relatively cheap at $O(BnV)$ for batch size $B$, sequence length $n$, and vocabulary size $V$.
Reinforcement Learning from Human Feedback (RLHF) uses KL divergence to constrain the policy learned through reinforcement learning to remain close to the original supervised fine-tuned model. This prevents the model from exploiting the reward model by generating adversarial outputs that score highly but are nonsensical. The RLHF objective includes a KL penalty term:
where $\pi_\theta$ is the policy being optimized, $\pi_{\text{ref}}$ is the reference model, and $\beta$ controls the strength of the KL constraint. Computing this KL divergence requires running both the policy and reference models on the same inputs and computing the divergence over the vocabulary at each token position.
Numerical Stability Considerations
Computing KL divergence naively can lead to numerical instability due to the logarithm of very small probabilities. When $q(x)$ is close to zero, $\log q(x)$ approaches negative infinity, and the product $p(x) \log q(x)$ can produce NaN values or catastrophic cancellation. Similarly, when computing $\log(p(x)/q(x))$, direct division can lose precision for very small probabilities.
The numerically stable approach computes KL divergence in log-space using the log-sum-exp trick. Instead of computing probabilities via softmax and then taking logarithms, we work directly with log-probabilities:
This formulation avoids computing very small probabilities explicitly. Modern deep learning frameworks like PyTorch provide F.kl\_div that operates on log-probabilities directly, ensuring numerical stability even when probabilities span many orders of magnitude.
Another source of instability arises when $p(x) > 0$ but $q(x) = 0$, which makes the KL divergence infinite. In practice, this occurs when the model assigns zero probability to an event that actually occurs in the data. To prevent this, implementations typically add a small epsilon ($\epsilon \approx 10^{-8}$) to probabilities before computing logarithms, or use label smoothing to ensure that the target distribution $p$ never assigns exactly zero probability to any class. Label smoothing replaces hard targets with a mixture of the true label and a uniform distribution:
where $\epsilon \approx 0.1$ is typical. This not only improves numerical stability but also acts as a regularizer that prevents overconfident predictions and often improves generalization.
Exercises
Solutions
Full solutions for all exercises are available at \url{https://deeplearning.hofkensvermeulen.be}.
(1) Entropy:
(2) Cross-entropy loss for true label class 2:
(3) Optimal output distribution: The optimal distribution assigns probability 1 to the correct class:
This gives entropy $H(\mathbf{p}^*) = 0$ (no uncertainty) and cross-entropy loss $L = -\log 1 = 0$ (perfect prediction).
Starting with the definition of cross-entropy:
Since $H(p)$ is constant with respect to $q$, minimizing $H(p, q)$ is equivalent to minimizing $D_{\text{KL}}(p \| q)$. This shows that training with cross-entropy loss is equivalent to minimizing the KL divergence between the true distribution and the predicted distribution.
(1) Binary cross-entropy loss:
(2) Gradient:
(3) Loss comparison:
- $\hat{p} = 0.99$: $L = -\log 0.99 \approx 0.010$ (very confident, correct)
- $\hat{p} = 0.2$: $L = -\log 0.2 \approx 1.609$ (low confidence, incorrect)
The loss heavily penalizes confident wrong predictions, encouraging the model to be calibrated.
Memory for logits:
Memory with FP16:
Savings: $3.06 - 1.53 = 1.53$ GB (50\% reduction)
Maximum batch size with 25\% memory budget: Available memory: $0.25 \times 40{,}000 \text{ MB} = 10{,}000$ MB
For FP16 logits:
With FP32, maximum batch size would be only 48 sequences.
(1) Speedup factor:
- Full softmax: $O(V) = 100{,}000$ operations per token
- Sampled softmax: $O(K+1) = 5{,}001$ operations per token
- Speedup: $\frac{100{,}000}{5{,}001} \approx 20\times$
(2) Memory reduction: For batch of 32 sequences with 512 tokens:
- Full softmax logits: $32 \times 512 \times 100{,}000 \times 4 = 6{,}553{,}600{,}000$ bytes $\approx 6.1$ GB
- Sampled softmax logits: $32 \times 512 \times 5{,}001 \times 4 = 327{,}745{,}536$ bytes $\approx 312$ MB
- Reduction: $\frac{6.1 \text{ GB}}{312 \text{ MB}} \approx 20\times$
(3) Why sampled softmax introduces bias:
- The gradient estimate is unbiased only if we sample from the true distribution
- In practice, we sample from a proposal distribution (e.g., unigram frequency)
- This creates importance sampling bias in the gradient
- The normalization constant is approximated, not exact
- Bias decreases as $K$ increases, but never reaches zero
- For large $K$ (e.g., 10,000), bias is negligible for most applications
(1) Memory transfer time:
(2) Compute time: For 30,000 exponentials and 30,000 divisions:
(3) Bottleneck analysis:
- Memory time: 0.267 microseconds
- Compute time: 0.000192 microseconds
- The operation is memory-bound by a factor of $\frac{0.267}{0.000192} \approx 1{,}390\times$
This extreme memory-bandwidth bottleneck explains why vocabulary size has such a direct impact on training speed, and why reducing precision from FP32 to FP16 provides nearly 2× speedup for softmax operations.
(1) Temperature effect on gradient magnitude: The softmax with temperature is:
As $T$ increases, the distribution becomes more uniform (softer). The gradient magnitude scales as $O(1/T)$ because:
(2) Gradient derivation: For KL divergence $D_{\text{KL}}(p_{\text{teacher}} \| p_{\text{student}})$ where both use temperature $T$:
(3) Why $T^2$ scaling is necessary: Without $T^2$ scaling, the gradient is $O(1/T)$, which vanishes as $T \to \infty$:
With $T^2$ scaling, the effective gradient becomes:
This compensates for the $1/T$ factor from the softmax derivative, maintaining meaningful gradients even for large $T$. The $T^2$ factor ensures that the distillation loss has the same scale as the hard label loss, allowing proper balancing between the two objectives.