Probability and Information Theory

Chapter Overview

Deep learning is fundamentally a probabilistic framework. Neural networks learn probability distributions over data, make predictions with uncertainty, and are trained using probabilistic objectives. This chapter develops the probability theory and information theory necessary to understand these probabilistic aspects of deep learning.

We cover probability distributions, conditional probability, expectation, and variance—the building blocks for understanding neural network outputs as probabilistic models. We then introduce information theory concepts like entropy, cross-entropy, and KL divergence, which form the basis for loss functions used in training.

Learning Objectives

After completing this chapter, you will be able to:

Work with probability distributions and compute expectations
Apply Bayes' theorem to understand conditional probabilities
Understand entropy as a measure of uncertainty
Derive and apply cross-entropy loss for classification
Use KL divergence to measure distribution differences
Interpret neural network outputs as probability distributions

Probability Fundamentals

Random Variables and Distributions

Definition: A random variable $X$ is a function that maps outcomes from a sample space to real numbers. We distinguish between:

Discrete random variables: Take countable values (e.g., class labels)
Continuous random variables: Take values in continuous ranges

Definition: For discrete random variable $X$, the probability mass function is:

$$ P(X = x) = p(x) $$

satisfying: (1) $0 \leq p(x) \leq 1$ for all $x$, and (2) $\sum_x p(x) = 1$

Example: In image classification with 10 classes (digits 0-9), a neural network outputs a probability distribution using softmax:

$$ P(Y = k | \vx) = \frac{\exp(z_k)}{\sum_{j=1}^{10} \exp(z_j)} $$

For logits $\vz = [2.1, 0.5, -1.2, 3.4, 0.8, -0.5, 1.1, -2.0, 0.3, 1.8]$, the model predicts class 3 with highest probability $\approx 68.9\%$.

Conditional Probability and Bayes' Theorem

Definition: The probability of event $A$ given event $B$:

$$ P(A|B) = \frac{P(A \cap B)}{P(B)} \quad \text{if } P(B) > 0 $$

Theorem: For events $A$ and $B$ with $P(B) > 0$:

$$ P(A|B) = \frac{P(B|A) P(A)}{P(B)} $$

where $P(A|B)$ is the posterior, $P(B|A)$ is the likelihood, $P(A)$ is the prior, and $P(B)$ is the evidence.

Information Theory

Entropy

Definition: For discrete random variable $X$ with PMF $p(x)$:

$$ H(X) = -\sum_x p(x) \ln p(x) = \mathbb{E}[-\ln P(X)] $$

Entropy measures average uncertainty. Higher entropy means more uncertainty.

Example: Fair coin: $P(\text{heads}) = P(\text{tails}) = 0.5$

$$ H = -[0.5 \log_2(0.5) + 0.5 \log_2(0.5)] = 1 \text{ bit (maximum)} $$

Biased coin: $P(\text{heads}) = 0.9$, $P(\text{tails}) = 0.1$

$$ H \approx 0.469 \text{ bits (lower, more predictable)} $$

Cross-Entropy

Definition: For true distribution $p$ and predicted distribution $q$:

$$ H(p, q) = -\sum_x p(x) \log q(x) = \mathbb{E}_{x \sim p}[-\log q(x)] $$

Theorem: For true label $y$ and predicted probabilities $\hat{\mathbf{p}}$:

$$ L = -\log \hat{p}_y $$

Example: For 3-class classification with true label $y=2$:

Predicted: $\hat{\mathbf{p}} = [0.2, 0.6, 0.2]$ $\Rightarrow$ $L = -\log(0.6) \approx 0.511$
More confident: $\hat{\mathbf{p}} = [0.1, 0.8, 0.1]$ $\Rightarrow$ $L = -\log(0.8) \approx 0.223$ (better)
Wrong prediction: $\hat{\mathbf{p}} = [0.7, 0.2, 0.1]$ $\Rightarrow$ $L = -\log(0.2) \approx 1.609$ (bad)

PyTorch cross-entropy loss:

import torch
import torch.nn as nn

# Logits: shape (batch_size, num_classes)
logits = torch.tensor([[2.0, 1.0, 0.1],
                       [0.5, 2.5, 1.0]])
labels = torch.tensor([0, 1])

# CrossEntropyLoss applies softmax internally
criterion = nn.CrossEntropyLoss()
loss = criterion(logits, labels)
print(f"Loss: {loss.item():.4f}")

Kullback-Leibler Divergence

Definition: The KL divergence from distribution $q$ to $p$:

$$ D_{\text{KL}}(p \| q) = \sum_x p(x) \log \frac{p(x)}{q(x)} = H(p, q) - H(p) $$

Properties: (1) $D_{\text{KL}}(p \| q) \geq 0$ with equality iff $p = q$, (2) Not symmetric: $D_{\text{KL}}(p \| q) \neq D_{\text{KL}}(q \| p)$

Minimizing KL divergence is equivalent to minimizing cross-entropy when $p$ is fixed. Training neural networks with cross-entropy loss is maximum likelihood estimation.

Practical Considerations for Cross-Entropy and Softmax

The cross-entropy loss computation becomes a significant bottleneck for large vocabularies. The logits tensor has shape $B \times n \times V$ (batch size $\times$ sequence length $\times$ vocabulary size), requiring $4BnV$ bytes in FP32. For BERT-base with $B=32$, $n=512$, $V=30{,}000$, logits alone consume $\sim$1.8~GB. The softmax operation over large vocabularies is memory-bandwidth-bound rather than compute-bound on modern GPUs, making vocabulary size one of the most direct levers for controlling training speed.

Key optimizations for large vocabularies include sampled softmax (approximating the full softmax using a subset of $K$ negative samples), adaptive softmax (exploiting the Zipfian distribution of natural language with hierarchical prediction), and subword tokenization (reducing vocabulary size through BPE or WordPiece). Modern models use vocabularies of 30,000--50,000 subword tokens, balancing per-token cost against sequence length. For a detailed computational analysis, see Chapter~12.

KL Divergence in Practice

KL divergence appears throughout modern deep learning as a measure of distribution similarity. Understanding its computational properties and applications is essential for implementing techniques like variational autoencoders, knowledge distillation, and reinforcement learning from human feedback.

Applications in Modern Deep Learning

Variational Autoencoders (VAEs) use KL divergence as a regularization term to ensure that the learned latent distribution $q(z|x)$ remains close to a prior distribution $p(z)$, typically a standard Gaussian. The VAE loss function combines reconstruction loss with a KL divergence term:

$$ \mathcal{L}_{\text{VAE}} = \mathbb{E}_{q(z|x)}[\log p(x|z)] - D_{\text{KL}}(q(z|x) \| p(z)) $$

For a Gaussian encoder with mean $\mu$ and variance $\sigma^2$, the KL divergence to a standard Gaussian has a closed form:

$$ D_{\text{KL}}(q \| p) = \frac{1}{2} \sum_{i=1}^{d} (\mu_i^2 + \sigma_i^2 - \log \sigma_i^2 - 1) $$

This closed form makes VAEs computationally efficient, as the KL term requires only $O(d)$ operations for a $d$-dimensional latent space, typically much smaller than the reconstruction loss computation.

Knowledge Distillation transfers knowledge from a large teacher model to a smaller student model by minimizing the KL divergence between their output distributions. The student is trained to match not just the hard labels but the full probability distribution produced by the teacher:

$$ \mathcal{L}_{\text{distill}} = \alpha \mathcal{L}_{\text{CE}}(y, \hat{y}_{\text{student}}) + (1-\alpha) T^2 D_{\text{KL}}(\hat{y}_{\text{teacher}} \| \hat{y}_{\text{student}}) $$

where $T$ is a temperature parameter that softens the distributions. The KL divergence term encourages the student to learn the relative confidences between classes that the teacher has learned, not just the most likely class. This is particularly valuable when the teacher assigns non-negligible probability to multiple classes, indicating genuine ambiguity or similarity between categories.

The computational cost of knowledge distillation is dominated by running both teacher and student models, with the KL divergence computation itself being relatively cheap at $O(BnV)$ for batch size $B$, sequence length $n$, and vocabulary size $V$.

Reinforcement Learning from Human Feedback (RLHF) uses KL divergence to constrain the policy learned through reinforcement learning to remain close to the original supervised fine-tuned model. This prevents the model from exploiting the reward model by generating adversarial outputs that score highly but are nonsensical. The RLHF objective includes a KL penalty term:

$$ \mathcal{L}_{\text{RLHF}} = \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta}[r(x, y)] - \beta D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}}) $$

where $\pi_\theta$ is the policy being optimized, $\pi_{\text{ref}}$ is the reference model, and $\beta$ controls the strength of the KL constraint. Computing this KL divergence requires running both the policy and reference models on the same inputs and computing the divergence over the vocabulary at each token position.

Numerical Stability Considerations

Computing KL divergence naively can lead to numerical instability due to the logarithm of very small probabilities. When $q(x)$ is close to zero, $\log q(x)$ approaches negative infinity, and the product $p(x) \log q(x)$ can produce NaN values or catastrophic cancellation. Similarly, when computing $\log(p(x)/q(x))$, direct division can lose precision for very small probabilities.

The numerically stable approach computes KL divergence in log-space using the log-sum-exp trick. Instead of computing probabilities via softmax and then taking logarithms, we work directly with log-probabilities:

$$ D_{\text{KL}}(p \| q) = \sum_x p(x) (\log p(x) - \log q(x)) = \sum_x \exp(\log p(x)) \cdot (\log p(x) - \log q(x)) $$

This formulation avoids computing very small probabilities explicitly. Modern deep learning frameworks like PyTorch provide F.kl\_div that operates on log-probabilities directly, ensuring numerical stability even when probabilities span many orders of magnitude.

Another source of instability arises when $p(x) > 0$ but $q(x) = 0$, which makes the KL divergence infinite. In practice, this occurs when the model assigns zero probability to an event that actually occurs in the data. To prevent this, implementations typically add a small epsilon ($\epsilon \approx 10^{-8}$) to probabilities before computing logarithms, or use label smoothing to ensure that the target distribution $p$ never assigns exactly zero probability to any class. Label smoothing replaces hard targets with a mixture of the true label and a uniform distribution:

$$ p_{\text{smooth}}(x) = (1 - \epsilon) p_{\text{true}}(x) + \epsilon / V $$

where $\epsilon \approx 0.1$ is typical. This not only improves numerical stability but also acts as a regularizer that prevents overconfident predictions and often improves generalization.

Exercises

Exercise 1: A neural network outputs $\hat{\mathbf{p}} = [0.15, 0.60, 0.20, 0.05]$ for 4 classes. Compute: (1) entropy $H(\hat{\mathbf{p}})$, (2) cross-entropy loss if true label is class 2, (3) optimal output distribution.

Exercise 2: Show that $H(p, q) = H(p) + D_{\text{KL}}(p \| q)$, proving cross-entropy minimization equals KL divergence minimization when $p$ is fixed.

Exercise 3: For binary classifier with $\hat{p} = 0.8$ and true label class 1: (1) Compute binary cross-entropy loss, (2) Find $\frac{\partial L}{\partial \hat{p}}$, (3) Compare loss for $\hat{p} \in \{0.99, 0.2\}$.

Exercise 4: Calculate the memory requirements for storing logits in a GPT-2 model with vocabulary size $V = 50{,}257$, batch size $B = 16$, and sequence length $n = 1024$. How much memory is saved by using FP16 instead of FP32? If you have an NVIDIA A100 with 40 GB of memory, what is the maximum batch size you can use if logits consume at most 25\% of available memory?

Exercise 5: For sampled softmax with $K = 5{,}000$ negative samples and vocabulary size $V = 100{,}000$: (1) Calculate the speedup factor for the forward pass compared to full softmax, (2) Compute the memory reduction for a batch of 32 sequences with 512 tokens each, (3) Discuss why sampled softmax introduces bias in gradient estimates.

Exercise 6: An NVIDIA A100 GPU has memory bandwidth of 1.5 TB/s and FP16 compute throughput of 312 TFLOPS. For softmax over a vocabulary of $V = 30{,}000$ tokens: (1) Calculate the time to read and write the logits and probabilities (400 KB total), (2) Calculate the time to compute 30,000 exponentials and divisions at peak throughput, (3) Determine whether the operation is compute-bound or bandwidth-bound and by what factor.

Exercise 7: In knowledge distillation, the KL divergence loss is scaled by $T^2$ where $T$ is the temperature parameter. Explain why this scaling is necessary by: (1) Showing how temperature affects the magnitude of gradients, (2) Deriving the gradient of $D_{\text{KL}}(\text{softmax}(\mathbf{z}/T) \| \text{softmax}(\mathbf{z}'/T))$ with respect to $\mathbf{z}'$, (3) Demonstrating that without $T^2$ scaling, the distillation loss would vanish as $T \to \infty$.

Solutions

Full solutions for all exercises are available at \url{https://deeplearning.hofkensvermeulen.be}.

Solution: For neural network output $\hat{\mathbf{p}} = [0.15, 0.60, 0.20, 0.05]$:

(1) Entropy:

$$\begin{align} H(\hat{\mathbf{p}}) &= -\sum_{i=1}^4 \hat{p}_i \log_2 \hat{p}_i \\ &= -(0.15 \log_2 0.15 + 0.60 \log_2 0.60 + 0.20 \log_2 0.20 + 0.05 \log_2 0.05) \\ &= -(0.15(-2.737) + 0.60(-0.737) + 0.20(-2.322) + 0.05(-4.322)) \\ &= -(-0.411 - 0.442 - 0.464 - 0.216) \\ &= 1.533 \text{ bits} \end{align}$$

(2) Cross-entropy loss for true label class 2:

$$ L = -\log \hat{p}_2 = -\log 0.60 \approx 0.511 \text{ nats} \quad \text{or} \quad -\log_2 0.60 \approx 0.737 \text{ bits} $$

(3) Optimal output distribution: The optimal distribution assigns probability 1 to the correct class:

$$ \mathbf{p}^* = [0, 1, 0, 0] $$

This gives entropy $H(\mathbf{p}^*) = 0$ (no uncertainty) and cross-entropy loss $L = -\log 1 = 0$ (perfect prediction).

Solution: Proof that $H(p, q) = H(p) + D_{\text{KL}}(p \| q)$:

Starting with the definition of cross-entropy:

$$\begin{align} H(p, q) &= -\sum_x p(x) \log q(x) \\ &= -\sum_x p(x) \log q(x) + \sum_x p(x) \log p(x) - \sum_x p(x) \log p(x) \\ &= -\sum_x p(x) \log p(x) + \sum_x p(x) \log \frac{p(x)}{q(x)} \\ &= H(p) + D_{\text{KL}}(p \| q) \end{align}$$

Since $H(p)$ is constant with respect to $q$, minimizing $H(p, q)$ is equivalent to minimizing $D_{\text{KL}}(p \| q)$. This shows that training with cross-entropy loss is equivalent to minimizing the KL divergence between the true distribution and the predicted distribution.

Solution: For binary classifier with $\hat{p} = 0.8$ and true label class 1:

(1) Binary cross-entropy loss:

$$ L = -[y \log \hat{p} + (1-y) \log(1-\hat{p})] = -[1 \cdot \log 0.8 + 0 \cdot \log 0.2] = -\log 0.8 \approx 0.223 $$

(2) Gradient:

$$\begin{align} \frac{\partial L}{\partial \hat{p}} &= \frac{\partial}{\partial \hat{p}}[-y \log \hat{p} - (1-y) \log(1-\hat{p})] \\ &= -\frac{y}{\hat{p}} + \frac{1-y}{1-\hat{p}} \\ &= -\frac{1}{0.8} + \frac{0}{0.2} = -1.25 \end{align}$$

(3) Loss comparison:

$\hat{p} = 0.99$: $L = -\log 0.99 \approx 0.010$ (very confident, correct)
$\hat{p} = 0.2$: $L = -\log 0.2 \approx 1.609$ (low confidence, incorrect)

The loss heavily penalizes confident wrong predictions, encouraging the model to be calibrated.

Solution: For GPT-2 with $V = 50{,}257$, $B = 16$, $n = 1024$:

Memory for logits:

$$ B \times n \times V \times 4 \text{ bytes} = 16 \times 1024 \times 50{,}257 \times 4 = 3{,}280{,}838{,}144 \text{ bytes} \approx 3.06 \text{ GB} $$

Memory with FP16:

$$ 16 \times 1024 \times 50{,}257 \times 2 = 1{,}640{,}419{,}072 \text{ bytes} \approx 1.53 \text{ GB} $$

Savings: $3.06 - 1.53 = 1.53$ GB (50\% reduction)

Maximum batch size with 25\% memory budget: Available memory: $0.25 \times 40{,}000 \text{ MB} = 10{,}000$ MB

For FP16 logits:

$$ B = \frac{10{,}000 \text{ MB}}{n \times V \times 2 \text{ bytes}} = \frac{10{,}000 \times 10^6}{1024 \times 50{,}257 \times 2} \approx 97 \text{ sequences} $$

With FP32, maximum batch size would be only 48 sequences.

Solution: For sampled softmax with $K = 5{,}000$ and $V = 100{,}000$:

(1) Speedup factor:

Full softmax: $O(V) = 100{,}000$ operations per token
Sampled softmax: $O(K+1) = 5{,}001$ operations per token
Speedup: $\frac{100{,}000}{5{,}001} \approx 20\times$

(2) Memory reduction: For batch of 32 sequences with 512 tokens:

Full softmax logits: $32 \times 512 \times 100{,}000 \times 4 = 6{,}553{,}600{,}000$ bytes $\approx 6.1$ GB
Sampled softmax logits: $32 \times 512 \times 5{,}001 \times 4 = 327{,}745{,}536$ bytes $\approx 312$ MB
Reduction: $\frac{6.1 \text{ GB}}{312 \text{ MB}} \approx 20\times$

(3) Why sampled softmax introduces bias:

The gradient estimate is unbiased only if we sample from the true distribution
In practice, we sample from a proposal distribution (e.g., unigram frequency)
This creates importance sampling bias in the gradient
The normalization constant is approximated, not exact
Bias decreases as $K$ increases, but never reaches zero
For large $K$ (e.g., 10,000), bias is negligible for most applications

Solution: For A100 GPU with 1.5 TB/s bandwidth and 312 TFLOPS FP16 throughput, softmax over $V = 30{,}000$:

(1) Memory transfer time:

$$ \text{Time} = \frac{400 \text{ KB}}{1{,}500{,}000{,}000 \text{ KB/s}} = \frac{400}{1{,}500{,}000{,}000} \approx 0.267 \text{ microseconds} $$

(2) Compute time: For 30,000 exponentials and 30,000 divisions:

$$ \text{Time} = \frac{60{,}000 \text{ ops}}{312 \times 10^{12} \text{ ops/s}} \approx 0.000192 \text{ microseconds} $$

(3) Bottleneck analysis:

Memory time: 0.267 microseconds
Compute time: 0.000192 microseconds
The operation is memory-bound by a factor of $\frac{0.267}{0.000192} \approx 1{,}390\times$

This extreme memory-bandwidth bottleneck explains why vocabulary size has such a direct impact on training speed, and why reducing precision from FP32 to FP16 provides nearly 2× speedup for softmax operations.

Solution: For knowledge distillation with temperature $T$:

(1) Temperature effect on gradient magnitude: The softmax with temperature is:

$$ p_i = \frac{\exp(z_i/T)}{\sum_j \exp(z_j/T)} $$

As $T$ increases, the distribution becomes more uniform (softer). The gradient magnitude scales as $O(1/T)$ because:

$$ \frac{\partial p_i}{\partial z_j} = \frac{1}{T} p_i(\delta_{ij} - p_j) $$

(2) Gradient derivation: For KL divergence $D_{\text{KL}}(p_{\text{teacher}} \| p_{\text{student}})$ where both use temperature $T$:

$$\begin{align} \frac{\partial D_{\text{KL}}}{\partial z'_i} &= \frac{\partial}{\partial z'_i} \sum_j p_j^T \log \frac{p_j^T}{q_j^T} \\ &= -\sum_j p_j^T \frac{\partial \log q_j^T}{\partial z'_i} \\ &= -\sum_j p_j^T \frac{1}{q_j^T} \frac{\partial q_j^T}{\partial z'_i} \\ &= -\sum_j p_j^T \frac{1}{q_j^T} \cdot \frac{1}{T} q_j^T(\delta_{ij} - q_i^T) \\ &= -\frac{1}{T} \sum_j p_j^T(\delta_{ij} - q_i^T) \\ &= \frac{1}{T}(q_i^T - p_i^T) \end{align}$$

(3) Why $T^2$ scaling is necessary: Without $T^2$ scaling, the gradient is $O(1/T)$, which vanishes as $T \to \infty$:

$$ \lim_{T \to \infty} \frac{1}{T}(q_i^T - p_i^T) = 0 $$

With $T^2$ scaling, the effective gradient becomes:

$$ T^2 \cdot \frac{1}{T}(q_i^T - p_i^T) = T(q_i^T - p_i^T) $$

This compensates for the $1/T$ factor from the softmax derivative, maintaining meaningful gradients even for large $T$. The $T^2$ factor ensures that the distillation loss has the same scale as the hard label loss, allowing proper balancing between the two objectives.

← Chapter 2: Calculus and Optimization 📚 Table of Contents Chapter 4: Feed-Forward Neural Networks →