Calculus and Optimization
Chapter Overview
Training deep learning models requires optimizing complex, high-dimensional functions. This chapter develops the calculus and optimization theory necessary to understand how neural networks learn from data. We cover multivariable calculus, gradient computation, and the optimization algorithms that power modern deep learning.
The centerpiece of this chapter is backpropagation, the algorithm that efficiently computes gradients in neural networks. We derive backpropagation from first principles, showing how the chain rule enables gradient computation through arbitrarily deep computational graphs. We then explore gradient descent and its variants, which use these gradients to iteratively improve model parameters.
Learning Objectives
After completing this chapter, you will be able to:
- Compute gradients and Jacobians for multivariable functions
- Apply the chain rule to composite functions
- Understand and implement the backpropagation algorithm
- Implement gradient descent and its variants (SGD, momentum, Adam)
- Analyze convergence properties of optimization algorithms
- Apply learning rate schedules and regularization techniques
Multivariable Calculus
Partial Derivatives
At point $(x_1, x_2) = (1, 2)$:
Gradients
The gradient points in the direction of steepest ascent of the function.
The gradient with respect to $\vw$ is:
For $N=1$, $\vw = [w_1, w_2]\transpose$, $\vx = [1, 2]\transpose$, $y = 5$, and current prediction $\hat{y} = \vw\transpose \vx = 3$:
The negative gradient $-\nabla_{\vw} L = [4, 8]\transpose$ points toward better parameters.
The Chain Rule
In vector form:
Let $\vz = \mW\vx + \vb$ (pre-activation). Then:
where $\odot$ denotes element-wise multiplication.
Concrete example: For ReLU activation $\sigma(z) = \max(0, z)$:
If $\vz = [2.0, -1.0, 0.5]\transpose$, then $\sigma'(\vz) = [1, 0, 1]\transpose$.
Jacobian and Hessian Matrices
The Hessian describes the local curvature of the function. For smooth functions, $\mH$ is symmetric.
Gradient Descent
The Gradient Descent Algorithm
Gradient descent iteratively moves parameters in the direction opposite to the gradient:
The parameters converge to $w^* = 0$ (the minimum).
Stochastic Gradient Descent (SGD)
For large datasets, computing the full gradient is expensive. SGD approximates the gradient using mini-batches.
import torch
import torch.nn as nn
# Model and loss
model = nn.Linear(10, 1)
criterion = nn.MSELoss()
# SGD optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
# Training loop
for epoch in range(100):
for x_batch, y_batch in dataloader:
# Forward pass
y_pred = model(x_batch)
loss = criterion(y_pred, y_batch)
# Backward pass
optimizer.zero_grad() # Clear previous gradients
loss.backward() # Compute gradients
optimizer.step() # Update parameters
Momentum
Momentum accelerates SGD by accumulating a velocity vector, drawing inspiration from physics where a ball rolling down a hill builds up speed.
The Intuition Behind Momentum
Standard SGD can be slow and oscillatory, especially in regions where the loss surface has different curvatures in different directions (ravines). Consider a loss landscape that is steep in one direction but shallow in another:
- Without momentum: The gradient points steeply downward, causing large oscillations perpendicular to the optimal path. Progress along the shallow direction is slow because each step only uses the current gradient.
- With momentum: The velocity accumulates gradients over time. Oscillations in steep directions cancel out (positive and negative gradients average), while consistent gradients in shallow directions accumulate, accelerating progress.
Mathematical Formulation
The velocity update can be expanded to show its exponential weighting:
With $\beta = 0.9$, the current gradient has weight 1, the previous gradient has weight 0.9, two steps ago has weight 0.81, and so on. This creates a "memory" of approximately $\frac{1}{1-\beta} = 10$ steps.
Why Momentum Works Better
- Dampens oscillations: In directions with high curvature (steep gradients that change sign), the velocity accumulates opposing gradients, reducing oscillation amplitude.
- Accelerates in consistent directions: When gradients consistently point in the same direction, the velocity builds up, effectively increasing the learning rate in that direction.
- Escapes shallow local minima: The accumulated velocity can carry the optimization through small barriers, potentially escaping poor local minima.
- Reduces sensitivity to learning rate: The smoothing effect makes the optimization more robust to learning rate choices.
SGD without momentum:
Progress in $w_1$ direction: $1 \to 0.8 \to 0.64$ (slow, 20\% per step)
SGD with momentum ($\beta = 0.9$):
Progress in $w_1$ direction: $1 \to 0.8 \to 0.46$ (faster, 42.5\% in second step due to accumulated velocity)
Momentum accelerates convergence by 2-3× in this ravine scenario.
Adam Optimizer
Adam (Adaptive Moment Estimation) combines momentum with adaptive learning rates, creating a powerful optimizer that adapts to the geometry of the loss landscape for each parameter individually.
The Intuition Behind Adam
Adam addresses two key limitations of standard SGD with momentum:
- Different parameters need different learning rates: In deep networks, some parameters (e.g., early layer weights) may have small gradients while others (e.g., output layer weights) have large gradients. A single global learning rate is suboptimal.
- Gradient magnitudes vary across training: Early in training, gradients may be large and noisy. Later, they become smaller and more stable. Adam adapts to these changes automatically.
Mathematical Formulation
Understanding Each Component
First moment ($\mathbf{m}_t$): Exponentially weighted average of gradients (momentum)
Second moment ($\mathbf{v}_t$): Exponentially weighted average of squared gradients
Bias correction: Since $\mathbf{m}_0 = \mathbf{v}_0 = \mathbf{0}$, early estimates are biased toward zero. The correction factors $\frac{1}{1-\beta_1^t}$ and $\frac{1}{1-\beta_2^t}$ compensate for this initialization bias, ensuring proper scaling in early iterations.
Adaptive update: The final update is:
Each parameter $w_i$ gets an effective learning rate of $\frac{\alpha}{\sqrt{\hat{v}_{i,t}} + \epsilon}$:
- Parameters with large, consistent gradients: $\hat{v}_{i,t}$ is large $\Rightarrow$ effective learning rate is small (prevents overshooting)
- Parameters with small, noisy gradients: $\hat{v}_{i,t}$ is small $\Rightarrow$ effective learning rate is large (accelerates learning)
| Optimizer | Steps to 95\% Accuracy | Final Accuracy | Training Time |
|---|---|---|---|
| SGD ($\eta = 0.01$) | 8,500 | 96.2\% | 12.3 min |
| SGD + Momentum ($\eta = 0.01$, $\beta = 0.9$) | 4,200 | 97.1\% | 6.1 min |
| Adam ($\alpha = 0.001$) | 2,100 | 97.8\% | 3.0 min |
Adam converges 4× faster than SGD and 2× faster than momentum, while achieving higher final accuracy.
Parameter 1 (consistent large gradients):
Parameter 2 (small noisy gradients):
Parameter 2 gets 25× larger effective learning rate, compensating for its smaller gradients. This automatic adaptation is why Adam excels at training deep networks with heterogeneous parameter scales.
Gradient Computation Complexity
Understanding the computational and memory costs of gradient computation is essential for training large models efficiently.
FLOPs for Gradient Computation
Forward pass:
- Self-attention: $12 \times 4n^2d = 12 \times 4(512)^2(768) \approx 48$ GFLOPs
- Feed-forward: $12 \times 2nd(4d) = 12 \times 2(512)(768)(3072) \approx 36$ GFLOPs
- Other operations: $\approx 12$ GFLOPs
- Total forward: $\approx 96$ GFLOPs
Backward pass:
- Gradient computation through each layer: $\approx 96$ GFLOPs
- Gradient accumulation for weight updates: $\approx 97$ GFLOPs
- Total backward: $\approx 193$ GFLOPs
Total per training step: $\approx 289$ GFLOPs
For batch size $B = 32$: $289 \times 32 \approx 9.2$ TFLOPs per batch.
Memory Requirements for Activations
During backpropagation, intermediate activations must be stored for gradient computation.
Automatic Differentiation: Forward vs Reverse Mode
Forward mode:
- Requires $10^8$ forward passes
- Each pass: $\approx 100$ GFLOPs
- Total: $10^{10}$ GFLOPs $\approx 10$ PFLOPs
- Time on A100 GPU (312 TFLOPS): $\approx 32,000$ seconds $\approx 9$ hours
Reverse mode (backpropagation):
- Requires 1 forward + 1 backward pass
- Total: $\approx 300$ GFLOPs
- Time on A100 GPU: $\approx 0.001$ seconds
- Speedup: $\approx 32$ million×
Gradient Checkpointing
Gradient checkpointing trades computation for memory by recomputing activations during the backward pass.
Without checkpointing:
- Memory: $8.4$ GB (all activations)
- Computation: $289$ GFLOPs (1 forward + 1 backward)
With checkpointing (every 3 layers):
- Memory: $8.4 / 3 \approx 2.8$ GB (only checkpoints)
- Computation: $96 + 193 + 72 = 361$ GFLOPs (1 forward + 1 backward + 0.75 forward recompute)
- Memory reduction: 3×, Computation increase: 1.25×
For GPT-3 (175B parameters), checkpointing is essential to fit in GPU memory.
Backpropagation
Backpropagation efficiently computes gradients in neural networks using the chain rule.
Computational Graphs
A computational graph represents the sequence of operations in a neural network. Each node is an operation, and edges carry values/gradients.
Forward pass:
Backward pass:
Backpropagation Algorithm
Why Backpropagation is $O(n)$ Not $O(n^2)$
We compute $\frac{\partial L}{\partial z_j}$ once and reuse it for all parameters that affect $z_j$. This sharing prevents the $O(n^2)$ cost of computing each gradient independently.
Naive finite differences:
Backpropagation:
- Forward pass: $O(m) = O(n)$ operations
- Backward pass: $O(m) = O(n)$ operations
- Total: $O(n)$ operations
Speedup: $O(n) = 10^8$×
Optimizer Memory Requirements
Different optimizers have vastly different memory requirements, which becomes critical for large models.
Memory Comparison by Optimizer
| Optimizer | Memory per Parameter | Total Memory Factor |
|---|---|---|
| SGD (no momentum) | 4 bytes (fp32) | 1× |
| SGD with momentum | 8 bytes (param + velocity) | 2× |
| Adam | 16 bytes (param + 2 moments) | 4× |
| Adam (mixed precision) | 10 bytes (fp16 param + fp32 master + 2 moments) | 2.5× |
Model parameters:
- FP32: $110 \times 10^6 \times 4 \text{ bytes} = 440$ MB
- FP16: $110 \times 10^6 \times 2 \text{ bytes} = 220$ MB
SGD with momentum:
- Parameters: 440 MB
- Momentum buffer: 440 MB
- Total: 880 MB
Adam optimizer:
- Parameters: 440 MB
- First moment ($\mathbf{m}$): 440 MB
- Second moment ($\mathbf{v}$): 440 MB
- Gradients: 440 MB
- Total: 1,760 MB $\approx 1.7$ GB
Adam with mixed precision:
- FP16 parameters: 220 MB
- FP32 master copy: 440 MB
- FP32 first moment: 440 MB
- FP32 second moment: 440 MB
- FP16 gradients: 220 MB
- Total: 1,760 MB $\approx 1.7$ GB
Note: Mixed precision doesn't reduce optimizer memory, but enables larger batch sizes.
Model + optimizer states:
- Parameters (FP16): $175 \times 10^9 \times 2 = 350$ GB
- Master copy (FP32): $175 \times 10^9 \times 4 = 700$ GB
- First moment (FP32): 700 GB
- Second moment (FP32): 700 GB
- Gradients (FP16): 350 GB
- Total: 2,800 GB $\approx 2.8$ TB
This requires distributed training across multiple GPUs. With 8× A100 GPUs (80 GB each = 640 GB total), we need model parallelism and optimizer state sharding (e.g., ZeRO optimizer).
Impact on GPU Memory Budget
Memory allocation:
- Model parameters: 0.44 GB
- Optimizer states (Adam): 1.32 GB
- Activations (batch size 32): 8.4 GB
- Gradients: 0.44 GB
- Framework overhead: $\approx 2$ GB
- Total: $\approx 12.6$ GB
Remaining: 67.4 GB available for larger batch sizes or longer sequences.
With batch size 256: Activations $\approx 67$ GB, total $\approx 71$ GB (fits comfortably).
Learning Rate Schedules
Learning rate schedules adjust $\eta$ during training to improve convergence.
Learning Rate Impact on Convergence and GPU Utilization
| Learning Rate | Steps to Converge | Wall Time | Final Loss |
|---|---|---|---|
| $1 \times 10^{-5}$ | 100,000 | 12 hours | 1.85 |
| $1 \times 10^{-4}$ | 30,000 | 3.6 hours | 1.82 |
| $5 \times 10^{-4}$ | 15,000 | 1.8 hours | 1.81 |
| $1 \times 10^{-3}$ | 20,000 | 2.4 hours | 1.83 |
| $5 \times 10^{-3}$ | Diverges | - | - |
Optimal learning rate ($5 \times 10^{-4}$) achieves 6.7× faster convergence than conservative rate.
Learning Rate Scaling with Batch Size
This holds approximately for $k \leq 8$. For larger $k$, use gradual warmup.
| Batch Size | Learning Rate | GPU Util. | Steps/sec | Samples/sec |
|---|---|---|---|---|
| 32 | $5 \times 10^{-4}$ | 45\% | 2.1 | 67 |
| 64 | $1 \times 10^{-3}$ | 68\% | 1.8 | 115 |
| 128 | $2 \times 10^{-3}$ | 85\% | 1.4 | 179 |
| 256 | $4 \times 10^{-3}$ | 92\% | 1.0 | 256 |
| 512 | $8 \times 10^{-3}$ | 95\% | 0.6 | 307 |
Larger batches improve GPU utilization but require proportionally larger learning rates. Throughput increases 4.6× from batch 32 to 512.
Practical Learning Rates for Transformers
| Model | Batch Size | Peak Learning Rate |
|---|---|---|
| BERT-base | 256 | $1 \times 10^{-4}$ |
| BERT-large | 256 | $5 \times 10^{-5}$ |
| GPT-2 (117M) | 512 | $2.5 \times 10^{-4}$ |
| GPT-2 (1.5B) | 512 | $1.5 \times 10^{-4}$ |
| GPT-3 (175B) | 3.2M | $6 \times 10^{-5}$ |
| T5-base | 128 | $1 \times 10^{-4}$ |
| T5-11B | 2048 | $1 \times 10^{-4}$ |
Common Schedules
Step Decay:
Exponential Decay:
Cosine Annealing:
Warmup + Decay (Transformers):
The warmup phase prevents instability in early training of transformers.
Exercises
- Why is bias correction necessary?
- What are the effective learning rates after steps $t = 1, 10, 100, 1000$?
- How does Adam handle sparse gradients compared to SGD?
- Plot the learning rate schedule for 100,000 steps
- What is the learning rate at step 1, 4000, and 10,000?
- Why is warmup beneficial for transformer training?
- Model parameters in FP16
- Optimizer states (FP32 master copy + 2 moments)
- Gradients in FP16
- Total memory for model + optimizer
- How many A100 GPUs (80 GB each) are needed?
- Calculate total FLOPs for one training step (forward + backward)
- Estimate time per step on A100 GPU (312 TFLOPS)
- How does mixed precision (FP16) affect throughput?
- What is the maximum batch size that fits in 80 GB memory?
- How many forward passes does finite differences require?
- How many passes does backpropagation require?
- If one forward pass takes 10 ms, compare total time
- Why is reverse mode AD preferred over forward mode?
- Without checkpointing, how much activation memory is needed?
- With checkpointing every 6 layers, what is the memory reduction?
- What is the computational overhead (extra forward passes)?
- At what model size does checkpointing become necessary?
- If gradient all-reduce takes 15 ms and computation takes 100 ms, what is the scaling efficiency?
- How does batch size affect communication overhead?
- Compare ring all-reduce vs tree all-reduce for 64 GPUs
- When does gradient compression become beneficial?
Solutions
Full solutions for all exercises are available at \url{https://deeplearning.hofkensvermeulen.be}.
Using the gradient rules:
- $\nabla_{\vw}(\vw\transpose \mA \vw) = 2\mA\vw$ (since $\mA$ is symmetric)
- $\nabla_{\vw}(\vb\transpose \vw) = \vb$
- $\nabla_{\vw}(c) = \mathbf{0}$
Therefore:
Let's use specific weights:
Forward pass:
Backward pass:
(1) Why bias correction is necessary: The first and second moment estimates are initialized to zero, creating a bias toward zero in early iterations. Without correction, the effective learning rate would be too small initially. Bias correction factors $\frac{1}{1-\beta_1^t}$ and $\frac{1}{1-\beta_2^t}$ compensate for this initialization bias.
(2) Effective learning rates: The effective learning rate is $\alpha_{\text{eff}} = \alpha \frac{\sqrt{1-\beta_2^t}}{1-\beta_1^t}$:
- $t=1$: $\alpha_{\text{eff}} = 0.001 \times \frac{\sqrt{1-0.999}}{1-0.9} = 0.001 \times \frac{0.0316}{0.1} \approx 0.000316$
- $t=10$: $\alpha_{\text{eff}} = 0.001 \times \frac{\sqrt{1-0.999^{10}}}{1-0.9^{10}} \approx 0.001 \times \frac{0.0998}{0.651} \approx 0.000153$
- $t=100$: $\alpha_{\text{eff}} = 0.001 \times \frac{\sqrt{1-0.999^{100}}}{1-0.9^{100}} \approx 0.001 \times \frac{0.302}{1.000} \approx 0.000302$
- $t=1000$: $\alpha_{\text{eff}} \approx 0.001$ (bias correction negligible)
(3) Handling sparse gradients: Adam maintains separate adaptive learning rates for each parameter through the second moment estimate $\mathbf{v}$. For sparse gradients, parameters with infrequent updates have smaller $v_i$ values, resulting in larger effective learning rates. This allows Adam to make larger updates to rarely-updated parameters, unlike SGD which treats all parameters equally. This is particularly beneficial for embedding layers and natural language processing tasks.
The learning rate schedule is:
(1) Plot description: The schedule has two phases:
- Warmup ($t \leq 4000$): Linear increase $\eta(t) = \frac{t}{4000} \cdot 512^{-0.5} \approx 0.0011 \cdot t$
- Decay ($t > 4000$): Inverse square root $\eta(t) = 512^{-0.5} \cdot t^{-0.5} \approx \frac{1.414}{\sqrt{t}}$
(2) Learning rates at specific steps:
- $t=1$: $\eta = 512^{-0.5} \cdot 1 \cdot 4000^{-1.5} \approx 0.0000111$
- $t=4000$: $\eta = 512^{-0.5} \cdot 4000^{-0.5} \approx 0.0222$ (peak)
- $t=10000$: $\eta = 512^{-0.5} \cdot 10000^{-0.5} \approx 0.0141$
(3) Why warmup is beneficial:
- Prevents instability from large gradients in early training when parameters are randomly initialized
- Allows the optimizer's momentum statistics to stabilize
- Particularly important for Adam, where the second moment estimate needs time to accumulate
- Without warmup, large initial learning rates can cause divergence or poor local minima
(1) Model parameters in FP16:
(2) Optimizer states:
- FP32 master copy: $1.5 \times 10^9 \times 4 = 6$ GB
- First moment $\mathbf{m}$ (FP32): $1.5 \times 10^9 \times 4 = 6$ GB
- Second moment $\mathbf{v}$ (FP32): $1.5 \times 10^9 \times 4 = 6$ GB
- Total optimizer states: $18$ GB
(3) Gradients in FP16:
(4) Total memory:
(5) Number of A100 GPUs needed:
One A100 GPU is sufficient for the model and optimizer states alone. However, activations during training require additional memory, so 1-2 GPUs would be needed in practice depending on batch size.
(1) Total FLOPs per training step: From Example~[ref]:
- Forward pass: $\approx 96$ GFLOPs per sample
- Backward pass: $\approx 193$ GFLOPs per sample
- Total per sample: $289$ GFLOPs
- For batch of 64: $289 \times 64 = 18{,}496$ GFLOPs $\approx 18.5$ TFLOPs
(2) Time per step on A100:
In practice, memory bandwidth and kernel launch overhead increase this to $\approx 80$-$100$ ms.
(3) Mixed precision impact:
- FP16 Tensor Cores provide 2× speedup: $\approx 30$ ms theoretical
- Reduced memory traffic (2× less bandwidth): enables larger batches
- Practical speedup: 1.8-2.2× including overhead
- Throughput increase: $\approx 3.5$-$4$× due to larger batch sizes
(4) Maximum batch size in 80 GB: Memory breakdown:
- Model + optimizer: $\approx 1.7$ GB
- Activations per sample: $\approx 130$ MB
- Gradients: $\approx 0.44$ GB
- Framework overhead: $\approx 2$ GB
Available for activations: $80 - 1.7 - 0.44 - 2 = 75.86$ GB
Maximum batch size: $\frac{75{,}860 \text{ MB}}{130 \text{ MB/sample}} \approx 583$ samples
(1) Finite differences forward passes: Requires one forward pass per parameter: $10^7$ forward passes
(2) Backpropagation passes: Requires 1 forward pass + 1 backward pass = 2 passes total
(3) Time comparison:
- Finite differences: $10^7 \times 10 \text{ ms} = 10^8 \text{ ms} = 100{,}000 \text{ seconds} \approx 27.8 \text{ hours}$
- Backpropagation: $2 \times 10 \text{ ms} = 20 \text{ ms}$
- Speedup: $\frac{10^8}{20} = 5 \times 10^6 = 5$ million×
(4) Why reverse mode AD is preferred:
- For $n$ parameters and scalar loss, forward mode requires $O(n)$ passes while reverse mode requires $O(1)$ passes
- Reverse mode exploits the structure of neural networks: many parameters, one loss
- Memory cost is higher (must store activations) but computational savings are enormous
- Forward mode would be preferred only if we had many outputs and few inputs (rare in deep learning)
(1) Activation memory without checkpointing: Assuming $\approx 700$ MB per layer (from Example~[ref]):
(2) Memory reduction with checkpointing every 6 layers: We save only 4 checkpoints (layers 6, 12, 18, 24):
Reduction factor: $\frac{16.4}{2.7} \approx 6\times$
(3) Computational overhead: For each checkpoint interval, we recompute the forward pass once during backward:
- Original: 1 forward + 1 backward
- With checkpointing: 1 forward + 1 backward + 0.75 forward (recompute 18 of 24 layers)
- Overhead: $\frac{1.75}{2} = 87.5\%$ increase, or 1.875× total time
(4) When checkpointing becomes necessary: Checkpointing is essential when:
- Activation memory exceeds available GPU memory
- For GPT-3 scale (175B parameters), activations can exceed 100 GB
- Rule of thumb: Use checkpointing when activations $>$ 50\% of GPU memory
- Trade-off: 1.5-2× slower training for 4-8× memory reduction
(1) Scaling efficiency:
- Time per step (single GPU): 100 ms
- Time per step (8 GPUs): $\frac{100}{8} + 15 = 12.5 + 15 = 27.5$ ms
- Ideal time (perfect scaling): $\frac{100}{8} = 12.5$ ms
- Scaling efficiency: $\frac{12.5}{27.5} \approx 45.5\%$
(2) Batch size effect on communication:
- Communication time is independent of batch size (same gradient size)
- Larger batches increase computation time, reducing communication overhead percentage
- For batch size $B$: efficiency $\approx \frac{100B/8}{100B/8 + 15}$
- Doubling batch size: $\frac{25}{40} = 62.5\%$ efficiency
- 4× batch size: $\frac{50}{65} = 76.9\%$ efficiency
(3) Ring vs tree all-reduce for 64 GPUs:
- Ring all-reduce: $O(N)$ communication steps, bandwidth-optimal
- Tree all-reduce: $O(\log N)$ communication steps, latency-optimal
- For 64 GPUs: Ring has 64 steps, tree has $\log_2(64) = 6$ steps
- Ring is better for large messages (bandwidth-bound)
- Tree is better for small messages (latency-bound)
- Typical gradient sizes favor ring all-reduce
(4) When gradient compression is beneficial:
- When communication time $>$ compression time
- For slow networks (inter-node communication)
- Typical compression: 8-bit quantization or top-k sparsification
- Compression ratio: 4× (FP32 to 8-bit)
- Beneficial when: $\frac{\text{gradient size}}{\text{bandwidth}} > \frac{\text{gradient size}}{\text{compression throughput}} + \frac{\text{compressed size}}{\text{bandwidth}}$
- Usually beneficial for $>$8 GPUs across multiple nodes