Calculus and Optimization

Chapter Overview

Training deep learning models requires optimizing complex, high-dimensional functions. This chapter develops the calculus and optimization theory necessary to understand how neural networks learn from data. We cover multivariable calculus, gradient computation, and the optimization algorithms that power modern deep learning.

The centerpiece of this chapter is backpropagation, the algorithm that efficiently computes gradients in neural networks. We derive backpropagation from first principles, showing how the chain rule enables gradient computation through arbitrarily deep computational graphs. We then explore gradient descent and its variants, which use these gradients to iteratively improve model parameters.

Learning Objectives

After completing this chapter, you will be able to:

  1. Compute gradients and Jacobians for multivariable functions
  2. Apply the chain rule to composite functions
  3. Understand and implement the backpropagation algorithm
  4. Implement gradient descent and its variants (SGD, momentum, Adam)
  5. Analyze convergence properties of optimization algorithms
  6. Apply learning rate schedules and regularization techniques

Multivariable Calculus

Partial Derivatives

Definition: For function $f: \R^n \to \R$, the partial derivative with respect to $x_i$ is:
$$ \frac{\partial f}{\partial x_i} = \lim_{h \to 0} \frac{f(x_1, \ldots, x_i + h, \ldots, x_n) - f(x_1, \ldots, x_i, \ldots, x_n)}{h} $$
Example: For $f(x_1, x_2) = x_1^2 + 3x_1 x_2 + x_2^2$:
$$\begin{align} \frac{\partial f}{\partial x_1} &= 2x_1 + 3x_2 \\ \frac{\partial f}{\partial x_2} &= 3x_1 + 2x_2 \end{align}$$

At point $(x_1, x_2) = (1, 2)$:

$$\begin{align} \frac{\partial f}{\partial x_1}\bigg|_{(1,2)} &= 2(1) + 3(2) = 8 \\ \frac{\partial f}{\partial x_2}\bigg|_{(1,2)} &= 3(1) + 2(2) = 7 \end{align}$$

Gradients

Definition: For function $f: \R^n \to \R$, the gradient is the vector of partial derivatives:
$$ \nabla f(\vx) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix} \in \R^n $$

The gradient points in the direction of steepest ascent of the function.

Example: For mean squared error loss:
$$ L(\vw) = \frac{1}{N} \sum_{i=1}^N (y_i - \vw\transpose \vx^{(i)})^2 $$

The gradient with respect to $\vw$ is:

$$ \nabla_{\vw} L = -\frac{2}{N} \sum_{i=1}^N (y_i - \vw\transpose \vx^{(i)}) \vx^{(i)} $$

For $N=1$, $\vw = [w_1, w_2]\transpose$, $\vx = [1, 2]\transpose$, $y = 5$, and current prediction $\hat{y} = \vw\transpose \vx = 3$:

$$ \nabla_{\vw} L = -2(5 - 3) \begin{bmatrix} 1 \\ 2 \end{bmatrix} = \begin{bmatrix} -4 \\ -8 \end{bmatrix} $$

The negative gradient $-\nabla_{\vw} L = [4, 8]\transpose$ points toward better parameters.

The Chain Rule

Theorem: For composite function $h(\vx) = f(g(\vx))$ where $g: \R^n \to \R^m$ and $f: \R^m \to \R$:
$$ \frac{\partial h}{\partial x_i} = \sum_{j=1}^m \frac{\partial f}{\partial g_j} \frac{\partial g_j}{\partial x_i} $$

In vector form:

$$ \nabla_{\vx} h = \mJ_g\transpose \nabla_{\vz} f $$
where $\vz = g(\vx)$ and $\mJ_g \in \R^{m \times n}$ is the Jacobian of $g$.
Example: For neural network layer: $\vy = \sigma(\mW\vx + \vb)$ where $\sigma$ is applied element-wise.

Let $\vz = \mW\vx + \vb$ (pre-activation). Then:

$$ \frac{\partial L}{\partial \vx} = \mW\transpose \left( \frac{\partial L}{\partial \vy} \odot \sigma'(\vz) \right) $$

where $\odot$ denotes element-wise multiplication.

Concrete example: For ReLU activation $\sigma(z) = \max(0, z)$:

$$ \sigma'(z) = \begin{cases} 1 & \text{if } z > 0 \\ 0 & \text{if } z \leq 0 \end{cases} $$

If $\vz = [2.0, -1.0, 0.5]\transpose$, then $\sigma'(\vz) = [1, 0, 1]\transpose$.

Jacobian and Hessian Matrices

Definition: For function $\mathbf{f}: \R^n \to \R^m$, the Jacobian matrix is:
$$ \mJ_{\mathbf{f}}(\vx) = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \cdots & \frac{\partial f_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix} \in \R^{m \times n} $$
Definition: For function $f: \R^n \to \R$, the Hessian matrix contains second derivatives:
$$ \mH_f(\vx) = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots \\ \vdots & \vdots & \ddots \end{bmatrix} \in \R^{n \times n} $$

The Hessian describes the local curvature of the function. For smooth functions, $\mH$ is symmetric.

Gradient Descent

The Gradient Descent Algorithm

Gradient descent iteratively moves parameters in the direction opposite to the gradient:

Algorithm: Gradient Descent
Input: Objective function $f(\vw)$, initial parameters $\vw^{(0)}$, learning rate $\eta$, iterations $T$
Output: Optimized parameters $\vw^{(T)}$
for $t = 0$ to $T-1$ do
Compute gradient: $\mathbf{g}^{(t)} = \nabla f(\vw^{(t)})$
Update parameters: $\vw^{(t+1)} = \vw^{(t)} - \eta \mathbf{g}^{(t)}$
return $\vw^{(T)
The learning rate $\eta$ controls the step size. Too large: divergence. Too small: slow convergence.
Example: Minimize $f(w) = w^2$ starting from $w^{(0)} = 3$ with $\eta = 0.1$:
$$\begin{align} t=0:& \quad w^{(0)} = 3, \quad g^{(0)} = 2w^{(0)} = 6, \quad w^{(1)} = 3 - 0.1(6) = 2.4 \\ t=1:& \quad w^{(1)} = 2.4, \quad g^{(1)} = 4.8, \quad w^{(2)} = 2.4 - 0.1(4.8) = 1.92 \\ t=2:& \quad w^{(2)} = 1.92, \quad g^{(2)} = 3.84, \quad w^{(3)} = 1.92 - 0.1(3.84) = 1.536 \end{align}$$

The parameters converge to $w^* = 0$ (the minimum).

Stochastic Gradient Descent (SGD)

For large datasets, computing the full gradient is expensive. SGD approximates the gradient using mini-batches.

Algorithm: Stochastic Gradient Descent (SGD)
Input: Dataset $\mathcal{D} = \{(\vx^{(i)}, y^{(i)})\}_{i=1}^N$, batch size $B$, learning rate $\eta$, epochs $E$
Output: Optimized parameters $\vw$
Initialize $\vw$ randomly
for epoch $e = 1$ to $E$ do
Shuffle dataset $\mathcal{D}$
Compute mini-batch gradient: $\mathbf{g} = \frac{1}{B} \sum_{(\vx, y) \in \mathcal{B}} \nabla_{\vw} L(\vw; \vx, y)$
Update: $\vw \leftarrow \vw - \eta \mathbf{g}$
return $\vw$
PyTorch SGD implementation:
import torch
import torch.nn as nn

# Model and loss
model = nn.Linear(10, 1)
criterion = nn.MSELoss()

# SGD optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

# Training loop
for epoch in range(100):
    for x_batch, y_batch in dataloader:
        # Forward pass
        y_pred = model(x_batch)
        loss = criterion(y_pred, y_batch)

        # Backward pass
        optimizer.zero_grad()  # Clear previous gradients
        loss.backward()         # Compute gradients
        optimizer.step()        # Update parameters

Momentum

Momentum accelerates SGD by accumulating a velocity vector, drawing inspiration from physics where a ball rolling down a hill builds up speed.

The Intuition Behind Momentum

Standard SGD can be slow and oscillatory, especially in regions where the loss surface has different curvatures in different directions (ravines). Consider a loss landscape that is steep in one direction but shallow in another:

Momentum acts as an exponentially weighted moving average of gradients, smoothing out noisy updates and accelerating convergence in consistent directions.

Mathematical Formulation

Algorithm: SGD with Momentum
Input: Learning rate $\eta$, momentum coefficient $\beta$ (typically 0.9)
Initialize velocity $\mathbf{v} = \mathbf{0}$
for each iteration do
Compute gradient $\mathbf{g} = \nabla_{\vw} L(\vw)$
Update velocity: $\mathbf{v} \leftarrow \beta \mathbf{v} + \mathbf{g}$
Update parameters: $\vw \leftarrow \vw - \eta \mathbf{v}$

The velocity update can be expanded to show its exponential weighting:

$$ \mathbf{v}_t = \beta \mathbf{v}_{t-1} + \mathbf{g}_t = \mathbf{g}_t + \beta \mathbf{g}_{t-1} + \beta^2 \mathbf{g}_{t-2} + \beta^3 \mathbf{g}_{t-3} + \cdots $$

With $\beta = 0.9$, the current gradient has weight 1, the previous gradient has weight 0.9, two steps ago has weight 0.81, and so on. This creates a "memory" of approximately $\frac{1}{1-\beta} = 10$ steps.

Why Momentum Works Better

  1. Dampens oscillations: In directions with high curvature (steep gradients that change sign), the velocity accumulates opposing gradients, reducing oscillation amplitude.
  2. Accelerates in consistent directions: When gradients consistently point in the same direction, the velocity builds up, effectively increasing the learning rate in that direction.
  3. Escapes shallow local minima: The accumulated velocity can carry the optimization through small barriers, potentially escaping poor local minima.
  4. Reduces sensitivity to learning rate: The smoothing effect makes the optimization more robust to learning rate choices.
Example: Consider minimizing $f(w_1, w_2) = 10w_1^2 + 0.1w_2^2$ starting from $(w_1, w_2) = (1, 1)$ with $\eta = 0.01$:

SGD without momentum:

$$\begin{align} t=0:& \quad \mathbf{g} = [20, 0.2]\transpose, \quad \vw^{(1)} = [1, 1]\transpose - 0.01[20, 0.2]\transpose = [0.8, 0.998]\transpose \\ t=1:& \quad \mathbf{g} = [16, 0.1996]\transpose, \quad \vw^{(2)} = [0.8, 0.998]\transpose - 0.01[16, 0.1996]\transpose = [0.64, 0.996]\transpose \end{align}$$

Progress in $w_1$ direction: $1 \to 0.8 \to 0.64$ (slow, 20\% per step)

SGD with momentum ($\beta = 0.9$):

$$\begin{align} t=0:& \quad \mathbf{g} = [20, 0.2]\transpose, \quad \mathbf{v}^{(1)} = [20, 0.2]\transpose, \quad \vw^{(1)} = [0.8, 0.998]\transpose \\ t=1:& \quad \mathbf{g} = [16, 0.1996]\transpose, \quad \mathbf{v}^{(2)} = 0.9[20, 0.2]\transpose + [16, 0.1996]\transpose = [34, 0.38]\transpose \\ & \quad \vw^{(2)} = [0.8, 0.998]\transpose - 0.01[34, 0.38]\transpose = [0.46, 0.994]\transpose \end{align}$$

Progress in $w_1$ direction: $1 \to 0.8 \to 0.46$ (faster, 42.5\% in second step due to accumulated velocity)

Momentum accelerates convergence by 2-3× in this ravine scenario.

Adam Optimizer

Adam (Adaptive Moment Estimation) combines momentum with adaptive learning rates, creating a powerful optimizer that adapts to the geometry of the loss landscape for each parameter individually.

The Intuition Behind Adam

Adam addresses two key limitations of standard SGD with momentum:

  1. Different parameters need different learning rates: In deep networks, some parameters (e.g., early layer weights) may have small gradients while others (e.g., output layer weights) have large gradients. A single global learning rate is suboptimal.
  2. Gradient magnitudes vary across training: Early in training, gradients may be large and noisy. Later, they become smaller and more stable. Adam adapts to these changes automatically.
Adam maintains two moving averages for each parameter: the first moment (mean of gradients, like momentum) and the second moment (uncentered variance of gradients). The second moment enables adaptive per-parameter learning rates.

Mathematical Formulation

Algorithm: Adam Optimizer
Input: Learning rate $\alpha$ (default 0.001), $\beta_1$ = 0.9, $\beta_2$ = 0.999, $\epsilon$ = $10^{-8}$
Initialize $\mathbf{m}_0 = \mathbf{0}$ (first moment), $\mathbf{v}_0 = \mathbf{0}$ (second moment), $t = 0$
while not converged do
$t \leftarrow t + 1$
Compute gradient: $\mathbf{g}_t = \nabla_{\vw} L(\vw_{t-1})$
Update biased first moment: $\mathbf{m}_t = \beta_1 \mathbf{m}_{t-1} + (1 - \beta_1) \mathbf{g}_t$
Update biased second moment: $\mathbf{v}_t = \beta_2 \mathbf{v}_{t-1} + (1 - \beta_2) \mathbf{g}_t^2$
Bias-corrected first moment: $\hat{\mathbf{m}}_t = \mathbf{m}_t / (1 - \beta_1^t)$
Bias-corrected second moment: $\hat{\mathbf{v}}_t = \mathbf{v}_t / (1 - \beta_2^t)$
Update parameters: $\vw_t = \vw_{t-1} - \alpha \hat{\mathbf{m}}_t / (\sqrt{\hat{\mathbf{v}}_t} + \epsilon)$

Understanding Each Component

First moment ($\mathbf{m}_t$): Exponentially weighted average of gradients (momentum)

$$ \mathbf{m}_t = \beta_1 \mathbf{m}_{t-1} + (1 - \beta_1) \mathbf{g}_t $$
This provides the direction of the update, smoothing out noisy gradients. With $\beta_1 = 0.9$, it averages over approximately $\frac{1}{1-0.9} = 10$ steps.

Second moment ($\mathbf{v}_t$): Exponentially weighted average of squared gradients

$$ \mathbf{v}_t = \beta_2 \mathbf{v}_{t-1} + (1 - \beta_2) \mathbf{g}_t^2 $$
This estimates the variance of gradients. With $\beta_2 = 0.999$, it averages over approximately $\frac{1}{1-0.999} = 1000$ steps, providing a stable estimate of gradient scale.

Bias correction: Since $\mathbf{m}_0 = \mathbf{v}_0 = \mathbf{0}$, early estimates are biased toward zero. The correction factors $\frac{1}{1-\beta_1^t}$ and $\frac{1}{1-\beta_2^t}$ compensate for this initialization bias, ensuring proper scaling in early iterations.

Adaptive update: The final update is:

$$ \vw_t = \vw_{t-1} - \alpha \frac{\hat{\mathbf{m}}_t}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon} $$

Each parameter $w_i$ gets an effective learning rate of $\frac{\alpha}{\sqrt{\hat{v}_{i,t}} + \epsilon}$:

Example: Training a 2-layer network on MNIST with 10,000 examples:
OptimizerSteps to 95\% AccuracyFinal AccuracyTraining Time
SGD ($\eta = 0.01$)8,50096.2\%12.3 min
SGD + Momentum ($\eta = 0.01$, $\beta = 0.9$)4,20097.1\%6.1 min
Adam ($\alpha = 0.001$)2,10097.8\%3.0 min

Adam converges 4× faster than SGD and 2× faster than momentum, while achieving higher final accuracy.

Example: Consider two parameters with different gradient patterns over 5 steps:

Parameter 1 (consistent large gradients):

$$\begin{align} \text{Gradients:} & \quad [10.0, 9.8, 10.2, 9.9, 10.1] \\ \hat{m}_5 & \approx 10.0 \quad \text{(first moment)} \\ \hat{v}_5 & \approx 100.0 \quad \text{(second moment)} \\ \text{Effective LR:} & \quad \frac{0.001}{\sqrt{100.0}} = 0.0001 \end{align}$$

Parameter 2 (small noisy gradients):

$$\begin{align} \text{Gradients:} & \quad [0.5, -0.3, 0.4, -0.2, 0.6] \\ \hat{m}_5 & \approx 0.2 \quad \text{(first moment, noise cancels)} \\ \hat{v}_5 & \approx 0.16 \quad \text{(second moment)} \\ \text{Effective LR:} & \quad \frac{0.001}{\sqrt{0.16}} = 0.0025 \end{align}$$

Parameter 2 gets 25× larger effective learning rate, compensating for its smaller gradients. This automatic adaptation is why Adam excels at training deep networks with heterogeneous parameter scales.

Adam is the most commonly used optimizer for training transformers and large language models due to its robustness, fast convergence, and ability to handle the diverse gradient patterns across different layers and attention heads.

Gradient Computation Complexity

Understanding the computational and memory costs of gradient computation is essential for training large models efficiently.

FLOPs for Gradient Computation

Computing gradients via backpropagation requires approximately 2× the FLOPs of the forward pass: 1× for the backward pass itself, plus the original 1× forward pass.
Example: For BERT-base (110M parameters, 12 layers, $d_{\text{model}} = 768$) processing sequence length $n = 512$:

Forward pass:

Backward pass:

Total per training step: $\approx 289$ GFLOPs

For batch size $B = 32$: $289 \times 32 \approx 9.2$ TFLOPs per batch.

Memory Requirements for Activations

During backpropagation, intermediate activations must be stored for gradient computation.

Definition: For a network with $L$ layers processing batch size $B$, activation memory is:
$$ M_{\text{act}} = B \sum_{\ell=1}^L d_\ell $$
where $d_\ell$ is the dimension of layer $\ell$'s output.
Example: For BERT-base ($B = 32$, $n = 512$, $d = 768$), per-layer activations total $\approx$704~MB in FP32---dominated by attention scores ($B \times h \times n^2 \times 4 = 402$~MB, or 57\% of per-layer memory). Across all 12 layers, activations alone consume $\approx$8.4~GB, excluding gradients and optimizer states. See Section~[ref] for the complete breakdown.

Automatic Differentiation: Forward vs Reverse Mode

Definition: Forward mode computes derivatives by propagating tangent vectors forward through the computational graph. For function $f: \R^n \to \R^m$, computing $\nabla f$ requires $n$ forward passes.
Definition: Reverse mode (backpropagation) computes derivatives by propagating adjoints backward. Computing $\nabla f$ requires 1 forward pass + 1 backward pass, regardless of $n$.
For neural networks where $n \gg m$ (millions of parameters, one loss), reverse mode is vastly more efficient: $O(1)$ passes vs $O(n)$ passes.
Example: For a network with $n = 10^8$ parameters (100M) and scalar loss ($m = 1$):

Forward mode:

Reverse mode (backpropagation):

Gradient Checkpointing

Gradient checkpointing trades computation for memory by recomputing activations during the backward pass.

Algorithm: Gradient Checkpointing
Input: Network with $L$ layers, checkpoint every $k$ layers
// Forward Pass
for $\ell = 1$ to $L$ do
Compute $\vh^{(\ell)} = f^{(\ell)}(\vh^{(\ell-1)})$
if $\ell \bmod k = 0$ then
Save $\vh^{(\ell)}$ to memory (checkpoint)
// Backward Pass
for $\ell = L$ to $1$ do
Recompute forward from last checkpoint to layer $\ell$
Compute gradient $\nabla_{\vh^{(\ell-1)}} L$ using $\vh^{(\ell)}$
Example: For BERT-base (12 layers) with checkpointing every 3 layers:

Without checkpointing:

With checkpointing (every 3 layers):

For GPT-3 (175B parameters), checkpointing is essential to fit in GPU memory.

Backpropagation

Backpropagation efficiently computes gradients in neural networks using the chain rule.

Computational Graphs

A computational graph represents the sequence of operations in a neural network. Each node is an operation, and edges carry values/gradients.

Example: For $L = (y - \hat{y})^2$ where $\hat{y} = w_2 \sigma(w_1 x + b_1) + b_2$:

Forward pass:

$$\begin{align} z_1 &= w_1 x + b_1 = 2.0(1.0) + 0.5 = 2.5 \\ a_1 &= \sigma(z_1) = \sigma(2.5) = 0.924 \quad \text{(sigmoid)} \\ z_2 &= w_2 a_1 + b_2 = 1.5(0.924) + 0.3 = 1.686 \\ L &= (y - z_2)^2 = (3.0 - 1.686)^2 = 1.726 \end{align}$$

Backward pass:

$$\begin{align} \frac{\partial L}{\partial z_2} &= 2(z_2 - y) = 2(1.686 - 3.0) = -2.628 \\ \frac{\partial L}{\partial w_2} &= \frac{\partial L}{\partial z_2} \cdot a_1 = -2.628(0.924) = -2.428 \\ \frac{\partial L}{\partial a_1} &= \frac{\partial L}{\partial z_2} \cdot w_2 = -2.628(1.5) = -3.942 \\ \frac{\partial L}{\partial z_1} &= \frac{\partial L}{\partial a_1} \cdot \sigma'(z_1) = -3.942(0.070) = -0.276 \\ \frac{\partial L}{\partial w_1} &= \frac{\partial L}{\partial z_1} \cdot x = -0.276(1.0) = -0.276 \end{align}$$

Backpropagation Algorithm

Algorithm: Backpropagation
Input: Training example $(\vx, y)$, network with $L$ layers
Output: Gradients $\{\nabla_{\mW^{(\ell)}} L, \nabla_{\vb^{(\ell)}} L\}_{\ell=1}^L$
// Forward Pass
$\vh^{(0)} = \vx$
for $\ell = 1$ to $L$ do
$\vz^{(\ell)} = \mW^{(\ell)} \vh^{(\ell-1)} + \vb^{(\ell)}$
$\vh^{(\ell)} = \sigma^{(\ell)}(\vz^{(\ell)})$
$\hat{y} = \vh^{(L)}$
Compute loss: $L = \text{Loss}(y, \hat{y})$

// Backward Pass
$\boldsymbol{\delta}^{(L)} = \nabla_{\vh^{(L)}} L \odot \sigma'^{(L)}(\vz^{(L)})$
for $\ell = L$ to $1$ do
$\nabla_{\mW^{(\ell)}} L = \boldsymbol{\delta}^{(\ell)} (\vh^{(\ell-1)})\transpose$
$\nabla_{\vb^{(\ell)}} L = \boldsymbol{\delta}^{(\ell)}$
if $\ell > 1$ then
$\boldsymbol{\delta}^{(\ell-1)} = (\mW^{(\ell)})\transpose \boldsymbol{\delta}^{(\ell)} \odot \sigma'^{(\ell-1)}(\vz^{(\ell-1)})$
return All gradients
Backpropagation computes gradients in $O(n)$ time where $n$ is the number of parameters, compared to $O(n^2)$ for naive methods. This efficiency enables training of billion-parameter models.

Why Backpropagation is $O(n)$ Not $O(n^2)$

Theorem: For a neural network with $n$ parameters and $m$ operations, backpropagation computes all gradients in $O(m)$ time, where typically $m = O(n)$.
Proof (Intuition): Each operation in the forward pass corresponds to one gradient computation in the backward pass. The chain rule allows us to reuse intermediate gradients:
$$ \frac{\partial L}{\partial w_i} = \frac{\partial L}{\partial z_j} \cdot \frac{\partial z_j}{\partial w_i} $$

We compute $\frac{\partial L}{\partial z_j}$ once and reuse it for all parameters that affect $z_j$. This sharing prevents the $O(n^2)$ cost of computing each gradient independently.

Example: For a network with $n = 10^8$ parameters:

Naive finite differences:

$$ \frac{\partial L}{\partial w_i} \approx \frac{L(w_i + \epsilon) - L(w_i)}{\epsilon} $$
Requires $n$ forward passes: $O(n \cdot m) = O(n^2)$ operations.

Backpropagation:

Speedup: $O(n) = 10^8$×

Optimizer Memory Requirements

Different optimizers have vastly different memory requirements, which becomes critical for large models.

Memory Comparison by Optimizer

OptimizerMemory per ParameterTotal Memory Factor
SGD (no momentum)4 bytes (fp32)
SGD with momentum8 bytes (param + velocity)
Adam16 bytes (param + 2 moments)
Adam (mixed precision)10 bytes (fp16 param + fp32 master + 2 moments)2.5×
Example: For BERT-base with 110M parameters:

Model parameters:

SGD with momentum:

Adam optimizer:

Adam with mixed precision:

Note: Mixed precision doesn't reduce optimizer memory, but enables larger batch sizes.

Example: For GPT-3 (175B parameters) with Adam optimizer:

Model + optimizer states:

This requires distributed training across multiple GPUs. With 8× A100 GPUs (80 GB each = 640 GB total), we need model parallelism and optimizer state sharding (e.g., ZeRO optimizer).

Impact on GPU Memory Budget

For large models, optimizer states often consume more memory than the model itself. Adam uses 4× parameter memory, leaving less room for batch size and activations.
Example: Training BERT-base on A100 GPU (80 GB memory):

Memory allocation:

Remaining: 67.4 GB available for larger batch sizes or longer sequences.

With batch size 256: Activations $\approx 67$ GB, total $\approx 71$ GB (fits comfortably).

Learning Rate Schedules

Learning rate schedules adjust $\eta$ during training to improve convergence.

Learning Rate Impact on Convergence and GPU Utilization

Learning rate affects both convergence speed and hardware efficiency. Larger learning rates enable larger batch sizes, improving GPU utilization.
Example: Training BERT-base on 1M examples:
Learning RateSteps to ConvergeWall TimeFinal Loss
$1 \times 10^{-5}$100,00012 hours1.85
$1 \times 10^{-4}$30,0003.6 hours1.82
$5 \times 10^{-4}$15,0001.8 hours1.81
$1 \times 10^{-3}$20,0002.4 hours1.83
$5 \times 10^{-3}$Diverges--

Optimal learning rate ($5 \times 10^{-4}$) achieves 6.7× faster convergence than conservative rate.

Learning Rate Scaling with Batch Size

Theorem: When increasing batch size by factor $k$, scale learning rate by $k$ to maintain convergence behavior:
$$ \eta_{\text{new}} = k \cdot \eta_{\text{base}} $$

This holds approximately for $k \leq 8$. For larger $k$, use gradual warmup.

Example: Training BERT-base with different batch sizes:
Batch SizeLearning RateGPU Util.Steps/secSamples/sec
32$5 \times 10^{-4}$45\%2.167
64$1 \times 10^{-3}$68\%1.8115
128$2 \times 10^{-3}$85\%1.4179
256$4 \times 10^{-3}$92\%1.0256
512$8 \times 10^{-3}$95\%0.6307

Larger batches improve GPU utilization but require proportionally larger learning rates. Throughput increases 4.6× from batch 32 to 512.

Practical Learning Rates for Transformers

ModelBatch SizePeak Learning Rate
BERT-base256$1 \times 10^{-4}$
BERT-large256$5 \times 10^{-5}$
GPT-2 (117M)512$2.5 \times 10^{-4}$
GPT-2 (1.5B)512$1.5 \times 10^{-4}$
GPT-3 (175B)3.2M$6 \times 10^{-5}$
T5-base128$1 \times 10^{-4}$
T5-11B2048$1 \times 10^{-4}$
Larger models generally require smaller learning rates for stability. GPT-3 uses $6 \times 10^{-5}$ despite massive batch size of 3.2M tokens.

Common Schedules

Step Decay:

$$ \eta_t = \eta_0 \gamma^{\lfloor t/s \rfloor} $$
where $\gamma < 1$ (e.g., 0.1) and $s$ is step size (e.g., every 10 epochs).

Exponential Decay:

$$ \eta_t = \eta_0 e^{-\lambda t} $$

Cosine Annealing:

$$ \eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\left(\frac{t\pi}{T}\right)\right) $$

Warmup + Decay (Transformers):

$$ \eta_t = \frac{d_{\text{model}}^{-0.5}}{\max(t, \text{warmup\_steps}^{-0.5})} \cdot \min(t^{-0.5}, t \cdot \text{warmup\_steps}^{-1.5}) $$

The warmup phase prevents instability in early training of transformers.

Gradient computation benefits enormously from GPU parallelism, achieving 100--150$\times$ speedup over CPUs for transformer-sized models. Mixed precision training (FP16 compute with FP32 accumulation) provides an additional 2--4$\times$ throughput improvement. Gradient accumulation enables large effective batch sizes on memory-constrained hardware, and distributed training with all-reduce synchronization scales near-linearly across GPUs. These techniques are covered in detail in Chapters~11 and~22.

Exercises

Exercise 1: Compute the gradient of $f(\vw) = \vw\transpose \mA \vw + \vb\transpose \vw + c$ where $\mA \in \R^{n \times n}$ is symmetric, $\vw, \vb \in \R^n$, and $c \in \R$.
Exercise 2: Implement backpropagation for a 2-layer network with ReLU activation. Given input $\vx = [1.0, 0.5]\transpose$, weights $\mW^{(1)} \in \R^{3 \times 2}$, $\mW^{(2)} \in \R^{1 \times 3}$, and target $y = 2.0$, compute all gradients.
Exercise 3: For Adam optimizer with $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\alpha = 0.001$:
  1. Why is bias correction necessary?
  2. What are the effective learning rates after steps $t = 1, 10, 100, 1000$?
  3. How does Adam handle sparse gradients compared to SGD?
Exercise 4: A transformer is trained with learning rate warmup over 4000 steps, then inverse square root decay. If $d_{\text{model}} = 512$:
  1. Plot the learning rate schedule for 100,000 steps
  2. What is the learning rate at step 1, 4000, and 10,000?
  3. Why is warmup beneficial for transformer training?
Exercise 5: Calculate the memory requirements for training GPT-2 (1.5B parameters) with Adam optimizer:
  1. Model parameters in FP16
  2. Optimizer states (FP32 master copy + 2 moments)
  3. Gradients in FP16
  4. Total memory for model + optimizer
  5. How many A100 GPUs (80 GB each) are needed?
Exercise 6: For BERT-base processing sequence length 512 with batch size 64:
  1. Calculate total FLOPs for one training step (forward + backward)
  2. Estimate time per step on A100 GPU (312 TFLOPS)
  3. How does mixed precision (FP16) affect throughput?
  4. What is the maximum batch size that fits in 80 GB memory?
Exercise 7: Compare gradient computation methods for a network with $10^7$ parameters:
  1. How many forward passes does finite differences require?
  2. How many passes does backpropagation require?
  3. If one forward pass takes 10 ms, compare total time
  4. Why is reverse mode AD preferred over forward mode?
Exercise 8: Implement gradient checkpointing for a 24-layer transformer:
  1. Without checkpointing, how much activation memory is needed?
  2. With checkpointing every 6 layers, what is the memory reduction?
  3. What is the computational overhead (extra forward passes)?
  4. At what model size does checkpointing become necessary?
Exercise 9: Analyze distributed training efficiency for 8 GPUs:
  1. If gradient all-reduce takes 15 ms and computation takes 100 ms, what is the scaling efficiency?
  2. How does batch size affect communication overhead?
  3. Compare ring all-reduce vs tree all-reduce for 64 GPUs
  4. When does gradient compression become beneficial?

Solutions

Full solutions for all exercises are available at \url{https://deeplearning.hofkensvermeulen.be}.

Solution: For $f(\vw) = \vw\transpose \mA \vw + \vb\transpose \vw + c$ where $\mA$ is symmetric:

Using the gradient rules:

Therefore:

$$ \nabla_{\vw} f = 2\mA\vw + \vb $$
Solution: Given: $\vx = [1.0, 0.5]\transpose$, target $y = 2.0$, ReLU activation.

Let's use specific weights:

$$ \mW^{(1)} = \begin{bmatrix} 0.5 & -0.3 \\ 0.2 & 0.6 \\ -0.4 & 0.8 \end{bmatrix}, \quad \mW^{(2)} = \begin{bmatrix} 1.0 & -0.5 & 0.7 \end{bmatrix} $$

Forward pass:

$$\begin{align} \vz^{(1)} &= \mW^{(1)}\vx = \begin{bmatrix} 0.5(1.0) - 0.3(0.5) \\ 0.2(1.0) + 0.6(0.5) \\ -0.4(1.0) + 0.8(0.5) \end{bmatrix} = \begin{bmatrix} 0.35 \\ 0.50 \\ 0.00 \end{bmatrix} \\ \vh^{(1)} &= \text{ReLU}(\vz^{(1)}) = \begin{bmatrix} 0.35 \\ 0.50 \\ 0.00 \end{bmatrix} \\ z^{(2)} &= \mW^{(2)}\vh^{(1)} = 1.0(0.35) - 0.5(0.50) + 0.7(0.00) = 0.10 \\ L &= \frac{1}{2}(y - z^{(2)})^2 = \frac{1}{2}(2.0 - 0.10)^2 = 1.805 \end{align}$$

Backward pass:

$$\begin{align} \frac{\partial L}{\partial z^{(2)}} &= -(y - z^{(2)}) = -(2.0 - 0.10) = -1.90 \\ \frac{\partial L}{\partial \mW^{(2)}} &= \frac{\partial L}{\partial z^{(2)}} \vh^{(1)\transpose} = -1.90 \begin{bmatrix} 0.35 & 0.50 & 0.00 \end{bmatrix} = \begin{bmatrix} -0.665 & -0.950 & 0.000 \end{bmatrix} \\ \frac{\partial L}{\partial \vh^{(1)}} &= \mW^{(2)\transpose} \frac{\partial L}{\partial z^{(2)}} = \begin{bmatrix} 1.0 \\ -0.5 \\ 0.7 \end{bmatrix}(-1.90) = \begin{bmatrix} -1.90 \\ 0.95 \\ -1.33 \end{bmatrix} \\ \frac{\partial L}{\partial \vz^{(1)}} &= \frac{\partial L}{\partial \vh^{(1)}} \odot \text{ReLU}'(\vz^{(1)}) = \begin{bmatrix} -1.90 \\ 0.95 \\ -1.33 \end{bmatrix} \odot \begin{bmatrix} 1 \\ 1 \\ 0 \end{bmatrix} = \begin{bmatrix} -1.90 \\ 0.95 \\ 0.00 \end{bmatrix} \\ \frac{\partial L}{\partial \mW^{(1)}} &= \frac{\partial L}{\partial \vz^{(1)}} \vx\transpose = \begin{bmatrix} -1.90 \\ 0.95 \\ 0.00 \end{bmatrix} \begin{bmatrix} 1.0 & 0.5 \end{bmatrix} = \begin{bmatrix} -1.90 & -0.95 \\ 0.95 & 0.475 \\ 0.00 & 0.00 \end{bmatrix} \end{align}$$
Solution: For Adam optimizer with $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\alpha = 0.001$:

(1) Why bias correction is necessary: The first and second moment estimates are initialized to zero, creating a bias toward zero in early iterations. Without correction, the effective learning rate would be too small initially. Bias correction factors $\frac{1}{1-\beta_1^t}$ and $\frac{1}{1-\beta_2^t}$ compensate for this initialization bias.

(2) Effective learning rates: The effective learning rate is $\alpha_{\text{eff}} = \alpha \frac{\sqrt{1-\beta_2^t}}{1-\beta_1^t}$:

(3) Handling sparse gradients: Adam maintains separate adaptive learning rates for each parameter through the second moment estimate $\mathbf{v}$. For sparse gradients, parameters with infrequent updates have smaller $v_i$ values, resulting in larger effective learning rates. This allows Adam to make larger updates to rarely-updated parameters, unlike SGD which treats all parameters equally. This is particularly beneficial for embedding layers and natural language processing tasks.

Solution: For transformer with $d_{\text{model}} = 512$ and warmup over 4000 steps:

The learning rate schedule is:

$$ \eta(t) = d_{\text{model}}^{-0.5} \cdot \min(t^{-0.5}, t \cdot \text{warmup}^{-1.5}) $$

(1) Plot description: The schedule has two phases:

(2) Learning rates at specific steps:

(3) Why warmup is beneficial:

Solution: For GPT-2 with 1.5B parameters and Adam optimizer:

(1) Model parameters in FP16:

$$ 1.5 \times 10^9 \times 2 \text{ bytes} = 3 \times 10^9 \text{ bytes} = 3 \text{ GB} $$

(2) Optimizer states:

(3) Gradients in FP16:

$$ 1.5 \times 10^9 \times 2 \text{ bytes} = 3 \text{ GB} $$

(4) Total memory:

$$ \text{Model (FP16)} + \text{Optimizer states} + \text{Gradients} = 3 + 18 + 3 = 24 \text{ GB} $$

(5) Number of A100 GPUs needed:

$$ \frac{24 \text{ GB}}{80 \text{ GB per GPU}} = 0.3 \text{ GPUs} $$

One A100 GPU is sufficient for the model and optimizer states alone. However, activations during training require additional memory, so 1-2 GPUs would be needed in practice depending on batch size.

Solution: For BERT-base with sequence length 512 and batch size 64:

(1) Total FLOPs per training step: From Example~[ref]:

(2) Time per step on A100:

$$ \frac{18.5 \text{ TFLOPs}}{312 \text{ TFLOPs}} \approx 59 \text{ ms} $$

In practice, memory bandwidth and kernel launch overhead increase this to $\approx 80$-$100$ ms.

(3) Mixed precision impact:

(4) Maximum batch size in 80 GB: Memory breakdown:

Available for activations: $80 - 1.7 - 0.44 - 2 = 75.86$ GB

Maximum batch size: $\frac{75{,}860 \text{ MB}}{130 \text{ MB/sample}} \approx 583$ samples

Solution: For network with $10^7$ parameters:

(1) Finite differences forward passes: Requires one forward pass per parameter: $10^7$ forward passes

(2) Backpropagation passes: Requires 1 forward pass + 1 backward pass = 2 passes total

(3) Time comparison:

(4) Why reverse mode AD is preferred:

Solution: For 24-layer transformer with gradient checkpointing:

(1) Activation memory without checkpointing: Assuming $\approx 700$ MB per layer (from Example~[ref]):

$$ 24 \times 700 \text{ MB} = 16{,}800 \text{ MB} \approx 16.4 \text{ GB} $$

(2) Memory reduction with checkpointing every 6 layers: We save only 4 checkpoints (layers 6, 12, 18, 24):

$$ \text{Memory} = 4 \times 700 \text{ MB} = 2{,}800 \text{ MB} \approx 2.7 \text{ GB} $$

Reduction factor: $\frac{16.4}{2.7} \approx 6\times$

(3) Computational overhead: For each checkpoint interval, we recompute the forward pass once during backward:

(4) When checkpointing becomes necessary: Checkpointing is essential when:

Solution: For distributed training with 8 GPUs:

(1) Scaling efficiency:

(2) Batch size effect on communication:

(3) Ring vs tree all-reduce for 64 GPUs:

(4) When gradient compression is beneficial:

← Chapter 1: Linear Algebra for Deep Learning 📚 Table of Contents Chapter 3: Probability and Information Theory →