Calculus and Optimization

Chapter Overview

Training deep learning models requires optimizing complex, high-dimensional functions. This chapter develops the calculus and optimization theory necessary to understand how neural networks learn from data. We cover multivariable calculus, gradient computation, and the optimization algorithms that power modern deep learning.

The centerpiece of this chapter is backpropagation, the algorithm that efficiently computes gradients in neural networks. We derive backpropagation from first principles, showing how the chain rule enables gradient computation through arbitrarily deep computational graphs. We then explore gradient descent and its variants, which use these gradients to iteratively improve model parameters.

Learning Objectives

After completing this chapter, you will be able to:

Compute gradients and Jacobians for multivariable functions
Apply the chain rule to composite functions
Understand and implement the backpropagation algorithm
Implement gradient descent and its variants (SGD, momentum, Adam)
Analyze convergence properties of optimization algorithms
Apply learning rate schedules and regularization techniques

Multivariable Calculus

Partial Derivatives

Definition: For function $f: \R^n \to \R$, the partial derivative with respect to $x_i$ is:

$$ \frac{\partial f}{\partial x_i} = \lim_{h \to 0} \frac{f(x_1, \ldots, x_i + h, \ldots, x_n) - f(x_1, \ldots, x_i, \ldots, x_n)}{h} $$

Example: For $f(x_1, x_2) = x_1^2 + 3x_1 x_2 + x_2^2$:

$$\begin{align} \frac{\partial f}{\partial x_1} &= 2x_1 + 3x_2 \\ \frac{\partial f}{\partial x_2} &= 3x_1 + 2x_2 \end{align}$$

At point $(x_1, x_2) = (1, 2)$:

$$\begin{align} \frac{\partial f}{\partial x_1}\bigg|_{(1,2)} &= 2(1) + 3(2) = 8 \\ \frac{\partial f}{\partial x_2}\bigg|_{(1,2)} &= 3(1) + 2(2) = 7 \end{align}$$

Gradients

Definition: For function $f: \R^n \to \R$, the gradient is the vector of partial derivatives:

$$ \nabla f(\vx) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix} \in \R^n $$

The gradient points in the direction of steepest ascent of the function.

Example: For mean squared error loss:

$$ L(\vw) = \frac{1}{N} \sum_{i=1}^N (y_i - \vw\transpose \vx^{(i)})^2 $$

The gradient with respect to $\vw$ is:

$$ \nabla_{\vw} L = -\frac{2}{N} \sum_{i=1}^N (y_i - \vw\transpose \vx^{(i)}) \vx^{(i)} $$

For $N=1$, $\vw = [w_1, w_2]\transpose$, $\vx = [1, 2]\transpose$, $y = 5$, and current prediction $\hat{y} = \vw\transpose \vx = 3$:

$$ \nabla_{\vw} L = -2(5 - 3) \begin{bmatrix} 1 \\ 2 \end{bmatrix} = \begin{bmatrix} -4 \\ -8 \end{bmatrix} $$

The negative gradient $-\nabla_{\vw} L = [4, 8]\transpose$ points toward better parameters.

The Chain Rule

Theorem: For composite function $h(\vx) = f(g(\vx))$ where $g: \R^n \to \R^m$ and $f: \R^m \to \R$:

$$ \frac{\partial h}{\partial x_i} = \sum_{j=1}^m \frac{\partial f}{\partial g_j} \frac{\partial g_j}{\partial x_i} $$

In vector form:

$$ \nabla_{\vx} h = \mJ_g\transpose \nabla_{\vz} f $$

where $\vz = g(\vx)$ and $\mJ_g \in \R^{m \times n}$ is the Jacobian of $g$.

Example: For neural network layer: $\vy = \sigma(\mW\vx + \vb)$ where $\sigma$ is applied element-wise.

Let $\vz = \mW\vx + \vb$ (pre-activation). Then:

$$ \frac{\partial L}{\partial \vx} = \mW\transpose \left( \frac{\partial L}{\partial \vy} \odot \sigma'(\vz) \right) $$

where $\odot$ denotes element-wise multiplication.

Concrete example: For ReLU activation $\sigma(z) = \max(0, z)$:

$$ \sigma'(z) = \begin{cases} 1 & \text{if } z > 0 \\ 0 & \text{if } z \leq 0 \end{cases} $$

If $\vz = [2.0, -1.0, 0.5]\transpose$, then $\sigma'(\vz) = [1, 0, 1]\transpose$.

Jacobian and Hessian Matrices

Definition: For function $\mathbf{f}: \R^n \to \R^m$, the Jacobian matrix is:

$$ \mJ_{\mathbf{f}}(\vx) = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \cdots & \frac{\partial f_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix} \in \R^{m \times n} $$

Definition: For function $f: \R^n \to \R$, the Hessian matrix contains second derivatives:

$$ \mH_f(\vx) = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots \\ \vdots & \vdots & \ddots \end{bmatrix} \in \R^{n \times n} $$

The Hessian describes the local curvature of the function. For smooth functions, $\mH$ is symmetric.

Gradient Descent

The Gradient Descent Algorithm

Gradient descent iteratively moves parameters in the direction opposite to the gradient:

Algorithm: Gradient Descent

Input: Objective function $f(\vw)$, initial parameters $\vw^{(0)}$, learning rate $\eta$, iterations $T$

Output: Optimized parameters $\vw^{(T)}$

for $t = 0$ to $T-1$ do

Compute gradient: $\mathbf{g}^{(t)} = \nabla f(\vw^{(t)})$

Update parameters: $\vw^{(t+1)} = \vw^{(t)} - \eta \mathbf{g}^{(t)}$

return $\vw^{(T)

The learning rate $\eta$ controls the step size. Too large: divergence. Too small: slow convergence.

Example: Minimize $f(w) = w^2$ starting from $w^{(0)} = 3$ with $\eta = 0.1$:

$$\begin{align} t=0:& \quad w^{(0)} = 3, \quad g^{(0)} = 2w^{(0)} = 6, \quad w^{(1)} = 3 - 0.1(6) = 2.4 \\ t=1:& \quad w^{(1)} = 2.4, \quad g^{(1)} = 4.8, \quad w^{(2)} = 2.4 - 0.1(4.8) = 1.92 \\ t=2:& \quad w^{(2)} = 1.92, \quad g^{(2)} = 3.84, \quad w^{(3)} = 1.92 - 0.1(3.84) = 1.536 \end{align}$$

The parameters converge to $w^* = 0$ (the minimum).

Stochastic Gradient Descent (SGD)

For large datasets, computing the full gradient is expensive. SGD approximates the gradient using mini-batches.

Algorithm: Stochastic Gradient Descent (SGD)

Input: Dataset $\mathcal{D} = \{(\vx^{(i)}, y^{(i)})\}_{i=1}^N$, batch size $B$, learning rate $\eta$, epochs $E$

Output: Optimized parameters $\vw$

Initialize $\vw$ randomly

for epoch $e = 1$ to $E$ do

Shuffle dataset $\mathcal{D}$

Compute mini-batch gradient: $\mathbf{g} = \frac{1}{B} \sum_{(\vx, y) \in \mathcal{B}} \nabla_{\vw} L(\vw; \vx, y)$

Update: $\vw \leftarrow \vw - \eta \mathbf{g}$

return $\vw$

PyTorch SGD implementation:

import torch
import torch.nn as nn

# Model and loss
model = nn.Linear(10, 1)
criterion = nn.MSELoss()

# SGD optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

# Training loop
for epoch in range(100):
    for x_batch, y_batch in dataloader:
        # Forward pass
        y_pred = model(x_batch)
        loss = criterion(y_pred, y_batch)

        # Backward pass
        optimizer.zero_grad()  # Clear previous gradients
        loss.backward()         # Compute gradients
        optimizer.step()        # Update parameters

Momentum

Momentum accelerates SGD by accumulating a velocity vector, drawing inspiration from physics where a ball rolling down a hill builds up speed.

The Intuition Behind Momentum

Standard SGD can be slow and oscillatory, especially in regions where the loss surface has different curvatures in different directions (ravines). Consider a loss landscape that is steep in one direction but shallow in another:

Without momentum: The gradient points steeply downward, causing large oscillations perpendicular to the optimal path. Progress along the shallow direction is slow because each step only uses the current gradient.
With momentum: The velocity accumulates gradients over time. Oscillations in steep directions cancel out (positive and negative gradients average), while consistent gradients in shallow directions accumulate, accelerating progress.

Momentum acts as an exponentially weighted moving average of gradients, smoothing out noisy updates and accelerating convergence in consistent directions.

Mathematical Formulation

Algorithm: SGD with Momentum

Input: Learning rate $\eta$, momentum coefficient $\beta$ (typically 0.9)

Initialize velocity $\mathbf{v} = \mathbf{0}$

for each iteration do

Compute gradient $\mathbf{g} = \nabla_{\vw} L(\vw)$

Update velocity: $\mathbf{v} \leftarrow \beta \mathbf{v} + \mathbf{g}$

Update parameters: $\vw \leftarrow \vw - \eta \mathbf{v}$

The velocity update can be expanded to show its exponential weighting:

$$ \mathbf{v}_t = \beta \mathbf{v}_{t-1} + \mathbf{g}_t = \mathbf{g}_t + \beta \mathbf{g}_{t-1} + \beta^2 \mathbf{g}_{t-2} + \beta^3 \mathbf{g}_{t-3} + \cdots $$

With $\beta = 0.9$, the current gradient has weight 1, the previous gradient has weight 0.9, two steps ago has weight 0.81, and so on. This creates a "memory" of approximately $\frac{1}{1-\beta} = 10$ steps.

Why Momentum Works Better

Dampens oscillations: In directions with high curvature (steep gradients that change sign), the velocity accumulates opposing gradients, reducing oscillation amplitude.
Accelerates in consistent directions: When gradients consistently point in the same direction, the velocity builds up, effectively increasing the learning rate in that direction.
Escapes shallow local minima: The accumulated velocity can carry the optimization through small barriers, potentially escaping poor local minima.
Reduces sensitivity to learning rate: The smoothing effect makes the optimization more robust to learning rate choices.

Example: Consider minimizing $f(w_1, w_2) = 10w_1^2 + 0.1w_2^2$ starting from $(w_1, w_2) = (1, 1)$ with $\eta = 0.01$:

SGD without momentum:

$$\begin{align} t=0:& \quad \mathbf{g} = [20, 0.2]\transpose, \quad \vw^{(1)} = [1, 1]\transpose - 0.01[20, 0.2]\transpose = [0.8, 0.998]\transpose \\ t=1:& \quad \mathbf{g} = [16, 0.1996]\transpose, \quad \vw^{(2)} = [0.8, 0.998]\transpose - 0.01[16, 0.1996]\transpose = [0.64, 0.996]\transpose \end{align}$$

Progress in $w_1$ direction: $1 \to 0.8 \to 0.64$ (slow, 20\% per step)

SGD with momentum ($\beta = 0.9$):

$$\begin{align} t=0:& \quad \mathbf{g} = [20, 0.2]\transpose, \quad \mathbf{v}^{(1)} = [20, 0.2]\transpose, \quad \vw^{(1)} = [0.8, 0.998]\transpose \\ t=1:& \quad \mathbf{g} = [16, 0.1996]\transpose, \quad \mathbf{v}^{(2)} = 0.9[20, 0.2]\transpose + [16, 0.1996]\transpose = [34, 0.38]\transpose \\ & \quad \vw^{(2)} = [0.8, 0.998]\transpose - 0.01[34, 0.38]\transpose = [0.46, 0.994]\transpose \end{align}$$

Progress in $w_1$ direction: $1 \to 0.8 \to 0.46$ (faster, 42.5\% in second step due to accumulated velocity)

Momentum accelerates convergence by 2-3× in this ravine scenario.

Adam Optimizer

Adam (Adaptive Moment Estimation) combines momentum with adaptive learning rates, creating a powerful optimizer that adapts to the geometry of the loss landscape for each parameter individually.

The Intuition Behind Adam

Adam addresses two key limitations of standard SGD with momentum:

Different parameters need different learning rates: In deep networks, some parameters (e.g., early layer weights) may have small gradients while others (e.g., output layer weights) have large gradients. A single global learning rate is suboptimal.
Gradient magnitudes vary across training: Early in training, gradients may be large and noisy. Later, they become smaller and more stable. Adam adapts to these changes automatically.

Adam maintains two moving averages for each parameter: the first moment (mean of gradients, like momentum) and the second moment (uncentered variance of gradients). The second moment enables adaptive per-parameter learning rates.

Mathematical Formulation

Algorithm: Adam Optimizer

Input: Learning rate $\alpha$ (default 0.001), $\beta_1$ = 0.9, $\beta_2$ = 0.999, $\epsilon$ = $10^{-8}$

Initialize $\mathbf{m}_0 = \mathbf{0}$ (first moment), $\mathbf{v}_0 = \mathbf{0}$ (second moment), $t = 0$

while not converged do

$t \leftarrow t + 1$

Compute gradient: $\mathbf{g}_t = \nabla_{\vw} L(\vw_{t-1})$

Update biased first moment: $\mathbf{m}_t = \beta_1 \mathbf{m}_{t-1} + (1 - \beta_1) \mathbf{g}_t$

Update biased second moment: $\mathbf{v}_t = \beta_2 \mathbf{v}_{t-1} + (1 - \beta_2) \mathbf{g}_t^2$

Bias-corrected first moment: $\hat{\mathbf{m}}_t = \mathbf{m}_t / (1 - \beta_1^t)$

Bias-corrected second moment: $\hat{\mathbf{v}}_t = \mathbf{v}_t / (1 - \beta_2^t)$

Update parameters: $\vw_t = \vw_{t-1} - \alpha \hat{\mathbf{m}}_t / (\sqrt{\hat{\mathbf{v}}_t} + \epsilon)$

Understanding Each Component

First moment ($\mathbf{m}_t$): Exponentially weighted average of gradients (momentum)

$$ \mathbf{m}_t = \beta_1 \mathbf{m}_{t-1} + (1 - \beta_1) \mathbf{g}_t $$

This provides the direction of the update, smoothing out noisy gradients. With $\beta_1 = 0.9$, it averages over approximately $\frac{1}{1-0.9} = 10$ steps.

Second moment ($\mathbf{v}_t$): Exponentially weighted average of squared gradients

$$ \mathbf{v}_t = \beta_2 \mathbf{v}_{t-1} + (1 - \beta_2) \mathbf{g}_t^2 $$

This estimates the variance of gradients. With $\beta_2 = 0.999$, it averages over approximately $\frac{1}{1-0.999} = 1000$ steps, providing a stable estimate of gradient scale.

Bias correction: Since $\mathbf{m}_0 = \mathbf{v}_0 = \mathbf{0}$, early estimates are biased toward zero. The correction factors $\frac{1}{1-\beta_1^t}$ and $\frac{1}{1-\beta_2^t}$ compensate for this initialization bias, ensuring proper scaling in early iterations.

Adaptive update: The final update is:

$$ \vw_t = \vw_{t-1} - \alpha \frac{\hat{\mathbf{m}}_t}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon} $$

Each parameter $w_i$ gets an effective learning rate of $\frac{\alpha}{\sqrt{\hat{v}_{i,t}} + \epsilon}$:

Parameters with large, consistent gradients: $\hat{v}_{i,t}$ is large $\Rightarrow$ effective learning rate is small (prevents overshooting)
Parameters with small, noisy gradients: $\hat{v}_{i,t}$ is small $\Rightarrow$ effective learning rate is large (accelerates learning)

Example: Training a 2-layer network on MNIST with 10,000 examples:

Optimizer	Steps to 95\% Accuracy	Final Accuracy	Training Time
SGD ($\eta = 0.01$)	8,500	96.2\%	12.3 min
SGD + Momentum ($\eta = 0.01$, $\beta = 0.9$)	4,200	97.1\%	6.1 min
Adam ($\alpha = 0.001$)	2,100	97.8\%	3.0 min

Adam converges 4× faster than SGD and 2× faster than momentum, while achieving higher final accuracy.

Example: Consider two parameters with different gradient patterns over 5 steps:

Parameter 1 (consistent large gradients):

$$\begin{align} \text{Gradients:} & \quad [10.0, 9.8, 10.2, 9.9, 10.1] \\ \hat{m}_5 & \approx 10.0 \quad \text{(first moment)} \\ \hat{v}_5 & \approx 100.0 \quad \text{(second moment)} \\ \text{Effective LR:} & \quad \frac{0.001}{\sqrt{100.0}} = 0.0001 \end{align}$$

Parameter 2 (small noisy gradients):

$$\begin{align} \text{Gradients:} & \quad [0.5, -0.3, 0.4, -0.2, 0.6] \\ \hat{m}_5 & \approx 0.2 \quad \text{(first moment, noise cancels)} \\ \hat{v}_5 & \approx 0.16 \quad \text{(second moment)} \\ \text{Effective LR:} & \quad \frac{0.001}{\sqrt{0.16}} = 0.0025 \end{align}$$

Parameter 2 gets 25× larger effective learning rate, compensating for its smaller gradients. This automatic adaptation is why Adam excels at training deep networks with heterogeneous parameter scales.

Adam is the most commonly used optimizer for training transformers and large language models due to its robustness, fast convergence, and ability to handle the diverse gradient patterns across different layers and attention heads.

Gradient Computation Complexity

Understanding the computational and memory costs of gradient computation is essential for training large models efficiently.

FLOPs for Gradient Computation

Computing gradients via backpropagation requires approximately 2× the FLOPs of the forward pass: 1× for the backward pass itself, plus the original 1× forward pass.

Example: For BERT-base (110M parameters, 12 layers, $d_{\text{model}} = 768$) processing sequence length $n = 512$:

Forward pass:

Self-attention: $12 \times 4n^2d = 12 \times 4(512)^2(768) \approx 48$ GFLOPs
Feed-forward: $12 \times 2nd(4d) = 12 \times 2(512)(768)(3072) \approx 36$ GFLOPs
Other operations: $\approx 12$ GFLOPs
Total forward: $\approx 96$ GFLOPs

Backward pass:

Gradient computation through each layer: $\approx 96$ GFLOPs
Gradient accumulation for weight updates: $\approx 97$ GFLOPs
Total backward: $\approx 193$ GFLOPs

Total per training step: $\approx 289$ GFLOPs

For batch size $B = 32$: $289 \times 32 \approx 9.2$ TFLOPs per batch.

Memory Requirements for Activations

During backpropagation, intermediate activations must be stored for gradient computation.

Definition: For a network with $L$ layers processing batch size $B$, activation memory is:

$$ M_{\text{act}} = B \sum_{\ell=1}^L d_\ell $$

where $d_\ell$ is the dimension of layer $\ell$'s output.

Example: For BERT-base ($B = 32$, $n = 512$, $d = 768$), per-layer activations total $\approx$704~MB in FP32---dominated by attention scores ($B \times h \times n^2 \times 4 = 402$~MB, or 57\% of per-layer memory). Across all 12 layers, activations alone consume $\approx$8.4~GB, excluding gradients and optimizer states. See Section~[ref] for the complete breakdown.

Automatic Differentiation: Forward vs Reverse Mode

Definition: Forward mode computes derivatives by propagating tangent vectors forward through the computational graph. For function $f: \R^n \to \R^m$, computing $\nabla f$ requires $n$ forward passes.

Definition: Reverse mode (backpropagation) computes derivatives by propagating adjoints backward. Computing $\nabla f$ requires 1 forward pass + 1 backward pass, regardless of $n$.

For neural networks where $n \gg m$ (millions of parameters, one loss), reverse mode is vastly more efficient: $O(1)$ passes vs $O(n)$ passes.

Example: For a network with $n = 10^8$ parameters (100M) and scalar loss ($m = 1$):

Forward mode:

Requires $10^8$ forward passes
Each pass: $\approx 100$ GFLOPs
Total: $10^{10}$ GFLOPs $\approx 10$ PFLOPs
Time on A100 GPU (312 TFLOPS): $\approx 32,000$ seconds $\approx 9$ hours

Reverse mode (backpropagation):

Requires 1 forward + 1 backward pass
Total: $\approx 300$ GFLOPs
Time on A100 GPU: $\approx 0.001$ seconds
Speedup: $\approx 32$ million×

Gradient Checkpointing

Gradient checkpointing trades computation for memory by recomputing activations during the backward pass.

Algorithm: Gradient Checkpointing

Input: Network with $L$ layers, checkpoint every $k$ layers

// Forward Pass

for $\ell = 1$ to $L$ do

Compute $\vh^{(\ell)} = f^{(\ell)}(\vh^{(\ell-1)})$

if $\ell \bmod k = 0$ then

Save $\vh^{(\ell)}$ to memory (checkpoint)

// Backward Pass

for $\ell = L$ to $1$ do

Recompute forward from last checkpoint to layer $\ell$

Compute gradient $\nabla_{\vh^{(\ell-1)}} L$ using $\vh^{(\ell)}$

Example: For BERT-base (12 layers) with checkpointing every 3 layers:

Without checkpointing:

Memory: $8.4$ GB (all activations)
Computation: $289$ GFLOPs (1 forward + 1 backward)

With checkpointing (every 3 layers):

Memory: $8.4 / 3 \approx 2.8$ GB (only checkpoints)
Computation: $96 + 193 + 72 = 361$ GFLOPs (1 forward + 1 backward + 0.75 forward recompute)
Memory reduction: 3×, Computation increase: 1.25×

For GPT-3 (175B parameters), checkpointing is essential to fit in GPU memory.

Backpropagation

Backpropagation efficiently computes gradients in neural networks using the chain rule.

Computational Graphs

A computational graph represents the sequence of operations in a neural network. Each node is an operation, and edges carry values/gradients.

Example: For $L = (y - \hat{y})^2$ where $\hat{y} = w_2 \sigma(w_1 x + b_1) + b_2$:

Forward pass:

$$\begin{align} z_1 &= w_1 x + b_1 = 2.0(1.0) + 0.5 = 2.5 \\ a_1 &= \sigma(z_1) = \sigma(2.5) = 0.924 \quad \text{(sigmoid)} \\ z_2 &= w_2 a_1 + b_2 = 1.5(0.924) + 0.3 = 1.686 \\ L &= (y - z_2)^2 = (3.0 - 1.686)^2 = 1.726 \end{align}$$

Backward pass:

$$\begin{align} \frac{\partial L}{\partial z_2} &= 2(z_2 - y) = 2(1.686 - 3.0) = -2.628 \\ \frac{\partial L}{\partial w_2} &= \frac{\partial L}{\partial z_2} \cdot a_1 = -2.628(0.924) = -2.428 \\ \frac{\partial L}{\partial a_1} &= \frac{\partial L}{\partial z_2} \cdot w_2 = -2.628(1.5) = -3.942 \\ \frac{\partial L}{\partial z_1} &= \frac{\partial L}{\partial a_1} \cdot \sigma'(z_1) = -3.942(0.070) = -0.276 \\ \frac{\partial L}{\partial w_1} &= \frac{\partial L}{\partial z_1} \cdot x = -0.276(1.0) = -0.276 \end{align}$$

Backpropagation Algorithm

Algorithm: Backpropagation

Input: Training example $(\vx, y)$, network with $L$ layers

Output: Gradients $\{\nabla_{\mW^{(\ell)}} L, \nabla_{\vb^{(\ell)}} L\}_{\ell=1}^L$

// Forward Pass

$\vh^{(0)} = \vx$

for $\ell = 1$ to $L$ do

$\vz^{(\ell)} = \mW^{(\ell)} \vh^{(\ell-1)} + \vb^{(\ell)}$

$\vh^{(\ell)} = \sigma^{(\ell)}(\vz^{(\ell)})$

$\hat{y} = \vh^{(L)}$

Compute loss: $L = \text{Loss}(y, \hat{y})$

// Backward Pass

$\boldsymbol{\delta}^{(L)} = \nabla_{\vh^{(L)}} L \odot \sigma'^{(L)}(\vz^{(L)})$

for $\ell = L$ to $1$ do

$\nabla_{\mW^{(\ell)}} L = \boldsymbol{\delta}^{(\ell)} (\vh^{(\ell-1)})\transpose$

$\nabla_{\vb^{(\ell)}} L = \boldsymbol{\delta}^{(\ell)}$

if $\ell > 1$ then

$\boldsymbol{\delta}^{(\ell-1)} = (\mW^{(\ell)})\transpose \boldsymbol{\delta}^{(\ell)} \odot \sigma'^{(\ell-1)}(\vz^{(\ell-1)})$

return All gradients

Backpropagation computes gradients in $O(n)$ time where $n$ is the number of parameters, compared to $O(n^2)$ for naive methods. This efficiency enables training of billion-parameter models.

Why Backpropagation is $O(n)$ Not $O(n^2)$

Theorem: For a neural network with $n$ parameters and $m$ operations, backpropagation computes all gradients in $O(m)$ time, where typically $m = O(n)$.

Proof (Intuition): Each operation in the forward pass corresponds to one gradient computation in the backward pass. The chain rule allows us to reuse intermediate gradients:

$$ \frac{\partial L}{\partial w_i} = \frac{\partial L}{\partial z_j} \cdot \frac{\partial z_j}{\partial w_i} $$

We compute $\frac{\partial L}{\partial z_j}$ once and reuse it for all parameters that affect $z_j$. This sharing prevents the $O(n^2)$ cost of computing each gradient independently.

Example: For a network with $n = 10^8$ parameters:

Naive finite differences:

$$ \frac{\partial L}{\partial w_i} \approx \frac{L(w_i + \epsilon) - L(w_i)}{\epsilon} $$

Requires $n$ forward passes: $O(n \cdot m) = O(n^2)$ operations.

Backpropagation:

Forward pass: $O(m) = O(n)$ operations
Backward pass: $O(m) = O(n)$ operations
Total: $O(n)$ operations

Speedup: $O(n) = 10^8$×

Optimizer Memory Requirements

Different optimizers have vastly different memory requirements, which becomes critical for large models.

Memory Comparison by Optimizer

Optimizer	Memory per Parameter	Total Memory Factor
SGD (no momentum)	4 bytes (fp32)	1×
SGD with momentum	8 bytes (param + velocity)	2×
Adam	16 bytes (param + 2 moments)	4×
Adam (mixed precision)	10 bytes (fp16 param + fp32 master + 2 moments)	2.5×

Example: For BERT-base with 110M parameters:

Model parameters:

FP32: $110 \times 10^6 \times 4 \text{ bytes} = 440$ MB
FP16: $110 \times 10^6 \times 2 \text{ bytes} = 220$ MB

SGD with momentum:

Parameters: 440 MB
Momentum buffer: 440 MB
Total: 880 MB

Adam optimizer:

Parameters: 440 MB
First moment ($\mathbf{m}$): 440 MB
Second moment ($\mathbf{v}$): 440 MB
Gradients: 440 MB
Total: 1,760 MB $\approx 1.7$ GB

Adam with mixed precision:

FP16 parameters: 220 MB
FP32 master copy: 440 MB
FP32 first moment: 440 MB
FP32 second moment: 440 MB
FP16 gradients: 220 MB
Total: 1,760 MB $\approx 1.7$ GB

Note: Mixed precision doesn't reduce optimizer memory, but enables larger batch sizes.

Example: For GPT-3 (175B parameters) with Adam optimizer:

Model + optimizer states:

Parameters (FP16): $175 \times 10^9 \times 2 = 350$ GB
Master copy (FP32): $175 \times 10^9 \times 4 = 700$ GB
First moment (FP32): 700 GB
Second moment (FP32): 700 GB
Gradients (FP16): 350 GB
Total: 2,800 GB $\approx 2.8$ TB

This requires distributed training across multiple GPUs. With 8× A100 GPUs (80 GB each = 640 GB total), we need model parallelism and optimizer state sharding (e.g., ZeRO optimizer).

Impact on GPU Memory Budget

For large models, optimizer states often consume more memory than the model itself. Adam uses 4× parameter memory, leaving less room for batch size and activations.

Example: Training BERT-base on A100 GPU (80 GB memory):

Memory allocation:

Model parameters: 0.44 GB
Optimizer states (Adam): 1.32 GB
Activations (batch size 32): 8.4 GB
Gradients: 0.44 GB
Framework overhead: $\approx 2$ GB
Total: $\approx 12.6$ GB

Remaining: 67.4 GB available for larger batch sizes or longer sequences.

With batch size 256: Activations $\approx 67$ GB, total $\approx 71$ GB (fits comfortably).

Learning Rate Schedules

Learning rate schedules adjust $\eta$ during training to improve convergence.

Learning Rate Impact on Convergence and GPU Utilization

Learning rate affects both convergence speed and hardware efficiency. Larger learning rates enable larger batch sizes, improving GPU utilization.

Example: Training BERT-base on 1M examples:

Learning Rate	Steps to Converge	Wall Time	Final Loss
$1 \times 10^{-5}$	100,000	12 hours	1.85
$1 \times 10^{-4}$	30,000	3.6 hours	1.82
$5 \times 10^{-4}$	15,000	1.8 hours	1.81
$1 \times 10^{-3}$	20,000	2.4 hours	1.83
$5 \times 10^{-3}$	Diverges	-	-

Optimal learning rate ($5 \times 10^{-4}$) achieves 6.7× faster convergence than conservative rate.

Learning Rate Scaling with Batch Size

Theorem: When increasing batch size by factor $k$, scale learning rate by $k$ to maintain convergence behavior:

$$ \eta_{\text{new}} = k \cdot \eta_{\text{base}} $$

This holds approximately for $k \leq 8$. For larger $k$, use gradual warmup.

Example: Training BERT-base with different batch sizes:

Batch Size	Learning Rate	GPU Util.	Steps/sec	Samples/sec
32	$5 \times 10^{-4}$	45\%	2.1	67
64	$1 \times 10^{-3}$	68\%	1.8	115
128	$2 \times 10^{-3}$	85\%	1.4	179
256	$4 \times 10^{-3}$	92\%	1.0	256
512	$8 \times 10^{-3}$	95\%	0.6	307

Larger batches improve GPU utilization but require proportionally larger learning rates. Throughput increases 4.6× from batch 32 to 512.

Practical Learning Rates for Transformers

Model	Batch Size	Peak Learning Rate
BERT-base	256	$1 \times 10^{-4}$
BERT-large	256	$5 \times 10^{-5}$
GPT-2 (117M)	512	$2.5 \times 10^{-4}$
GPT-2 (1.5B)	512	$1.5 \times 10^{-4}$
GPT-3 (175B)	3.2M	$6 \times 10^{-5}$
T5-base	128	$1 \times 10^{-4}$
T5-11B	2048	$1 \times 10^{-4}$

Larger models generally require smaller learning rates for stability. GPT-3 uses $6 \times 10^{-5}$ despite massive batch size of 3.2M tokens.

Common Schedules

Step Decay:

$$ \eta_t = \eta_0 \gamma^{\lfloor t/s \rfloor} $$

where $\gamma < 1$ (e.g., 0.1) and $s$ is step size (e.g., every 10 epochs).

Exponential Decay:

$$ \eta_t = \eta_0 e^{-\lambda t} $$

Cosine Annealing:

$$ \eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\left(\frac{t\pi}{T}\right)\right) $$

Warmup + Decay (Transformers):

$$ \eta_t = \frac{d_{\text{model}}^{-0.5}}{\max(t, \text{warmup\_steps}^{-0.5})} \cdot \min(t^{-0.5}, t \cdot \text{warmup\_steps}^{-1.5}) $$

The warmup phase prevents instability in early training of transformers.

Gradient computation benefits enormously from GPU parallelism, achieving 100--150$\times$ speedup over CPUs for transformer-sized models. Mixed precision training (FP16 compute with FP32 accumulation) provides an additional 2--4$\times$ throughput improvement. Gradient accumulation enables large effective batch sizes on memory-constrained hardware, and distributed training with all-reduce synchronization scales near-linearly across GPUs. These techniques are covered in detail in Chapters~11 and~22.

Exercises

Exercise 1: Compute the gradient of $f(\vw) = \vw\transpose \mA \vw + \vb\transpose \vw + c$ where $\mA \in \R^{n \times n}$ is symmetric, $\vw, \vb \in \R^n$, and $c \in \R$.

Exercise 2: Implement backpropagation for a 2-layer network with ReLU activation. Given input $\vx = [1.0, 0.5]\transpose$, weights $\mW^{(1)} \in \R^{3 \times 2}$, $\mW^{(2)} \in \R^{1 \times 3}$, and target $y = 2.0$, compute all gradients.

Exercise 3: For Adam optimizer with $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\alpha = 0.001$:

Why is bias correction necessary?
What are the effective learning rates after steps $t = 1, 10, 100, 1000$?
How does Adam handle sparse gradients compared to SGD?

Exercise 4: A transformer is trained with learning rate warmup over 4000 steps, then inverse square root decay. If $d_{\text{model}} = 512$:

Plot the learning rate schedule for 100,000 steps
What is the learning rate at step 1, 4000, and 10,000?
Why is warmup beneficial for transformer training?

Exercise 5: Calculate the memory requirements for training GPT-2 (1.5B parameters) with Adam optimizer:

Model parameters in FP16
Optimizer states (FP32 master copy + 2 moments)
Gradients in FP16
Total memory for model + optimizer
How many A100 GPUs (80 GB each) are needed?

Exercise 6: For BERT-base processing sequence length 512 with batch size 64:

Calculate total FLOPs for one training step (forward + backward)
Estimate time per step on A100 GPU (312 TFLOPS)
How does mixed precision (FP16) affect throughput?
What is the maximum batch size that fits in 80 GB memory?

Exercise 7: Compare gradient computation methods for a network with $10^7$ parameters:

How many forward passes does finite differences require?
How many passes does backpropagation require?
If one forward pass takes 10 ms, compare total time
Why is reverse mode AD preferred over forward mode?

Exercise 8: Implement gradient checkpointing for a 24-layer transformer:

Without checkpointing, how much activation memory is needed?
With checkpointing every 6 layers, what is the memory reduction?
What is the computational overhead (extra forward passes)?
At what model size does checkpointing become necessary?

Exercise 9: Analyze distributed training efficiency for 8 GPUs:

If gradient all-reduce takes 15 ms and computation takes 100 ms, what is the scaling efficiency?
How does batch size affect communication overhead?
Compare ring all-reduce vs tree all-reduce for 64 GPUs
When does gradient compression become beneficial?

Solutions

Full solutions for all exercises are available at \url{https://deeplearning.hofkensvermeulen.be}.

Solution: For $f(\vw) = \vw\transpose \mA \vw + \vb\transpose \vw + c$ where $\mA$ is symmetric:

Using the gradient rules:

$\nabla_{\vw}(\vw\transpose \mA \vw) = 2\mA\vw$ (since $\mA$ is symmetric)
$\nabla_{\vw}(\vb\transpose \vw) = \vb$
$\nabla_{\vw}(c) = \mathbf{0}$

Therefore:

$$ \nabla_{\vw} f = 2\mA\vw + \vb $$

Solution: Given: $\vx = [1.0, 0.5]\transpose$, target $y = 2.0$, ReLU activation.

Let's use specific weights:

$$ \mW^{(1)} = \begin{bmatrix} 0.5 & -0.3 \\ 0.2 & 0.6 \\ -0.4 & 0.8 \end{bmatrix}, \quad \mW^{(2)} = \begin{bmatrix} 1.0 & -0.5 & 0.7 \end{bmatrix} $$

Forward pass:

$$\begin{align} \vz^{(1)} &= \mW^{(1)}\vx = \begin{bmatrix} 0.5(1.0) - 0.3(0.5) \\ 0.2(1.0) + 0.6(0.5) \\ -0.4(1.0) + 0.8(0.5) \end{bmatrix} = \begin{bmatrix} 0.35 \\ 0.50 \\ 0.00 \end{bmatrix} \\ \vh^{(1)} &= \text{ReLU}(\vz^{(1)}) = \begin{bmatrix} 0.35 \\ 0.50 \\ 0.00 \end{bmatrix} \\ z^{(2)} &= \mW^{(2)}\vh^{(1)} = 1.0(0.35) - 0.5(0.50) + 0.7(0.00) = 0.10 \\ L &= \frac{1}{2}(y - z^{(2)})^2 = \frac{1}{2}(2.0 - 0.10)^2 = 1.805 \end{align}$$

Backward pass:

$$\begin{align} \frac{\partial L}{\partial z^{(2)}} &= -(y - z^{(2)}) = -(2.0 - 0.10) = -1.90 \\ \frac{\partial L}{\partial \mW^{(2)}} &= \frac{\partial L}{\partial z^{(2)}} \vh^{(1)\transpose} = -1.90 \begin{bmatrix} 0.35 & 0.50 & 0.00 \end{bmatrix} = \begin{bmatrix} -0.665 & -0.950 & 0.000 \end{bmatrix} \\ \frac{\partial L}{\partial \vh^{(1)}} &= \mW^{(2)\transpose} \frac{\partial L}{\partial z^{(2)}} = \begin{bmatrix} 1.0 \\ -0.5 \\ 0.7 \end{bmatrix}(-1.90) = \begin{bmatrix} -1.90 \\ 0.95 \\ -1.33 \end{bmatrix} \\ \frac{\partial L}{\partial \vz^{(1)}} &= \frac{\partial L}{\partial \vh^{(1)}} \odot \text{ReLU}'(\vz^{(1)}) = \begin{bmatrix} -1.90 \\ 0.95 \\ -1.33 \end{bmatrix} \odot \begin{bmatrix} 1 \\ 1 \\ 0 \end{bmatrix} = \begin{bmatrix} -1.90 \\ 0.95 \\ 0.00 \end{bmatrix} \\ \frac{\partial L}{\partial \mW^{(1)}} &= \frac{\partial L}{\partial \vz^{(1)}} \vx\transpose = \begin{bmatrix} -1.90 \\ 0.95 \\ 0.00 \end{bmatrix} \begin{bmatrix} 1.0 & 0.5 \end{bmatrix} = \begin{bmatrix} -1.90 & -0.95 \\ 0.95 & 0.475 \\ 0.00 & 0.00 \end{bmatrix} \end{align}$$

Solution: For Adam optimizer with $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\alpha = 0.001$:

(1) Why bias correction is necessary: The first and second moment estimates are initialized to zero, creating a bias toward zero in early iterations. Without correction, the effective learning rate would be too small initially. Bias correction factors $\frac{1}{1-\beta_1^t}$ and $\frac{1}{1-\beta_2^t}$ compensate for this initialization bias.

(2) Effective learning rates: The effective learning rate is $\alpha_{\text{eff}} = \alpha \frac{\sqrt{1-\beta_2^t}}{1-\beta_1^t}$:

$t=1$: $\alpha_{\text{eff}} = 0.001 \times \frac{\sqrt{1-0.999}}{1-0.9} = 0.001 \times \frac{0.0316}{0.1} \approx 0.000316$
$t=10$: $\alpha_{\text{eff}} = 0.001 \times \frac{\sqrt{1-0.999^{10}}}{1-0.9^{10}} \approx 0.001 \times \frac{0.0998}{0.651} \approx 0.000153$
$t=100$: $\alpha_{\text{eff}} = 0.001 \times \frac{\sqrt{1-0.999^{100}}}{1-0.9^{100}} \approx 0.001 \times \frac{0.302}{1.000} \approx 0.000302$
$t=1000$: $\alpha_{\text{eff}} \approx 0.001$ (bias correction negligible)

(3) Handling sparse gradients: Adam maintains separate adaptive learning rates for each parameter through the second moment estimate $\mathbf{v}$. For sparse gradients, parameters with infrequent updates have smaller $v_i$ values, resulting in larger effective learning rates. This allows Adam to make larger updates to rarely-updated parameters, unlike SGD which treats all parameters equally. This is particularly beneficial for embedding layers and natural language processing tasks.

Solution: For transformer with $d_{\text{model}} = 512$ and warmup over 4000 steps:

The learning rate schedule is:

$$ \eta(t) = d_{\text{model}}^{-0.5} \cdot \min(t^{-0.5}, t \cdot \text{warmup}^{-1.5}) $$

(1) Plot description: The schedule has two phases:

Warmup ($t \leq 4000$): Linear increase $\eta(t) = \frac{t}{4000} \cdot 512^{-0.5} \approx 0.0011 \cdot t$
Decay ($t > 4000$): Inverse square root $\eta(t) = 512^{-0.5} \cdot t^{-0.5} \approx \frac{1.414}{\sqrt{t}}$

(2) Learning rates at specific steps:

$t=1$: $\eta = 512^{-0.5} \cdot 1 \cdot 4000^{-1.5} \approx 0.0000111$
$t=4000$: $\eta = 512^{-0.5} \cdot 4000^{-0.5} \approx 0.0222$ (peak)
$t=10000$: $\eta = 512^{-0.5} \cdot 10000^{-0.5} \approx 0.0141$

(3) Why warmup is beneficial:

Prevents instability from large gradients in early training when parameters are randomly initialized
Allows the optimizer's momentum statistics to stabilize
Particularly important for Adam, where the second moment estimate needs time to accumulate
Without warmup, large initial learning rates can cause divergence or poor local minima

Solution: For GPT-2 with 1.5B parameters and Adam optimizer:

(1) Model parameters in FP16:

$$ 1.5 \times 10^9 \times 2 \text{ bytes} = 3 \times 10^9 \text{ bytes} = 3 \text{ GB} $$

(2) Optimizer states:

FP32 master copy: $1.5 \times 10^9 \times 4 = 6$ GB
First moment $\mathbf{m}$ (FP32): $1.5 \times 10^9 \times 4 = 6$ GB
Second moment $\mathbf{v}$ (FP32): $1.5 \times 10^9 \times 4 = 6$ GB
Total optimizer states: $18$ GB

(3) Gradients in FP16:

$$ 1.5 \times 10^9 \times 2 \text{ bytes} = 3 \text{ GB} $$

(4) Total memory:

$$ \text{Model (FP16)} + \text{Optimizer states} + \text{Gradients} = 3 + 18 + 3 = 24 \text{ GB} $$

(5) Number of A100 GPUs needed:

$$ \frac{24 \text{ GB}}{80 \text{ GB per GPU}} = 0.3 \text{ GPUs} $$

One A100 GPU is sufficient for the model and optimizer states alone. However, activations during training require additional memory, so 1-2 GPUs would be needed in practice depending on batch size.

Solution: For BERT-base with sequence length 512 and batch size 64:

(1) Total FLOPs per training step: From Example~[ref]:

Forward pass: $\approx 96$ GFLOPs per sample
Backward pass: $\approx 193$ GFLOPs per sample
Total per sample: $289$ GFLOPs
For batch of 64: $289 \times 64 = 18{,}496$ GFLOPs $\approx 18.5$ TFLOPs

(2) Time per step on A100:

$$ \frac{18.5 \text{ TFLOPs}}{312 \text{ TFLOPs}} \approx 59 \text{ ms} $$

In practice, memory bandwidth and kernel launch overhead increase this to $\approx 80$-$100$ ms.

(3) Mixed precision impact:

FP16 Tensor Cores provide 2× speedup: $\approx 30$ ms theoretical
Reduced memory traffic (2× less bandwidth): enables larger batches
Practical speedup: 1.8-2.2× including overhead
Throughput increase: $\approx 3.5$-$4$× due to larger batch sizes

(4) Maximum batch size in 80 GB: Memory breakdown:

Model + optimizer: $\approx 1.7$ GB
Activations per sample: $\approx 130$ MB
Gradients: $\approx 0.44$ GB
Framework overhead: $\approx 2$ GB

Available for activations: $80 - 1.7 - 0.44 - 2 = 75.86$ GB

Maximum batch size: $\frac{75{,}860 \text{ MB}}{130 \text{ MB/sample}} \approx 583$ samples

Solution: For network with $10^7$ parameters:

(1) Finite differences forward passes: Requires one forward pass per parameter: $10^7$ forward passes

(2) Backpropagation passes: Requires 1 forward pass + 1 backward pass = 2 passes total

(3) Time comparison:

Finite differences: $10^7 \times 10 \text{ ms} = 10^8 \text{ ms} = 100{,}000 \text{ seconds} \approx 27.8 \text{ hours}$
Backpropagation: $2 \times 10 \text{ ms} = 20 \text{ ms}$
Speedup: $\frac{10^8}{20} = 5 \times 10^6 = 5$ million×

(4) Why reverse mode AD is preferred:

For $n$ parameters and scalar loss, forward mode requires $O(n)$ passes while reverse mode requires $O(1)$ passes
Reverse mode exploits the structure of neural networks: many parameters, one loss
Memory cost is higher (must store activations) but computational savings are enormous
Forward mode would be preferred only if we had many outputs and few inputs (rare in deep learning)

Solution: For 24-layer transformer with gradient checkpointing:

(1) Activation memory without checkpointing: Assuming $\approx 700$ MB per layer (from Example~[ref]):

$$ 24 \times 700 \text{ MB} = 16{,}800 \text{ MB} \approx 16.4 \text{ GB} $$

(2) Memory reduction with checkpointing every 6 layers: We save only 4 checkpoints (layers 6, 12, 18, 24):

$$ \text{Memory} = 4 \times 700 \text{ MB} = 2{,}800 \text{ MB} \approx 2.7 \text{ GB} $$

Reduction factor: $\frac{16.4}{2.7} \approx 6\times$

(3) Computational overhead: For each checkpoint interval, we recompute the forward pass once during backward:

Original: 1 forward + 1 backward
With checkpointing: 1 forward + 1 backward + 0.75 forward (recompute 18 of 24 layers)
Overhead: $\frac{1.75}{2} = 87.5\%$ increase, or 1.875× total time

(4) When checkpointing becomes necessary: Checkpointing is essential when:

Activation memory exceeds available GPU memory
For GPT-3 scale (175B parameters), activations can exceed 100 GB
Rule of thumb: Use checkpointing when activations $>$ 50\% of GPU memory
Trade-off: 1.5-2× slower training for 4-8× memory reduction

Solution: For distributed training with 8 GPUs:

(1) Scaling efficiency:

Time per step (single GPU): 100 ms
Time per step (8 GPUs): $\frac{100}{8} + 15 = 12.5 + 15 = 27.5$ ms
Ideal time (perfect scaling): $\frac{100}{8} = 12.5$ ms
Scaling efficiency: $\frac{12.5}{27.5} \approx 45.5\%$

(2) Batch size effect on communication:

Communication time is independent of batch size (same gradient size)
Larger batches increase computation time, reducing communication overhead percentage
For batch size $B$: efficiency $\approx \frac{100B/8}{100B/8 + 15}$
Doubling batch size: $\frac{25}{40} = 62.5\%$ efficiency
4× batch size: $\frac{50}{65} = 76.9\%$ efficiency

(3) Ring vs tree all-reduce for 64 GPUs:

Ring all-reduce: $O(N)$ communication steps, bandwidth-optimal
Tree all-reduce: $O(\log N)$ communication steps, latency-optimal
For 64 GPUs: Ring has 64 steps, tree has $\log_2(64) = 6$ steps
Ring is better for large messages (bandwidth-bound)
Tree is better for small messages (latency-bound)
Typical gradient sizes favor ring all-reduce

(4) When gradient compression is beneficial:

When communication time $>$ compression time
For slow networks (inter-node communication)
Typical compression: 8-bit quantization or top-k sparsification
Compression ratio: 4× (FP32 to 8-bit)
Beneficial when: $\frac{\text{gradient size}}{\text{bandwidth}} > \frac{\text{gradient size}}{\text{compression throughput}} + \frac{\text{compressed size}}{\text{bandwidth}}$
Usually beneficial for $>$8 GPUs across multiple nodes

← Chapter 1: Linear Algebra for Deep Learning 📚 Table of Contents Chapter 3: Probability and Information Theory →