Linear Algebra for Deep Learning

Chapter Overview

Linear algebra forms the mathematical foundation of deep learning. Neural networks perform sequences of linear transformations interspersed with nonlinear operations, making matrices and vectors the fundamental objects of study. This chapter develops the linear algebra concepts essential for understanding how deep learning models transform data, how information flows through neural architectures, and how we can interpret the geometric operations these models perform.

Unlike a pure mathematics course, our treatment emphasizes the specific linear algebra operations that appear repeatedly in deep learning: matrix multiplication for transforming representations, dot products for measuring similarity, and matrix decompositions for understanding structure. We pay particular attention to dimensions and shapes, as tracking how tensor dimensions transform through operations is crucial for implementing and debugging deep learning systems.

Learning Objectives

After completing this chapter, you will be able to:

  1. Represent data as vectors and transformations as matrices with clear understanding of dimensions
  2. Perform matrix operations and understand their geometric interpretations
  3. Calculate and interpret dot products as similarity measures
  4. Understand eigendecompositions and singular value decompositions and their applications
  5. Apply matrix norms and use them in regularization
  6. Recognize how linear algebra operations map to neural network computations

Vector Spaces and Transformations

Vectors as Data Representations

In deep learning, we represent data as vectors in high-dimensional spaces. A vector $\vx \in \R^n$ is an ordered collection of $n$ real numbers, which we can interpret geometrically as a point in $n$-dimensional space or as an arrow from the origin to that point.

Definition: A vector $\vx \in \R^n$ is an $n$-tuple of real numbers:
$$ \vx = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} $$
where each $x_i \in \R$ is called a component or element of the vector.

The dimension $n$ is the number of components in the vector. We write vectors as column vectors by default.

Example: Consider a grayscale image of size $28 \times 28$ pixels, such as an image from the MNIST handwritten digit dataset. Each pixel has an intensity value between 0 (black) and 255 (white). We can represent this image as a vector $\vx \in \R^{784}$ by concatenating all pixel values:
$$ \vx = \begin{bmatrix} x_{1,1} \\ x_{1,2} \\ \vdots \\ x_{28,28} \end{bmatrix} \in \R^{784} $$

For color images with three channels (red, green, blue), a $224 \times 224$ RGB image becomes a vector in $\R^{150528}$ ($224 \times 224 \times 3 = 150{,}528$). The enormous dimensionality of image data motivates the need for powerful models that can find meaningful patterns in such high-dimensional spaces.

Example: In natural language processing, we represent words as vectors called word embeddings. A common choice is to represent each word as a vector in $\R^{300}$ or $\R^{768}$. For instance, the word ``king'' might be represented as:
$$ \vw_{\text{king}} = \begin{bmatrix} 0.23 \\ -0.45 \\ 0.87 \\ \vdots \\ 0.12 \end{bmatrix} \in \R^{300} $$
These embeddings are learned such that semantically similar words have similar vector representations. The famous example is that $\vw_{\text{king}} - \vw_{\text{man}} + \vw_{\text{woman}} \approx \vw_{\text{queen}}$, suggesting that vector arithmetic can capture semantic relationships.

Linear Transformations

Definition: A function $T: \R^n \to \R^m$ is a linear transformation if for all vectors $\vx, \vy \in \R^n$ and all scalars $a, b \in \R$:
$$ T(a\vx + b\vy) = aT(\vx) + bT(\vy) $$

Linear transformations preserve vector space structure: they map lines to lines and preserve the origin ($T(\mathbf{0}) = \mathbf{0}$).

Matrices as Linear Transformations

Every linear transformation from $\R^n$ to $\R^m$ can be represented by an $m \times n$ matrix.

Definition: An $m \times n$ matrix $\mA$ is a rectangular array of numbers with $m$ rows and $n$ columns:
$$ \mA = \begin{bmatrix} a_{1,1} & a_{1,2} & \cdots & a_{1,n} \\ a_{2,1} & a_{2,2} & \cdots & a_{2,n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m,1} & a_{m,2} & \cdots & a_{m,n} \end{bmatrix} \in \R^{m \times n} $$
The notation $\mA \in \R^{m \times n}$ specifies the dimensions explicitly: $m$ rows and $n$ columns.
Dimension Tracking: For matrix-vector multiplication $\mA\vx = \vy$:
$$ \underbrace{\mA}_{\R^{m \times n}} \underbrace{\vx}_{\R^{n}} = \underbrace{\vy}_{\R^{m}} $$
The inner dimensions must match ($n$), and the result has the outer dimensions ($m$).
Example: A single fully-connected neural network layer performs:
$$ \vh = \mW\vx + \vb $$
where $\vx \in \R^{n_{\text{in}}}$, $\mW \in \R^{n_{\text{out}} \times n_{\text{in}}}$, $\vb \in \R^{n_{\text{out}}}$, $\vh \in \R^{n_{\text{out}}}$.

For transforming a 784-dimensional input to 256-dimensional hidden representation:

$$ \underbrace{\vh}_{\R^{256}} = \underbrace{\mW}_{\R^{256 \times 784}} \underbrace{\vx}_{\R^{784}} + \underbrace{\vb}_{\R^{256}} $$

This layer has $256 \times 784 = 200{,}704$ weights plus 256 biases, totaling 200,960 trainable parameters.

Concrete Numerical Example: With $n_{\text{in}} = 3$, $n_{\text{out}} = 2$:

$$\begin{align} \mW &= \begin{bmatrix} 0.5 & -0.3 & 0.8 \\ 0.2 & 0.6 & -0.4 \end{bmatrix}, \quad \vb = \begin{bmatrix} 0.1 \\ -0.2 \end{bmatrix}, \quad \vx = \begin{bmatrix} 1.0 \\ 2.0 \\ -0.5 \end{bmatrix} \end{align}$$

Computing:

$$\begin{align} \mW\vx &= \begin{bmatrix} 0.5(1.0) - 0.3(2.0) + 0.8(-0.5) \\ 0.2(1.0) + 0.6(2.0) - 0.4(-0.5) \end{bmatrix} = \begin{bmatrix} -0.5 \\ 1.6 \end{bmatrix}\\ \vh &= \begin{bmatrix} -0.5 \\ 1.6 \end{bmatrix} + \begin{bmatrix} 0.1 \\ -0.2 \end{bmatrix} = \begin{bmatrix} -0.4 \\ 1.4 \end{bmatrix} \end{align}$$

Matrix Operations

Matrix Multiplication

Definition: For $\mA \in \R^{m \times n}$ and $\mB \in \R^{n \times p}$, their product $\mC = \mA\mB \in \R^{m \times p}$ is:
$$ c_{i,k} = \sum_{j=1}^{n} a_{i,j} b_{j,k} $$
Example: Compute $\mC = \mA\mB$ where:
$$ \mA = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} \in \R^{2 \times 2}, \quad \mB = \begin{bmatrix} 5 & 6 \\ 7 & 8 \end{bmatrix} \in \R^{2 \times 2} $$

Computing each entry:

$$\begin{align} c_{1,1} &= 1(5) + 2(7) = 19 \\ c_{1,2} &= 1(6) + 2(8) = 22 \\ c_{2,1} &= 3(5) + 4(7) = 43 \\ c_{2,2} &= 3(6) + 4(8) = 50 \end{align}$$

Therefore: $\mC = \begin{bmatrix} 19 & 22 \\ 43 & 50 \end{bmatrix}$

Computational Complexity of Matrix Operations

Understanding the computational cost of matrix operations is essential for designing efficient deep learning systems.

Theorem: Computing $\mC = \mA\mB$ where $\mA \in \R^{m \times n}$ and $\mB \in \R^{n \times p}$ requires:
$$ \text{FLOPs} = 2mnp $$
floating-point operations (multiply-accumulate operations count as 2 FLOPs each).
Example: In transformer self-attention, we compute $\mA = \mQ\mK\transpose$ where $\mQ, \mK \in \R^{n \times d_k}$ (sequence length $n$, key dimension $d_k$).

Dimensions: $\underbrace{\mQ}_{\R^{n \times d_k}} \underbrace{\mK\transpose}_{\R^{d_k \times n}} = \underbrace{\mA}_{\R^{n \times n}}$

Computational cost: $2n \cdot d_k \cdot n = 2n^2 d_k$ FLOPs

For GPT-3 with $n = 2048$ tokens and $d_k = 128$:

$$ \text{FLOPs} = 2 \times (2048)^2 \times 128 = 1{,}073{,}741{,}824 \approx 1.07 \text{ GFLOPs} $$

This quadratic scaling in sequence length ($O(n^2)$) is why long-context transformers are computationally expensive.

Example: A transformer feed-forward network applies two linear transformations:
$$\begin{align} \vh &= \mW_1 \vx + \vb_1 \quad \text{where } \mW_1 \in \R^{d_{ff} \times d_{\text{model}}} \\ \vy &= \mW_2 \vh + \vb_2 \quad \text{where } \mW_2 \in \R^{d_{\text{model}} \times d_{ff}} \end{align}$$

For a batch of $B$ sequences of length $n$, input is $\mX \in \R^{B \times n \times d_{\text{model}}}$.

First transformation: $2 \cdot (Bn) \cdot d_{\text{model}} \cdot d_{ff}$ FLOPs

Second transformation: $2 \cdot (Bn) \cdot d_{ff} \cdot d_{\text{model}}$ FLOPs

Total: $4Bn \cdot d_{\text{model}} \cdot d_{ff}$ FLOPs

For BERT-base ($d_{\text{model}} = 768$, $d_{ff} = 3072$, $n = 512$, $B = 32$):

$$ \text{FLOPs} = 4 \times 32 \times 512 \times 768 \times 3072 = 154{,}618{,}822{,}656 \approx 154.6 \text{ GFLOPs} $$

Batch Matrix Multiplication

Modern deep learning frameworks process multiple examples simultaneously using batched operations.

Definition: For tensors $\mA \in \R^{B \times m \times n}$ and $\mB \in \R^{B \times n \times p}$, batch matrix multiplication produces $\mC \in \R^{B \times m \times p}$ where:
$$ \mC[b] = \mA[b] \mB[b] \quad \text{for } b = 1, \ldots, B $$
Example: In multi-head attention with $h = 12$ heads, batch size $B = 32$, sequence length $n = 512$, and head dimension $d_k = 64$:

Query tensor: $\mQ \in \R^{B \times h \times n \times d_k} = \R^{32 \times 12 \times 512 \times 64}$

Key tensor: $\mK \in \R^{B \times h \times n \times d_k} = \R^{32 \times 12 \times 512 \times 64}$

Attention scores: $\mA = \mQ\mK\transpose \in \R^{32 \times 12 \times 512 \times 512}$

This requires $B \times h \times 2n^2d_k = 32 \times 12 \times 2 \times 512^2 \times 64 = 12{,}884{,}901{,}888 \approx 12.9$ GFLOPs.

Broadcasting in PyTorch/NumPy: When dimensions don't match, broadcasting rules automatically expand dimensions by aligning them from the right, stretching size-1 dimensions to match, and adding missing dimensions as size-1. For example, adding a bias vector $\R^{768}$ to a tensor $\R^{32 \times 512 \times 768}$ broadcasts the bias across batch and sequence dimensions, effectively treating it as $\R^{1 \times 1 \times 768}$ and expanding it to match the full shape.

Transpose

Definition: The transpose of $\mA \in \R^{m \times n}$, denoted $\mA\transpose \in \R^{n \times m}$, swaps rows and columns:
$$ [\mA\transpose]_{i,j} = a_{j,i} $$

Important properties:

$$\begin{align} (\mA\transpose)\transpose &= \mA \\ (\mA\mB)\transpose &= \mB\transpose \mA\transpose \end{align}$$

Hardware Context for Matrix Operations

Understanding how matrix operations map to hardware is crucial for writing efficient deep learning code.

Memory Layout: Row-Major vs Column-Major

Matrices are stored in memory as one-dimensional arrays, and the layout significantly affects performance. In row-major order, used by C and PyTorch, rows are stored consecutively in memory. In column-major order, used by Fortran and MATLAB, columns are stored consecutively. For a matrix $\mA = \begin{bmatrix} a & b \\ c & d \end{bmatrix}$, row-major storage produces the sequence $[a, b, c, d]$ while column-major storage produces $[a, c, b, d]$.

Cache Efficiency: Accessing memory sequentially is 10-100$\times$ faster than random access due to CPU cache lines, which typically hold 64 bytes of consecutive memory. This means you should always iterate in the storage order. For row-major matrices, iterate rows in the outer loop to access consecutive memory locations, avoiding strided access patterns that jump across rows and cause cache misses. For row-major matrices, iterate rows in the outer loop:
# Good: Sequential memory access
for i in range(m):
    for j in range(n):
        result += A[i, j]  # Accesses consecutive memory

# Bad: Strided memory access  
for j in range(n):
    for i in range(m):
        result += A[i, j]  # Jumps across rows

GPU Acceleration and BLAS Libraries

Modern deep learning relies on highly optimized linear algebra libraries that provide standardized interfaces for common operations. The Basic Linear Algebra Subprograms (BLAS) standard defines three levels of operations: Level 1 for vector operations like dot products and norms with $O(n)$ complexity, Level 2 for matrix-vector operations like $\mA\vx$ with $O(n^2)$ complexity, and Level 3 for matrix-matrix operations like $\mA\mB$ with $O(n^3)$ complexity. Common CPU implementations include Intel MKL, OpenBLAS, and Apple Accelerate, while GPU implementations include NVIDIA cuBLAS and AMD rocBLAS. These libraries achieve near-peak hardware performance through careful optimization of memory access patterns, instruction scheduling, and hardware-specific features.

Example: NVIDIA GPUs use specialized Tensor Cores for accelerated matrix multiplication, achieving dramatically higher throughput than standard CUDA cores. The A100 GPU delivers 312 TFLOPS peak performance for FP16 operations with Tensor Cores, has 1.6 TB/s memory bandwidth, and includes 40 MB of L2 cache to reduce memory access latency.

For matrix multiplication $\mC = \mA\mB$ with $\mA, \mB \in \R^{4096 \times 4096}$, we need $2 \times 4096^3 = 137{,}438{,}953{,}472 \approx 137.4$ GFLOPs of computation and must transfer $3 \times 4096^2 \times 4 = 201{,}326{,}592 \approx 192$ MB of data (three matrices at 4 bytes per float). On an A100, this takes approximately $\frac{137.4 \text{ GFLOPS}}{312{,}000 \text{ GFLOPS}} \approx 0.44$ ms, making it compute-bound since the computation time exceeds the memory transfer time of $\frac{192 \text{ MB}}{1{,}600{,}000 \text{ MB/s}} \approx 0.12$ ms.

Compute-Bound vs Memory-Bound Operations

Definition: Arithmetic intensity measures the ratio of computation to memory access:
$$ \text{Arithmetic Intensity} = \frac{\text{FLOPs}}{\text{Bytes Transferred}} $$

Operations with high arithmetic intensity are compute-bound, meaning they are limited by computational throughput, while operations with low arithmetic intensity are memory-bound, meaning they are limited by memory bandwidth.

Example: Element-wise operations like ReLU, which computes $\vy = \max(0, \vx)$, perform $n$ comparisons while transferring $2n$ elements at 4 bytes each for a total of $8n$ bytes, yielding an arithmetic intensity of only $\frac{n}{8n} = 0.125$ FLOP/byte. This makes element-wise operations memory-bound, as the GPU spends more time waiting for data than computing.

In contrast, matrix multiplication $\mC = \mA\mB$ with $\mA, \mB \in \R^{n \times n}$ performs $2n^3$ FLOPs while transferring $3n^2 \times 4 = 12n^2$ bytes, yielding an arithmetic intensity of $\frac{2n^3}{12n^2} = \frac{n}{6}$ FLOP/byte. For $n = 1024$, this gives 170.7 FLOP/byte, making the operation compute-bound and well-suited for GPU acceleration. For smaller matrices with $n = 64$, the arithmetic intensity drops to 10.7 FLOP/byte, placing it in a transitional regime where both compute and memory bandwidth matter.

Matrix Blocking for Cache Efficiency: Large matrix multiplications are broken into smaller blocks that fit in cache, computing $\mC_{ij} = \sum_{k} \mA_{ik} \mB_{kj}$ where each block is typically $32 \times 32$ or $64 \times 64$ elements. This blocking strategy reduces cache misses from $O(n^3)$ to $O(n^3/\sqrt{M})$ where $M$ is the cache size, dramatically improving performance by ensuring that frequently accessed data remains in fast cache memory rather than requiring slow main memory accesses.

Dot Products and Similarity

Definition: For vectors $\vx, \vy \in \R^n$, the dot product is:
$$ \vx\transpose \vy = \sum_{i=1}^{n} x_i y_i $$
Theorem: For non-zero vectors $\vx, \vy \in \R^n$:
$$ \vx\transpose \vy = \norm{\vx}_2 \norm{\vy}_2 \cos(\theta) $$
where $\theta$ is the angle between vectors and $\norm{\vx}_2 = \sqrt{\vx\transpose \vx}$ is the Euclidean norm.
Corollary: The cosine similarity between two non-zero vectors is:
$$ \text{sim}(\vx, \vy) = \frac{\vx\transpose \vy}{\norm{\vx}_2 \norm{\vy}_2} = \cos(\theta) \in [-1, 1] $$
Example: In transformer attention, we compute similarity between query and key vectors using dot products:
$$ \vq = \begin{bmatrix} 0.5 \\ 0.8 \\ 0.3 \end{bmatrix}, \quad \vk_1 = \begin{bmatrix} 0.6 \\ 0.7 \\ 0.2 \end{bmatrix}, \quad \vk_2 = \begin{bmatrix} -0.3 \\ 0.1 \\ 0.9 \end{bmatrix} $$

Computing similarities:

$$\begin{align} \vq\transpose \vk_1 &= 0.5(0.6) + 0.8(0.7) + 0.3(0.2) = 0.92 \\ \vq\transpose \vk_2 &= 0.5(-0.3) + 0.8(0.1) + 0.3(0.9) = 0.20 \end{align}$$

The query $\vq$ is more similar to $\vk_1$ (score 0.92) than to $\vk_2$ (score 0.20). These scores determine attention weights.

Matrix Decompositions

Eigenvalues and Eigenvectors

Definition: For a square matrix $\mA \in \R^{n \times n}$, a non-zero vector $\vv \in \R^n$ is an eigenvector with corresponding eigenvalue $\lambda \in \R$ if:
$$ \mA \vv = \lambda \vv $$

Geometrically, an eigenvector is only scaled (not rotated) when $\mA$ is applied. The eigenvalue $\lambda$ is the scaling factor.

Example: Find eigenvalues of:
$$ \mA = \begin{bmatrix} 3 & 1 \\ 1 & 3 \end{bmatrix} $$

Solving $\det(\mA - \lambda \mI) = 0$:

$$\begin{align} \det\begin{bmatrix} 3-\lambda & 1 \\ 1 & 3-\lambda \end{bmatrix} &= (3-\lambda)^2 - 1 = \lambda^2 - 6\lambda + 8 = 0\\ &= (\lambda - 4)(\lambda - 2) = 0 \end{align}$$

Eigenvalues: $\lambda_1 = 4$, $\lambda_2 = 2$

For $\lambda_1 = 4$, eigenvector: $\vv_1 = \frac{1}{\sqrt{2}}\begin{bmatrix} 1 \\ 1 \end{bmatrix}$

For $\lambda_2 = 2$, eigenvector: $\vv_2 = \frac{1}{\sqrt{2}}\begin{bmatrix} 1 \\ -1 \end{bmatrix}$

Singular Value Decomposition

Theorem: Any matrix $\mA \in \R^{m \times n}$ can be decomposed as:
$$ \mA = \mU \boldsymbol{\Sigma} \mV\transpose $$
where $\mU \in \R^{m \times m}$ is an orthogonal matrix of left singular vectors, $\boldsymbol{\Sigma} \in \R^{m \times n}$ is a diagonal matrix with singular values $\sigma_1 \geq \sigma_2 \geq \cdots \geq 0$, and $\mV \in \R^{n \times n}$ is an orthogonal matrix of right singular vectors.
SVD always exists for any matrix, unlike eigendecomposition which requires special conditions.

Low-Rank Approximation and Model Compression

SVD enables efficient model compression by approximating matrices with lower-rank factorizations.

Theorem: The best rank-$k$ approximation to $\mA$ in Frobenius norm is:
$$ \mA_k = \sum_{i=1}^{k} \sigma_i \vu_i \vv_i\transpose $$
where $\sigma_i$ are the $k$ largest singular values with corresponding singular vectors $\vu_i, \vv_i$.

The approximation error is:

$$ \norm{\mA - \mA_k}_F = \sqrt{\sum_{i=k+1}^{r} \sigma_i^2} $$
where $r = \min(m, n)$ is the rank of $\mA$.
Example: Consider weight matrix $\mW \in \R^{512 \times 2048}$ containing $1{,}048{,}576$ parameters. The full SVD gives $\mW = \mU \boldsymbol{\Sigma} \mV\transpose$ where $\mU \in \R^{512 \times 512}$, $\boldsymbol{\Sigma} \in \R^{512 \times 2048}$, and $\mV \in \R^{2048 \times 2048}$.

For a rank-$k$ approximation, we keep only the top $k$ singular values to obtain $\mW \approx \mW_1 \mW_2 = \mU_k \boldsymbol{\Sigma}_k \mV_k\transpose$ where $\mU_k \in \R^{512 \times k}$, $\boldsymbol{\Sigma}_k \in \R^{k \times k}$, and $\mV_k \in \R^{k \times 2048}$. We can absorb the diagonal matrix $\boldsymbol{\Sigma}_k$ into either factor, giving $\mW_1 = \mU_k \boldsymbol{\Sigma}_k \in \R^{512 \times k}$ and $\mW_2 = \mV_k\transpose \in \R^{k \times 2048}$.

The original matrix has $512 \times 2048 = 1{,}048{,}576$ parameters, while the compressed form has $512k + 2048k = 2560k$ parameters, yielding a compression ratio of $\frac{2560k}{1{,}048{,}576} = \frac{k}{409.6}$. For $k = 64$, we have $2560 \times 64 = 163{,}840$ parameters, achieving 84.4\% compression. For $k = 128$, we have $327{,}680$ parameters, achieving 68.8\% compression.

In terms of memory savings with 32-bit floats, the original matrix requires $1{,}048{,}576 \times 4 = 4{,}194{,}304$ bytes or approximately 4.0 MB. The compressed version with $k=64$ requires only $163{,}840 \times 4 = 655{,}360$ bytes or approximately 0.625 MB, saving 3.375 MB per layer. For a model with 100 such layers, this yields a total savings of 337.5 MB, significantly reducing memory footprint and enabling deployment on resource-constrained devices.

Example: Consider a weight matrix with singular values that decay exponentially:
$$ \sigma_i = \sigma_1 \cdot e^{-\alpha i} $$

The relative approximation error for rank-$k$ approximation is:

$$ \frac{\norm{\mW - \mW_k}_F}{\norm{\mW}_F} = \sqrt{\frac{\sum_{i=k+1}^{r} \sigma_i^2}{\sum_{i=1}^{r} \sigma_i^2}} $$

Typical results for transformer feed-forward layers:

Rank $k$CompressionRelative ErrorAccuracy Drop
25650\%0.05$<$0.1\%
12875\%0.120.3\%
6487.5\%0.251.2\%
3293.75\%0.453.5\%

Sweet spot: 50-75\% compression with minimal accuracy loss.

SVD in Modern Architectures

Low-rank matrix decompositions are also central to modern parameter-efficient fine-tuning methods such as LoRA, which adds low-rank updates $\Delta \mW = \mB\mA$ (with $\mB \in \R^{d \times r}$, $\mA \in \R^{r \times k}$, $r \ll \min(d,k)$) to frozen pre-trained weights, achieving $>$99\% parameter reduction. See Chapter~20 for details.

Computing SVD and low-rank approximation in PyTorch:
import torch

# Original weight matrix
W = torch.randn(512, 2048)

# Compute SVD
U, S, Vt = torch.linalg.svd(W, full_matrices=False)

# Rank-k approximation
k = 64
W_compressed = U[:, :k] @ torch.diag(S[:k]) @ Vt[:k, :]

# Factored form for efficient computation
W1 = U[:, :k] @ torch.diag(S[:k])  # 512 x 64
W2 = Vt[:k, :]                       # 64 x 2048

# Verify approximation
error = torch.norm(W - W_compressed, p='fro')
relative_error = error / torch.norm(W, p='fro')
print(f"Relative error: {relative_error:.4f}")

# Memory comparison
original_params = W.numel()
compressed_params = W1.numel() + W2.numel()
compression_ratio = compressed_params / original_params
print(f"Compression: {(1-compression_ratio)*100:.1f}

Norms and Distance Metrics

Definition: For vector $\vx \in \R^n$:
$$\begin{align} \text{L1 norm (Manhattan):} \quad &\norm{\vx}_1 = \sum_{i=1}^n |x_i| \\ \text{L2 norm (Euclidean):} \quad &\norm{\vx}_2 = \sqrt{\sum_{i=1}^n x_i^2} \\ \text{L}\infty \text{ norm (Max):} \quad &\norm{\vx}_\infty = \max_i |x_i| \end{align}$$
Definition: For matrix $\mA \in \R^{m \times n}$:
$$ \text{Frobenius norm:} \quad \norm{\mA}_F = \sqrt{\sum_{i=1}^m \sum_{j=1}^n a_{i,j}^2} = \sqrt{\text{tr}(\mA\transpose \mA)} $$

Norms are used in regularization to prevent overfitting by penalizing large weights.

In PyTorch:
import torch

# Vector norms
x = torch.tensor([3.0, 4.0])
l2_norm = torch.norm(x, p=2)  # 5.0
l1_norm = torch.norm(x, p=1)  # 7.0

# Matrix Frobenius norm
W = torch.randn(256, 784)
frob_norm = torch.norm(W, p='fro')

Practical Deep Learning Examples

Embedding Layers and Memory Requirements

Example: Large language models use embedding layers to map discrete tokens to continuous vector representations. For GPT-3, the vocabulary contains $V = 50{,}257$ tokens, each mapped to a vector of dimension $d_{\text{model}} = 12{,}288$, requiring an embedding matrix $\mE \in \R^{50257 \times 12288}$ with $50{,}257 \times 12{,}288 = 617{,}558{,}016$ parameters. Storing these embeddings in 32-bit floating-point format requires $617{,}558{,}016 \times 4 = 2{,}470{,}232{,}064$ bytes or approximately 2.3 GB of memory, while 16-bit format reduces this to approximately 1.15 GB.

For a batch of $B = 32$ sequences of length $n = 2048$, the input consists of integer token IDs in $\R^{32 \times 2048}$, which the embedding layer transforms into dense representations in $\R^{32 \times 2048 \times 12288}$, requiring $32 \times 2048 \times 12288 \times 4 = 3{,}221{,}225{,}472$ bytes or approximately 3.0 GB of memory. This demonstrates why large batch sizes and long sequences quickly exhaust GPU memory, necessitating techniques like gradient checkpointing and mixed-precision training.

Complete Transformer Layer Analysis

Understanding the full computational profile of a transformer layer---parameters, FLOPs, memory, and hardware utilization---is essential for practical deep learning engineering. Section~[ref] provides a comprehensive, self-contained worked analysis of BERT-base, the standard baseline configuration used as a running example throughout this textbook.

Common Dimension Errors and Debugging

Dimension Mismatch Errors: The most common bugs in deep learning involve incompatible tensor dimensions. When debugging dimension errors, start by printing tensor shapes using print(x.shape) to verify actual dimensions against expected values. Check whether the batch dimension is present (is it $\R^{B \times \ldots}$ or just $\R^{\ldots}$?), verify the sequence length dimension (is it $\R^{B \times n \times d}$ or $\R^{B \times d \times n}$?), confirm that matrix multiplication has compatible inner dimensions, and watch for unintentional broadcasting that may hide shape mismatches. These systematic checks quickly identify the source of dimension errors and guide appropriate fixes.
Common dimension fixes in PyTorch:
import torch

# Problem: Shape mismatch in matrix multiplication
Q = torch.randn(32, 512, 768)  # [batch, seq_len, d_model]
K = torch.randn(32, 512, 768)

# Wrong: Q @ K gives error (768 != 512)
# scores = Q @ K  # Error!

# Correct: Transpose last two dimensions of K
scores = Q @ K.transpose(-2, -1)  # [32, 512, 512]

# Problem: Missing batch dimension
x = torch.randn(512, 768)  # Missing batch dimension
W = torch.randn(768, 3072)

# Add batch dimension
x = x.unsqueeze(0)  # [1, 512, 768]
output = x @ W      # [1, 512, 3072]

# Problem: Broadcasting confusion
x = torch.randn(32, 512, 768)
bias = torch.randn(768)

# This works due to broadcasting
output = x + bias  # bias broadcasts to [32, 512, 768]

# Explicit broadcasting (clearer)
output = x + bias.view(1, 1, 768)

BERT-base: A Canonical Worked Analysis

BERT-base has become the standard baseline configuration for transformer analysis. Its dimensions are adopted throughout this textbook as a running example for parameter counts, memory estimates, and computational costs. This section provides a self-contained reference; subsequent chapters cite these results rather than re-derive them.

Architecture Specification

HyperparameterValue
Layers$N = 12$
Model dimension$d_{\text{model}} = 768$
Attention heads$h = 12$ \quad ($d_k = d_v = 64$ per head)
Feed-forward dimension$d_{ff} = 3072$ \quad ($4 \times d_{\text{model}}$)
Vocabulary size$V = 30{,}000$ (WordPiece)
Maximum sequence length$n_{\max} = 512$

Unless stated otherwise, worked examples use batch size $B = 32$ and full sequence length $n = 512$.

Parameter Count

Per encoder layer. Each layer has three components:

$$\begin{align} \text{Multi-head attention:} \quad &4 \times 768^2 = 2{,}359{,}296 \quad \text{(Q, K, V, O projections)} \\ \text{Feed-forward network:} \quad &2 \times 768 \times 3072 + 3072 + 768 = 4{,}722{,}432 \\ \text{Layer normalization (2$\times$):} \quad &2 \times 2 \times 768 = 3{,}072 \\[4pt] Total per layer: \quad &\mathbf{7{,}084{,}800} \text{ parameters} \end{align}$$

The feed-forward network contains approximately twice as many parameters as the attention mechanism ($4.7$M vs $2.4$M), despite attention being conceptually more complex. This is because the $4\times$ expansion to $d_{ff}$ creates two large weight matrices.

Complete model:

$$\begin{align} \text{Token embeddings:} \quad &30{,}000 \times 768 = 23{,}040{,}000 \\ \text{Position embeddings:} \quad &512 \times 768 = 393{,}216 \\ \text{Token type embeddings:} \quad &2 \times 768 = 1{,}536 \\ \text{Embedding layer norm:} \quad &2 \times 768 = 1{,}536 \\ \text{12 encoder layers:} \quad &12 \times 7{,}084{,}800 = 85{,}017{,}600 \\ \text{Pooler:} \quad &768 \times 768 + 768 = 590{,}592 \\[4pt] Total: \quad &\mathbf{109{,}044{,}480 \approx 110\text{M parameters}} \end{align}$$

Embeddings account for $\sim$21\% of total parameters ($23$M out of $110$M), while the transformer layers account for $\sim$78\%. This ratio shifts with vocabulary size---a 50K-token vocabulary would push embeddings to 35\% of total parameters.

Dimension Tracking Through One Layer

Input: $\mX \in \R^{32 \times 512 \times 768}$ (batch $\times$ sequence $\times$ model dimension)

Multi-Head Attention:

$$\begin{align} \text{Q, K, V projections:} \quad &\R^{32 \times 512 \times 768} \to \R^{32 \times 512 \times 768} \\ \text{Reshape for heads:} \quad &\R^{32 \times 512 \times 768} \to \R^{32 \times 12 \times 512 \times 64} \\ \text{Attention scores:} \quad &\R^{32 \times 12 \times 512 \times 512} \quad \text{(quadratic in } n\text{)} \\ \text{Attention output:} \quad &\R^{32 \times 12 \times 512 \times 64} \\ \text{Concatenate heads:} \quad &\R^{32 \times 512 \times 768} \\ \text{Output projection:} \quad &\R^{32 \times 512 \times 768} \end{align}$$

Feed-Forward Network:

$$\begin{align} \text{Expand:} \quad &\R^{32 \times 512 \times 768} \xrightarrow{\mW_1} \R^{32 \times 512 \times 3072} \\ \text{Activate + project:} \quad &\R^{32 \times 512 \times 3072} \xrightarrow{\mW_2} \R^{32 \times 512 \times 768} \end{align}$$

Activation Memory

Intermediate activations stored for backpropagation dominate training memory:

Component (per layer)Memory (FP32)
Q, K, V projections ($3 \times Bnd$)113 MB
Attention scores ($Bhn^2$)402 MB
Attention output ($Bnd$)38 MB
FFN intermediate ($Bn \cdot 4d$)151 MB
Per-layer total704 MB
12 layers8.4 GB

The attention scores alone ($B \times h \times n^2 \times 4$ bytes $= 402$~MB per layer) account for 57\% of per-layer activation memory, illustrating the quadratic cost of self-attention. Doubling the sequence length to $n = 1024$ would quadruple the attention memory to 1.6~GB per layer.

FLOPs Analysis

Counting each multiply-accumulate as 2 FLOPs:

Self-Attention (per layer):

$$\begin{align} \text{QKV projections:} \quad &3 \times 2 \times 512 \times 768^2 = 1.81\text{ GFLOPs} \\ \text{Attention scores ($\mQ\mK\transpose$):} \quad &12 \times 2 \times 512^2 \times 64 = 0.40\text{ GFLOPs} \\ \text{Attention output ($\mA\mV$):} \quad &12 \times 2 \times 512^2 \times 64 = 0.40\text{ GFLOPs} \\ \text{Output projection:} \quad &2 \times 512 \times 768^2 = 0.60\text{ GFLOPs} \\ \text{Subtotal:} \quad &3.21\text{ GFLOPs} \end{align}$$

Feed-Forward Network (per layer):

$$\begin{align} \text{Expand ($\mW_1$):} \quad &2 \times 512 \times 768 \times 3072 = 2.42\text{ GFLOPs} \\ \text{Project ($\mW_2$):} \quad &2 \times 512 \times 3072 \times 768 = 2.42\text{ GFLOPs} \\ \text{Subtotal:} \quad &4.84\text{ GFLOPs} \end{align}$$

Totals:

$$\begin{align} \text{Per layer:} \quad &3.21 + 4.84 = 8.05\text{ GFLOPs} \\ \text{12 layers (forward pass):} \quad &96.6\text{ GFLOPs} \\ \text{Training step ($\approx 3\times$ forward):} \quad &\approx 290\text{ GFLOPs} \end{align}$$

The FFN accounts for 60\% of per-layer FLOPs, while the attention score computations ($\mQ\mK\transpose$ and $\mA\mV$) contribute only 10\%. For short sequences, optimizing FFN yields the largest gains; for long sequences, attention's $O(n^2)$ term becomes dominant.

Training Memory Budget

ComponentMemory
Parameters (FP32)440 MB
Gradients (FP32)440 MB
Adam optimizer states ($m$, $v$)880 MB
Activations (12 layers)$\approx$8.4 GB
Embeddings + overhead$\approx$3.6 GB
Total$\approx$13.8 GB

Activations dominate at $\sim$87\% of total memory, motivating gradient checkpointing (recompute activations during the backward pass, trading 20--30\% slower training for 50--70\% memory reduction). This is why training BERT-base requires GPUs with at least 16~GB of memory.

Hardware Timing (NVIDIA A100)

The A100 provides 312~TFLOPS FP16 with Tensor Cores and 1.6~TB/s memory bandwidth. At 70\% utilization:

$$\begin{align} \text{Forward pass (single sample):} \quad &\frac{96.6\text{ GFLOPs}}{312 \times 0.7\text{ TFLOPS}} \approx 0.44\text{ ms} \\ \text{Batch of 32:} \quad &32 \times 0.44 \approx 14\text{ ms} \\ \text{Training step ($3\times$ forward):} \quad &\approx 42\text{ ms} \\ \text{Throughput:} \quad &\frac{32 \times 512}{42\text{ ms}} \approx 390{,}000\text{ tokens/sec} \end{align}$$

Memory bandwidth check: Loading 110M parameters ($\times$ 4 bytes $= 440$~MB) takes $440/1600 \approx 0.28$~ms, comparable to compute time. For small batch sizes, BERT-base becomes memory-bandwidth bound rather than compute-bound; larger batches amortize parameter loading across more computation.

Exercises

Exercise 1: Given $\vx = [2, -1, 3]\transpose$ and $\vy = [1, 4, -2]\transpose$, compute:
  1. The dot product $\vx\transpose \vy$
  2. The L2 norms $\norm{\vx}_2$ and $\norm{\vy}_2$
  3. The cosine similarity between $\vx$ and $\vy$
Exercise 2: For a transformer layer with $d_{\text{model}} = 768$ and feed-forward dimension $d_{ff} = 3072$:
  1. Calculate the number of parameters in the two linear transformations
  2. If processing a batch of $B = 32$ sequences of length $n = 512$, what are the dimensions of the input tensor?
  3. How many floating-point operations (FLOPs) are required for one forward pass through this layer?
Exercise 3: Prove that for symmetric matrix $\mA = \mA\transpose$, eigenvectors corresponding to distinct eigenvalues are orthogonal.
Exercise 4: A weight matrix $\mW \in \R^{1024 \times 4096}$ is approximated using SVD with rank $r$.
  1. Express the number of parameters as a function of $r$
  2. What value of $r$ achieves 75\% compression?
  3. What is the memory savings in MB (assuming 32-bit floats)?
Exercise 5: Consider computing attention scores $\mA = \mQ\mK\transpose$ where $\mQ, \mK \in \R^{B \times n \times d_k}$ with $B = 16$, $n = 1024$, $d_k = 64$.
  1. What are the dimensions of the output $\mA$?
  2. Calculate the total FLOPs required
  3. Compute the arithmetic intensity (FLOPs per byte transferred, assuming 32-bit floats)
  4. Is this operation compute-bound or memory-bound on a GPU with 312 TFLOPS and 1.6 TB/s bandwidth?
Exercise 6: An embedding layer has vocabulary size $V = 32{,}000$ and embedding dimension $d = 512$.
  1. How many parameters does the embedding matrix contain?
  2. What is the memory requirement in MB for 32-bit floats?
  3. For a batch of $B = 64$ sequences of length $n = 256$, what is the memory required for the embedded representations?
  4. If we use LoRA with rank $r = 16$ to adapt the embeddings, how many trainable parameters are needed?
Exercise 7: Compare the computational cost of two equivalent operations:
  1. Computing $(\mA\mB)\vx$ where $\mA \in \R^{m \times n}$, $\mB \in \R^{n \times p}$, $\vx \in \R^p$
  2. Computing $\mA(\mB\vx)$
For $m = 512$, $n = 2048$, $p = 512$, which order is more efficient and by what factor?
Exercise 8: A matrix multiplication $\mC = \mA\mB$ with $\mA, \mB \in \R^{2048 \times 2048}$ is performed on a GPU.
  1. Calculate the total FLOPs
  2. Calculate the memory transferred (assuming matrices are read once and result written once)
  3. Compute the arithmetic intensity
  4. If the GPU has 100 TFLOPS compute and 900 GB/s memory bandwidth, what is the theoretical execution time assuming perfect utilization?
  5. Which resource (compute or memory) is the bottleneck?

Solutions

Full solutions for all exercises are available at \url{https://deeplearning.hofkensvermeulen.be}.

Solution to Exercise 1: Given $\vx = [2, -1, 3]\transpose$ and $\vy = [1, 4, -2]\transpose$:

(1) Dot product:

$$ \vx\transpose \vy = 2(1) + (-1)(4) + 3(-2) = 2 - 4 - 6 = -8 $$

(2) L2 norms:

$$\begin{align} \norm{\vx}_2 &= \sqrt{2^2 + (-1)^2 + 3^2} = \sqrt{4 + 1 + 9} = \sqrt{14} \approx 3.742 \\ \norm{\vy}_2 &= \sqrt{1^2 + 4^2 + (-2)^2} = \sqrt{1 + 16 + 4} = \sqrt{21} \approx 4.583 \end{align}$$

(3) Cosine similarity:

$$ \text{sim}(\vx, \vy) = \frac{\vx\transpose \vy}{\norm{\vx}_2 \norm{\vy}_2} = \frac{-8}{\sqrt{14} \cdot \sqrt{21}} = \frac{-8}{\sqrt{294}} \approx -0.466 $$

The negative cosine similarity indicates the vectors point in somewhat opposite directions.

Solution to Exercise 2: For transformer layer with $d_{\text{model}} = 768$ and $d_{ff} = 3072$:

(1) Parameters in feed-forward network:

(2) Input tensor dimensions: For batch size $B = 32$ and sequence length $n = 512$, the input tensor has shape:

$$ \R^{B \times n \times d_{\text{model}}} = \R^{32 \times 512 \times 768} $$

(3) FLOPs for forward pass:

Solution to Exercise 3: Proof: Let $\mA = \mA\transpose$ be symmetric with eigenvectors $\vv_1, \vv_2$ corresponding to distinct eigenvalues $\lambda_1 \neq \lambda_2$.

We have:

$$\begin{align} \mA\vv_1 &= \lambda_1 \vv_1 \\ \mA\vv_2 &= \lambda_2 \vv_2 \end{align}$$

Consider $\vv_1\transpose \mA \vv_2$:

$$\begin{align} \vv_1\transpose \mA \vv_2 &= \vv_1\transpose (\lambda_2 \vv_2) = \lambda_2 (\vv_1\transpose \vv_2) \\ \vv_1\transpose \mA \vv_2 &= (\mA\vv_1)\transpose \vv_2 = (\lambda_1 \vv_1)\transpose \vv_2 = \lambda_1 (\vv_1\transpose \vv_2) \end{align}$$

where we used $\mA\transpose = \mA$ in the second line. Therefore:

$$ \lambda_2 (\vv_1\transpose \vv_2) = \lambda_1 (\vv_1\transpose \vv_2) $$

Rearranging:

$$ (\lambda_1 - \lambda_2)(\vv_1\transpose \vv_2) = 0 $$

Since $\lambda_1 \neq \lambda_2$, we must have $\vv_1\transpose \vv_2 = 0$, proving orthogonality.

Solution to Exercise 4: For weight matrix $\mW \in \R^{1024 \times 4096}$ with rank-$r$ SVD approximation:

(1) Parameters as function of $r$: The factored form requires:

(2) Value of $r$ for 75\% compression: Original parameters: $1024 \times 4096 = 4{,}194{,}304$

For 75\% compression, we want 25\% of original:

$$\begin{align} 5120r &= 0.25 \times 4{,}194{,}304 \\ 5120r &= 1{,}048{,}576 \\ r &= 204.8 \approx 205 \end{align}$$

(3) Memory savings:

Solution to Exercise 5: For attention scores $\mA = \mQ\mK\transpose$ with $\mQ, \mK \in \R^{B \times n \times d_k}$, $B = 16$, $n = 1024$, $d_k = 64$:

(1) Output dimensions:

$$ \mA = \underbrace{\mQ}_{\R^{16 \times 1024 \times 64}} \underbrace{\mK\transpose}_{\R^{16 \times 64 \times 1024}} = \underbrace{\mA}_{\R^{16 \times 1024 \times 1024}} $$

(2) Total FLOPs: For each batch element, we compute $\mQ_b \mK_b\transpose$ where $\mQ_b, \mK_b \in \R^{1024 \times 64}$:

$$ \text{FLOPs} = B \times 2n^2d_k = 16 \times 2 \times 1024^2 \times 64 = 2{,}147{,}483{,}648 \approx 2.15 \text{ GFLOPs} $$

(3) Arithmetic intensity: Memory transferred:

Arithmetic intensity:

$$ \frac{2{,}147{,}483{,}648 \text{ FLOPs}}{75{,}497{,}472 \text{ bytes}} \approx 28.4 \text{ FLOP/byte} $$

(4) Compute-bound or memory-bound:

Solution to Exercise 6: For embedding layer with $V = 32{,}000$ and $d = 512$:

(1) Number of parameters:

$$ V \times d = 32{,}000 \times 512 = 16{,}384{,}000 \text{ parameters} $$

(2) Memory requirement:

$$ 16{,}384{,}000 \times 4 \text{ bytes} = 65{,}536{,}000 \text{ bytes} \approx 62.5 \text{ MB} $$

(3) Memory for embedded representations: For batch $B = 64$, sequence length $n = 256$:

$$ B \times n \times d \times 4 = 64 \times 256 \times 512 \times 4 = 33{,}554{,}432 \text{ bytes} \approx 32 \text{ MB} $$

(4) LoRA trainable parameters: LoRA adds two matrices: $\mB \in \R^{d \times r}$ and $\mA \in \R^{r \times V}$:

$$ dr + rV = 512 \times 16 + 16 \times 32{,}000 = 8{,}192 + 512{,}000 = 520{,}192 \text{ parameters} $$

This is only $\frac{520{,}192}{16{,}384{,}000} \approx 3.2\%$ of the original parameters!

Solution to Exercise 7: For $\mA \in \R^{512 \times 2048}$, $\mB \in \R^{2048 \times 512}$, $\vx \in \R^{512}$:

Option 1: $(\mA\mB)\vx$

Option 2: $\mA(\mB\vx)$

Efficiency comparison:

$$ \text{Speedup} = \frac{1{,}074{,}266{,}112}{4{,}194{,}304} \approx 256\times $$

Option 2 is 256× more efficient! This demonstrates the importance of operation ordering in linear algebra.

Solution to Exercise 8: For $\mC = \mA\mB$ with $\mA, \mB \in \R^{2048 \times 2048}$:

(1) Total FLOPs:

$$ 2 \times 2048^3 = 2 \times 8{,}589{,}934{,}592 = 17{,}179{,}869{,}184 \approx 17.2 \text{ GFLOPs} $$

(2) Memory transferred:

(3) Arithmetic intensity:

$$ \frac{17{,}179{,}869{,}184 \text{ FLOPs}}{50{,}331{,}648 \text{ bytes}} \approx 341.3 \text{ FLOP/byte} $$

(4) Theoretical execution time:

(5) Bottleneck: The operation is compute-bound since compute time (0.172 ms) exceeds memory time (0.053 ms). The high arithmetic intensity (341 FLOP/byte) makes this operation well-suited for GPU acceleration.

← Notation and Conventions 📚 Table of Contents Chapter 2: Calculus and Optimization →