The Transformer Model
Chapter Overview
The Transformer architecture, introduced in "Attention is All You Need" (Vaswani et al., 2017), revolutionized deep learning by replacing recurrence with pure attention mechanisms. This chapter presents the complete transformer architecture, combining all attention mechanisms from previous chapters into a powerful encoder-decoder model.
We develop the transformer from bottom to top: starting with the attention layer, building encoder and decoder blocks, and assembling the full architecture. We provide complete mathematical specifications, dimension tracking, and parameter counts for standard transformer configurations.
Learning Objectives
- Understand the complete transformer encoder-decoder architecture
- Implement position-wise feed-forward networks
- Apply layer normalization and residual connections
- Compute output dimensions through the entire network
- Count parameters for transformer models (BERT-base, GPT-2)
- Understand training objectives for different transformer variants
Transformer Architecture Overview
High-Level Structure
The transformer architecture represents a fundamental departure from the recurrent and convolutional architectures that dominated sequence modeling before 2017. At its core, the transformer is an encoder-decoder architecture that processes sequences entirely through attention mechanisms, eliminating the sequential dependencies that made RNNs difficult to parallelize. The encoder processes the input sequence and produces contextualized representations where each position has attended to all other positions in the input. The decoder then generates the output sequence autoregressively, attending both to its own previously generated tokens and to the encoder's output through a cross-attention mechanism. This design enables the model to capture long-range dependencies without the vanishing gradient problems that plague recurrent architectures, while simultaneously allowing massive parallelization during training.
The key innovation that makes transformers practical is the elimination of recurrence in favor of pure attention mechanisms. In an RNN, processing a sequence of length $n$ requires $n$ sequential steps, each depending on the previous hidden state. This sequential dependency means that even with unlimited computational resources, the time complexity remains $O(n)$ because operations cannot be parallelized across time steps. The transformer, by contrast, computes attention between all pairs of positions simultaneously, requiring only $O(1)$ sequential operations regardless of sequence length. For a sequence of length 512, this means the difference between 512 sequential steps (RNN) and a single parallel operation (transformer). On modern GPUs with thousands of cores, this parallelization advantage translates to training speedups of 10-100Ă compared to recurrent architectures.
The transformer achieves this parallelization through multi-head self-attention, which allows each position to attend to all positions in a single operation. For an input sequence $\mX \in \R^{n \times d_{\text{model}}}$, the self-attention mechanism computes attention scores between all $n^2$ pairs of positions simultaneously, producing an output of the same shape $\R^{n \times d_{\text{model}}}$. This operation is entirely parallelizable across both the batch dimension and the sequence dimension, making it ideally suited for GPU acceleration. The multi-head aspect further enhances expressiveness by allowing the model to attend to different representation subspaces simultaneouslyâone head might capture syntactic relationships while another captures semantic similarity.
However, pure attention mechanisms lack an inherent notion of sequence order. Unlike RNNs where position information is implicit in the sequential processing, transformers must explicitly encode positional information. This is achieved through positional encodings that are added to the input embeddings, providing each position with a unique signature that the attention mechanism can use to distinguish positions. The original transformer uses sinusoidal positional encodings, though learned positional embeddings have also proven effective. This explicit position encoding is crucial: without it, the transformer would be permutation-invariant, treating "the cat sat" identically to "sat cat the."
The transformer architecture also incorporates residual connections and layer normalization at every sub-layer, forming the pattern $\text{LayerNorm}(x + \text{Sublayer}(x))$ throughout the network. These residual connections serve multiple purposes: they provide direct gradient pathways that enable training of very deep networks (the original transformer uses 6 layers, but modern variants scale to 96 layers in GPT-3), they allow the model to learn incremental refinements rather than complete transformations at each layer, and they stabilize training by preventing the exploding or vanishing gradient problems that can occur in deep networks. Layer normalization, applied after each residual connection, normalizes activations across the feature dimension, ensuring stable activation distributions throughout the network regardless of batch size.
The position-wise feed-forward network, applied after each attention layer, provides additional representational capacity through a simple two-layer network with a ReLU or GELU activation. This network is applied independently to each position, meaning it doesn't mix information across positions (unlike attention). The feed-forward network typically expands the representation to a higher dimension (usually $4 \times d_{\text{model}}$) before projecting back down, creating a bottleneck architecture that encourages the model to learn compressed representations. For BERT-base with $d_{\text{model}} = 768$, the feed-forward network expands to $d_{ff} = 3072$ dimensions, and this expansion-projection accounts for approximately two-thirds of the parameters in each transformer layer.
Transformer Encoder
Single Encoder Layer
A transformer encoder layer consists of two main sub-layers: multi-head self-attention followed by a position-wise feed-forward network, with residual connections and layer normalization applied around each sub-layer. This architecture enables the encoder to build increasingly sophisticated representations of the input sequence as information flows through multiple layers. The self-attention mechanism allows each position to gather information from all other positions, creating contextualized representations where the meaning of each token depends on its surrounding context. The feed-forward network then processes each position independently, applying a non-linear transformation that enhances the model's representational capacity.
The residual connections are crucial for enabling gradient flow through deep networks. Without them, gradients would need to flow through multiple attention and feed-forward layers, potentially vanishing or exploding. With residual connections, gradients have a direct path from the output back to the input of each layer, ensuring stable training even for very deep transformers. The layer normalization, applied after adding the residual, normalizes the activations across the feature dimension, maintaining stable activation distributions throughout the network. This combination of residual connections and layer normalization is what enables transformers to scale to dozens or even hundreds of layers.
Step 1: Multi-Head Self-Attention
Step 2: Position-wise Feed-Forward
The feed-forward network is defined as:
\node[block] (input) at (0,0) {Input $\mX$};
\node[operation] (mha) at (0,-1.5) {Multi-Head Self-Attention};
\node[operation] (add1) at (0,-2.5) {Add \& LayerNorm};
\node[operation] (ffn) at (0,-4) {Feed-Forward Network};
\node[operation] (add2) at (0,-5) {Add \& LayerNorm};
\node[block] (output) at (0,-6.5) {Output};
\draw[arrow] (input) -- (mha); \draw[arrow] (mha) -- (add1); \draw[arrow] (add1) -- (ffn); \draw[arrow] (ffn) -- (add2); \draw[arrow] (add2) -- (output);
\draw[residual] (input.east) -- ++(1,0) |- (add1.east); \draw[residual] (add1.east) -- ++(1,0) |- (add2.east);
\end{tikzpicture}
The dimension tracking through an encoder layer reveals important properties about memory consumption and computational cost. The input $\mX \in \R^{B \times n \times d_{\text{model}}}$ is first projected to queries, keys, and values, each with shape $\R^{B \times n \times d_{\text{model}}}$. For multi-head attention with $h$ heads, these are reshaped to $\R^{B \times h \times n \times d_k}$ where $d_k = d_{\text{model}}/h$. The attention scores form a matrix $\R^{B \times h \times n \times n}$, and this quadratic term in sequence length is what dominates memory consumption for long sequences. After attention, the output is projected back to $\R^{B \times n \times d_{\text{model}}}$, added to the residual, and normalized.
The feed-forward network then expands each position's representation from $d_{\text{model}}$ to $d_{ff}$ dimensions before projecting back down. For BERT-base with $d_{\text{model}} = 768$ and $d_{ff} = 3072$, this means each position's representation temporarily expands to 4Ă its original size. This expansion creates a bottleneck that forces the model to learn compressed representations, similar to the hidden layer in an autoencoder. The intermediate activations $\R^{B \times n \times d_{ff}}$ consume significant memory during trainingâfor batch size 32 and sequence length 512, this amounts to $32 \times 512 \times 3072 \times 4 = 201$ MB per layer in FP32, and with 12 layers in BERT-base, the feed-forward activations alone consume 2.4 GB of GPU memory.
The attention scores matrix $\R^{32 \times 12 \times 512 \times 512}$ requires 402~MB in FP32---the quadratic scaling means doubling sequence length to 1024 would require 1.6~GB per layer for attention scores alone.
Each encoder layer contains 7,084,800 parameters, with the feed-forward network ($4.7$M) contributing roughly twice the attention mechanism ($2.4$M). The per-layer activation memory totals $\approx$704~MB, dominated by attention scores. For all 12 layers, activations alone consume $\approx$8.4~GB. See Section~[ref] for the complete parameter, FLOPs, and memory analysis.
Complete Encoder Stack
The complete transformer encoder stacks $N$ identical encoder layers, with each layer's output serving as input to the next layer. This stacking enables the model to build increasingly abstract representations: early layers might capture local syntactic patterns, middle layers might identify semantic relationships, and later layers might encode task-specific features. The depth of the network is crucial for performanceâBERT-base uses 12 layers, BERT-large uses 24 layers, and GPT-3 uses 96 layers. However, deeper networks require more careful optimization, including learning rate warmup, gradient clipping, and appropriate weight initialization.
Then apply $N$ encoder layers sequentially:
The final encoder output $\mX^{(N)} \in \R^{B \times n \times d_{\text{model}}}$ contains contextualized representations of the input sequence.
The sequential application of encoder layers means that information flows through $N$ attention operations, allowing each token to indirectly attend to all other tokens through multiple hops. In a 12-layer encoder, information can propagate across the entire sequence through 12 levels of attention, enabling the model to capture very long-range dependencies. However, this sequential stacking also means that encoder layers cannot be parallelizedâlayer $\ell$ must wait for layer $\ell-1$ to complete. The parallelization in transformers occurs within each layer (across batch and sequence dimensions), not across layers.
Position-wise Feed-Forward Networks
The position-wise feed-forward network represents the second major component of each transformer layer, complementing the attention mechanism with additional non-linear transformations. While attention allows positions to exchange information and build contextualized representations, the feed-forward network processes each position independently, applying the same learned transformation to every position in the sequence. This independence is what makes it "position-wise"âthe network applied to position $i$ is identical to the network applied to position $j$, with no parameter sharing or information flow between positions.
The feed-forward network consists of two linear transformations with a non-linear activation function in between, forming a simple two-layer neural network. The first layer expands the representation from $d_{\text{model}}$ dimensions to a larger dimension $d_{ff}$ (typically $4 \times d_{\text{model}}$), applies an activation function, and then the second layer projects back down to $d_{\text{model}}$ dimensions. This expansion-and-contraction creates a bottleneck architecture similar to an autoencoder, forcing the model to learn compressed representations that capture the most important features. The expansion factor of 4Ă is a design choice from the original transformer paper that has been widely adopted, though some recent models experiment with different ratios.
For a sequence $\mX \in \R^{B \times n \times d_{\text{model}}}$, apply to each position independently:
The output maintains the same shape as the input: $\R^{B \times n \times d_{\text{model}}}$.
The term "position-wise" emphasizes a crucial distinction from the attention mechanism. In attention, every position attends to every other position, creating $O(n^2)$ interactions. In the feed-forward network, each position is processed completely independently, creating only $O(n)$ operations. This means the feed-forward network is embarrassingly parallelâall $n$ positions can be processed simultaneously with no dependencies. In practice, this is implemented as a single matrix multiplication: the input $\mX \in \R^{B \times n \times d_{\text{model}}}$ is reshaped to $\R^{Bn \times d_{\text{model}}}$, multiplied by $\mW_1$, activated, multiplied by $\mW_2$, and reshaped back to $\R^{B \times n \times d_{\text{model}}}$.
The choice of activation function significantly impacts model performance and training dynamics. The original transformer used ReLU activation, which is simple and computationally efficient but can suffer from "dying ReLU" problems where neurons become permanently inactive. BERT and GPT introduced the GELU (Gaussian Error Linear Unit) activation, which provides a smoother, probabilistic alternative to ReLU. GELU is defined as $\text{GELU}(x) = x \cdot \Phi(x)$ where $\Phi(x)$ is the cumulative distribution function of the standard normal distribution. In practice, GELU is approximated as $\text{GELU}(x) \approx 0.5x(1 + \tanh[\sqrt{2/\pi}(x + 0.044715x^3)])$. Empirically, GELU tends to provide slightly better performance than ReLU for transformer models, though the difference is often small.
The feed-forward network accounts for a substantial portion of the model's parameters and computational cost. For BERT-base with $d_{\text{model}} = 768$ and $d_{ff} = 3072$, each feed-forward network contains $768 \times 3072 + 3072 \times 768 = 4.7$M parameters, compared to $4 \times 768^2 = 2.4$M parameters in the attention mechanism. This means approximately two-thirds of each layer's parameters are in the feed-forward network. Similarly, for short sequences where $n < d_{\text{model}}$, the feed-forward network dominates computational cost. For BERT-base with sequence length 512, the feed-forward network requires $2 \times 512 \times 768 \times 3072 = 2.4$ GFLOPs per layer, while attention requires $8 \times 512 \times 768^2 + 4 \times 512^2 \times 768 = 3.2$ GFLOPs. The crossover point occurs around $n = 2d_{\text{model}}$âfor longer sequences, attention dominates; for shorter sequences, the feed-forward network dominates.
Dimension tracking:
Memory requirements:
The intermediate activations at dimension $d_{ff} = 3072$ consume 4Ă the memory of the input/output activations at dimension $d_{\text{model}} = 768$. For a 12-layer BERT model, the feed-forward intermediate activations across all layers consume $12 \times 201.3 = 2.4$ GB of memory during training. This is why gradient checkpointing, which recomputes these activations during the backward pass instead of storing them, can significantly reduce memory consumption.
Computational cost:
For comparison, the attention mechanism in the same layer requires approximately $51.5$ GFLOPs (including Q, K, V projections, attention computation, and output projection). This means the feed-forward network accounts for 60\% of the computational cost per layer for this configuration.
Alternative activation functions: While ReLU and GELU are most common, other activation functions have been explored for transformers. The Swish activation $\text{Swish}(x) = x \cdot \sigma(\beta x)$ where $\sigma$ is the sigmoid function, provides similar properties to GELU. The GLU (Gated Linear Unit) family, including $\text{GLU}(x) = (x \mW_1) \odot \sigma(x \mW_2)$, uses gating mechanisms similar to LSTMs. Recent work has also explored learned activation functions that adapt during training. However, GELU remains the most widely adopted choice for modern transformers due to its balance of performance and computational efficiency.
Transformer Decoder
Single Decoder Layer
The transformer decoder extends the encoder architecture with an additional cross-attention mechanism that allows the decoder to attend to the encoder's output. While the encoder uses only self-attention to build contextualized representations of the input, the decoder must perform three distinct operations: masked self-attention on the target sequence, cross-attention to the source sequence, and position-wise feed-forward transformation. This three-sublayer structure enables the decoder to generate output sequences that are conditioned on both the previously generated tokens and the encoded input sequence.
The masked self-attention in the decoder is crucial for maintaining the autoregressive property during training. Unlike the encoder's bidirectional self-attention where each position can attend to all positions, the decoder's self-attention must be causalâposition $i$ can only attend to positions $j \leq i$. This masking ensures that the model cannot "cheat" by looking at future tokens during training. Without this mask, the model could simply copy the target sequence during training without learning to generate it. The mask is implemented by setting attention scores for future positions to $-\infty$ before the softmax, ensuring they receive zero attention weight.
The cross-attention mechanism is where the decoder actually uses information from the encoder. In cross-attention, the queries come from the decoder's hidden states (representing "what information do I need?"), while the keys and values come from the encoder's output (representing "what information is available from the source?"). This asymmetry allows the decoder to selectively focus on relevant parts of the source sequence when generating each target token. For machine translation, this might mean attending to the source word being translated; for summarization, it might mean attending to the most salient sentences in the document.
Step 1: Masked Self-Attention
Step 2: Cross-Attention to Encoder
Step 3: Feed-Forward
The output $\vh^{(3)} \in \R^{B \times m \times d_{\text{model}}}$ maintains the target sequence length $m$.
The dimension compatibility in cross-attention deserves careful attention. The decoder hidden states $\vh^{(1)} \in \R^{B \times m \times d_{\text{model}}}$ are projected to queries $\mQ \in \R^{B \times m \times d_{\text{model}}}$, while the encoder output $\mX_{\text{enc}} \in \R^{B \times n \times d_{\text{model}}}$ is projected to keys $\mK \in \R^{B \times n \times d_{\text{model}}}$ and values $\mV \in \R^{B \times n \times d_{\text{model}}}$. The attention scores are computed as $\mQ \mK^T \in \R^{B \times m \times n}$, creating a rectangular attention matrix where each of the $m$ target positions attends to all $n$ source positions. This is different from self-attention where the attention matrix is square ($n \times n$ for encoder, $m \times m$ for decoder self-attention).
The causal mask in decoder self-attention is implemented as a lower-triangular matrix. For a sequence of length $m = 5$, the mask looks like:
Inputs:
Masked Self-Attention:
The attention scores matrix $\R^{32 \times 12 \times 15 \times 15}$ requires $32 \times 12 \times 15 \times 15 \times 4 = 3.5$ MB in FP32. This is much smaller than encoder self-attention because the target sequence is shorter than the source sequence in this example.
Cross-Attention:
The cross-attention scores $\R^{32 \times 12 \times 15 \times 20}$ require $32 \times 12 \times 15 \times 20 \times 4 = 4.6$ MB in FP32. Notice this is rectangular: 15 target positions attending to 20 source positions.
Feed-Forward Network:
The intermediate activations require $32 \times 15 \times 3072 \times 4 = 59.0$ MB in FP32.
Complete Decoder Stack
The complete decoder stacks $N$ decoder layers, with each layer attending to both the previous decoder layer's output and the encoder's final output. This stacking enables the decoder to build increasingly sophisticated representations of the target sequence, conditioned on the source sequence. The encoder output $\mX_{\text{enc}}$ is reused by every decoder layerâit's computed once by the encoder and then fed into all $N$ decoder layers. This means the encoder output must be stored in memory throughout the decoder's computation, contributing to memory requirements.
Then apply $N$ decoder layers sequentially, each attending to the encoder output:
The final decoder output $\mY^{(N)} \in \R^{B \times m \times d_{\text{model}}}$ is projected to vocabulary logits:
During training, the entire target sequence is processed in parallel using teacher forcingâthe model receives the ground-truth previous tokens rather than its own predictions. The causal mask ensures that position $i$ cannot attend to future positions, maintaining the autoregressive property even though all positions are computed simultaneously. This parallel training is a major advantage over RNN decoders, which must process the target sequence sequentially even during training.
During inference, however, the decoder must generate tokens autoregressively, one at a time. At step $t$, the decoder has generated tokens $y_1, \ldots, y_{t-1}$ and must predict $y_t$. This requires running the decoder with input sequence length $t-1$, computing attention over all previously generated tokens. For a target sequence of length $m$, this requires $m$ forward passes through the decoder, making inference much slower than training. This is why techniques like KV caching (storing computed key and value projections) are crucial for efficient inference.
Masked self-attention:
Cross-attention:
Feed-forward network:
Layer normalization (3 instances):
Total per decoder layer: $2{,}359{,}296 + 2{,}359{,}296 + 4{,}722{,}432 + 4{,}608 = 9{,}445{,}632$ parameters
This is approximately 33\% more parameters than an encoder layer ($9.4$M vs $7.1$M) due to the additional cross-attention mechanism. For a 6-layer decoder, this amounts to $6 \times 9{,}445{,}632 = 56.7$M parameters, compared to $6 \times 7{,}084{,}800 = 42.5$M for a 6-layer encoder.
Without KV caching:
- Process sequence of length 100
- Compute Q, K, V for all 100 positions
- Compute attention scores $\R^{100 \times 100}$
- Total: 100 forward passes through decoder, each processing increasing sequence lengths
With KV caching:
- Store K, V from previous steps: $\R^{99 \times 768}$ per layer
- At step 100, compute only Q for new position: $\R^{1 \times 768}$
- Concatenate with cached K, V: $\R^{100 \times 768}$
- Compute attention scores $\R^{1 \times 100}$ (only for new position)
- Total: 100 forward passes, but each processes only 1 new position
For BERT-base dimensions with 12 decoder layers, the KV cache requires:
This modest memory cost (7.4 MB for 100 tokens) enables approximately 50Ă speedup in generation, reducing generation time from several seconds to tens of milliseconds for typical sequences.
Computational Analysis
The computational complexity of transformers involves attention ($O(n^2 d)$ FLOPs) and feed-forward layers ($O(nd^2)$ FLOPs), with attention dominating for long sequences and feed-forward layers dominating for large model dimensions. Memory requirements include model parameters, optimizer states, activations (scaling linearly with batch size), and attention matrices (scaling quadratically with sequence length). A detailed computational analysis including FLOPs counting, memory budgets, and inference optimization is provided in Chapter~12.
Complete Transformer Architecture
Full Encoder-Decoder Model
Original Transformer Configuration
"Attention is All You Need" base model:
- Encoder layers: $N_{\text{enc}} = 6$
- Decoder layers: $N_{\text{dec}} = 6$
- Model dimension: $d_{\text{model}} = 512$
- Attention heads: $h = 8$
- Feed-forward dimension: $d_{ff} = 2048$
- Dropout rate: $p = 0.1$
Parameter count:
Residual Connections and Layer Normalization
Residual Connections
Residual connections, also known as skip connections, are fundamental to enabling the training of deep transformer networks. Without residual connections, gradients would need to flow through dozens of attention and feed-forward layers during backpropagation, leading to vanishing or exploding gradients that make optimization extremely difficult. The residual connection provides a direct path from each layer's output back to its input, allowing gradients to flow unimpeded through the network. This gradient highway ensures that even the earliest layers receive meaningful gradient signals, enabling effective training of networks with 96 layers (GPT-3) or more.
The residual connection pattern in transformers follows the post-addition layer normalization structure: $\text{LayerNorm}(x + \text{Sublayer}(x))$. This means the sublayer's output is added to its input before normalization. The addition operation has a gradient of 1 with respect to both operands, so during backpropagation, gradients flow both through the sublayer (learning to refine representations) and directly through the residual connection (providing a gradient highway). This dual path enables the network to learn both identity mappings (when the sublayer output is near zero) and complex transformations (when the sublayer output is large).
The residual connection also enables the network to learn incrementally. Early in training, the sublayer outputs are typically small due to weight initialization, so the network effectively starts as a near-identity function. As training progresses, the sublayers learn to make increasingly sophisticated transformations, building on the representations from previous layers. This incremental learning is much more stable than trying to learn the complete transformation from scratch. For a 12-layer BERT model, each layer can focus on learning a small refinement rather than a complete transformation, making optimization tractable.
Layer Normalization
Layer normalization stabilizes training by normalizing activations across the feature dimension, ensuring that each layer receives inputs with consistent statistics regardless of how previous layers' parameters change during training. Unlike batch normalization, which normalizes across the batch dimension and is commonly used in convolutional networks, layer normalization normalizes across features for each example independently. This independence from batch size is crucial for transformers, which often use small batch sizes during inference or fine-tuning, and for handling variable-length sequences where batch normalization's statistics would be unreliable.
For a batch of sequences $\mX \in \R^{B \times n \times d}$, layer normalization is applied independently to each of the $B \times n$ vectors, normalizing across the $d$ features.
The learned parameters $\gamma$ and $\beta$ allow the network to undo the normalization if beneficial. If $\gamma_i = \sqrt{\sigma^2 + \epsilon}$ and $\beta_i = \mu$, the normalization is completely undone. In practice, the network learns appropriate values that balance normalization's stabilizing effect with the flexibility to learn arbitrary distributions.
Layer normalization differs fundamentally from batch normalization in its normalization dimension. Batch normalization computes statistics across the batch dimension (normalizing each feature across all examples in the batch), making it dependent on batch size and batch composition. Layer normalization computes statistics across the feature dimension (normalizing all features for each example independently), making it independent of batch size. For transformers processing variable-length sequences with potentially small batch sizes, this independence is essential. A batch size of 1 works perfectly with layer normalization but would be problematic for batch normalization.
Pre-Norm vs Post-Norm
The placement of layer normalization relative to the residual connection significantly impacts training dynamics. The original transformer paper used post-norm: $\text{LayerNorm}(x + \text{Sublayer}(x))$, where normalization is applied after adding the residual. More recent models like GPT-2 and GPT-3 use pre-norm: $x + \text{LayerNorm}(\text{Sublayer}(x))$, where normalization is applied before the sublayer, and the residual connection bypasses normalization entirely.
Post-norm architecture normalizes the sum of the input and sublayer output, which can help prevent activation magnitudes from growing unboundedly as depth increases. However, post-norm requires careful learning rate warmup and can be unstable for very deep networks. The gradients must flow through the layer normalization operation, which can introduce additional numerical instabilities. BERT uses post-norm with 12-24 layers successfully, but scaling to 96+ layers becomes challenging.
Pre-norm architecture applies normalization before each sublayer, so the sublayer receives normalized inputs. The residual connection then adds the sublayer output directly to the (unnormalized) input, bypassing the normalization. This provides a cleaner gradient path through the residual connection and tends to be more stable for very deep networks. GPT-2 and GPT-3 use pre-norm, enabling training of 48-96 layer models without learning rate warmup. The trade-off is that pre-norm may achieve slightly lower final performance than post-norm for shallow networks, but this difference diminishes for deeper networks where pre-norm's stability advantages dominate.
Input: $\vx = [0.5, -0.3, 1.2, \ldots]$ (768 values)
Compute statistics:
Normalize:
The normalized values $\hat{\vx}$ have mean 0 and variance 1 across the 768 dimensions.
Apply learned affine transformation:
where $\gamma, \beta \in \R^{768}$ are learned during training. Initially, $\gamma$ is typically initialized to 1 and $\beta$ to 0, making layer normalization initially act as pure normalization.
Memory and computation:
- Parameters: $2 \times 768 = 1{,}536$ (scale and shift)
- FLOPs per position: $\approx 10 \times 768 = 7{,}680$ (mean, variance, normalize, scale, shift)
- For batch 32, sequence 512: $32 \times 512 \times 7{,}680 = 126$ MFLOPs
Layer normalization is computationally cheap compared to attention or feed-forward networks, but it's memory-bound rather than compute-bound, so kernel fusion with adjacent operations is important for efficiency.
Training Objectives
Sequence-to-Sequence Training
For machine translation, minimize cross-entropy loss:
Teacher forcing: During training, use ground-truth previous tokens $y_{ Encode source: $\mX_{\text{enc}} = \text{Encoder}(\mathbf{x})$ The original transformer uses both an encoder and decoder, but subsequent research established three main architectural patterns, each suited to different task families:Autoregressive Generation
Transformer Variants: Architectural Patterns
Recent trends favor decoder-only architectures for their versatility and scaling properties, though encoder-only models remain more parameter-efficient for understanding tasks and encoder-decoder models remain strongest for sequence-to-sequence tasks. Detailed coverage of each variant's architecture, pre-training, and fine-tuning follows in Chapters~13--15.
Exercises
- Calculate total parameters in encoder
- Calculate total parameters in decoder
- What percentage are in embeddings vs transformer layers?
- How does this change if vocabulary increases to 50,000?
- Encoder processing sequence length 1024
- Decoder generating 1024 tokens autoregressively
Solutions
Full solutions for all exercises are available at \url{https://deeplearning.hofkensvermeulen.be}.
Given: $N=6$, $d_{\text{model}}=512$, $h=8$, $d_{ff}=2048$, $V=32000$
Part (a): Encoder Parameters
For each encoder layer:
- Multi-head attention:
- Query, Key, Value projections: $3 \times d_{\text{model}} \times d_{\text{model}} = 3 \times 512 \times 512 = 786{,}432$
- Output projection: $d_{\text{model}} \times d_{\text{model}} = 512 \times 512 = 262{,}144$
- Total attention: $786{,}432 + 262{,}144 = 1{,}048{,}576$
- Feed-forward network:
- First layer: $d_{\text{model}} \times d_{ff} = 512 \times 2048 = 1{,}048{,}576$
- Second layer: $d_{ff} \times d_{\text{model}} = 2048 \times 512 = 1{,}048{,}576$
- Biases: $d_{ff} + d_{\text{model}} = 2048 + 512 = 2{,}560$
- Total FFN: $2{,}099{,}712$
- Layer normalization (2 instances):
- Parameters per LayerNorm: $2 \times d_{\text{model}} = 2 \times 512 = 1{,}024$
- Total: $2 \times 1{,}024 = 2{,}048$
Parameters per encoder layer: $1{,}048{,}576 + 2{,}099{,}712 + 2{,}048 = 3{,}150{,}336$
Total encoder layers: $N \times 3{,}150{,}336 = 6 \times 3{,}150{,}336 = 18{,}902{,}016$
Input embedding: $V \times d_{\text{model}} = 32{,}000 \times 512 = 16{,}384{,}000$
Positional encoding (learned): $L_{\max} \times d_{\text{model}}$ (typically $5{,}000 \times 512 = 2{,}560{,}000$)
Total encoder parameters: $18{,}902{,}016 + 16{,}384{,}000 + 2{,}560{,}000 = 37{,}846{,}016$
Part (b): Decoder Parameters
Each decoder layer has:
- Masked self-attention: $1{,}048{,}576$ (same as encoder)
- Cross-attention: $1{,}048{,}576$ (Q from decoder, K,V from encoder)
- Feed-forward: $2{,}099{,}712$
- Layer normalization (3 instances): $3 \times 1{,}024 = 3{,}072$
Parameters per decoder layer: $1{,}048{,}576 + 1{,}048{,}576 + 2{,}099{,}712 + 3{,}072 = 4{,}199{,}936$
Total decoder layers: $6 \times 4{,}199{,}936 = 25{,}199{,}616$
Output embedding (shared with input): $0$ (weight tying)
Output projection: $d_{\text{model}} \times V = 512 \times 32{,}000 = 16{,}384{,}000$
Total decoder parameters: $25{,}199{,}616 + 16{,}384{,}000 = 41{,}583{,}616$
Part (c): Embedding vs Transformer Percentage
Total parameters: $37{,}846{,}016 + 41{,}583{,}616 = 79{,}429{,}632$
Embedding parameters: $16{,}384{,}000 + 2{,}560{,}000 + 16{,}384{,}000 = 35{,}328{,}000$
Transformer layer parameters: $18{,}902{,}016 + 25{,}199{,}616 = 44{,}101{,}632$
Percentage in embeddings: $\frac{35{,}328{,}000}{79{,}429{,}632} \times 100\% = 44.5\%$
Percentage in transformer layers: $\frac{44{,}101{,}632}{79{,}429{,}632} \times 100\% = 55.5\%$
Part (d): Vocabulary Increase to 50,000
New embedding parameters: $50{,}000 \times 512 \times 2 = 51{,}200{,}000$ (input + output)
New total: $44{,}101{,}632 + 51{,}200{,}000 + 2{,}560{,}000 = 97{,}861{,}632$
Percentage in embeddings: $\frac{53{,}760{,}000}{97{,}861{,}632} \times 100\% = 54.9\%$
The embedding percentage increases from 44.5\% to 54.9\%, showing that vocabulary size has significant impact on model size.
import torch
import torch.nn as nn
class TransformerEncoderLayer(nn.Module):
def __init__(self, d_model=256, n_heads=8, d_ff=1024, dropout=0.1):
super().__init__()
# Multi-head self-attention
self.self_attn = nn.MultiheadAttention(
d_model, n_heads, dropout=dropout, batch_first=True
)
# Feed-forward network
self.ffn = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.ReLU(),
nn.Dropout(dropout),
nn.Linear(d_ff, d_model)
)
# Layer normalization
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
# Dropout
self.dropout1 = nn.Dropout(dropout)
self.dropout2 = nn.Dropout(dropout)
def forward(self, x, mask=None):
# Self-attention with residual connection
attn_output, _ = self.self_attn(x, x, x, attn_mask=mask)
x = x + self.dropout1(attn_output)
x = self.norm1(x)
# Feed-forward with residual connection
ffn_output = self.ffn(x)
x = x + self.dropout2(ffn_output)
x = self.norm2(x)
return x
# Test the implementation
batch_size = 16
seq_length = 64
d_model = 256
# Create model and input
model = TransformerEncoderLayer(d_model=d_model)
x = torch.randn(batch_size, seq_length, d_model, requires_grad=True)
# Forward pass
output = model(x)
# Verify output shape
print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
assert output.shape == (batch_size, seq_length, d_model), "Shape mismatch!"
# Verify gradient flow through residual connections
loss = output.sum()
loss.backward()
print(f"Input gradient norm: {x.grad.norm().item():.4f}")
print(f"Gradient exists: {x.grad is not None}")
# Check that gradients flow to all parameters
for name, param in model.named_parameters():
if param.grad is not None:
print(f"{name}: gradient norm = {param.grad.norm().item():.4f}")
else:
print(f"{name}: NO GRADIENT!")
Expected Output:
Input shape: torch.Size([16, 64, 256])
Output shape: torch.Size([16, 64, 256])
Input gradient norm: 1.2345
Gradient exists: True
self_attn.in_proj_weight: gradient norm = 0.0234
self_attn.out_proj.weight: gradient norm = 0.0156
ffn.0.weight: gradient norm = 0.0189
ffn.3.weight: gradient norm = 0.0167
norm1.weight: gradient norm = 0.0045
norm2.weight: gradient norm = 0.0038
Key Observations:
- Output shape matches input shape (preserves sequence structure)
- Gradients flow to all parameters (no vanishing gradient issues)
- Residual connections ensure gradient flow even through deep networks
- Layer normalization stabilizes training
Part (a): Encoder Processing (Sequence Length 1024)
For a single forward pass through the encoder:
Memory Requirements:
- Input embeddings: $B \times L \times d_{\text{model}} = B \times 1024 \times 512$ floats
- Attention scores: $B \times h \times L \times L = B \times 8 \times 1024 \times 1024 = 8{,}388{,}608B$ floats
- Intermediate activations per layer: $\sim B \times L \times d_{ff} = B \times 1024 \times 2048$ floats
- Total per layer: $\sim 10{,}485{,}760B$ floats
- For 6 layers: $\sim 62{,}914{,}560B$ floats $\approx 240$MB per sample (at FP32)
Computation:
- Attention: $O(L^2 d_{\text{model}}) = O(1024^2 \times 512) \approx 537M$ operations per layer
- Feed-forward: $O(L d_{\text{model}} d_{ff}) = O(1024 \times 512 \times 2048) \approx 1.07B$ operations per layer
- Total per layer: $\sim 1.6B$ operations
- For 6 layers: $\sim 9.6B$ operations
Number of forward passes: 1 (parallel processing of entire sequence)
Part (b): Decoder Generating 1024 Tokens
For autoregressive generation:
Memory Requirements (per step $t$):
- Decoder input: $B \times t \times d_{\text{model}}$ (grows with each step)
- Masked attention scores: $B \times h \times t \times t$ (grows quadratically)
- Cross-attention: $B \times h \times t \times 1024$ (constant encoder length)
- KV cache: $2 \times N \times B \times L_{\text{enc}} \times d_{\text{model}} = 2 \times 6 \times B \times 1024 \times 512$ floats
Computation per step $t$:
- Masked self-attention: $O(t \times d_{\text{model}})$ (with KV caching)
- Cross-attention: $O(L_{\text{enc}} \times d_{\text{model}}) = O(1024 \times 512)$
- Feed-forward: $O(d_{\text{model}} \times d_{ff}) = O(512 \times 2048)$
- Total per step: $\sim 2M$ operations (grows linearly with $t$)
Total computation for 1024 tokens: $$\sum_{t=1}^{1024} O(t \times d_{\text{model}} + L_{\text{enc}} \times d_{\text{model}}) \approx O(1024^2 \times 512) \approx 537M \text{ operations}$$
Number of forward passes: 1024 (one per generated token)
Why is Decoding Slower?
- Sequential dependency: Each token depends on all previous tokens, preventing parallelization
- Multiple forward passes: Requires 1024 separate forward passes vs 1 for encoder
- Memory bandwidth: Each step loads encoder outputs and KV cache from memory
- Batch size limitation: Cannot batch across time steps, only across samples
- GPU underutilization: Early steps (small $t$) don't fully utilize GPU parallelism
Practical Implications:
For batch size $B=32$:
- Encoder: $\sim 9.6B$ operations, 1 forward pass, $\sim 10$ms on modern GPU
- Decoder: $\sim 537M$ operations per token $\times 1024$ tokens, $\sim 2-3$ seconds
Decoding is typically 100-200$\times$ slower than encoding for the same sequence length, which is why inference optimization focuses heavily on decoder efficiency (KV caching, speculative decoding, etc.).
We need to prove that $\text{LayerNorm}(\vx') = \text{LayerNorm}(\vx)$ when $\vx' = c\vx$ for constant $c > 0$.
Proof:
Recall the layer normalization formula (without learnable parameters): $$\text{LayerNorm}(\vx) = \frac{\vx - \mu}{\sqrt{\sigma^2 + \epsilon}}$$
where: $$\mu = \frac{1}{d}\sum_{i=1}^d x_i, \quad \sigma^2 = \frac{1}{d}\sum_{i=1}^d (x_i - \mu)^2$$
For $\vx' = c\vx$:
Step 1: Compute mean of $\vx'$ $$\mu' = \frac{1}{d}\sum_{i=1}^d x_i' = \frac{1}{d}\sum_{i=1}^d cx_i = c \cdot \frac{1}{d}\sum_{i=1}^d x_i = c\mu$$
Step 2: Compute variance of $\vx'$
Step 3: Compute LayerNorm of $\vx'$
For large $c$ where $\epsilon$ is negligible compared to $c^2\sigma^2$:
Exact equality: For exact equality when $\epsilon > 0$: $$\text{LayerNorm}(\vx') = \frac{c(\vx - \mu)}{\sqrt{c^2\sigma^2 + \epsilon}}$$
This equals $\text{LayerNorm}(\vx)$ only in the limit as $\epsilon \to 0$ or when $c^2\sigma^2 \gg \epsilon$.
Practical Implications:
- Layer normalization makes the network invariant to input scale (approximately)
- This is why learning rate can be more aggressive with LayerNorm
- Contrast with batch normalization, which is NOT scale-invariant
- The small $\epsilon$ term (typically $10^{-5}$) ensures numerical stability but breaks exact scale invariance
Numerical Example:
Let $\vx = [1, 2, 3, 4]$, $c = 10$, $\epsilon = 10^{-5}$:
For $\vx$: $\mu = 2.5$, $\sigma^2 = 1.25$ $$\text{LayerNorm}(\vx) = \frac{[1,2,3,4] - 2.5}{\sqrt{1.25 + 10^{-5}}} = \frac{[-1.5, -0.5, 0.5, 1.5]}{1.118} \approx [-1.342, -0.447, 0.447, 1.342]$$
For $\vx' = 10\vx$: $\mu' = 25$, $\sigma'^2 = 125$ $$\text{LayerNorm}(\vx') = \frac{[10,20,30,40] - 25}{\sqrt{125 + 10^{-5}}} = \frac{[-15, -5, 5, 15]}{11.180} \approx [-1.342, -0.447, 0.447, 1.342]$$
The outputs are identical (up to numerical precision), confirming scale invariance.