Attention Mechanisms: Fundamentals
Chapter Overview
Attention mechanisms revolutionized sequence modeling by allowing models to focus on relevant parts of the input when producing each output. This chapter introduces attention from first principles, developing the query-key-value paradigm that underpins modern transformers.
Attention solves a fundamental limitation of RNN encoder-decoder models: compressing entire input sequence into single fixed-size vector. Instead, attention computes dynamic, context-dependent representations by weighted combination of all input positions.
Learning Objectives
- Understand the motivation for attention in sequence-to-sequence models
- Master the query-key-value attention paradigm
- Implement additive (Bahdanau) and multiplicative (Luong) attention
- Understand scaled dot-product attention
- Compute attention weights and apply to values
- Visualize and interpret attention distributions
Motivation: The Seq2Seq Bottleneck
RNN Encoder-Decoder Architecture
The sequence-to-sequence (seq2seq) problem requires mapping an input sequence $\vx_1, \ldots, \vx_n$ to an output sequence $\vy_1, \ldots, \vy_m$ of potentially different length. This formulation encompasses machine translation, text summarization, question answering, and many other natural language processing tasks. Before attention mechanisms, the standard approach used recurrent neural networks in an encoder-decoder architecture that suffered from a fundamental information bottleneck.
The encoder RNN processes the input sequence sequentially, updating its hidden state at each time step:
The decoder RNN then generates the output sequence conditioned on this context vector:
The Information Bottleneck: Compressing an entire input sequence into a single fixed-size vector creates severe information loss, particularly for long sequences. Consider translating a 50-word English sentence to French. The encoder must compress 50 words of semantic content, syntactic structure, and contextual relationships into a 512-dimensional vector. This is fundamentally insufficientâthe context vector becomes an information bottleneck that limits the model's capacity to handle complex or lengthy inputs.
Empirical evidence from early neural machine translation systems demonstrated this limitation quantitatively. For English-French translation using LSTM encoder-decoders with 1000-dimensional hidden states, translation quality (measured by BLEU score) remained stable for source sentences up to 20-25 words but degraded significantly beyond this length. Sentences of 30-40 words showed BLEU score drops of 5-10 points compared to shorter sentences, and sentences exceeding 50 words often produced nearly incomprehensible translations. The fixed-size context vector simply could not retain sufficient information about long, complex source sentences.
Memory and Computational Characteristics: The RNN encoder-decoder architecture requires $O(n+m)$ memory for storing hidden states during the forward pass, where $n$ is the source length and $m$ is the target length. For a typical translation task with $n=50$, $m=50$, and hidden dimension $d_h=1024$, this amounts to $(50+50) \times 1024 \times 4 = 400$ KB per sequence in FP32. However, the sequential nature of RNN processing prevents parallelization across time steps. Each hidden state $\vh_t$ depends on $\vh_{t-1}$, forcing strictly sequential computation. On a GPU capable of processing thousands of operations in parallel, this sequential constraint severely limits throughput.
For a batch of 32 sequences, the encoder processes $32 \times 50 = 1600$ time steps sequentially, even though the GPU could theoretically process all 1600 in parallel if the operations were independent. This sequential bottleneck means RNN encoder-decoders achieve only 5-10\% of peak GPU utilization during training, wasting the majority of available compute capacity.
Attention Solution
Attention mechanisms solve the information bottleneck by allowing the decoder to access all encoder hidden states directly, rather than relying on a single compressed representation. The key insight is that when generating each output word $\vy_t$, different input words have different relevance. When translating "The cat sat on the mat" to French, generating "chat" (cat) should focus primarily on the input word "cat," while generating "assis" (sat) should focus on "sat." The decoder's information needs change dynamically throughout generation.
Rather than computing a single context vector $\mathbf{c}$ for the entire sequence, attention computes a different context vector $\mathbf{c}_t$ for each output position $t$. This context vector is a weighted sum of all encoder hidden states:
The attention weights are computed dynamically based on the current decoder state $\mathbf{s}_t$ and each encoder hidden state $\vh_i$. This allows the model to learn which input positions are relevant for each output position, adapting the context vector to the decoder's current needs. When generating the first word of a translation, the attention might focus on the beginning of the source sentence. When generating the last word, attention shifts to the end of the source.
Memory Trade-off: Attention increases memory requirements from $O(n+m)$ to $O(nm)$ because we must store attention weights $\alpha_{t,i}$ for all pairs of input and output positions. For translation with $n=50$ and $m=50$, this requires storing a $50 \times 50 = 2500$ element attention matrix. At 4 bytes per element (FP32), this is 10 KB per sequenceâmodest compared to the benefits. However, this quadratic scaling becomes significant for very long sequences. For document-level translation with $n=1000$ and $m=1000$, the attention matrix requires $1000^2 \times 4 = 4$ MB per sequence, or 128 MB for batch size 32.
Parallelization Benefit: The crucial advantage is that attention enables parallelization. Unlike RNN hidden states that must be computed sequentially, attention weights for all output positions can be computed simultaneously during training when the target sequence is known. This transforms the sequential $O(m)$ decoder steps into a single parallel operation, dramatically improving GPU utilization from 5-10\% to 60-80\% in practice.
With attention, when generating "chat" (cat), the attention mechanism computes weights that heavily favor the input position containing "cat." The attention distribution might be $[0.05, 0.82, 0.03, 0.02, 0.03, 0.05]$, placing 82\% of the weight on position 2 (the word "cat"). The context vector $\mathbf{c}_2$ is then dominated by the encoder hidden state for "cat," providing the decoder with direct access to the relevant input information.
When generating "assis" (sat), the attention distribution shifts to $[0.03, 0.08, 0.75, 0.04, 0.05, 0.05]$, now focusing 75\% on position 3 (the word "sat"). The context vector $\mathbf{c}_4$ adapts to provide information about "sat" rather than "cat." This dynamic reweighting allows the decoder to access different parts of the input as needed, eliminating the information bottleneck of the fixed context vector.
Empirically, attention-based translation systems improved BLEU scores by 3-5 points on standard benchmarks and maintained consistent quality even for sentences exceeding 50 wordsâa regime where RNN encoder-decoders failed catastrophically.
Additive Attention (Bahdanau)
Bahdanau attention, introduced in 2015 for neural machine translation, was the first widely successful attention mechanism. It computes attention weights using an additive scoring function that combines the decoder state and encoder hidden states through learned transformations. While later superseded by more efficient mechanisms, understanding Bahdanau attention provides crucial insights into attention design and the evolution toward modern transformers.
Step 1: Compute alignment scores
Step 2: Compute attention weights (softmax)
Step 3: Compute context vector
Step 4: Use in decoder
Computational Cost Analysis: The additive scoring function in Step 1 requires substantial computation for each query-key pair. For a single alignment score $e_{t,i}$, we must:
- Compute $\mW_1 \mathbf{s}_t$: $2d_a d_s$ FLOPs (matrix-vector multiplication)
- Compute $\mW_2 \vh_i$: $2d_a d_h$ FLOPs
- Add the results: $d_a$ FLOPs
- Apply $\tanh$: $\approx 3d_a$ FLOPs (exponentials and divisions)
- Compute $\mathbf{v}\transpose (\cdot)$: $2d_a$ FLOPs
Total per alignment score: approximately $2d_a(d_s + d_h + 3)$ FLOPs. For a translation task with source length $n$ and target length $m$, we compute $nm$ alignment scores, requiring:
For typical dimensions $n=50$, $m=50$, $d_a=256$, $d_s=d_h=512$:
This is substantial, but the more critical issue is that these operations do not map efficiently to GPU hardware. The additive scoring function involves element-wise operations ($\tanh$), vector additions, and small matrix-vector products that achieve poor utilization on GPUs optimized for large matrix multiplications. In practice, Bahdanau attention achieves only 15-25\% of peak GPU throughput.
Memory Requirements: The attention mechanism requires storing:
- Encoder hidden states: $n \times d_h$ elements
- Alignment scores: $m \times n$ elements (for all decoder positions during training)
- Attention weights: $m \times n$ elements
- Intermediate activations: $m \times n \times d_a$ elements for the $\tanh$ layer
For $n=50$, $m=50$, $d_h=512$, $d_a=256$ in FP32:
The intermediate activations dominate memory usage, requiring 2.5 MB per sequence or 80 MB for batch size 32. This is manageable for short sequences but scales poorly to longer contexts.
Parameter Count: Bahdanau attention introduces $O(d_a(d_s + d_h))$ parameters:
For $d_a=256$, $d_s=d_h=512$: $(256 \times 512) + (256 \times 512) + 256 = 262{,}400$ parameters. While not enormous, these parameters must be learned specifically for the attention mechanism, adding to the model's overall capacity requirements.
Step 1: Compute alignment scores for each encoder position. Suppose after applying $\mW_1 \mathbf{s}_2 + \mW_2 \vh_i$ and passing through $\tanh$ and $\mathbf{v}\transpose$, we obtain:
These raw scores indicate that encoder position 2 has the highest compatibility with the current decoder state, but the scores are not yet normalized.
Step 2: Apply softmax to convert scores to a probability distribution:
Computing each attention weight:
The decoder places 67.8\% of its attention on encoder position 2, with the remaining attention distributed between positions 1 and 3. This sharp distribution indicates high confidence about which input position is relevant.
Step 3: Compute the context vector as a weighted sum:
If $\vh_1 = [1.0, 0.5, -0.3, 0.8]\transpose$, $\vh_2 = [0.3, 0.9, 0.6, -0.2]\transpose$, $\vh_3 = [-0.4, 0.2, 0.7, 0.5]\transpose$:
The context vector is dominated by $\vh_2$ due to the high attention weight $\alpha_{2,2} = 0.678$, but includes contributions from the other encoder states proportional to their attention weights.
Scaled Dot-Product Attention
Scaled dot-product attention, introduced in the "Attention is All You Need" paper, represents a fundamental simplification and improvement over additive attention. By replacing the learned additive scoring function with a simple scaled dot product, this mechanism achieves superior computational efficiency while maintaining or improving model performance. This design choice enabled the transformer architecture to scale to billions of parameters and become the foundation of modern large language models.
Step 1: Compute attention scores
Step 2: Scale by $\sqrt{d_k}$
Step 3: Softmax over keys (row-wise)
Step 4: Apply attention to values
The complete formula in one line:
\node[matrix] (Q) at (0,0) {$\mQ$}; \node[matrix] (K) at (2.5,0) {$\mK$}; \node[matrix] (V) at (5,0) {$\mV$};
\node[operation] (QKT) at (1.25,-2.5) {$\mQ\mK^\top$};
\node[operation] (scale) at (1.25,-4) {Scale};
\node[operation] (softmax) at (1.25,-5.5) {Softmax};
\node[operation] (AV) at (3.5,-7) {$\mA\mV$};
\node[matrix] (output) at (3.5,-8.5) {Output};
\draw[arrow] (Q) -- (QKT); \draw[arrow] (K) -- (QKT); \draw[arrow] (QKT) -- (scale); \draw[arrow] (scale) -- (softmax); \draw[arrow] (softmax) -- (AV); \draw[arrow] (V) -- (AV); \draw[arrow] (AV) -- (output);
\end{tikzpicture}
Why Scaling Matters: Variance Analysis
The scaling factor $1/\sqrt{d_k}$ is not merely a normalization convenienceâit is essential for maintaining stable gradients during training. To understand why, we analyze the variance of dot products between queries and keys.
Assume query and key vectors have independent elements with zero mean and unit variance: $\mathbb{E}[\vq_i] = \mathbb{E}[\vk_i] = 0$ and $\text{Var}(\vq_i) = \text{Var}(\vk_i) = 1$. The dot product between a query and key is:
Since the elements are independent, the variance of the sum equals the sum of variances:
Without scaling, the dot product has variance $d_k$, which grows linearly with the key dimension. For $d_k = 64$, typical dot products have standard deviation $\sqrt{64} = 8$. For $d_k = 512$, the standard deviation grows to $\sqrt{512} \approx 22.6$. These large magnitudes cause severe problems for the softmax function.
Softmax Saturation Problem: The softmax function is defined as:
When input magnitudes are large, softmax saturatesâone element dominates and receives nearly all the probability mass, while others receive exponentially small probabilities. Consider a simple example with two elements:
For $z = 10$: $\text{softmax}([10, 0]) \approx [0.9999, 0.0001]$. For $z = 20$: $\text{softmax}([20, 0]) \approx [1.0, 2 \times 10^{-9}]$. The distribution becomes a hard selection rather than a soft weighting.
Gradient Flow Analysis: The gradient of softmax with respect to its input is:
When softmax saturates with one element near 1 and others near 0, these gradients become tiny. For the dominant element $i$ where $\text{softmax}(\vz)_i \approx 1$:
For non-dominant elements where $\text{softmax}(\vz)_j \approx 0$:
All gradients vanish, preventing the model from learning. This is analogous to the vanishing gradient problem in deep networks, but occurring within a single attention layer.
Scaling Solution: Dividing by $\sqrt{d_k}$ normalizes the variance:
With unit variance, dot products typically range from $-3$ to $+3$ (within three standard deviations), keeping softmax in its sensitive region where gradients are substantial. This maintains effective gradient flow throughout training.
Numerical Example: Consider $d_k = 64$ versus $d_k = 512$ with random unit-variance queries and keys. Without scaling, for $d_k = 64$, a typical attention score might be $\vq\transpose \vk = 12.3$. After softmax over 10 keys with similar magnitudes, the distribution might be $[0.45, 0.18, 0.12, 0.08, 0.06, 0.04, 0.03, 0.02, 0.01, 0.01]$âreasonably distributed. The gradient of the top element is approximately $0.45 \times (1 - 0.45) = 0.248$, which is healthy.
For $d_k = 512$ without scaling, the same query-key pair might produce $\vq\transpose \vk = 35.2$. After softmax, the distribution becomes $[0.9997, 0.0001, 0.0001, 0.0001, \ldots]$âcompletely saturated. The gradient is approximately $0.9997 \times (1 - 0.9997) = 0.0003$, which is 800 times smaller. Over many layers, these tiny gradients compound, making training extremely difficult or impossible.
With scaling by $\sqrt{512} \approx 22.6$, the score becomes $35.2 / 22.6 \approx 1.56$, producing a softmax distribution like $[0.38, 0.15, 0.12, 0.10, \ldots]$ with gradient $0.38 \times (1 - 0.38) = 0.236$âsimilar to the $d_k = 64$ case. The scaling makes attention behavior independent of the key dimension, enabling stable training across different model sizes.
Computational Efficiency
Scaled dot-product attention achieves dramatically better computational efficiency than additive attention, both in raw FLOP count and in hardware utilization. This efficiency difference is why transformers can scale to billions of parameters while additive attention models remained limited to hundreds of millions.
FLOP Count Comparison: For $m$ queries and $n$ keys with dimension $d_k$:
Scaled dot-product attention:
For $d_k = d_v = 64$, $m = n = 512$:
Bahdanau attention: As computed earlier, for $d_a = 256$, $d_s = d_h = 512$, $m = n = 512$:
Scaled dot-product attention requires approximately 1000Ă fewer FLOPs than Bahdanau attention for this configuration! The difference grows with sequence length since Bahdanau's cost scales with the attention dimension $d_a$ while scaled dot-product depends only on $d_k$.
Hardware Efficiency: Beyond raw FLOP count, scaled dot-product attention maps naturally to highly optimized GPU operations. The core computation $\mQ \mK\transpose$ is a dense matrix multiplication (GEMM), which is the most optimized operation on modern GPUs. NVIDIA's cuBLAS library and Tensor Cores are specifically designed for GEMM, achieving 80-90\% of theoretical peak performance.
In contrast, Bahdanau attention requires element-wise operations ($\tanh$), vector additions, and many small matrix-vector products. These operations achieve only 15-25\% of peak GPU performance due to memory bandwidth limitations and poor parallelization. The $\tanh$ activation requires computing exponentials for each element, which is slow compared to the fused multiply-add operations in GEMM.
Memory Bandwidth Considerations: Modern GPUs are often memory-bandwidth limited rather than compute-limited. The NVIDIA A100 has 312 TFLOPS of FP16 compute but only 1.5 TB/s memory bandwidth. For operations to be compute-bound, they must perform many FLOPs per byte loaded from memory.
Matrix multiplication $\mQ \mK\transpose$ for $\mQ, \mK \in \R^{512 \times 64}$ loads $2 \times 512 \times 64 \times 2 = 131$ KB (FP16) and performs $2 \times 512 \times 512 \times 64 = 67$ MFLOPs, achieving $67{,}000{,}000 / 131{,}072 \approx 511$ FLOPs per byte. This high arithmetic intensity keeps the GPU compute units busy.
Bahdanau's element-wise operations load data, perform a few operations, and store resultsâachieving only 1-5 FLOPs per byte. The GPU spends most of its time waiting for memory rather than computing, wasting the available compute capacity.
Practical Performance: On an NVIDIA A100 GPU, computing attention for a batch of 32 sequences with $n = 512$ and $d_k = 64$:
- Scaled dot-product attention: $\approx 0.8$ ms (achieving 250 TFLOPS, 80\% of peak)
- Bahdanau attention: $\approx 15$ ms (achieving 15 TFLOPS, 5\% of peak)
The 19Ă speedup from scaled dot-product attention is what enables training GPT-3 scale models (175B parameters) in reasonable time. With Bahdanau attention, training would take 19Ă longer, making such models economically infeasible.
Step 1: Compute dot products between the query and each key:
Step 2: Scale by $\sqrt{d_k} = \sqrt{4} = 2$:
Without scaling, the scores would be $[1.33, 0.37, -0.24]$. For this small example with $d_k = 4$, the difference is modest. But for $d_k = 64$, unscaled scores would be $\sqrt{64/4} = 4$ times larger, and for $d_k = 512$, they would be $\sqrt{512/4} \approx 11.3$ times larger, causing severe softmax saturation.
Step 3: Apply softmax to obtain attention weights:
Computing each weight:
The attention is distributed across all three keys, with the highest weight on key 1 (48.2\%) but substantial attention to keys 2 and 3 as well. This soft distribution allows the model to incorporate information from multiple positions.
Step 4: Apply attention weights to values. Suppose:
The output is:
The output vector is a weighted combination of the value vectors, with weights determined by the query-key similarities. This output can then be used by subsequent layers in the transformer.
Attention Score Computation Methods
The following table summarizes the four major attention scoring functions introduced in this chapter:
| Method | Computation | Parameters | GPU Util. | Used In |
|---|---|---|---|---|
| Additive (Bahdanau) | $\mathbf{v}\transpose\tanh(\mW_1\vq + \mW_2\vk)$ | $O(d_a(d_q + d_k))$ | 15--25\% | Early seq2seq |
| Dot-product | $\vq\transpose\vk$ | 0 | 80--90\% | Not used (unstable) |
| Scaled dot-product | $\vq\transpose\vk/\sqrt{d_k}$ | 0 | 80--90\% | All transformers |
| General (Luong) | $\vq\transpose\mW\vk$ | $O(d_q d_k)$ | 50--70\% | Some seq2seq |
Scaled dot-product attention dominates modern transformers due to its parameter-free nature, high GPU utilization from regular matrix multiplication structure, and strong empirical performance. The simplicity enables hardware-specific optimizations like FlashAttention (Chapter~16).
Query-Key-Value Paradigm
Intuition
The query-key-value (QKV) framework provides an elegant abstraction for understanding attention mechanisms through the lens of information retrieval. This paradigm, borrowed from database systems and search engines, offers intuitive explanations for attention's behavior while precisely defining its mathematical operations.
Consider a database system where you want to retrieve relevant information. You provide a query describing what you're looking for, the system compares your query against keys (indexed descriptions of stored content), and returns the values (actual content) associated with the most relevant keys. Attention mechanisms operate identically: queries represent "what I'm looking for," keys represent "what information is available," and values represent "the actual information to retrieve."
In the context of neural networks, these three components serve distinct roles. The query $\vq$ encodes the current position's information needsâwhat aspects of the input are relevant for processing this position. The keys $\vk_i$ encode what information each input position offersâwhat content is available at that position. The values $\vv_i$ encode the actual information to be retrievedâthe representations that will be combined to form the output.
This separation of concerns is crucial. By decoupling "what to look for" (queries) from "what is available" (keys) and "what to retrieve" (values), the attention mechanism gains flexibility. The same input can be queried in different ways by different positions, and the retrieved information can differ from the indexing representation. This three-way separation enables the model to learn rich, task-specific attention patterns.
Concrete Example: In machine translation, when generating the French word "chat" (cat) from the English sentence "The cat sat on the mat," the decoder's query encodes "I need information about the subject noun." The keys encode what each English word represents: "the" offers determiner information, "cat" offers subject noun information, "sat" offers verb information, etc. The attention mechanism computes high similarity between the query and the key for "cat," then retrieves the value associated with "cat"âa rich representation encoding its meaning, grammatical role, and context.
Importantly, the key and value for "cat" can differ. The key might emphasize grammatical features (noun, singular, animate) that help match queries, while the value emphasizes semantic features (animal, feline, pet) that are useful for generation. This separation allows the attention mechanism to index on one set of features while retrieving another.
Projecting to QKV
In transformers, queries, keys, and values are not provided directly but are computed from the input through learned linear projections. This design choice allows the model to learn task-specific representations for each role rather than using the raw input embeddings.
Given input $\mX \in \R^{n \times d_{\text{model}}}$ where $n$ is the sequence length and $d_{\text{model}}$ is the model dimension, we compute:
Each projection matrix is a learned parameter that transforms the input into the appropriate representation space. The query and key projections map to the same dimension $d_k$ (typically $d_{\text{model}}/h$ where $h$ is the number of attention heads) because they must be compatible for dot products. The value projection maps to dimension $d_v$, which is often equal to $d_k$ but can differ.
Why Learn Separate Projections? One might ask: why not use the input $\mX$ directly as queries, keys, and values? The answer lies in representation learning. The raw input embeddings encode general semantic and syntactic information, but attention requires specialized representations. The query projection learns to emphasize features relevant for determining what to attend to. The key projection learns to emphasize features relevant for being attended to. The value projection learns to emphasize features relevant for the output representation.
These three projections can learn different aspects of the input. For example, in a language model, the query projection might emphasize the current word's part of speech and semantic category to determine what context is needed. The key projection might emphasize each word's grammatical role and position to help queries find relevant context. The value projection might emphasize semantic content and relationships to provide useful information for prediction.
Computational Cost: Each projection is a matrix multiplication requiring $2nd_{\text{model}}d_k$ FLOPs (for queries and keys) or $2nd_{\text{model}}d_v$ FLOPs (for values). With three projections and $d_k = d_v$:
For typical transformer configurations where $d_k = d_{\text{model}}/h$ and we consider all $h$ heads together (so $hd_k = d_{\text{model}}$):
For BERT-base with $n = 512$ and $d_{\text{model}} = 768$:
This is substantial but represents only about 20\% of the total attention computation for typical sequence lengths. The attention score computation ($\mQ \mK\transpose$) and output computation ($\mA \mV$) dominate for longer sequences.
Parameter Count: The three projection matrices introduce $d_{\text{model}}(2d_k + d_v)$ parameters per attention head. For $h$ heads with $d_k = d_v = d_{\text{model}}/h$:
For BERT-base with $d_{\text{model}} = 768$: $3 \times 768^2 = 1{,}769{,}472$ parameters per attention layer. With 12 layers, the QKV projections account for $12 \times 1.77 = 21.2$ million parameters out of BERT's total 110 millionâabout 19\% of the model.
We project to $d_k = d_v = 64$ (as in a single attention head of a model with $h = 8$ heads):
Each projection matrix has $512 \times 64 = 32{,}768$ parameters. Computing each projection requires $2 \times 5 \times 512 \times 64 = 327{,}680$ FLOPs, for a total of $3 \times 327{,}680 = 983{,}040$ FLOPs across all three projections.
Attention computation: After projection, we compute attention:
The attention matrix $\mA = \text{softmax}(\mQ \mK\transpose / \sqrt{64}) \in \R^{5 \times 5}$ has entry $a_{ij}$ representing how much position $i$ attends to position $j$. For example:
Position 1 attends most strongly to itself (45\%) and position 2 (25\%). Position 5 attends most strongly to itself (55\%) and position 4 (25\%). This pattern might emerge in a language model where each position attends to nearby context, with stronger attention to the current position and recent tokens.
The output $\mA \mV \in \R^{5 \times 64}$ provides an attended representation for each position, combining information from all positions according to the attention weights. This output can then be processed by subsequent layers.
Hardware Implications of Attention
Attention mechanisms align well with modern GPU architectures for two key reasons. First, attention eliminates the sequential bottleneck of RNNs: all positions are processed simultaneously via batched matrix multiplications, achieving 75\% GPU utilization compared to $\sim$5\% for LSTMs (see Chapter~6 for the detailed comparison). Second, the core $\mQ\mK\transpose$ operation achieves high arithmetic intensity ($\sim$256 FLOPs/byte for typical dimensions), keeping compute units busy rather than waiting for memory transfers.
Attention Variants
Self-Attention vs Cross-Attention
Self-Attention: $\mQ$, $\mK$, $\mV$ all from same source
Cross-Attention: Queries from one source, keys and values from another
Masked Attention
For autoregressive models (GPT), prevent attending to future positions:
After softmax, $\exp(-\infty) = 0$, so no attention to future!
Exercises
Solutions
Full solutions for all exercises are available at \url{https://deeplearning.hofkensvermeulen.be}.
Given:
Encoder states: $\vh_1 = [1, 0, 1]\transpose$, $\vh_2 = [0, 1, 1]\transpose$, $\vh_3 = [1, 1, 0]\transpose$, $\vh_4 = [0, 0, 1]\transpose$
Decoder state: $\vs = [0.5, 0.5, 0.5]\transpose$
Step 1: Compute alignment scores
For $i=1$:
Similarly: $e_2 \approx 1.089$, $e_3 \approx 1.118$, $e_4 \approx 0.856$
Step 2: Apply softmax
These are the attention weights showing how much the decoder attends to each encoder state.
(1) Output dimension:
(2) Attention matrix shape:
(3) FLOPs for $\mQ\mK\transpose$:
For $d_k = 64$ with unit variance elements:
Standard deviation: $\sigma = \sqrt{64} = 8$
Effect on softmax gradients:
Without scaling, dot products range roughly $[-24, 24]$ (3 standard deviations). After softmax:
- Large positive scores $\to$ probability $\approx 1$
- Large negative scores $\to$ probability $\approx 0$
- Softmax saturates, gradients vanish
Numerical demonstration:
Scaling by $\sqrt{d_k}$ keeps dot products in a range where softmax gradients are well-behaved.
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
def scaled_dot_product_attention(Q, K, V):
d_k = Q.size(-1)
scores = torch.matmul(Q, K.transpose(-2, -1)) / torch.sqrt(torch.tensor(d_k, dtype=torch.float32))
attention_weights = F.softmax(scores, dim=-1)
output = torch.matmul(attention_weights, V)
return output, attention_weights
# Test with sequence length 5
Q = torch.randn(1, 5, 32) # (batch, seq_len, d_k)
K = torch.randn(1, 5, 32)
V = torch.randn(1, 5, 48) # d_v = 48
output, weights = scaled_dot_product_attention(Q, K, V)
print(f"Output shape: {output.shape}") # (1, 5, 48)
print(f"Attention weights shape: {weights.shape}") # (1, 5, 5)
# Visualize attention weights
plt.imshow(weights[0].detach().numpy(), cmap='viridis')
plt.colorbar()
plt.xlabel('Key position')
plt.ylabel('Query position')
plt.title('Attention Weights Heatmap')
plt.show()
The heatmap shows which positions each query attends to, with brighter colors indicating higher attention weights.