Linear Algebra for Deep Learning
Chapter Overview
Linear algebra forms the mathematical foundation of deep learning. Neural networks perform sequences of linear transformations interspersed with nonlinear operations, making matrices and vectors the fundamental objects of study. This chapter develops the linear algebra concepts essential for understanding how deep learning models transform data, how information flows through neural architectures, and how we can interpret the geometric operations these models perform.
Unlike a pure mathematics course, our treatment emphasizes the specific linear algebra operations that appear repeatedly in deep learning: matrix multiplication for transforming representations, dot products for measuring similarity, and matrix decompositions for understanding structure. We pay particular attention to dimensions and shapes, as tracking how tensor dimensions transform through operations is crucial for implementing and debugging deep learning systems.
Learning Objectives
After completing this chapter, you will be able to:
- Represent data as vectors and transformations as matrices with clear understanding of dimensions
- Perform matrix operations and understand their geometric interpretations
- Calculate and interpret dot products as similarity measures
- Understand eigendecompositions and singular value decompositions and their applications
- Apply matrix norms and use them in regularization
- Recognize how linear algebra operations map to neural network computations
Vector Spaces and Transformations
Vectors as Data Representations
In deep learning, we represent data as vectors in high-dimensional spaces. A vector $\vx \in \R^n$ is an ordered collection of $n$ real numbers, which we can interpret geometrically as a point in $n$-dimensional space or as an arrow from the origin to that point.
The dimension $n$ is the number of components in the vector. We write vectors as column vectors by default.
For color images with three channels (red, green, blue), a $224 \times 224$ RGB image becomes a vector in $\R^{150528}$ ($224 \times 224 \times 3 = 150{,}528$). The enormous dimensionality of image data motivates the need for powerful models that can find meaningful patterns in such high-dimensional spaces.
Linear Transformations
Linear transformations preserve vector space structure: they map lines to lines and preserve the origin ($T(\mathbf{0}) = \mathbf{0}$).
Matrices as Linear Transformations
Every linear transformation from $\R^n$ to $\R^m$ can be represented by an $m \times n$ matrix.
For transforming a 784-dimensional input to 256-dimensional hidden representation:
This layer has $256 \times 784 = 200{,}704$ weights plus 256 biases, totaling 200,960 trainable parameters.
Concrete Numerical Example: With $n_{\text{in}} = 3$, $n_{\text{out}} = 2$:
Computing:
Matrix Operations
Matrix Multiplication
Computing each entry:
Therefore: $\mC = \begin{bmatrix} 19 & 22 \\ 43 & 50 \end{bmatrix}$
Computational Complexity of Matrix Operations
Understanding the computational cost of matrix operations is essential for designing efficient deep learning systems.
Dimensions: $\underbrace{\mQ}_{\R^{n \times d_k}} \underbrace{\mK\transpose}_{\R^{d_k \times n}} = \underbrace{\mA}_{\R^{n \times n}}$
Computational cost: $2n \cdot d_k \cdot n = 2n^2 d_k$ FLOPs
For GPT-3 with $n = 2048$ tokens and $d_k = 128$:
This quadratic scaling in sequence length ($O(n^2)$) is why long-context transformers are computationally expensive.
For a batch of $B$ sequences of length $n$, input is $\mX \in \R^{B \times n \times d_{\text{model}}}$.
First transformation: $2 \cdot (Bn) \cdot d_{\text{model}} \cdot d_{ff}$ FLOPs
Second transformation: $2 \cdot (Bn) \cdot d_{ff} \cdot d_{\text{model}}$ FLOPs
Total: $4Bn \cdot d_{\text{model}} \cdot d_{ff}$ FLOPs
For BERT-base ($d_{\text{model}} = 768$, $d_{ff} = 3072$, $n = 512$, $B = 32$):
Batch Matrix Multiplication
Modern deep learning frameworks process multiple examples simultaneously using batched operations.
Query tensor: $\mQ \in \R^{B \times h \times n \times d_k} = \R^{32 \times 12 \times 512 \times 64}$
Key tensor: $\mK \in \R^{B \times h \times n \times d_k} = \R^{32 \times 12 \times 512 \times 64}$
Attention scores: $\mA = \mQ\mK\transpose \in \R^{32 \times 12 \times 512 \times 512}$
This requires $B \times h \times 2n^2d_k = 32 \times 12 \times 2 \times 512^2 \times 64 = 12{,}884{,}901{,}888 \approx 12.9$ GFLOPs.
Transpose
Important properties:
Hardware Context for Matrix Operations
Understanding how matrix operations map to hardware is crucial for writing efficient deep learning code.
Memory Layout: Row-Major vs Column-Major
Matrices are stored in memory as one-dimensional arrays, and the layout significantly affects performance. In row-major order, used by C and PyTorch, rows are stored consecutively in memory. In column-major order, used by Fortran and MATLAB, columns are stored consecutively. For a matrix $\mA = \begin{bmatrix} a & b \\ c & d \end{bmatrix}$, row-major storage produces the sequence $[a, b, c, d]$ while column-major storage produces $[a, c, b, d]$.
# Good: Sequential memory access
for i in range(m):
for j in range(n):
result += A[i, j] # Accesses consecutive memory
# Bad: Strided memory access
for j in range(n):
for i in range(m):
result += A[i, j] # Jumps across rows
GPU Acceleration and BLAS Libraries
Modern deep learning relies on highly optimized linear algebra libraries that provide standardized interfaces for common operations. The Basic Linear Algebra Subprograms (BLAS) standard defines three levels of operations: Level 1 for vector operations like dot products and norms with $O(n)$ complexity, Level 2 for matrix-vector operations like $\mA\vx$ with $O(n^2)$ complexity, and Level 3 for matrix-matrix operations like $\mA\mB$ with $O(n^3)$ complexity. Common CPU implementations include Intel MKL, OpenBLAS, and Apple Accelerate, while GPU implementations include NVIDIA cuBLAS and AMD rocBLAS. These libraries achieve near-peak hardware performance through careful optimization of memory access patterns, instruction scheduling, and hardware-specific features.
For matrix multiplication $\mC = \mA\mB$ with $\mA, \mB \in \R^{4096 \times 4096}$, we need $2 \times 4096^3 = 137{,}438{,}953{,}472 \approx 137.4$ GFLOPs of computation and must transfer $3 \times 4096^2 \times 4 = 201{,}326{,}592 \approx 192$ MB of data (three matrices at 4 bytes per float). On an A100, this takes approximately $\frac{137.4 \text{ GFLOPS}}{312{,}000 \text{ GFLOPS}} \approx 0.44$ ms, making it compute-bound since the computation time exceeds the memory transfer time of $\frac{192 \text{ MB}}{1{,}600{,}000 \text{ MB/s}} \approx 0.12$ ms.
Compute-Bound vs Memory-Bound Operations
Operations with high arithmetic intensity are compute-bound, meaning they are limited by computational throughput, while operations with low arithmetic intensity are memory-bound, meaning they are limited by memory bandwidth.
In contrast, matrix multiplication $\mC = \mA\mB$ with $\mA, \mB \in \R^{n \times n}$ performs $2n^3$ FLOPs while transferring $3n^2 \times 4 = 12n^2$ bytes, yielding an arithmetic intensity of $\frac{2n^3}{12n^2} = \frac{n}{6}$ FLOP/byte. For $n = 1024$, this gives 170.7 FLOP/byte, making the operation compute-bound and well-suited for GPU acceleration. For smaller matrices with $n = 64$, the arithmetic intensity drops to 10.7 FLOP/byte, placing it in a transitional regime where both compute and memory bandwidth matter.
Dot Products and Similarity
Computing similarities:
The query $\vq$ is more similar to $\vk_1$ (score 0.92) than to $\vk_2$ (score 0.20). These scores determine attention weights.
Matrix Decompositions
Eigenvalues and Eigenvectors
Geometrically, an eigenvector is only scaled (not rotated) when $\mA$ is applied. The eigenvalue $\lambda$ is the scaling factor.
Solving $\det(\mA - \lambda \mI) = 0$:
Eigenvalues: $\lambda_1 = 4$, $\lambda_2 = 2$
For $\lambda_1 = 4$, eigenvector: $\vv_1 = \frac{1}{\sqrt{2}}\begin{bmatrix} 1 \\ 1 \end{bmatrix}$
For $\lambda_2 = 2$, eigenvector: $\vv_2 = \frac{1}{\sqrt{2}}\begin{bmatrix} 1 \\ -1 \end{bmatrix}$
Singular Value Decomposition
Low-Rank Approximation and Model Compression
SVD enables efficient model compression by approximating matrices with lower-rank factorizations.
The approximation error is:
For a rank-$k$ approximation, we keep only the top $k$ singular values to obtain $\mW \approx \mW_1 \mW_2 = \mU_k \boldsymbol{\Sigma}_k \mV_k\transpose$ where $\mU_k \in \R^{512 \times k}$, $\boldsymbol{\Sigma}_k \in \R^{k \times k}$, and $\mV_k \in \R^{k \times 2048}$. We can absorb the diagonal matrix $\boldsymbol{\Sigma}_k$ into either factor, giving $\mW_1 = \mU_k \boldsymbol{\Sigma}_k \in \R^{512 \times k}$ and $\mW_2 = \mV_k\transpose \in \R^{k \times 2048}$.
The original matrix has $512 \times 2048 = 1{,}048{,}576$ parameters, while the compressed form has $512k + 2048k = 2560k$ parameters, yielding a compression ratio of $\frac{2560k}{1{,}048{,}576} = \frac{k}{409.6}$. For $k = 64$, we have $2560 \times 64 = 163{,}840$ parameters, achieving 84.4\% compression. For $k = 128$, we have $327{,}680$ parameters, achieving 68.8\% compression.
In terms of memory savings with 32-bit floats, the original matrix requires $1{,}048{,}576 \times 4 = 4{,}194{,}304$ bytes or approximately 4.0 MB. The compressed version with $k=64$ requires only $163{,}840 \times 4 = 655{,}360$ bytes or approximately 0.625 MB, saving 3.375 MB per layer. For a model with 100 such layers, this yields a total savings of 337.5 MB, significantly reducing memory footprint and enabling deployment on resource-constrained devices.
The relative approximation error for rank-$k$ approximation is:
Typical results for transformer feed-forward layers:
| Rank $k$ | Compression | Relative Error | Accuracy Drop |
|---|---|---|---|
| 256 | 50\% | 0.05 | $<$0.1\% |
| 128 | 75\% | 0.12 | 0.3\% |
| 64 | 87.5\% | 0.25 | 1.2\% |
| 32 | 93.75\% | 0.45 | 3.5\% |
Sweet spot: 50-75\% compression with minimal accuracy loss.
SVD in Modern Architectures
Low-rank matrix decompositions are also central to modern parameter-efficient fine-tuning methods such as LoRA, which adds low-rank updates $\Delta \mW = \mB\mA$ (with $\mB \in \R^{d \times r}$, $\mA \in \R^{r \times k}$, $r \ll \min(d,k)$) to frozen pre-trained weights, achieving $>$99\% parameter reduction. See Chapter~20 for details.
import torch
# Original weight matrix
W = torch.randn(512, 2048)
# Compute SVD
U, S, Vt = torch.linalg.svd(W, full_matrices=False)
# Rank-k approximation
k = 64
W_compressed = U[:, :k] @ torch.diag(S[:k]) @ Vt[:k, :]
# Factored form for efficient computation
W1 = U[:, :k] @ torch.diag(S[:k]) # 512 x 64
W2 = Vt[:k, :] # 64 x 2048
# Verify approximation
error = torch.norm(W - W_compressed, p='fro')
relative_error = error / torch.norm(W, p='fro')
print(f"Relative error: {relative_error:.4f}")
# Memory comparison
original_params = W.numel()
compressed_params = W1.numel() + W2.numel()
compression_ratio = compressed_params / original_params
print(f"Compression: {(1-compression_ratio)*100:.1f}
Norms and Distance Metrics
Norms are used in regularization to prevent overfitting by penalizing large weights.
import torch
# Vector norms
x = torch.tensor([3.0, 4.0])
l2_norm = torch.norm(x, p=2) # 5.0
l1_norm = torch.norm(x, p=1) # 7.0
# Matrix Frobenius norm
W = torch.randn(256, 784)
frob_norm = torch.norm(W, p='fro')
Practical Deep Learning Examples
Embedding Layers and Memory Requirements
For a batch of $B = 32$ sequences of length $n = 2048$, the input consists of integer token IDs in $\R^{32 \times 2048}$, which the embedding layer transforms into dense representations in $\R^{32 \times 2048 \times 12288}$, requiring $32 \times 2048 \times 12288 \times 4 = 3{,}221{,}225{,}472$ bytes or approximately 3.0 GB of memory. This demonstrates why large batch sizes and long sequences quickly exhaust GPU memory, necessitating techniques like gradient checkpointing and mixed-precision training.
Complete Transformer Layer Analysis
Understanding the full computational profile of a transformer layer---parameters, FLOPs, memory, and hardware utilization---is essential for practical deep learning engineering. Section~[ref] provides a comprehensive, self-contained worked analysis of BERT-base, the standard baseline configuration used as a running example throughout this textbook.
Common Dimension Errors and Debugging
print(x.shape) to verify actual dimensions against expected values. Check whether the batch dimension is present (is it $\R^{B \times \ldots}$ or just $\R^{\ldots}$?), verify the sequence length dimension (is it $\R^{B \times n \times d}$ or $\R^{B \times d \times n}$?), confirm that matrix multiplication has compatible inner dimensions, and watch for unintentional broadcasting that may hide shape mismatches. These systematic checks quickly identify the source of dimension errors and guide appropriate fixes.
import torch
# Problem: Shape mismatch in matrix multiplication
Q = torch.randn(32, 512, 768) # [batch, seq_len, d_model]
K = torch.randn(32, 512, 768)
# Wrong: Q @ K gives error (768 != 512)
# scores = Q @ K # Error!
# Correct: Transpose last two dimensions of K
scores = Q @ K.transpose(-2, -1) # [32, 512, 512]
# Problem: Missing batch dimension
x = torch.randn(512, 768) # Missing batch dimension
W = torch.randn(768, 3072)
# Add batch dimension
x = x.unsqueeze(0) # [1, 512, 768]
output = x @ W # [1, 512, 3072]
# Problem: Broadcasting confusion
x = torch.randn(32, 512, 768)
bias = torch.randn(768)
# This works due to broadcasting
output = x + bias # bias broadcasts to [32, 512, 768]
# Explicit broadcasting (clearer)
output = x + bias.view(1, 1, 768)
BERT-base: A Canonical Worked Analysis
BERT-base has become the standard baseline configuration for transformer analysis. Its dimensions are adopted throughout this textbook as a running example for parameter counts, memory estimates, and computational costs. This section provides a self-contained reference; subsequent chapters cite these results rather than re-derive them.
Architecture Specification
| Hyperparameter | Value |
|---|---|
| Layers | $N = 12$ |
| Model dimension | $d_{\text{model}} = 768$ |
| Attention heads | $h = 12$ \quad ($d_k = d_v = 64$ per head) |
| Feed-forward dimension | $d_{ff} = 3072$ \quad ($4 \times d_{\text{model}}$) |
| Vocabulary size | $V = 30{,}000$ (WordPiece) |
| Maximum sequence length | $n_{\max} = 512$ |
Unless stated otherwise, worked examples use batch size $B = 32$ and full sequence length $n = 512$.
Parameter Count
Per encoder layer. Each layer has three components:
The feed-forward network contains approximately twice as many parameters as the attention mechanism ($4.7$M vs $2.4$M), despite attention being conceptually more complex. This is because the $4\times$ expansion to $d_{ff}$ creates two large weight matrices.
Complete model:
Embeddings account for $\sim$21\% of total parameters ($23$M out of $110$M), while the transformer layers account for $\sim$78\%. This ratio shifts with vocabulary size---a 50K-token vocabulary would push embeddings to 35\% of total parameters.
Dimension Tracking Through One Layer
Input: $\mX \in \R^{32 \times 512 \times 768}$ (batch $\times$ sequence $\times$ model dimension)
Multi-Head Attention:
Feed-Forward Network:
Activation Memory
Intermediate activations stored for backpropagation dominate training memory:
| Component (per layer) | Memory (FP32) |
|---|---|
| Q, K, V projections ($3 \times Bnd$) | 113 MB |
| Attention scores ($Bhn^2$) | 402 MB |
| Attention output ($Bnd$) | 38 MB |
| FFN intermediate ($Bn \cdot 4d$) | 151 MB |
| Per-layer total | 704 MB |
| 12 layers | 8.4 GB |
The attention scores alone ($B \times h \times n^2 \times 4$ bytes $= 402$~MB per layer) account for 57\% of per-layer activation memory, illustrating the quadratic cost of self-attention. Doubling the sequence length to $n = 1024$ would quadruple the attention memory to 1.6~GB per layer.
FLOPs Analysis
Counting each multiply-accumulate as 2 FLOPs:
Self-Attention (per layer):
Feed-Forward Network (per layer):
Totals:
The FFN accounts for 60\% of per-layer FLOPs, while the attention score computations ($\mQ\mK\transpose$ and $\mA\mV$) contribute only 10\%. For short sequences, optimizing FFN yields the largest gains; for long sequences, attention's $O(n^2)$ term becomes dominant.
Training Memory Budget
| Component | Memory |
|---|---|
| Parameters (FP32) | 440 MB |
| Gradients (FP32) | 440 MB |
| Adam optimizer states ($m$, $v$) | 880 MB |
| Activations (12 layers) | $\approx$8.4 GB |
| Embeddings + overhead | $\approx$3.6 GB |
| Total | $\approx$13.8 GB |
Activations dominate at $\sim$87\% of total memory, motivating gradient checkpointing (recompute activations during the backward pass, trading 20--30\% slower training for 50--70\% memory reduction). This is why training BERT-base requires GPUs with at least 16~GB of memory.
Hardware Timing (NVIDIA A100)
The A100 provides 312~TFLOPS FP16 with Tensor Cores and 1.6~TB/s memory bandwidth. At 70\% utilization:
Memory bandwidth check: Loading 110M parameters ($\times$ 4 bytes $= 440$~MB) takes $440/1600 \approx 0.28$~ms, comparable to compute time. For small batch sizes, BERT-base becomes memory-bandwidth bound rather than compute-bound; larger batches amortize parameter loading across more computation.
Exercises
- The dot product $\vx\transpose \vy$
- The L2 norms $\norm{\vx}_2$ and $\norm{\vy}_2$
- The cosine similarity between $\vx$ and $\vy$
- Calculate the number of parameters in the two linear transformations
- If processing a batch of $B = 32$ sequences of length $n = 512$, what are the dimensions of the input tensor?
- How many floating-point operations (FLOPs) are required for one forward pass through this layer?
- Express the number of parameters as a function of $r$
- What value of $r$ achieves 75\% compression?
- What is the memory savings in MB (assuming 32-bit floats)?
- What are the dimensions of the output $\mA$?
- Calculate the total FLOPs required
- Compute the arithmetic intensity (FLOPs per byte transferred, assuming 32-bit floats)
- Is this operation compute-bound or memory-bound on a GPU with 312 TFLOPS and 1.6 TB/s bandwidth?
- How many parameters does the embedding matrix contain?
- What is the memory requirement in MB for 32-bit floats?
- For a batch of $B = 64$ sequences of length $n = 256$, what is the memory required for the embedded representations?
- If we use LoRA with rank $r = 16$ to adapt the embeddings, how many trainable parameters are needed?
- Computing $(\mA\mB)\vx$ where $\mA \in \R^{m \times n}$, $\mB \in \R^{n \times p}$, $\vx \in \R^p$
- Computing $\mA(\mB\vx)$
- Calculate the total FLOPs
- Calculate the memory transferred (assuming matrices are read once and result written once)
- Compute the arithmetic intensity
- If the GPU has 100 TFLOPS compute and 900 GB/s memory bandwidth, what is the theoretical execution time assuming perfect utilization?
- Which resource (compute or memory) is the bottleneck?
Solutions
Full solutions for all exercises are available at \url{https://deeplearning.hofkensvermeulen.be}.
(1) Dot product:
(2) L2 norms:
(3) Cosine similarity:
The negative cosine similarity indicates the vectors point in somewhat opposite directions.
(1) Parameters in feed-forward network:
- First linear: $\mW_1 \in \R^{3072 \times 768}$ has $3072 \times 768 = 2{,}359{,}296$ weights, plus bias $\vb_1 \in \R^{3072}$ has $3{,}072$ parameters
- Second linear: $\mW_2 \in \R^{768 \times 3072}$ has $768 \times 3072 = 2{,}359{,}296$ weights, plus bias $\vb_2 \in \R^{768}$ has $768$ parameters
- Total: $2{,}359{,}296 + 3{,}072 + 2{,}359{,}296 + 768 = 4{,}722{,}432$ parameters
(2) Input tensor dimensions: For batch size $B = 32$ and sequence length $n = 512$, the input tensor has shape:
(3) FLOPs for forward pass:
- First transformation: $2 \times (Bn) \times d_{\text{model}} \times d_{ff} = 2 \times (32 \times 512) \times 768 \times 3072 = 77{,}309{,}411{,}328$ FLOPs
- Second transformation: $2 \times (Bn) \times d_{ff} \times d_{\text{model}} = 77{,}309{,}411{,}328$ FLOPs
- Total: $154{,}618{,}822{,}656 \approx 154.6$ GFLOPs
We have:
Consider $\vv_1\transpose \mA \vv_2$:
where we used $\mA\transpose = \mA$ in the second line. Therefore:
Rearranging:
Since $\lambda_1 \neq \lambda_2$, we must have $\vv_1\transpose \vv_2 = 0$, proving orthogonality.
(1) Parameters as function of $r$: The factored form requires:
- $\mW_1 = \mU_r \boldsymbol{\Sigma}_r \in \R^{1024 \times r}$: $1024r$ parameters
- $\mW_2 = \mV_r\transpose \in \R^{r \times 4096}$: $4096r$ parameters
- Total: $1024r + 4096r = 5120r$ parameters
(2) Value of $r$ for 75\% compression: Original parameters: $1024 \times 4096 = 4{,}194{,}304$
For 75\% compression, we want 25\% of original:
(3) Memory savings:
- Original: $4{,}194{,}304 \times 4 \text{ bytes} = 16{,}777{,}216 \text{ bytes} \approx 16.0$ MB
- Compressed: $1{,}048{,}576 \times 4 \text{ bytes} = 4{,}194{,}304 \text{ bytes} \approx 4.0$ MB
- Savings: $16.0 - 4.0 = 12.0$ MB
(1) Output dimensions:
(2) Total FLOPs: For each batch element, we compute $\mQ_b \mK_b\transpose$ where $\mQ_b, \mK_b \in \R^{1024 \times 64}$:
(3) Arithmetic intensity: Memory transferred:
- Read $\mQ$: $16 \times 1024 \times 64 \times 4 = 4{,}194{,}304$ bytes
- Read $\mK$: $16 \times 1024 \times 64 \times 4 = 4{,}194{,}304$ bytes
- Write $\mA$: $16 \times 1024 \times 1024 \times 4 = 67{,}108{,}864$ bytes
- Total: $75{,}497{,}472$ bytes $\approx 72$ MB
Arithmetic intensity:
(4) Compute-bound or memory-bound:
- Compute time: $\frac{2.15 \text{ GFLOPs}}{312 \text{ TFLOPs}} \approx 6.9$ microseconds
- Memory time: $\frac{72 \text{ MB}}{1{,}600{,}000 \text{ MB/s}} \approx 45$ microseconds
- The operation is memory-bound by a factor of $45/6.9 \approx 6.5\times$
(1) Number of parameters:
(2) Memory requirement:
(3) Memory for embedded representations: For batch $B = 64$, sequence length $n = 256$:
(4) LoRA trainable parameters: LoRA adds two matrices: $\mB \in \R^{d \times r}$ and $\mA \in \R^{r \times V}$:
This is only $\frac{520{,}192}{16{,}384{,}000} \approx 3.2\%$ of the original parameters!
Option 1: $(\mA\mB)\vx$
- Compute $\mC = \mA\mB \in \R^{512 \times 512}$: $2 \times 512 \times 2048 \times 512 = 1{,}073{,}741{,}824$ FLOPs
- Compute $\mC\vx \in \R^{512}$: $2 \times 512 \times 512 = 524{,}288$ FLOPs
- Total: $1{,}074{,}266{,}112$ FLOPs
Option 2: $\mA(\mB\vx)$
- Compute $\vy = \mB\vx \in \R^{2048}$: $2 \times 2048 \times 512 = 2{,}097{,}152$ FLOPs
- Compute $\mA\vy \in \R^{512}$: $2 \times 512 \times 2048 = 2{,}097{,}152$ FLOPs
- Total: $4{,}194{,}304$ FLOPs
Efficiency comparison:
Option 2 is 256× more efficient! This demonstrates the importance of operation ordering in linear algebra.
(1) Total FLOPs:
(2) Memory transferred:
- Read $\mA$: $2048^2 \times 4 = 16{,}777{,}216$ bytes
- Read $\mB$: $2048^2 \times 4 = 16{,}777{,}216$ bytes
- Write $\mC$: $2048^2 \times 4 = 16{,}777{,}216$ bytes
- Total: $50{,}331{,}648$ bytes $\approx 48$ MB
(3) Arithmetic intensity:
(4) Theoretical execution time:
- Compute time: $\frac{17.2 \text{ GFLOPs}}{100 \text{ TFLOPs}} = 0.172$ ms
- Memory time: $\frac{48 \text{ MB}}{900{,}000 \text{ MB/s}} = 0.053$ ms
- Execution time: $\max(0.172, 0.053) = 0.172$ ms
(5) Bottleneck: The operation is compute-bound since compute time (0.172 ms) exceeds memory time (0.053 ms). The high arithmetic intensity (341 FLOP/byte) makes this operation well-suited for GPU acceleration.