Notation and Conventions
This book adopts consistent notation throughout to enhance readability and comprehension.
General Mathematical Notation
| Symbol | Meaning |
|---|---|
| $a, b, c$ | Scalars (lowercase italic) |
| $n, m, d$ | Integer scalars (dimensions, indices) |
| $\vx, \vy, \vz$ | Vectors (lowercase bold) |
| $\mA, \mB, \mC$ | Matrices (uppercase bold) |
| $\mathcal{X}, \mathcal{D}$ | Sets (uppercase calligraphic) |
| $f, g, h$ | Functions (lowercase italic) |
| $\R, \N, \Z, \C$ | Number sets (blackboard bold) |
Linear Algebra
| Symbol | Meaning |
|---|---|
| $\vx \in \R^n$ | Vector $\vx$ with $n$ components |
| $\mA \in \R^{m \times n}$ | Matrix $\mA$ with $m$ rows and $n$ columns |
| $a_{i,j}$ or $[\mA]_{i,j}$ | Element in row $i$, column $j$ of matrix $\mA$ |
| $\mA\transpose$ | Transpose of matrix $\mA$ |
| $\mA^{-1}$ | Inverse of matrix $\mA$ |
| $\mA \mB$ | Matrix multiplication |
| $\mA \odot \mB$ | Element-wise (Hadamard) product |
| $\vx \transpose \vy$ | Dot product of vectors $\vx$ and $\vy$ |
| $\norm{\vx}_2$ | Euclidean (L2) norm |
| $\norm{\vx}_1$ | L1 norm |
| $\norm{\mA}_F$ | Frobenius norm of matrix $\mA$ |
| $\text{tr}(\mA)$ | Trace of matrix $\mA$ |
| $\det(\mA)$ | Determinant of matrix $\mA$ |
| $\mI$ or $\mI_n$ | Identity matrix |
Deep Learning Specific
| Symbol | Meaning |
|---|---|
| $\vx^{(i)}$ | $i$-th training example |
| $\vx_t$ | Input at time step $t$ |
| $\vh^{(\ell)}$ | Hidden state at layer $\ell$ |
| $\mW^{(\ell)}$ | Weight matrix at layer $\ell$ |
| $\vb^{(\ell)}$ | Bias vector at layer $\ell$ |
| $\sigma(\cdot)$ | Activation function (generic) |
| $\text{ReLU}(x)$ | Rectified Linear Unit: $\max(0, x)$ |
| $\text{softmax}(\vx)$ | Softmax function |
| $N$ or $B$ | Batch size |
| $d_{\text{model}}$ | Model dimension |
| $d_k, d_v$ | Dimension of keys and values |
| $h$ | Number of attention heads |
| $L$ | Number of layers |
| $V$ | Vocabulary size |
| $n$ or $T$ | Sequence length |
| $\eta$ | Learning rate |
Dimension Conventions
Throughout this book, we explicitly annotate dimensions:
- For $\mW \in \R^{m \times n}$: $m$ rows, $n$ columns
- Batch dimensions listed first: $\mX \in \R^{B \times n \times d}$
- Superscripts for layer indices: $\vh^{(\ell)}$
- Subscripts for time/position indices: $\vx_t$