Feed-Forward Neural Networks
Chapter Overview
Feed-forward neural networks are the foundation of deep learning. These networks transform inputs through sequences of linear and nonlinear operations to produce outputs. This chapter develops the architecture, training, and theory of feed-forward networks, establishing concepts that extend to all modern deep learning models including transformers.
Learning Objectives
After completing this chapter, you will be able to:
- Understand the architecture of feed-forward neural networks
- Implement forward and backward passes through MLPs
- Apply appropriate activation functions and understand their properties
- Initialize network weights properly to enable training
- Apply regularization techniques to prevent overfitting
- Understand the universal approximation theorem
From Linear Models to Neural Networks
The Perceptron
Multi-Class Classification: Softmax Regression
Multi-Layer Perceptrons
\foreach \i in {1,2,3} { \node[neuron] (x\i) at (0, -\i*1.2) {$x_{\i}$}; }
\node[layer] at (0, 0.3) {Input}; \node[font=\footnotesize] at (0, -4.2) {$\R^3$};
\foreach \i in {1,2,3,4} { \node[neuron, fill=blue!10] (h\i) at (3, -\i*1) {$h_{\i}$}; } \node[layer] at (3, 0.3) {Hidden}; \node[font=\footnotesize] at (3, -4.5) {$\R^4$};
\foreach \i in {1,2} { \node[neuron, fill=green!10] (y\i) at (6, -\i*1.5-0.5) {$y_{\i}$}; } \node[layer] at (6, 0.3) {Output}; \node[font=\footnotesize] at (6, -4.2) {$\R^2$};
\foreach \i in {1,2,3} { \foreach \j in {1,2,3,4} { \draw[arrow, gray!50] (x\i) -- (h\j); } }
\foreach \i in {1,2,3,4} { \foreach \j in {1,2} { \draw[arrow, gray!50] (h\i) -- (y\j); } }
\draw[arrow, red, very thick] (x1) -- (h2); \draw[arrow, red, very thick] (h2) -- (y1);
\node[font=\footnotesize] at (1.5, -1.5) {$\mW^{(1)}$}; \node[font=\footnotesize] at (4.5, -1.5) {$\mW^{(2)}$};
\end{tikzpicture}
- Input: $\vx \in \R^{784}$ (flattened $28 \times 28$ image)
- Hidden 1: $\vh^{(1)} \in \R^{256}$ with ReLU
- Hidden 2: $\vh^{(2)} \in \R^{128}$ with ReLU
- Output: $\vz^{(3)} \in \R^{10}$ with softmax
Parameter count: $200{,}960 + 32{,}896 + 1{,}290 = 235{,}146$ parameters.
Why Depth Matters
Without nonlinear activations, multiple layers collapse to single linear transformation. With nonlinearities, deep networks learn complex functions efficiently.
Memory and Computation Analysis
For a single fully-connected layer computing $\vy = \mW\vx + \vb$ where $\mW \in \R^{m \times n}$, the forward pass requires approximately $2mn$ FLOPs and the backward pass requires approximately $4mn$ FLOPs (computing gradients with respect to inputs, weights, and biases). This gives a useful rule of thumb: one training step requires approximately $6\times$ as many FLOPs as the model has parameters---$2\times$ for the forward pass and $4\times$ for the backward pass. This ratio holds for fully-connected layers but varies with architecture; convolutional layers have much higher FLOPs per parameter due to weight sharing, while embedding layers have zero FLOPs (table lookups only).
During training, intermediate activations must be stored for the backward pass, and these often consume more memory than the model parameters. Activation memory scales linearly with batch size while parameter memory remains constant, so large batch sizes eventually become memory-limited. For transformer models, attention score matrices of size $B \times h \times n \times n$ dominate activation memory due to their $O(n^2)$ scaling with sequence length, dwarfing the $O(n)$ scaling of feed-forward activations.
Transformer feed-forward networks use a standard two-layer architecture with $4\times$ expansion: projecting from model dimension $d$ to $4d$ with a GELU activation, then back to $d$. For BERT-base ($d = 768$, $d_{ff} = 3072$), the FFN contributes 4.7M parameters per layer---roughly twice the attention mechanism's 2.4M---and accounts for $\sim$60\% of per-layer FLOPs (see Section~[ref] for the complete breakdown).
Activation Functions
Computational Cost of Activation Functions
GELU is 4$\times$ more expensive than ReLU in arithmetic operations, but in practice adds only 1--2\% to total training time because both operations are memory-bandwidth-bound on modern GPUs. Modern frameworks provide fused linear-activation kernels that eliminate intermediate memory traffic, providing 1.3--1.5$\times$ speedup for combined operations.
Why GELU is Preferred in Transformers
Despite its higher computational cost, GELU has become the standard activation function for transformer models, used in BERT, GPT-2, GPT-3, T5, and most modern language models. This preference is driven by empirical performance rather than computational efficiency: models trained with GELU consistently achieve better final accuracy than those trained with ReLU, particularly on language understanding tasks.
The theoretical motivation for GELU is that it provides a smoother approximation to the ReLU function, with non-zero gradients for negative inputs. While ReLU has gradient zero for all $z < 0$, GELU has small but non-zero gradients in this region, allowing the network to recover from neurons that have been pushed into the negative regime. This property is particularly valuable in deep networks where gradient flow through many layers can be fragile. For a 24-layer BERT-large model, the probability that a gradient signal survives through all layers is significantly higher with GELU than with ReLU, as GELU never completely blocks gradient flow.
Empirically, BERT-base trained with GELU achieves 84.6\% accuracy on the MNLI natural language inference task, compared to 83.9\% with ReLU—a 0.7 percentage point improvement that is statistically significant and practically meaningful. For GPT-2, the perplexity on the WebText validation set is 18.3 with GELU compared to 19.1 with ReLU, indicating better language modeling performance. These improvements justify the 1-2\% computational overhead of GELU, as the improved model quality translates to better downstream task performance and potentially reduced training time to reach a target accuracy.
The success of GELU has inspired variants like Swish and Mish that share the property of smooth, non-zero gradients everywhere. Swish, defined as $\text{Swish}(z) = z \cdot \sigma(z)$, has similar performance to GELU on most tasks and is used in some efficient transformer architectures like EfficientNet. Mish, defined as $\text{Mish}(z) = z \cdot \tanh(\text{softplus}(z))$, provides slightly better performance than GELU on some vision tasks but has higher computational cost. The landscape of activation functions continues to evolve, but GELU remains the standard for language models due to its strong empirical performance and reasonable computational cost.
Universal Approximation Theorem
Caveat: The theorem says nothing about how many units needed, how to find weights, or generalization. Deep networks often more efficient than wide networks.
Weight Initialization
Variance Preservation Through Layers
Proper weight initialization ensures that activations and gradients maintain reasonable magnitudes as they propagate through deep networks. Consider a linear layer $\vy = \mW\vx$ where $\vx \in \R^{n_{\text{in}}}$ has zero mean and unit variance, and weights $w_{ij}$ are independent with zero mean and variance $\sigma_w^2$. The variance of each output element is:
To preserve variance ($\text{Var}(y_i) = 1$), we need $\sigma_w^2 = 1/n_{\text{in}}$. Xavier initialization uses $\sigma_w^2 = 2/(n_{\text{in}} + n_{\text{out}})$ to balance forward and backward pass variance preservation. For ReLU activations, which zero out half the activations on average, He initialization compensates by using $\sigma_w^2 = 2/n_{\text{in}}$, doubling the variance to maintain signal strength after the nonlinearity.
Without proper initialization, deep networks fail to train: with variance too large ($\sigma_w^2 = 1$), activations explode exponentially through layers; with variance too small ($\sigma_w^2 = 0.01$), they vanish to zero. For transformer models with GELU activations, He initialization or slight variants work well and are used universally in BERT, GPT, T5, and other modern architectures.
Regularization
L2 Regularization
Add penalty to loss:
L2 regularization, also known as weight decay, penalizes large parameter values to prevent overfitting. The regularization term adds the squared Frobenius norm of all weight matrices to the loss function, encouraging the optimizer to keep weights small. The hyperparameter $\lambda$ controls the strength of regularization: larger $\lambda$ produces smaller weights and stronger regularization.
The computational cost of L2 regularization is modest. Computing the squared norm $\norm{\mW}_F^2 = \sum_{ij} w_{ij}^2$ requires one multiplication and one addition per parameter, totaling $2P$ operations for a model with $P$ parameters. For BERT-base with 110 million parameters, this requires 220 million operations, or 0.22 GFLOPs. Compared to the 96 GFLOPs required for a forward pass, the regularization computation adds only 0.23\% overhead. On an NVIDIA A100 GPU, computing the regularization term takes approximately 0.7 microseconds, which is negligible compared to the 50 milliseconds for a full forward-backward pass.
The gradient of the L2 regularization term is even simpler: $\nabla_{\mW} \left(\frac{\lambda}{2} \norm{\mW}_F^2\right) = \lambda \mW$. This adds a term proportional to the current weights to the gradient, which can be implemented as a simple scaling operation during the optimizer step. Most optimizers, including PyTorch's Adam and SGD, support weight decay as a built-in parameter that applies this scaling automatically without requiring explicit computation of the regularization term. This makes L2 regularization essentially free from a computational perspective.
The memory overhead of L2 regularization is zero, as it requires no additional storage beyond the parameters themselves. The regularization term is computed on-the-fly during the backward pass and does not need to be stored. This makes L2 regularization an attractive regularization technique for large models where memory is at a premium.
Dropout
Dropout is a powerful regularization technique that randomly drops (sets to zero) a fraction of activations during training. This prevents the network from relying too heavily on any single neuron and encourages learning robust features. The dropout probability $p$ is typically 0.1 to 0.5, with higher values providing stronger regularization at the cost of slower convergence.
Dropout in Practice
Dropout adds approximately 4--5\% overhead to training time for BERT-base, primarily from random number generation and memory traffic for the binary mask. Transformer models apply dropout at multiple points: after attention weights (attention dropout), after sublayer outputs (residual dropout), and on input embeddings. BERT uses $p=0.1$ at all locations; GPT-3 uses only residual dropout.
Exercises
Solutions
Full solutions for all exercises are available at \url{https://deeplearning.hofkensvermeulen.be}.
Architecture:
- Input layer: 100 dimensions
- Hidden layer 1: 100 $\to$ 64 with ReLU activation
- Hidden layer 2: 64 $\to$ 32 with ReLU activation
- Output layer: 32 $\to$ 1 with sigmoid activation
Parameter count:
- Layer 1: $\mW^{(1)} \in \R^{64 \times 100}$ has $6{,}400$ weights, $\vb^{(1)} \in \R^{64}$ has $64$ biases
- Layer 2: $\mW^{(2)} \in \R^{32 \times 64}$ has $2{,}048$ weights, $\vb^{(2)} \in \R^{32}$ has $32$ biases
- Layer 3: $\mW^{(3)} \in \R^{1 \times 32}$ has $32$ weights, $b^{(3)} \in \R$ has $1$ bias
- Total: $6{,}400 + 64 + 2{,}048 + 32 + 32 + 1 = 8{,}577$ parameters
Forward pass equations:
where $\sigma(z) = \frac{1}{1 + e^{-z}}$ is the sigmoid function.
Given weights:
Input: $\vx = \begin{bmatrix} 2.0 \\ 1.0 \end{bmatrix}$
Layer 1:
Layer 2:
Final output: $\hat{y} = 0.7$
(1) He initialization variance:
Standard deviation: $\sigma = \sqrt{0.00391} \approx 0.0625$
(2) Why different from Xavier:
- Xavier initialization: $\text{Var}(w) = \frac{1}{n_{\text{in}}}$ (for tanh/sigmoid)
- He initialization: $\text{Var}(w) = \frac{2}{n_{\text{in}}}$ (for ReLU)
- ReLU zeros out half the activations, reducing variance by factor of 2
- He initialization compensates by doubling the initial variance
- This maintains signal variance through deep networks with ReLU
(3) Zero initialization problem: If all weights are initialized to zero:
- All neurons in a layer compute identical outputs
- All gradients are identical (symmetry)
- Neurons never differentiate during training
- Network effectively has only one neuron per layer
- Learning fails completely
Random initialization breaks symmetry, allowing neurons to learn different features.
Consider an $L$-layer network without nonlinear activations:
Substituting recursively:
Continuing to layer $L$:
where:
This is equivalent to a single linear layer with weights $\mW_{\text{eff}}$ and bias $\vb_{\text{eff}}$. Therefore, without nonlinear activations, depth provides no additional representational power—the network can only learn linear functions regardless of depth.