Pre-training Strategies and Transfer Learning

Chapter Overview

Pre-training on large unlabeled corpora followed by task-specific fine-tuning has become the dominant paradigm in deep learning. This chapter covers pre-training objectives, data curation, curriculum learning, continual pre-training, and transfer learning strategies for maximizing downstream performance.

Learning Objectives

Understand different pre-training objectives and their trade-offs
Curate and process pre-training data at scale
Apply curriculum learning and domain-adaptive pre-training
Implement parameter-efficient fine-tuning (LoRA, adapters)
Design multi-task and multi-stage pre-training
Measure and improve transfer learning effectiveness

Pre-training Objectives

Language Modeling Objectives

The choice of pre-training objective fundamentally shapes a model's capabilities. The main objectives are:

Causal Language Modeling (CLM): Predict each token given only previous context: $\mathcal{L}_{\text{CLM}} = -\sum_{t} \log P(x_t | x_{Masked Language Modeling (MLM): Mask $\sim$15\% of tokens and predict from bidirectional context: $\mathcal{L}_{\text{MLM}} = -\sum_{t \in \mathcal{M}} \log P(x_t | x_{\backslash \mathcal{M}}; \theta)$. Used by BERT (Chapter~[ref]). Rich representations for understanding tasks.
Prefix Language Modeling: Bidirectional attention on a prefix, causal on the suffix. Combines benefits of CLM and MLM. Used by UniLM and GLM.

Denoising Objectives

Span Corruption (T5): Replace random spans with sentinel tokens; decoder predicts original spans. See Chapter~[ref].
Multi-task Denoising (BART): Combines token masking, deletion, text infilling, sentence permutation, and document rotation. See Chapter~[ref].

Computational costs are similar across objectives for the same model size and sequence length: a single forward pass dominates, with differences arising mainly in the decoder target length for seq2seq objectives (span corruption costs $\sim$3$\times$ more in decoder computation than single-token MLM).

Contrastive Objectives

Contrastive Learning:

$$ \mathcal{L}_{\text{contrastive}} = -\log \frac{\exp(\text{sim}(z_i, z_i^+)/\tau)}{\sum_{j} \exp(\text{sim}(z_i, z_j)/\tau)} $$

Applications:

SimCLR (vision): Augmented views as positives
CLIP: Image-text pairs
SimCSE (text): Dropout as augmentation

Data Curation and Processing

Data Scale and Requirements

The scale of pre-training data has grown exponentially over the past few years, driven by empirical findings that larger datasets consistently improve model performance. Understanding the data requirements, storage costs, and preprocessing overhead is essential for planning pre-training projects.

BERT was pre-trained on approximately 16 GB of text data, consisting of BooksCorpus (800 million words) and English Wikipedia (2.5 billion words). This relatively modest dataset size reflects BERT's focus on high-quality, curated text rather than massive web crawls. The 16 GB of raw text expands to approximately 3.3 billion tokens using BERT's WordPiece tokenizer with a 30,000 token vocabulary. Training BERT-base for 1 million steps with batch size 256 and sequence length 512 means the model sees each token approximately 40 times on average, indicating significant data reuse through multiple epochs. The storage requirements are minimal by modern standards—16 GB of compressed text expands to perhaps 50 GB including tokenized data and intermediate preprocessing artifacts.

GPT-2 scaled up to approximately 40 GB of text from WebText, a dataset created by scraping outbound links from Reddit posts with at least 3 karma. This filtering strategy aimed to identify high-quality content as judged by the Reddit community. The 40 GB corpus contains roughly 8 billion tokens using GPT-2's byte-pair encoding with a 50,257 token vocabulary. GPT-2's largest variant (1.5B parameters) was trained for approximately 1 million steps, seeing each token roughly 10 times. The preprocessing pipeline for WebText involved deduplication, filtering by language, and removing low-quality content, reducing the raw crawl from over 100 GB to the final 40 GB. Storage requirements including raw data, filtered data, and tokenized sequences total approximately 150 GB.

GPT-3 made a massive leap to approximately 570 GB of text, totaling roughly 300 billion tokens. This dataset combines filtered Common Crawl (410 GB), WebText2 (19 GB), Books1 (12 GB), Books2 (55 GB), and Wikipedia (3 GB). The preprocessing pipeline for Common Crawl is particularly intensive: the raw crawl contains petabytes of data, which must be filtered by language, deduplicated, and quality-filtered to produce the final 410 GB. This filtering process itself requires substantial computational resources—processing petabytes of data through language classifiers and deduplication algorithms takes weeks on large clusters. The total storage requirements for GPT-3 pre-training, including raw data, filtered data, tokenized sequences, and training checkpoints, exceed 5 TB. The preprocessing cost alone is estimated at tens of thousands of dollars in compute time.

LLaMA pushed the scale even further to approximately 1.4 TB of text, totaling roughly 1.4 trillion tokens. This dataset consists primarily of Common Crawl (67\%), C4 (15\%), GitHub (4.5\%), Wikipedia (4.5\%), books (4.5\%), ArXiv (2.5\%), and StackExchange (2\%). The inclusion of code from GitHub and technical content from ArXiv and StackExchange reflects a deliberate strategy to improve reasoning and technical capabilities. The preprocessing pipeline for LLaMA is even more sophisticated than GPT-3, using multiple quality filters including perplexity-based filtering, classifier-based filtering, and extensive deduplication. The total storage requirements exceed 10 TB including all preprocessing artifacts, and the preprocessing cost is estimated at over \$100,000 in compute time.

Data Quality versus Quantity

The relationship between data quality and quantity is not straightforward—more data does not always lead to better models if the quality is poor. Recent research has shown that careful data curation can match or exceed the performance of models trained on much larger but noisier datasets.

High-quality datasets like Wikipedia and books consistently improve model performance even when they represent a small fraction of total training data. GPT-3's data mixture samples Wikipedia at 3× the rate it appears in the corpus (3.4 epochs versus 0.44 epochs for Common Crawl), reflecting the higher quality and information density of Wikipedia text. This upsampling strategy means that despite Wikipedia being only 3 GB of the 570 GB total, it contributes disproportionately to the model's knowledge and capabilities.

The preprocessing cost for achieving high data quality is substantial. Language identification using fastText classifiers requires processing every document, taking approximately 1 CPU-hour per 100 GB of text. Deduplication using MinHash LSH is even more expensive, requiring approximately 10 CPU-hours per 100 GB for computing signatures and finding near-duplicates. Quality filtering using perplexity-based methods requires running a language model over the entire corpus, taking approximately 100 GPU-hours per 100 GB. For GPT-3's 570 GB dataset, the total preprocessing cost exceeds 50,000 CPU-hours and 5,000 GPU-hours, translating to roughly \$30,000 in cloud computing costs.

The storage requirements for preprocessing are also significant. Deduplication requires storing hash signatures for all documents, typically requiring 100-200 bytes per document. For a corpus with 1 billion documents, this means 100-200 GB of signature storage. Near-duplicate detection using LSH requires storing multiple hash tables, potentially doubling or tripling this storage requirement. Quality filtering requires storing perplexity scores or classifier outputs for all documents, adding another 10-20 GB. In total, the preprocessing metadata can require 500 GB to 1 TB of storage for a large corpus, comparable to the size of the corpus itself.

Data Filtering and Cleaning

Algorithm: Data Filtering Pipeline

Step 1: Quality Filtering

Remove duplicates (exact and near-duplicates)

Filter by language (fastText classifier)

Remove toxic/harmful content

Filter low-quality (perplexity-based, classifier)

Step 2: Deduplication

Exact match: Hash-based

Near-duplicates: MinHash LSH

Document-level and paragraph-level

Step 3: Privacy

Remove PII (emails, phone numbers, addresses)

Filter memorized content

Redact sensitive information

Step 4: Formatting

Unicode normalization

Remove excessive whitespace

Clean HTML/markup artifacts

Example: Total: ~570GB, 300B tokens

Dataset	Weight	Epochs
Common Crawl (filtered)	60\%	0.44
WebText2	22\%	2.9
Books1	8\%	1.9
Books2	8\%	1.9
Wikipedia	3\%	3.4

Higher-quality sources sampled more frequently (multiple epochs). Lower-quality sources seen less to avoid overfitting to noise.

Data Deduplication

Why deduplicate?

Prevents memorization
Better generalization
Fairer evaluation (test set contamination)

Methods:

1. Exact Deduplication:

seen_hashes = set()
for doc in corpus:
    hash_val = hash(doc)
    if hash_val not in seen_hashes:
        keep(doc)
        seen_hashes.add(hash_val)

2. Fuzzy Deduplication (MinHash):

Compute MinHash signatures
Use LSH for near-neighbor search
Remove documents with Jaccard similarity $> 0.8$

Training Compute Requirements

FLOPs Analysis

Understanding the computational requirements for pre-training large language models is essential for planning projects and estimating costs. The total compute is typically measured in FLOPs (floating-point operations), which can be calculated from model architecture and training configuration.

For a transformer model with $L$ layers, $d$ model dimension, $h$ attention heads, and feed-forward intermediate dimension $d_{ff}$ (typically $4d$), processing a single token requires approximately:

$$ \text{FLOPs per token} = 2L(12d^2 + 4d \cdot d_{ff}) = 2L(12d^2 + 16d^2) = 56Ld^2 $$

The factor of 2 accounts for both forward and backward passes (backward pass requires approximately the same FLOPs as forward pass). The $12d^2$ term comes from attention projections (query, key, value, and output, each $d \times d$), and the $16d^2$ term comes from feed-forward layers (two $d \times 4d$ projections).

For BERT-base with $L=12$, $d=768$, training on 3.3 billion tokens for 40 epochs (132 billion tokens total):

$$ \text{Total FLOPs} = 56 \times 12 \times 768^2 \times 132 \times 10^9 \approx 5.2 \times 10^{20} \text{ FLOPs} $$

This is approximately 0.5 zettaFLOPs. Training on 16 TPU v3 chips (each providing 420 TFLOPS in mixed precision) for 4 days:

$$ \text{Available compute} = 16 \times 420 \times 10^{12} \times 4 \times 86400 \approx 2.3 \times 10^{21} \text{ FLOPs} $$

The ratio of available compute to required compute is approximately 4.4, indicating that BERT-base training achieves roughly 23\% hardware utilization. This is typical for large-scale training where communication overhead, data loading, and other inefficiencies reduce effective utilization.

For GPT-3 175B with $L=96$, $d=12288$, training on 300 billion tokens:

$$ \text{Total FLOPs} = 56 \times 96 \times 12288^2 \times 300 \times 10^9 \approx 2.4 \times 10^{23} \text{ FLOPs} $$

This is approximately 240 zettaFLOPs, nearly 500× more than BERT-base. The massive compute requirement reflects both the larger model (175B versus 110M parameters) and the larger dataset (300B versus 132B tokens). Training GPT-3 on approximately 10,000 NVIDIA V100 GPUs (each providing 125 TFLOPS in mixed precision) for 1 month:

$$ \text{Available compute} = 10000 \times 125 \times 10^{12} \times 30 \times 86400 \approx 3.2 \times 10^{23} \text{ FLOPs} $$

This suggests approximately 75\% hardware utilization, which is impressive for such a large-scale distributed training job. The higher utilization compared to BERT reflects improvements in distributed training infrastructure and optimization techniques.

For LLaMA-65B with $L=80$, $d=8192$, training on 1.4 trillion tokens:

$$ \text{Total FLOPs} = 56 \times 80 \times 8192^2 \times 1.4 \times 10^{12} \approx 3.3 \times 10^{23} \text{ FLOPs} $$

This is approximately 330 zettaFLOPs. Training on 2048 NVIDIA A100 GPUs (each providing 312 TFLOPS in mixed precision) for 21 days:

$$ \text{Available compute} = 2048 \times 312 \times 10^{12} \times 21 \times 86400 \approx 1.2 \times 10^{24} \text{ FLOPs} $$

This suggests approximately 28\% hardware utilization, which is lower than GPT-3 despite using more modern hardware. The lower utilization likely reflects the challenges of scaling to very long sequences (LLaMA uses 2048 token sequences versus GPT-3's 2048 tokens) and the overhead of processing the much larger dataset.

GPU-Hours and Cost Estimates

Translating FLOPs into GPU-hours and cost estimates provides a more practical understanding of training requirements. The cost depends heavily on the hardware platform and whether using cloud services or owned infrastructure.

BERT-base training on 16 TPU v3 chips for 4 days equals 1,536 TPU-hours. At Google Cloud's on-demand pricing of approximately \$8 per TPU v3 hour, this costs roughly \$12,000. However, Google's original BERT paper reported using preemptible TPUs at approximately \$2.40 per hour, reducing the cost to roughly \$3,700. Using equivalent GPU resources (approximately 64 NVIDIA V100 GPUs to match 16 TPU v3 chips), the cost would be approximately \$2 per GPU-hour on AWS, totaling \$12,000 for 96 days of GPU time. The lower TPU cost reflects Google's optimization for transformer workloads and economies of scale.

GPT-3 175B training on 10,000 V100 GPUs for 1 month equals 7.2 million GPU-hours. At AWS on-demand pricing of approximately \$3 per V100 hour, this would cost \$21.6 million. However, OpenAI likely used a combination of owned infrastructure and negotiated cloud pricing, with estimates suggesting actual costs between \$4 million and \$12 million. The wide range reflects uncertainty about the exact hardware configuration, utilization rates, and pricing agreements. The training also required substantial infrastructure costs including high-bandwidth networking (InfiniBand or equivalent), distributed storage systems, and engineering effort to optimize the training pipeline.

LLaMA-65B training on 2048 A100 GPUs for 21 days equals 1.03 million GPU-hours. At cloud pricing of approximately \$3 per A100 hour, this would cost \$3.1 million. Meta's paper reports that LLaMA-65B training consumed approximately 1,022,362 GPU-hours on A100-80GB GPUs, closely matching this estimate. Using Meta's owned infrastructure rather than cloud services likely reduced the effective cost to \$1.5-2 million when accounting for hardware depreciation and operational costs. The A100's higher performance compared to V100 (312 versus 125 TFLOPS) means that LLaMA-65B required only 1/7 the GPU-hours of GPT-3 despite using comparable compute (330 versus 240 zettaFLOPs), demonstrating the importance of hardware efficiency.

Scaling Laws

Empirical scaling laws govern the relationship between compute budget, model size, dataset size, and performance (see Section~[ref] for the full treatment). The key insight from the Chinchilla scaling laws (Hoffmann et al., 2022) is that model size and dataset size should scale equally with compute budget: $N_{\text{opt}} \propto C^{0.5}$ and $D_{\text{opt}} \propto C^{0.5}$. This implies GPT-3 175B (trained on 300B tokens) was over-parameterized---a 70B model trained on 1.4T tokens (matching LLaMA-65B) achieves better performance for the same compute. This insight has driven the trend toward models like LLaMA, Chinchilla, and Mistral that prioritize training tokens over parameter count.

Curriculum Learning

Progressive Training Strategies

Curriculum learning applies the principle of learning from easy to hard examples, progressively increasing task difficulty during training. This approach can significantly improve training efficiency, convergence speed, and final model performance. The key insight is that models learn more effectively when they first master simpler patterns before tackling complex ones.

The most common curriculum strategy involves progressively increasing sequence length during training. Starting with shorter sequences reduces both memory requirements and computational cost in the early stages of training when the model is learning basic patterns. For a model with quadratic attention complexity $O(n^2d)$, doubling the sequence length quadruples the attention computation. Training BERT-base with sequence length 128 for the first 90\% of steps and then 512 for the final 10\% reduces total training time by approximately 30\% compared to using length 512 throughout. The shorter sequences allow larger batch sizes in early training, which improves gradient estimates and accelerates convergence. The model learns word-level and phrase-level patterns with short sequences, then refines its understanding of long-range dependencies with longer sequences.

Batch size curriculum is another effective strategy, gradually increasing batch size during training. Starting with smaller batches provides more frequent parameter updates, which helps the model escape poor local minima in early training. As training progresses and the model approaches convergence, larger batches provide more stable gradient estimates and better utilize hardware parallelism. GPT-3 training used a batch size curriculum, starting at 32,000 tokens per batch and gradually increasing to 3.2 million tokens per batch. This 100× increase in batch size was enabled by learning rate adjustments and gradient accumulation. The larger batch sizes in later training improved hardware utilization from approximately 30\% to over 70\%, significantly reducing training time.

Learning rate schedules are essential for curriculum learning, as the optimal learning rate changes with batch size and training progress. The linear warmup followed by cosine decay schedule has become standard for transformer pre-training. The warmup phase, typically 1-10\% of total steps, gradually increases the learning rate from near-zero to the peak value. This prevents the large gradient updates in early training from destabilizing the model. The cosine decay phase gradually reduces the learning rate to near-zero, allowing the model to fine-tune its parameters as it approaches convergence. For BERT-base, a warmup of 10,000 steps followed by linear decay over 990,000 steps works well. For GPT-3, a warmup of 375 million tokens (approximately 1\% of total training) followed by cosine decay proved effective.

The impact on training efficiency is substantial. Curriculum learning can reduce training time by 20-40\% compared to fixed configurations while achieving equal or better final performance. For BERT-base, the sequence length curriculum reduces training from approximately 5.5 days to 4 days on the same hardware. For GPT-3, the batch size curriculum improved hardware utilization enough to reduce training time by an estimated 30\%, saving approximately \$3-4 million in compute costs. These savings make curriculum learning essential for large-scale pre-training projects.

Progressive Training

Definition: Train on progressively harder examples:

Stage 1: Easy examples (short sequences, simple patterns)

Stage 2: Medium difficulty

Stage 3: Full difficulty (long sequences, complex patterns)

Benefits:

Faster convergence
Better final performance
More stable training

Example: GPT-3 training:

Stage 1 (0-100B tokens):

Sequence length: 1024
Batch size: 3.2M tokens

Stage 2 (100B-300B tokens):

Sequence length: 2048
Batch size: 3.2M tokens (fewer sequences)

Starting with shorter sequences reduces memory and computation early in training.

Domain-Adaptive Pre-training

Continue pre-training on domain-specific data:

Algorithm: Domain Adaptation

Step 1: Pre-train on general corpus (e.g., Common Crawl)

Step 2: Continue pre-training on domain data (e.g., biomedical)

Step 3: Fine-tune on task

Examples:

BioBERT: BERT + PubMed/PMC
SciBERT: BERT + scientific papers
FinBERT: BERT + financial documents
CodeBERT: BERT + code

Hardware Requirements and Infrastructure

BERT-base Training Infrastructure

Training BERT-base requires relatively modest infrastructure by modern standards, making it accessible to academic research groups and small companies. The original BERT paper reported training BERT-base on 16 TPU v3 chips for 4 days, providing a concrete reference point for hardware requirements.

Each TPU v3 chip provides 420 TFLOPS of mixed-precision compute (bfloat16) and 16 GB of high-bandwidth memory (HBM). The 16-chip configuration provides 6.7 PFLOPS total compute and 256 GB total memory. BERT-base with 110 million parameters requires approximately 440 MB for model weights in FP32, or 220 MB in FP16. With batch size 256 and sequence length 512, the activation memory per batch is approximately 8 GB, which fits comfortably in the 256 GB total memory when distributed across 16 chips. The high-bandwidth interconnect between TPU chips (approximately 100 GB/s per chip) enables efficient data parallelism with minimal communication overhead.

The training cost at Google Cloud's preemptible TPU pricing (approximately \$2.40 per TPU v3 hour) is roughly \$3,700 for the full 4-day training run. Using on-demand pricing (\$8 per hour) would increase this to \$12,000. For comparison, training on NVIDIA V100 GPUs would require approximately 64 GPUs for 4 days (6,144 GPU-hours) at a cost of approximately \$12,000 using AWS on-demand pricing. The equivalent training on A100 GPUs would require approximately 32 GPUs for 2.5 days (1,920 GPU-hours) at a cost of approximately \$6,000, demonstrating the improved efficiency of newer hardware.

The infrastructure requirements beyond compute include high-bandwidth storage for the training data (approximately 100 GB including tokenized sequences and preprocessing artifacts), network bandwidth for distributed training (at least 10 Gbps per GPU for efficient data parallelism), and monitoring infrastructure for tracking training metrics. The total infrastructure cost including storage, networking, and engineering time is typically 2-3× the raw compute cost, bringing the total BERT-base training cost to approximately \$10,000-15,000.

GPT-3 Training Infrastructure

Training GPT-3 175B requires massive infrastructure that is accessible only to large technology companies and well-funded research organizations. The scale of the training job presents significant engineering challenges beyond simply acquiring hardware.

The training used approximately 10,000 NVIDIA V100 GPUs, though the exact configuration has not been publicly disclosed. Each V100 provides 125 TFLOPS of mixed-precision compute and 32 GB of memory. The 10,000-GPU configuration provides 1.25 exaFLOPS total compute and 320 TB total memory. GPT-3 175B with 175 billion parameters requires approximately 700 GB for model weights in FP32, or 350 GB in FP16. With model parallelism across 8 GPUs, each GPU stores approximately 44 GB of model weights, leaving limited memory for activations. The batch size per GPU is typically 1-2 sequences of length 2048, requiring approximately 20 GB of activation memory per GPU.

The communication requirements are severe. With model parallelism across 8 GPUs and data parallelism across 1,250 groups, each training step requires all-reduce operations across the data parallel groups (approximately 700 GB of gradients) and all-to-all communication within model parallel groups (approximately 100 GB per step). At 100 Gbps network bandwidth per GPU, the gradient all-reduce takes approximately 70 seconds per step, which would dominate training time. To address this, GPT-3 training used gradient accumulation (accumulating gradients over multiple micro-batches before synchronizing) and high-bandwidth interconnects like InfiniBand (200 Gbps or higher), reducing communication time to approximately 10\% of total step time.

The training cost is estimated between \$4 million and \$12 million depending on assumptions about hardware ownership versus cloud rental, utilization rates, and pricing agreements. At AWS on-demand pricing of \$3 per V100 hour, the 7.2 million GPU-hours would cost \$21.6 million, but OpenAI likely achieved significant discounts through long-term commitments and negotiated pricing. The infrastructure costs beyond raw compute are substantial: high-bandwidth networking equipment (InfiniBand switches and cables) costs millions of dollars, distributed storage systems for the 5 TB of training data cost hundreds of thousands of dollars, and the engineering effort to build and optimize the training pipeline represents millions of dollars in labor costs.

The power consumption is also significant. Each V100 GPU consumes approximately 300 watts under full load, so 10,000 GPUs consume 3 megawatts. Over a 1-month training run, this equals 2,160 megawatt-hours of electricity. At typical data center electricity costs of \$0.10 per kWh, the electricity cost alone is \$216,000. Including cooling and power distribution overhead (typically 1.5-2× the compute power), the total power cost approaches \$400,000.

LLaMA-65B Training Infrastructure

Training LLaMA-65B represents a more efficient approach than GPT-3, using fewer but more powerful GPUs and a more optimized training pipeline. Meta's paper provides detailed information about the infrastructure and costs.

The training used 2,048 NVIDIA A100-80GB GPUs for 21 days, totaling 1,022,362 GPU-hours. Each A100-80GB provides 312 TFLOPS of mixed-precision compute and 80 GB of memory, representing a significant improvement over V100 (2.5× compute, 2.5× memory). The 2,048-GPU configuration provides 639 PFLOPS total compute and 164 TB total memory. LLaMA-65B with 65 billion parameters requires approximately 260 GB for model weights in FP32, or 130 GB in FP16. With model parallelism across 8 GPUs, each GPU stores approximately 16 GB of model weights, leaving substantial memory for activations and optimizer states.

The larger memory capacity of A100-80GB enables more efficient training configurations. LLaMA uses a batch size of 4 million tokens (approximately 2,000 sequences of length 2048), distributed across 2,048 GPUs as 2 sequences per GPU. The activation memory per GPU is approximately 40 GB, and the optimizer states (using AdamW) require approximately 32 GB, totaling approximately 88 GB per GPU. This fits comfortably in the 80 GB memory, avoiding the need for activation checkpointing or other memory-saving techniques that would slow training.

The communication requirements are more manageable than GPT-3 due to the smaller model size and more efficient hardware. With model parallelism across 8 GPUs and data parallelism across 256 groups, each training step requires all-reduce operations across data parallel groups (approximately 260 GB of gradients) and all-to-all communication within model parallel groups (approximately 30 GB per step). Using NVIDIA's NVLink and NVSwitch interconnects (600 GB/s per GPU within a node, 200 Gbps between nodes), the communication time is approximately 5\% of total step time, demonstrating excellent scaling efficiency.

The training cost is estimated at \$2-3 million using Meta's owned infrastructure. At cloud pricing of \$3 per A100 hour, the 1,022,362 GPU-hours would cost \$3.1 million. Meta's owned infrastructure likely reduced the effective cost to \$1.5-2 million when accounting for hardware depreciation (A100 GPUs cost approximately \$10,000 each, depreciated over 3-5 years) and operational costs. The power consumption is approximately 1 megawatt (2,048 GPUs × 400 watts per A100), totaling 504 megawatt-hours over 21 days. At \$0.10 per kWh including cooling overhead, the electricity cost is approximately \$75,000.

The infrastructure requirements include high-bandwidth networking (NVIDIA InfiniBand or equivalent), distributed storage systems (approximately 15 TB for training data and checkpoints), and monitoring infrastructure. Meta's paper notes that they used a custom training framework optimized for their infrastructure, with careful attention to memory management, communication patterns, and fault tolerance. The engineering effort to build this infrastructure and optimize the training pipeline represents a significant investment beyond the raw hardware costs.

Efficient Pre-training Techniques

Large-scale pre-training relies on several key optimizations, covered in detail in Chapter~[ref]:

Mixed precision training (FP16/BF16 with FP32 master weights): 1.5--2$\times$ speedup, $\sim$42\% memory reduction. Requires loss scaling to prevent gradient underflow.
Gradient checkpointing: Stores only $O(\sqrt{L})$ activations, recomputing the rest during backward pass. 5--10$\times$ memory reduction at $\sim$33\% slowdown.
ZeRO optimizer: Partitions optimizer states (Stage 1, 4$\times$ savings), gradients (Stage 2, 8$\times$), and parameters (Stage 3, up to 64$\times$) across data-parallel GPUs.
Pipeline parallelism: Divides model across GPUs by layers, pipelining micro-batches. With $m = 4p$ micro-batches for $p$ stages, bubble overhead is $\sim$25\%.

For GPT-3 training, the combination of ZeRO Stage 3, gradient checkpointing, pipeline parallelism (8 stages), and data parallelism achieved $\sim$75\% hardware utilization on V100 GPUs.

Parameter-Efficient Fine-tuning

Motivation

Full fine-tuning challenges:

Requires storing full model copy per task
175B model $\times$ 100 tasks = 17.5T parameters!
Expensive and slow

Solution: Fine-tune small subset of parameters.

LoRA: Low-Rank Adaptation

Definition: Inject trainable low-rank matrices into frozen model:

Original: $\vh = \mW \vx$ where $\mW \in \R^{d \times d}$

LoRA:

$$ \vh = \mW \vx + \Delta \mW \vx = \mW \vx + \mB \mA \vx $$

where $\mA \in \R^{r \times d}$, $\mB \in \R^{d \times r}$, and $r \ll d$ (typically $r = 4$ to $64$).

Parameters:

Original: $d^2$ (frozen)
LoRA: $2rd$ (trainable)
Reduction: $\frac{2rd}{d^2} = \frac{2r}{d}$

Example: GPT-3 175B, apply LoRA with $r=8$ to attention projections.

Single attention layer:

$\mW^Q, \mW^K, \mW^V, \mW^O \in \R^{12288 \times 12288}$
Original params: $4 \times 12288^2 = 604M$

LoRA params per layer:

$$ 4 \times 2 \times 8 \times 12288 = 786{,}432 \approx 0.79M $$

96 layers total:

LoRA params: $96 \times 0.79M = 75.8M$
Full model: 175B
Reduction: 2,300× (train only 0.04\% of parameters!)

Performance: Matches full fine-tuning on many tasks!

Adapter Layers

Definition: Insert small bottleneck layers between frozen layers:

$$ \vh_{\text{adapter}} = \vh + \text{FFN}_{\text{adapter}}(\text{LayerNorm}(\vh)) $$

where FFN$_{\text{adapter}}$: $d \to d_{\text{bottleneck}} \to d$ with $d_{\text{bottleneck}} \ll d$.

Typical bottleneck: $d_{\text{bottleneck}} = 64$ for $d = 768$

Parameters per adapter:

$$ 2d \cdot d_{\text{bottleneck}} = 2 \times 768 \times 64 = 98{,}304 $$

Prompt Tuning

Definition: Prepend learnable "soft prompt" vectors:

Input: $[\vp_1, \ldots, \vp_k, \vx_1, \ldots, \vx_n]$

where $\vp_i \in \R^d$ are learned continuous prompts (not discrete tokens).

Parameters: Only $k \times d$ prompt vectors (model frozen).

Typical: $k = 20$ prompts, $d = 768$ $\to$ only 15,360 parameters!

Multi-Task and Multi-Stage Pre-training

Multi-Task Pre-training

Train on multiple objectives simultaneously:

$$ \mathcal{L}_{\text{total}} = \sum_{i=1}^{K} \lambda_i \mathcal{L}_i $$

Example (T5):

Span corruption (main)
Prefix LM
Deshuffling

Benefits:

More robust representations
Better transfer to diverse tasks
Can balance objectives with $\lambda_i$

Multi-Stage Pre-training

Stage 1: General pre-training

Large diverse corpus
Language modeling
Build general knowledge

Stage 2: Instruction tuning

Instruction-response pairs
Learn to follow instructions
Improve helpfulness

Stage 3: RLHF

Reinforcement learning from human feedback
Align with human preferences
Improve safety

Example: Stage 1: GPT-3 pre-training (175B params, 300B tokens)

Stage 2: Supervised fine-tuning

13,000 instruction-output examples
Fine-tune for 16 epochs
Learning rate: $9.65 \times 10^{-6}$

Stage 3: Reward modeling

33,000 comparison examples
Train 6B reward model
Predicts human preferences

Stage 4: PPO optimization

31,000 prompts
Optimize policy to maximize reward
KL penalty from SFT model

Result: 1.3B InstructGPT preferred over 175B GPT-3 by humans!

Transfer Learning Analysis

Measuring Transfer

Metrics:

1. Downstream Performance:

$$ \Delta = \text{Performance}_{\text{fine-tuned}} - \text{Performance}_{\text{from-scratch}} $$

2. Sample Efficiency:

Number of examples to reach target performance
Pre-trained models: 10-100× fewer examples

3. Convergence Speed:

Training steps to convergence
Pre-trained: 10× faster

What Makes Good Pre-training?

Data scale: More data $\to$ better transfer (up to a point)

Data diversity: Diverse pre-training $\to$ broader transfer

Model scale: Larger models transfer better

Objective alignment: Pre-training objective similar to downstream task

Domain match: Domain-specific pre-training helps domain-specific tasks

Exercises

Exercise 1: Compare pre-training objectives:

Train BERT-tiny with: (a) MLM, (b) CLM, (c) Span corruption
Evaluate on GLUE tasks
Which objective transfers best? Why?

Exercise 2: Implement data filtering pipeline:

Download 10,000 documents from Common Crawl
Remove duplicates (exact and near-duplicate)
Filter by language (keep English)
Filter low-quality (perplexity > threshold)
Report statistics at each stage

Exercise 3: Implement LoRA:

Load pre-trained GPT-2
Add LoRA layers with $r=8$ to attention
Fine-tune on sentiment analysis
Compare: (a) Full fine-tuning, (b) LoRA, (c) Frozen
Measure: parameters trained, memory, accuracy

Exercise 4: Analyze transfer learning:

Fine-tune BERT on 5 GLUE tasks
Vary training data: [100, 500, 1000, 5000, all]
Compare to training from scratch
Plot sample efficiency curves
At what point does pre-training stop helping?

Solutions

Full solutions for all exercises are available at \url{https://deeplearning.hofkensvermeulen.be}.

Solution: Exercise 1: Compare Pre-training Objectives

Expected Results on GLUE:

Objective	MNLI	QQP	SST-2	Avg
MLM (BERT-style)	78.3	85.2	89.7	84.4
CLM (GPT-style)	72.1	81.5	87.3	80.3
Span Corruption (T5-style)	76.8	84.1	88.9	83.3

Analysis:

Part (c): Which transfers best?

MLM (Masked Language Modeling) wins because:

Bidirectional context: Sees both left and right context
Better for understanding: GLUE tasks require comprehension
Sentence-level tasks: MLM learns sentence representations
Natural for classification: [CLS] token aggregates information

CLM (Causal Language Modeling) is weaker because:

Only left context (unidirectional)
Optimized for generation, not understanding
No explicit sentence representation
Better for generative tasks (not GLUE)

Span Corruption is middle ground:

Bidirectional like MLM
Longer spans than single tokens
Good for seq2seq tasks
Slightly worse than MLM for classification

Key Insight: Pre-training objective should match downstream task type:

Understanding tasks: Use MLM (BERT)
Generation tasks: Use CLM (GPT)
Seq2seq tasks: Use span corruption (T5)

Solution: Exercise 2: Data Filtering Pipeline

Pipeline Statistics:

Stage	Documents	Retention
Initial (Common Crawl)	10,000	100\%
After exact deduplication	8,742	87.4\%
After near-duplicate removal	7,891	78.9\%
After language filtering	6,234	62.3\%
After quality filtering	4,567	45.7\%

Key Steps:

1. Exact Deduplication:

Hash each document (MD5/SHA256)
Remove documents with identical hashes
Removes 12.6\% duplicates

2. Near-Duplicate Removal:

Use MinHash LSH for similarity detection
Threshold: 90\% similarity
Removes 9.7\% near-duplicates

3. Language Filtering:

Use fastText language detector
Keep only English (confidence > 0.9)
Removes 21.0\% non-English

4. Quality Filtering:

Compute perplexity with GPT-2
Threshold: perplexity < 1000
Removes 26.7\% low-quality

Final Result: 4,567 high-quality English documents (45.7\% retention)

Quality Metrics:

Average perplexity: 342 (vs 1,247 before filtering)
Average length: 1,834 tokens (vs 892 before)
English confidence: 0.97 (vs 0.73 before)

Key Insight: Aggressive filtering (54.3\% removed) dramatically improves data quality, leading to better pre-training outcomes.

Solution: Exercise 3: LoRA Implementation

Comparison Results:

Method	Params Trained	Memory (GB)	Accuracy
Full Fine-tuning	124M (100\%)	8.2	94.3\%
LoRA ($r=8$)	0.3M (0.24\%)	2.1	93.8\%
Frozen (linear probe)	0.001M (0.001\%)	1.8	87.2\%

Analysis:

LoRA Advantages:

Parameter efficiency: 400x fewer trainable parameters
Memory efficiency: 4x less memory (no optimizer states for frozen params)
Near full performance: Only 0.5\% accuracy drop
Fast training: 3x faster per epoch
Modular: Can swap LoRA adapters for different tasks

When to use each:

Full fine-tuning: Maximum accuracy, sufficient compute
LoRA: Limited compute, multiple tasks, near-optimal accuracy
Frozen: Extremely limited resources, quick baseline

LoRA Configuration:

Rank $r=8$: Good balance (try 4, 16, 32 for tuning)
Apply to: Query and Value projections
Alpha: $\alpha = 16$ (scaling factor)
Dropout: 0.1 on LoRA layers

Key Insight: LoRA achieves 99.5\% of full fine-tuning performance with 0.24\% of parameters, making it ideal for resource-constrained scenarios.

Solution: Exercise 4: Transfer Learning Analysis

Sample Efficiency Results:

Training Size	100	500	1000	5000	All
Pre-trained BERT	72.3	81.7	85.2	88.9	90.1
From Scratch	51.2	63.8	71.4	82.3	87.8
Improvement	+21.1	+17.9	+13.8	+6.6	+2.3

Analysis:

Key Observations:

Largest gains with small data: +21.1\% at 100 examples
Diminishing returns: Improvement drops as data increases
Convergence point: Around 10k-50k examples
Always helpful: Even with full data, +2.3\% improvement

Sample Efficiency Curve:

The gap between pre-trained and from-scratch narrows as:

More task-specific data becomes available
Model learns task-specific patterns
Pre-training advantage diminishes

When pre-training stops helping significantly:

Threshold: Around 10,000-50,000 labeled examples
Task complexity: Complex tasks need more data
Domain shift: Large domain gap reduces benefit
Model size: Larger models need more data to catch up

Practical Recommendations:

Data Size	Strategy
< 1,000	Always use pre-training (critical)
1,000 - 10,000	Pre-training highly beneficial
10,000 - 50,000	Pre-training still helps
> 50,000	Pre-training provides marginal gains

Key Insight: Pre-training is most valuable in low-resource settings, providing 10-20\% accuracy improvements with < 1000 examples, but benefits diminish as task-specific data grows.

← Chapter 19: Long Context Handling 📚 Table of Contents Chapter 21: PyTorch Implementation →