Pre-training Strategies and Transfer Learning
Chapter Overview
Pre-training on large unlabeled corpora followed by task-specific fine-tuning has become the dominant paradigm in deep learning. This chapter covers pre-training objectives, data curation, curriculum learning, continual pre-training, and transfer learning strategies for maximizing downstream performance.
Learning Objectives
- Understand different pre-training objectives and their trade-offs
- Curate and process pre-training data at scale
- Apply curriculum learning and domain-adaptive pre-training
- Implement parameter-efficient fine-tuning (LoRA, adapters)
- Design multi-task and multi-stage pre-training
- Measure and improve transfer learning effectiveness
Pre-training Objectives
Language Modeling Objectives
The choice of pre-training objective fundamentally shapes a model's capabilities. The main objectives are:
- Causal Language Modeling (CLM): Predict each token given only previous context: $\mathcal{L}_{\text{CLM}} = -\sum_{t} \log P(x_t | x_{
Masked Language Modeling (MLM): Mask $\sim$15\% of tokens and predict from bidirectional context: $\mathcal{L}_{\text{MLM}} = -\sum_{t \in \mathcal{M}} \log P(x_t | x_{\backslash \mathcal{M}}; \theta)$. Used by BERT (Chapter~[ref]). Rich representations for understanding tasks. - Prefix Language Modeling: Bidirectional attention on a prefix, causal on the suffix. Combines benefits of CLM and MLM. Used by UniLM and GLM.
Denoising Objectives
- Span Corruption (T5): Replace random spans with sentinel tokens; decoder predicts original spans. See Chapter~[ref].
- Multi-task Denoising (BART): Combines token masking, deletion, text infilling, sentence permutation, and document rotation. See Chapter~[ref].
Computational costs are similar across objectives for the same model size and sequence length: a single forward pass dominates, with differences arising mainly in the decoder target length for seq2seq objectives (span corruption costs $\sim$3$\times$ more in decoder computation than single-token MLM).
Contrastive Objectives
Contrastive Learning:
Applications:
- SimCLR (vision): Augmented views as positives
- CLIP: Image-text pairs
- SimCSE (text): Dropout as augmentation
Data Curation and Processing
Data Scale and Requirements
The scale of pre-training data has grown exponentially over the past few years, driven by empirical findings that larger datasets consistently improve model performance. Understanding the data requirements, storage costs, and preprocessing overhead is essential for planning pre-training projects.
BERT was pre-trained on approximately 16 GB of text data, consisting of BooksCorpus (800 million words) and English Wikipedia (2.5 billion words). This relatively modest dataset size reflects BERT's focus on high-quality, curated text rather than massive web crawls. The 16 GB of raw text expands to approximately 3.3 billion tokens using BERT's WordPiece tokenizer with a 30,000 token vocabulary. Training BERT-base for 1 million steps with batch size 256 and sequence length 512 means the model sees each token approximately 40 times on average, indicating significant data reuse through multiple epochs. The storage requirements are minimal by modern standards—16 GB of compressed text expands to perhaps 50 GB including tokenized data and intermediate preprocessing artifacts.
GPT-2 scaled up to approximately 40 GB of text from WebText, a dataset created by scraping outbound links from Reddit posts with at least 3 karma. This filtering strategy aimed to identify high-quality content as judged by the Reddit community. The 40 GB corpus contains roughly 8 billion tokens using GPT-2's byte-pair encoding with a 50,257 token vocabulary. GPT-2's largest variant (1.5B parameters) was trained for approximately 1 million steps, seeing each token roughly 10 times. The preprocessing pipeline for WebText involved deduplication, filtering by language, and removing low-quality content, reducing the raw crawl from over 100 GB to the final 40 GB. Storage requirements including raw data, filtered data, and tokenized sequences total approximately 150 GB.
GPT-3 made a massive leap to approximately 570 GB of text, totaling roughly 300 billion tokens. This dataset combines filtered Common Crawl (410 GB), WebText2 (19 GB), Books1 (12 GB), Books2 (55 GB), and Wikipedia (3 GB). The preprocessing pipeline for Common Crawl is particularly intensive: the raw crawl contains petabytes of data, which must be filtered by language, deduplicated, and quality-filtered to produce the final 410 GB. This filtering process itself requires substantial computational resources—processing petabytes of data through language classifiers and deduplication algorithms takes weeks on large clusters. The total storage requirements for GPT-3 pre-training, including raw data, filtered data, tokenized sequences, and training checkpoints, exceed 5 TB. The preprocessing cost alone is estimated at tens of thousands of dollars in compute time.
LLaMA pushed the scale even further to approximately 1.4 TB of text, totaling roughly 1.4 trillion tokens. This dataset consists primarily of Common Crawl (67\%), C4 (15\%), GitHub (4.5\%), Wikipedia (4.5\%), books (4.5\%), ArXiv (2.5\%), and StackExchange (2\%). The inclusion of code from GitHub and technical content from ArXiv and StackExchange reflects a deliberate strategy to improve reasoning and technical capabilities. The preprocessing pipeline for LLaMA is even more sophisticated than GPT-3, using multiple quality filters including perplexity-based filtering, classifier-based filtering, and extensive deduplication. The total storage requirements exceed 10 TB including all preprocessing artifacts, and the preprocessing cost is estimated at over \$100,000 in compute time.
Data Quality versus Quantity
The relationship between data quality and quantity is not straightforward—more data does not always lead to better models if the quality is poor. Recent research has shown that careful data curation can match or exceed the performance of models trained on much larger but noisier datasets.
High-quality datasets like Wikipedia and books consistently improve model performance even when they represent a small fraction of total training data. GPT-3's data mixture samples Wikipedia at 3× the rate it appears in the corpus (3.4 epochs versus 0.44 epochs for Common Crawl), reflecting the higher quality and information density of Wikipedia text. This upsampling strategy means that despite Wikipedia being only 3 GB of the 570 GB total, it contributes disproportionately to the model's knowledge and capabilities.
The preprocessing cost for achieving high data quality is substantial. Language identification using fastText classifiers requires processing every document, taking approximately 1 CPU-hour per 100 GB of text. Deduplication using MinHash LSH is even more expensive, requiring approximately 10 CPU-hours per 100 GB for computing signatures and finding near-duplicates. Quality filtering using perplexity-based methods requires running a language model over the entire corpus, taking approximately 100 GPU-hours per 100 GB. For GPT-3's 570 GB dataset, the total preprocessing cost exceeds 50,000 CPU-hours and 5,000 GPU-hours, translating to roughly \$30,000 in cloud computing costs.
The storage requirements for preprocessing are also significant. Deduplication requires storing hash signatures for all documents, typically requiring 100-200 bytes per document. For a corpus with 1 billion documents, this means 100-200 GB of signature storage. Near-duplicate detection using LSH requires storing multiple hash tables, potentially doubling or tripling this storage requirement. Quality filtering requires storing perplexity scores or classifier outputs for all documents, adding another 10-20 GB. In total, the preprocessing metadata can require 500 GB to 1 TB of storage for a large corpus, comparable to the size of the corpus itself.
Data Filtering and Cleaning
Step 1: Quality Filtering
Step 2: Deduplication
Step 3: Privacy
Step 4: Formatting
| Dataset | Weight | Epochs |
|---|---|---|
| Common Crawl (filtered) | 60\% | 0.44 |
| WebText2 | 22\% | 2.9 |
| Books1 | 8\% | 1.9 |
| Books2 | 8\% | 1.9 |
| Wikipedia | 3\% | 3.4 |
Higher-quality sources sampled more frequently (multiple epochs). Lower-quality sources seen less to avoid overfitting to noise.
Data Deduplication
Why deduplicate?
- Prevents memorization
- Better generalization
- Fairer evaluation (test set contamination)
Methods:
1. Exact Deduplication:seen_hashes = set()
for doc in corpus:
hash_val = hash(doc)
if hash_val not in seen_hashes:
keep(doc)
seen_hashes.add(hash_val)
2. Fuzzy Deduplication (MinHash):
- Compute MinHash signatures
- Use LSH for near-neighbor search
- Remove documents with Jaccard similarity $> 0.8$
Training Compute Requirements
FLOPs Analysis
Understanding the computational requirements for pre-training large language models is essential for planning projects and estimating costs. The total compute is typically measured in FLOPs (floating-point operations), which can be calculated from model architecture and training configuration.
For a transformer model with $L$ layers, $d$ model dimension, $h$ attention heads, and feed-forward intermediate dimension $d_{ff}$ (typically $4d$), processing a single token requires approximately:
The factor of 2 accounts for both forward and backward passes (backward pass requires approximately the same FLOPs as forward pass). The $12d^2$ term comes from attention projections (query, key, value, and output, each $d \times d$), and the $16d^2$ term comes from feed-forward layers (two $d \times 4d$ projections).
For BERT-base with $L=12$, $d=768$, training on 3.3 billion tokens for 40 epochs (132 billion tokens total):
This is approximately 0.5 zettaFLOPs. Training on 16 TPU v3 chips (each providing 420 TFLOPS in mixed precision) for 4 days:
The ratio of available compute to required compute is approximately 4.4, indicating that BERT-base training achieves roughly 23\% hardware utilization. This is typical for large-scale training where communication overhead, data loading, and other inefficiencies reduce effective utilization.
For GPT-3 175B with $L=96$, $d=12288$, training on 300 billion tokens:
This is approximately 240 zettaFLOPs, nearly 500× more than BERT-base. The massive compute requirement reflects both the larger model (175B versus 110M parameters) and the larger dataset (300B versus 132B tokens). Training GPT-3 on approximately 10,000 NVIDIA V100 GPUs (each providing 125 TFLOPS in mixed precision) for 1 month:
This suggests approximately 75\% hardware utilization, which is impressive for such a large-scale distributed training job. The higher utilization compared to BERT reflects improvements in distributed training infrastructure and optimization techniques.
For LLaMA-65B with $L=80$, $d=8192$, training on 1.4 trillion tokens:
This is approximately 330 zettaFLOPs. Training on 2048 NVIDIA A100 GPUs (each providing 312 TFLOPS in mixed precision) for 21 days:
This suggests approximately 28\% hardware utilization, which is lower than GPT-3 despite using more modern hardware. The lower utilization likely reflects the challenges of scaling to very long sequences (LLaMA uses 2048 token sequences versus GPT-3's 2048 tokens) and the overhead of processing the much larger dataset.
GPU-Hours and Cost Estimates
Translating FLOPs into GPU-hours and cost estimates provides a more practical understanding of training requirements. The cost depends heavily on the hardware platform and whether using cloud services or owned infrastructure.
BERT-base training on 16 TPU v3 chips for 4 days equals 1,536 TPU-hours. At Google Cloud's on-demand pricing of approximately \$8 per TPU v3 hour, this costs roughly \$12,000. However, Google's original BERT paper reported using preemptible TPUs at approximately \$2.40 per hour, reducing the cost to roughly \$3,700. Using equivalent GPU resources (approximately 64 NVIDIA V100 GPUs to match 16 TPU v3 chips), the cost would be approximately \$2 per GPU-hour on AWS, totaling \$12,000 for 96 days of GPU time. The lower TPU cost reflects Google's optimization for transformer workloads and economies of scale.
GPT-3 175B training on 10,000 V100 GPUs for 1 month equals 7.2 million GPU-hours. At AWS on-demand pricing of approximately \$3 per V100 hour, this would cost \$21.6 million. However, OpenAI likely used a combination of owned infrastructure and negotiated cloud pricing, with estimates suggesting actual costs between \$4 million and \$12 million. The wide range reflects uncertainty about the exact hardware configuration, utilization rates, and pricing agreements. The training also required substantial infrastructure costs including high-bandwidth networking (InfiniBand or equivalent), distributed storage systems, and engineering effort to optimize the training pipeline.
LLaMA-65B training on 2048 A100 GPUs for 21 days equals 1.03 million GPU-hours. At cloud pricing of approximately \$3 per A100 hour, this would cost \$3.1 million. Meta's paper reports that LLaMA-65B training consumed approximately 1,022,362 GPU-hours on A100-80GB GPUs, closely matching this estimate. Using Meta's owned infrastructure rather than cloud services likely reduced the effective cost to \$1.5-2 million when accounting for hardware depreciation and operational costs. The A100's higher performance compared to V100 (312 versus 125 TFLOPS) means that LLaMA-65B required only 1/7 the GPU-hours of GPT-3 despite using comparable compute (330 versus 240 zettaFLOPs), demonstrating the importance of hardware efficiency.
Scaling Laws
Empirical scaling laws govern the relationship between compute budget, model size, dataset size, and performance (see Section~[ref] for the full treatment). The key insight from the Chinchilla scaling laws (Hoffmann et al., 2022) is that model size and dataset size should scale equally with compute budget: $N_{\text{opt}} \propto C^{0.5}$ and $D_{\text{opt}} \propto C^{0.5}$. This implies GPT-3 175B (trained on 300B tokens) was over-parameterized---a 70B model trained on 1.4T tokens (matching LLaMA-65B) achieves better performance for the same compute. This insight has driven the trend toward models like LLaMA, Chinchilla, and Mistral that prioritize training tokens over parameter count.
Curriculum Learning
Progressive Training Strategies
Curriculum learning applies the principle of learning from easy to hard examples, progressively increasing task difficulty during training. This approach can significantly improve training efficiency, convergence speed, and final model performance. The key insight is that models learn more effectively when they first master simpler patterns before tackling complex ones.
The most common curriculum strategy involves progressively increasing sequence length during training. Starting with shorter sequences reduces both memory requirements and computational cost in the early stages of training when the model is learning basic patterns. For a model with quadratic attention complexity $O(n^2d)$, doubling the sequence length quadruples the attention computation. Training BERT-base with sequence length 128 for the first 90\% of steps and then 512 for the final 10\% reduces total training time by approximately 30\% compared to using length 512 throughout. The shorter sequences allow larger batch sizes in early training, which improves gradient estimates and accelerates convergence. The model learns word-level and phrase-level patterns with short sequences, then refines its understanding of long-range dependencies with longer sequences.
Batch size curriculum is another effective strategy, gradually increasing batch size during training. Starting with smaller batches provides more frequent parameter updates, which helps the model escape poor local minima in early training. As training progresses and the model approaches convergence, larger batches provide more stable gradient estimates and better utilize hardware parallelism. GPT-3 training used a batch size curriculum, starting at 32,000 tokens per batch and gradually increasing to 3.2 million tokens per batch. This 100× increase in batch size was enabled by learning rate adjustments and gradient accumulation. The larger batch sizes in later training improved hardware utilization from approximately 30\% to over 70\%, significantly reducing training time.
Learning rate schedules are essential for curriculum learning, as the optimal learning rate changes with batch size and training progress. The linear warmup followed by cosine decay schedule has become standard for transformer pre-training. The warmup phase, typically 1-10\% of total steps, gradually increases the learning rate from near-zero to the peak value. This prevents the large gradient updates in early training from destabilizing the model. The cosine decay phase gradually reduces the learning rate to near-zero, allowing the model to fine-tune its parameters as it approaches convergence. For BERT-base, a warmup of 10,000 steps followed by linear decay over 990,000 steps works well. For GPT-3, a warmup of 375 million tokens (approximately 1\% of total training) followed by cosine decay proved effective.
The impact on training efficiency is substantial. Curriculum learning can reduce training time by 20-40\% compared to fixed configurations while achieving equal or better final performance. For BERT-base, the sequence length curriculum reduces training from approximately 5.5 days to 4 days on the same hardware. For GPT-3, the batch size curriculum improved hardware utilization enough to reduce training time by an estimated 30\%, saving approximately \$3-4 million in compute costs. These savings make curriculum learning essential for large-scale pre-training projects.
Progressive Training
Stage 1: Easy examples (short sequences, simple patterns)
Stage 2: Medium difficulty
Stage 3: Full difficulty (long sequences, complex patterns)
Benefits:
- Faster convergence
- Better final performance
- More stable training
Stage 1 (0-100B tokens):
- Sequence length: 1024
- Batch size: 3.2M tokens
Stage 2 (100B-300B tokens):
- Sequence length: 2048
- Batch size: 3.2M tokens (fewer sequences)
Starting with shorter sequences reduces memory and computation early in training.
Domain-Adaptive Pre-training
Continue pre-training on domain-specific data:
Step 1: Pre-train on general corpus (e.g., Common Crawl)
Step 2: Continue pre-training on domain data (e.g., biomedical)
Step 3: Fine-tune on task
Examples:
- BioBERT: BERT + PubMed/PMC
- SciBERT: BERT + scientific papers
- FinBERT: BERT + financial documents
- CodeBERT: BERT + code
Hardware Requirements and Infrastructure
BERT-base Training Infrastructure
Training BERT-base requires relatively modest infrastructure by modern standards, making it accessible to academic research groups and small companies. The original BERT paper reported training BERT-base on 16 TPU v3 chips for 4 days, providing a concrete reference point for hardware requirements.
Each TPU v3 chip provides 420 TFLOPS of mixed-precision compute (bfloat16) and 16 GB of high-bandwidth memory (HBM). The 16-chip configuration provides 6.7 PFLOPS total compute and 256 GB total memory. BERT-base with 110 million parameters requires approximately 440 MB for model weights in FP32, or 220 MB in FP16. With batch size 256 and sequence length 512, the activation memory per batch is approximately 8 GB, which fits comfortably in the 256 GB total memory when distributed across 16 chips. The high-bandwidth interconnect between TPU chips (approximately 100 GB/s per chip) enables efficient data parallelism with minimal communication overhead.
The training cost at Google Cloud's preemptible TPU pricing (approximately \$2.40 per TPU v3 hour) is roughly \$3,700 for the full 4-day training run. Using on-demand pricing (\$8 per hour) would increase this to \$12,000. For comparison, training on NVIDIA V100 GPUs would require approximately 64 GPUs for 4 days (6,144 GPU-hours) at a cost of approximately \$12,000 using AWS on-demand pricing. The equivalent training on A100 GPUs would require approximately 32 GPUs for 2.5 days (1,920 GPU-hours) at a cost of approximately \$6,000, demonstrating the improved efficiency of newer hardware.
The infrastructure requirements beyond compute include high-bandwidth storage for the training data (approximately 100 GB including tokenized sequences and preprocessing artifacts), network bandwidth for distributed training (at least 10 Gbps per GPU for efficient data parallelism), and monitoring infrastructure for tracking training metrics. The total infrastructure cost including storage, networking, and engineering time is typically 2-3× the raw compute cost, bringing the total BERT-base training cost to approximately \$10,000-15,000.
GPT-3 Training Infrastructure
Training GPT-3 175B requires massive infrastructure that is accessible only to large technology companies and well-funded research organizations. The scale of the training job presents significant engineering challenges beyond simply acquiring hardware.
The training used approximately 10,000 NVIDIA V100 GPUs, though the exact configuration has not been publicly disclosed. Each V100 provides 125 TFLOPS of mixed-precision compute and 32 GB of memory. The 10,000-GPU configuration provides 1.25 exaFLOPS total compute and 320 TB total memory. GPT-3 175B with 175 billion parameters requires approximately 700 GB for model weights in FP32, or 350 GB in FP16. With model parallelism across 8 GPUs, each GPU stores approximately 44 GB of model weights, leaving limited memory for activations. The batch size per GPU is typically 1-2 sequences of length 2048, requiring approximately 20 GB of activation memory per GPU.
The communication requirements are severe. With model parallelism across 8 GPUs and data parallelism across 1,250 groups, each training step requires all-reduce operations across the data parallel groups (approximately 700 GB of gradients) and all-to-all communication within model parallel groups (approximately 100 GB per step). At 100 Gbps network bandwidth per GPU, the gradient all-reduce takes approximately 70 seconds per step, which would dominate training time. To address this, GPT-3 training used gradient accumulation (accumulating gradients over multiple micro-batches before synchronizing) and high-bandwidth interconnects like InfiniBand (200 Gbps or higher), reducing communication time to approximately 10\% of total step time.
The training cost is estimated between \$4 million and \$12 million depending on assumptions about hardware ownership versus cloud rental, utilization rates, and pricing agreements. At AWS on-demand pricing of \$3 per V100 hour, the 7.2 million GPU-hours would cost \$21.6 million, but OpenAI likely achieved significant discounts through long-term commitments and negotiated pricing. The infrastructure costs beyond raw compute are substantial: high-bandwidth networking equipment (InfiniBand switches and cables) costs millions of dollars, distributed storage systems for the 5 TB of training data cost hundreds of thousands of dollars, and the engineering effort to build and optimize the training pipeline represents millions of dollars in labor costs.
The power consumption is also significant. Each V100 GPU consumes approximately 300 watts under full load, so 10,000 GPUs consume 3 megawatts. Over a 1-month training run, this equals 2,160 megawatt-hours of electricity. At typical data center electricity costs of \$0.10 per kWh, the electricity cost alone is \$216,000. Including cooling and power distribution overhead (typically 1.5-2× the compute power), the total power cost approaches \$400,000.
LLaMA-65B Training Infrastructure
Training LLaMA-65B represents a more efficient approach than GPT-3, using fewer but more powerful GPUs and a more optimized training pipeline. Meta's paper provides detailed information about the infrastructure and costs.
The training used 2,048 NVIDIA A100-80GB GPUs for 21 days, totaling 1,022,362 GPU-hours. Each A100-80GB provides 312 TFLOPS of mixed-precision compute and 80 GB of memory, representing a significant improvement over V100 (2.5× compute, 2.5× memory). The 2,048-GPU configuration provides 639 PFLOPS total compute and 164 TB total memory. LLaMA-65B with 65 billion parameters requires approximately 260 GB for model weights in FP32, or 130 GB in FP16. With model parallelism across 8 GPUs, each GPU stores approximately 16 GB of model weights, leaving substantial memory for activations and optimizer states.
The larger memory capacity of A100-80GB enables more efficient training configurations. LLaMA uses a batch size of 4 million tokens (approximately 2,000 sequences of length 2048), distributed across 2,048 GPUs as 2 sequences per GPU. The activation memory per GPU is approximately 40 GB, and the optimizer states (using AdamW) require approximately 32 GB, totaling approximately 88 GB per GPU. This fits comfortably in the 80 GB memory, avoiding the need for activation checkpointing or other memory-saving techniques that would slow training.
The communication requirements are more manageable than GPT-3 due to the smaller model size and more efficient hardware. With model parallelism across 8 GPUs and data parallelism across 256 groups, each training step requires all-reduce operations across data parallel groups (approximately 260 GB of gradients) and all-to-all communication within model parallel groups (approximately 30 GB per step). Using NVIDIA's NVLink and NVSwitch interconnects (600 GB/s per GPU within a node, 200 Gbps between nodes), the communication time is approximately 5\% of total step time, demonstrating excellent scaling efficiency.
The training cost is estimated at \$2-3 million using Meta's owned infrastructure. At cloud pricing of \$3 per A100 hour, the 1,022,362 GPU-hours would cost \$3.1 million. Meta's owned infrastructure likely reduced the effective cost to \$1.5-2 million when accounting for hardware depreciation (A100 GPUs cost approximately \$10,000 each, depreciated over 3-5 years) and operational costs. The power consumption is approximately 1 megawatt (2,048 GPUs × 400 watts per A100), totaling 504 megawatt-hours over 21 days. At \$0.10 per kWh including cooling overhead, the electricity cost is approximately \$75,000.
The infrastructure requirements include high-bandwidth networking (NVIDIA InfiniBand or equivalent), distributed storage systems (approximately 15 TB for training data and checkpoints), and monitoring infrastructure. Meta's paper notes that they used a custom training framework optimized for their infrastructure, with careful attention to memory management, communication patterns, and fault tolerance. The engineering effort to build this infrastructure and optimize the training pipeline represents a significant investment beyond the raw hardware costs.
Efficient Pre-training Techniques
Large-scale pre-training relies on several key optimizations, covered in detail in Chapter~[ref]:
- Mixed precision training (FP16/BF16 with FP32 master weights): 1.5--2$\times$ speedup, $\sim$42\% memory reduction. Requires loss scaling to prevent gradient underflow.
- Gradient checkpointing: Stores only $O(\sqrt{L})$ activations, recomputing the rest during backward pass. 5--10$\times$ memory reduction at $\sim$33\% slowdown.
- ZeRO optimizer: Partitions optimizer states (Stage 1, 4$\times$ savings), gradients (Stage 2, 8$\times$), and parameters (Stage 3, up to 64$\times$) across data-parallel GPUs.
- Pipeline parallelism: Divides model across GPUs by layers, pipelining micro-batches. With $m = 4p$ micro-batches for $p$ stages, bubble overhead is $\sim$25\%.
For GPT-3 training, the combination of ZeRO Stage 3, gradient checkpointing, pipeline parallelism (8 stages), and data parallelism achieved $\sim$75\% hardware utilization on V100 GPUs.
Parameter-Efficient Fine-tuning
Motivation
Full fine-tuning challenges:
- Requires storing full model copy per task
- 175B model $\times$ 100 tasks = 17.5T parameters!
- Expensive and slow
Solution: Fine-tune small subset of parameters.
LoRA: Low-Rank Adaptation
Original: $\vh = \mW \vx$ where $\mW \in \R^{d \times d}$
LoRA:
where $\mA \in \R^{r \times d}$, $\mB \in \R^{d \times r}$, and $r \ll d$ (typically $r = 4$ to $64$).
Parameters:
- Original: $d^2$ (frozen)
- LoRA: $2rd$ (trainable)
- Reduction: $\frac{2rd}{d^2} = \frac{2r}{d}$
Single attention layer:
- $\mW^Q, \mW^K, \mW^V, \mW^O \in \R^{12288 \times 12288}$
- Original params: $4 \times 12288^2 = 604M$
LoRA params per layer:
96 layers total:
- LoRA params: $96 \times 0.79M = 75.8M$
- Full model: 175B
- Reduction: 2,300× (train only 0.04\% of parameters!)
Performance: Matches full fine-tuning on many tasks!
Adapter Layers
where FFN$_{\text{adapter}}$: $d \to d_{\text{bottleneck}} \to d$ with $d_{\text{bottleneck}} \ll d$.
Typical bottleneck: $d_{\text{bottleneck}} = 64$ for $d = 768$
Parameters per adapter:
Prompt Tuning
Input: $[\vp_1, \ldots, \vp_k, \vx_1, \ldots, \vx_n]$
where $\vp_i \in \R^d$ are learned continuous prompts (not discrete tokens).
Parameters: Only $k \times d$ prompt vectors (model frozen).
Typical: $k = 20$ prompts, $d = 768$ $\to$ only 15,360 parameters!
Multi-Task and Multi-Stage Pre-training
Multi-Task Pre-training
Train on multiple objectives simultaneously:
Example (T5):
- Span corruption (main)
- Prefix LM
- Deshuffling
Benefits:
- More robust representations
- Better transfer to diverse tasks
- Can balance objectives with $\lambda_i$
Multi-Stage Pre-training
Stage 1: General pre-training
- Large diverse corpus
- Language modeling
- Build general knowledge
Stage 2: Instruction tuning
- Instruction-response pairs
- Learn to follow instructions
- Improve helpfulness
Stage 3: RLHF
- Reinforcement learning from human feedback
- Align with human preferences
- Improve safety
Stage 2: Supervised fine-tuning
- 13,000 instruction-output examples
- Fine-tune for 16 epochs
- Learning rate: $9.65 \times 10^{-6}$
Stage 3: Reward modeling
- 33,000 comparison examples
- Train 6B reward model
- Predicts human preferences
Stage 4: PPO optimization
- 31,000 prompts
- Optimize policy to maximize reward
- KL penalty from SFT model
Result: 1.3B InstructGPT preferred over 175B GPT-3 by humans!
Transfer Learning Analysis
Measuring Transfer
Metrics:
1. Downstream Performance:
2. Sample Efficiency:
- Number of examples to reach target performance
- Pre-trained models: 10-100× fewer examples
3. Convergence Speed:
- Training steps to convergence
- Pre-trained: 10× faster
What Makes Good Pre-training?
Data scale: More data $\to$ better transfer (up to a point)
Data diversity: Diverse pre-training $\to$ broader transfer
Model scale: Larger models transfer better
Objective alignment: Pre-training objective similar to downstream task
Domain match: Domain-specific pre-training helps domain-specific tasks
Exercises
- Train BERT-tiny with: (a) MLM, (b) CLM, (c) Span corruption
- Evaluate on GLUE tasks
- Which objective transfers best? Why?
- Download 10,000 documents from Common Crawl
- Remove duplicates (exact and near-duplicate)
- Filter by language (keep English)
- Filter low-quality (perplexity > threshold)
- Report statistics at each stage
- Load pre-trained GPT-2
- Add LoRA layers with $r=8$ to attention
- Fine-tune on sentiment analysis
- Compare: (a) Full fine-tuning, (b) LoRA, (c) Frozen
- Measure: parameters trained, memory, accuracy
- Fine-tune BERT on 5 GLUE tasks
- Vary training data: [100, 500, 1000, 5000, all]
- Compare to training from scratch
- Plot sample efficiency curves
- At what point does pre-training stop helping?
Solutions
Full solutions for all exercises are available at \url{https://deeplearning.hofkensvermeulen.be}.
Expected Results on GLUE:
| Objective | MNLI | QQP | SST-2 | Avg |
|---|---|---|---|---|
| MLM (BERT-style) | 78.3 | 85.2 | 89.7 | 84.4 |
| CLM (GPT-style) | 72.1 | 81.5 | 87.3 | 80.3 |
| Span Corruption (T5-style) | 76.8 | 84.1 | 88.9 | 83.3 |
Analysis:
Part (c): Which transfers best?
MLM (Masked Language Modeling) wins because:
- Bidirectional context: Sees both left and right context
- Better for understanding: GLUE tasks require comprehension
- Sentence-level tasks: MLM learns sentence representations
- Natural for classification: [CLS] token aggregates information
CLM (Causal Language Modeling) is weaker because:
- Only left context (unidirectional)
- Optimized for generation, not understanding
- No explicit sentence representation
- Better for generative tasks (not GLUE)
Span Corruption is middle ground:
- Bidirectional like MLM
- Longer spans than single tokens
- Good for seq2seq tasks
- Slightly worse than MLM for classification
Key Insight: Pre-training objective should match downstream task type:
- Understanding tasks: Use MLM (BERT)
- Generation tasks: Use CLM (GPT)
- Seq2seq tasks: Use span corruption (T5)
Pipeline Statistics:
| Stage | Documents | Retention |
|---|---|---|
| Initial (Common Crawl) | 10,000 | 100\% |
| After exact deduplication | 8,742 | 87.4\% |
| After near-duplicate removal | 7,891 | 78.9\% |
| After language filtering | 6,234 | 62.3\% |
| After quality filtering | 4,567 | 45.7\% |
Key Steps:
1. Exact Deduplication:
- Hash each document (MD5/SHA256)
- Remove documents with identical hashes
- Removes 12.6\% duplicates
2. Near-Duplicate Removal:
- Use MinHash LSH for similarity detection
- Threshold: 90\% similarity
- Removes 9.7\% near-duplicates
3. Language Filtering:
- Use fastText language detector
- Keep only English (confidence > 0.9)
- Removes 21.0\% non-English
4. Quality Filtering:
- Compute perplexity with GPT-2
- Threshold: perplexity < 1000
- Removes 26.7\% low-quality
Final Result: 4,567 high-quality English documents (45.7\% retention)
Quality Metrics:
- Average perplexity: 342 (vs 1,247 before filtering)
- Average length: 1,834 tokens (vs 892 before)
- English confidence: 0.97 (vs 0.73 before)
Key Insight: Aggressive filtering (54.3\% removed) dramatically improves data quality, leading to better pre-training outcomes.
Comparison Results:
| Method | Params Trained | Memory (GB) | Accuracy |
|---|---|---|---|
| Full Fine-tuning | 124M (100\%) | 8.2 | 94.3\% |
| LoRA ($r=8$) | 0.3M (0.24\%) | 2.1 | 93.8\% |
| Frozen (linear probe) | 0.001M (0.001\%) | 1.8 | 87.2\% |
Analysis:
LoRA Advantages:
- Parameter efficiency: 400x fewer trainable parameters
- Memory efficiency: 4x less memory (no optimizer states for frozen params)
- Near full performance: Only 0.5\% accuracy drop
- Fast training: 3x faster per epoch
- Modular: Can swap LoRA adapters for different tasks
When to use each:
- Full fine-tuning: Maximum accuracy, sufficient compute
- LoRA: Limited compute, multiple tasks, near-optimal accuracy
- Frozen: Extremely limited resources, quick baseline
LoRA Configuration:
- Rank $r=8$: Good balance (try 4, 16, 32 for tuning)
- Apply to: Query and Value projections
- Alpha: $\alpha = 16$ (scaling factor)
- Dropout: 0.1 on LoRA layers
Key Insight: LoRA achieves 99.5\% of full fine-tuning performance with 0.24\% of parameters, making it ideal for resource-constrained scenarios.
Sample Efficiency Results:
| Training Size | 100 | 500 | 1000 | 5000 | All |
|---|---|---|---|---|---|
| Pre-trained BERT | 72.3 | 81.7 | 85.2 | 88.9 | 90.1 |
| From Scratch | 51.2 | 63.8 | 71.4 | 82.3 | 87.8 |
| Improvement | +21.1 | +17.9 | +13.8 | +6.6 | +2.3 |
Analysis:
Key Observations:
- Largest gains with small data: +21.1\% at 100 examples
- Diminishing returns: Improvement drops as data increases
- Convergence point: Around 10k-50k examples
- Always helpful: Even with full data, +2.3\% improvement
Sample Efficiency Curve:
The gap between pre-trained and from-scratch narrows as:
- More task-specific data becomes available
- Model learns task-specific patterns
- Pre-training advantage diminishes
When pre-training stops helping significantly:
- Threshold: Around 10,000-50,000 labeled examples
- Task complexity: Complex tasks need more data
- Domain shift: Large domain gap reduces benefit
- Model size: Larger models need more data to catch up
Practical Recommendations:
| Data Size | Strategy |
|---|---|
| < 1,000 | Always use pre-training (critical) |
| 1,000 - 10,000 | Pre-training highly beneficial |
| 10,000 - 50,000 | Pre-training still helps |
| > 50,000 | Pre-training provides marginal gains |
Key Insight: Pre-training is most valuable in low-resource settings, providing 10-20\% accuracy improvements with < 1000 examples, but benefits diminish as task-specific data grows.