Pre-training Strategies and Transfer Learning

Chapter Overview

Pre-training on large unlabeled corpora followed by task-specific fine-tuning has become the dominant paradigm in deep learning. This chapter covers pre-training objectives, data curation, curriculum learning, continual pre-training, and transfer learning strategies for maximizing downstream performance.

Learning Objectives

  1. Understand different pre-training objectives and their trade-offs
  2. Curate and process pre-training data at scale
  3. Apply curriculum learning and domain-adaptive pre-training
  4. Implement parameter-efficient fine-tuning (LoRA, adapters)
  5. Design multi-task and multi-stage pre-training
  6. Measure and improve transfer learning effectiveness

Pre-training Objectives

Language Modeling Objectives

The choice of pre-training objective fundamentally shapes a model's capabilities. The main objectives are:

Denoising Objectives

Computational costs are similar across objectives for the same model size and sequence length: a single forward pass dominates, with differences arising mainly in the decoder target length for seq2seq objectives (span corruption costs $\sim$3$\times$ more in decoder computation than single-token MLM).

Contrastive Objectives

Contrastive Learning:

$$ \mathcal{L}_{\text{contrastive}} = -\log \frac{\exp(\text{sim}(z_i, z_i^+)/\tau)}{\sum_{j} \exp(\text{sim}(z_i, z_j)/\tau)} $$

Applications:

Data Curation and Processing

Data Scale and Requirements

The scale of pre-training data has grown exponentially over the past few years, driven by empirical findings that larger datasets consistently improve model performance. Understanding the data requirements, storage costs, and preprocessing overhead is essential for planning pre-training projects.

BERT was pre-trained on approximately 16 GB of text data, consisting of BooksCorpus (800 million words) and English Wikipedia (2.5 billion words). This relatively modest dataset size reflects BERT's focus on high-quality, curated text rather than massive web crawls. The 16 GB of raw text expands to approximately 3.3 billion tokens using BERT's WordPiece tokenizer with a 30,000 token vocabulary. Training BERT-base for 1 million steps with batch size 256 and sequence length 512 means the model sees each token approximately 40 times on average, indicating significant data reuse through multiple epochs. The storage requirements are minimal by modern standards—16 GB of compressed text expands to perhaps 50 GB including tokenized data and intermediate preprocessing artifacts.

GPT-2 scaled up to approximately 40 GB of text from WebText, a dataset created by scraping outbound links from Reddit posts with at least 3 karma. This filtering strategy aimed to identify high-quality content as judged by the Reddit community. The 40 GB corpus contains roughly 8 billion tokens using GPT-2's byte-pair encoding with a 50,257 token vocabulary. GPT-2's largest variant (1.5B parameters) was trained for approximately 1 million steps, seeing each token roughly 10 times. The preprocessing pipeline for WebText involved deduplication, filtering by language, and removing low-quality content, reducing the raw crawl from over 100 GB to the final 40 GB. Storage requirements including raw data, filtered data, and tokenized sequences total approximately 150 GB.

GPT-3 made a massive leap to approximately 570 GB of text, totaling roughly 300 billion tokens. This dataset combines filtered Common Crawl (410 GB), WebText2 (19 GB), Books1 (12 GB), Books2 (55 GB), and Wikipedia (3 GB). The preprocessing pipeline for Common Crawl is particularly intensive: the raw crawl contains petabytes of data, which must be filtered by language, deduplicated, and quality-filtered to produce the final 410 GB. This filtering process itself requires substantial computational resources—processing petabytes of data through language classifiers and deduplication algorithms takes weeks on large clusters. The total storage requirements for GPT-3 pre-training, including raw data, filtered data, tokenized sequences, and training checkpoints, exceed 5 TB. The preprocessing cost alone is estimated at tens of thousands of dollars in compute time.

LLaMA pushed the scale even further to approximately 1.4 TB of text, totaling roughly 1.4 trillion tokens. This dataset consists primarily of Common Crawl (67\%), C4 (15\%), GitHub (4.5\%), Wikipedia (4.5\%), books (4.5\%), ArXiv (2.5\%), and StackExchange (2\%). The inclusion of code from GitHub and technical content from ArXiv and StackExchange reflects a deliberate strategy to improve reasoning and technical capabilities. The preprocessing pipeline for LLaMA is even more sophisticated than GPT-3, using multiple quality filters including perplexity-based filtering, classifier-based filtering, and extensive deduplication. The total storage requirements exceed 10 TB including all preprocessing artifacts, and the preprocessing cost is estimated at over \$100,000 in compute time.

Data Quality versus Quantity

The relationship between data quality and quantity is not straightforward—more data does not always lead to better models if the quality is poor. Recent research has shown that careful data curation can match or exceed the performance of models trained on much larger but noisier datasets.

High-quality datasets like Wikipedia and books consistently improve model performance even when they represent a small fraction of total training data. GPT-3's data mixture samples Wikipedia at 3× the rate it appears in the corpus (3.4 epochs versus 0.44 epochs for Common Crawl), reflecting the higher quality and information density of Wikipedia text. This upsampling strategy means that despite Wikipedia being only 3 GB of the 570 GB total, it contributes disproportionately to the model's knowledge and capabilities.

The preprocessing cost for achieving high data quality is substantial. Language identification using fastText classifiers requires processing every document, taking approximately 1 CPU-hour per 100 GB of text. Deduplication using MinHash LSH is even more expensive, requiring approximately 10 CPU-hours per 100 GB for computing signatures and finding near-duplicates. Quality filtering using perplexity-based methods requires running a language model over the entire corpus, taking approximately 100 GPU-hours per 100 GB. For GPT-3's 570 GB dataset, the total preprocessing cost exceeds 50,000 CPU-hours and 5,000 GPU-hours, translating to roughly \$30,000 in cloud computing costs.

The storage requirements for preprocessing are also significant. Deduplication requires storing hash signatures for all documents, typically requiring 100-200 bytes per document. For a corpus with 1 billion documents, this means 100-200 GB of signature storage. Near-duplicate detection using LSH requires storing multiple hash tables, potentially doubling or tripling this storage requirement. Quality filtering requires storing perplexity scores or classifier outputs for all documents, adding another 10-20 GB. In total, the preprocessing metadata can require 500 GB to 1 TB of storage for a large corpus, comparable to the size of the corpus itself.

Data Filtering and Cleaning

Algorithm: Data Filtering Pipeline

Step 1: Quality Filtering

  • Remove duplicates (exact and near-duplicates)
  • Filter by language (fastText classifier)
  • Remove toxic/harmful content
  • Filter low-quality (perplexity-based, classifier)
  • Step 2: Deduplication

    • Exact match: Hash-based
    • Near-duplicates: MinHash LSH
    • Document-level and paragraph-level
    • Step 3: Privacy

      • Remove PII (emails, phone numbers, addresses)
      • Filter memorized content
      • Redact sensitive information
      • Step 4: Formatting

        • Unicode normalization
        • Remove excessive whitespace
        • Clean HTML/markup artifacts
        • Example: Total: ~570GB, 300B tokens
          DatasetWeightEpochs
          Common Crawl (filtered)60\%0.44
          WebText222\%2.9
          Books18\%1.9
          Books28\%1.9
          Wikipedia3\%3.4

          Higher-quality sources sampled more frequently (multiple epochs). Lower-quality sources seen less to avoid overfitting to noise.

          Data Deduplication

          Why deduplicate?

          Methods:

          1. Exact Deduplication:
          seen_hashes = set()
          for doc in corpus:
              hash_val = hash(doc)
              if hash_val not in seen_hashes:
                  keep(doc)
                  seen_hashes.add(hash_val)
          

          2. Fuzzy Deduplication (MinHash):

          Training Compute Requirements

          FLOPs Analysis

          Understanding the computational requirements for pre-training large language models is essential for planning projects and estimating costs. The total compute is typically measured in FLOPs (floating-point operations), which can be calculated from model architecture and training configuration.

          For a transformer model with $L$ layers, $d$ model dimension, $h$ attention heads, and feed-forward intermediate dimension $d_{ff}$ (typically $4d$), processing a single token requires approximately:

          $$ \text{FLOPs per token} = 2L(12d^2 + 4d \cdot d_{ff}) = 2L(12d^2 + 16d^2) = 56Ld^2 $$

          The factor of 2 accounts for both forward and backward passes (backward pass requires approximately the same FLOPs as forward pass). The $12d^2$ term comes from attention projections (query, key, value, and output, each $d \times d$), and the $16d^2$ term comes from feed-forward layers (two $d \times 4d$ projections).

          For BERT-base with $L=12$, $d=768$, training on 3.3 billion tokens for 40 epochs (132 billion tokens total):

          $$ \text{Total FLOPs} = 56 \times 12 \times 768^2 \times 132 \times 10^9 \approx 5.2 \times 10^{20} \text{ FLOPs} $$

          This is approximately 0.5 zettaFLOPs. Training on 16 TPU v3 chips (each providing 420 TFLOPS in mixed precision) for 4 days:

          $$ \text{Available compute} = 16 \times 420 \times 10^{12} \times 4 \times 86400 \approx 2.3 \times 10^{21} \text{ FLOPs} $$

          The ratio of available compute to required compute is approximately 4.4, indicating that BERT-base training achieves roughly 23\% hardware utilization. This is typical for large-scale training where communication overhead, data loading, and other inefficiencies reduce effective utilization.

          For GPT-3 175B with $L=96$, $d=12288$, training on 300 billion tokens:

          $$ \text{Total FLOPs} = 56 \times 96 \times 12288^2 \times 300 \times 10^9 \approx 2.4 \times 10^{23} \text{ FLOPs} $$

          This is approximately 240 zettaFLOPs, nearly 500× more than BERT-base. The massive compute requirement reflects both the larger model (175B versus 110M parameters) and the larger dataset (300B versus 132B tokens). Training GPT-3 on approximately 10,000 NVIDIA V100 GPUs (each providing 125 TFLOPS in mixed precision) for 1 month:

          $$ \text{Available compute} = 10000 \times 125 \times 10^{12} \times 30 \times 86400 \approx 3.2 \times 10^{23} \text{ FLOPs} $$

          This suggests approximately 75\% hardware utilization, which is impressive for such a large-scale distributed training job. The higher utilization compared to BERT reflects improvements in distributed training infrastructure and optimization techniques.

          For LLaMA-65B with $L=80$, $d=8192$, training on 1.4 trillion tokens:

          $$ \text{Total FLOPs} = 56 \times 80 \times 8192^2 \times 1.4 \times 10^{12} \approx 3.3 \times 10^{23} \text{ FLOPs} $$

          This is approximately 330 zettaFLOPs. Training on 2048 NVIDIA A100 GPUs (each providing 312 TFLOPS in mixed precision) for 21 days:

          $$ \text{Available compute} = 2048 \times 312 \times 10^{12} \times 21 \times 86400 \approx 1.2 \times 10^{24} \text{ FLOPs} $$

          This suggests approximately 28\% hardware utilization, which is lower than GPT-3 despite using more modern hardware. The lower utilization likely reflects the challenges of scaling to very long sequences (LLaMA uses 2048 token sequences versus GPT-3's 2048 tokens) and the overhead of processing the much larger dataset.

          GPU-Hours and Cost Estimates

          Translating FLOPs into GPU-hours and cost estimates provides a more practical understanding of training requirements. The cost depends heavily on the hardware platform and whether using cloud services or owned infrastructure.

          BERT-base training on 16 TPU v3 chips for 4 days equals 1,536 TPU-hours. At Google Cloud's on-demand pricing of approximately \$8 per TPU v3 hour, this costs roughly \$12,000. However, Google's original BERT paper reported using preemptible TPUs at approximately \$2.40 per hour, reducing the cost to roughly \$3,700. Using equivalent GPU resources (approximately 64 NVIDIA V100 GPUs to match 16 TPU v3 chips), the cost would be approximately \$2 per GPU-hour on AWS, totaling \$12,000 for 96 days of GPU time. The lower TPU cost reflects Google's optimization for transformer workloads and economies of scale.

          GPT-3 175B training on 10,000 V100 GPUs for 1 month equals 7.2 million GPU-hours. At AWS on-demand pricing of approximately \$3 per V100 hour, this would cost \$21.6 million. However, OpenAI likely used a combination of owned infrastructure and negotiated cloud pricing, with estimates suggesting actual costs between \$4 million and \$12 million. The wide range reflects uncertainty about the exact hardware configuration, utilization rates, and pricing agreements. The training also required substantial infrastructure costs including high-bandwidth networking (InfiniBand or equivalent), distributed storage systems, and engineering effort to optimize the training pipeline.

          LLaMA-65B training on 2048 A100 GPUs for 21 days equals 1.03 million GPU-hours. At cloud pricing of approximately \$3 per A100 hour, this would cost \$3.1 million. Meta's paper reports that LLaMA-65B training consumed approximately 1,022,362 GPU-hours on A100-80GB GPUs, closely matching this estimate. Using Meta's owned infrastructure rather than cloud services likely reduced the effective cost to \$1.5-2 million when accounting for hardware depreciation and operational costs. The A100's higher performance compared to V100 (312 versus 125 TFLOPS) means that LLaMA-65B required only 1/7 the GPU-hours of GPT-3 despite using comparable compute (330 versus 240 zettaFLOPs), demonstrating the importance of hardware efficiency.

          Scaling Laws

          Empirical scaling laws govern the relationship between compute budget, model size, dataset size, and performance (see Section~[ref] for the full treatment). The key insight from the Chinchilla scaling laws (Hoffmann et al., 2022) is that model size and dataset size should scale equally with compute budget: $N_{\text{opt}} \propto C^{0.5}$ and $D_{\text{opt}} \propto C^{0.5}$. This implies GPT-3 175B (trained on 300B tokens) was over-parameterized---a 70B model trained on 1.4T tokens (matching LLaMA-65B) achieves better performance for the same compute. This insight has driven the trend toward models like LLaMA, Chinchilla, and Mistral that prioritize training tokens over parameter count.

          Curriculum Learning

          Progressive Training Strategies

          Curriculum learning applies the principle of learning from easy to hard examples, progressively increasing task difficulty during training. This approach can significantly improve training efficiency, convergence speed, and final model performance. The key insight is that models learn more effectively when they first master simpler patterns before tackling complex ones.

          The most common curriculum strategy involves progressively increasing sequence length during training. Starting with shorter sequences reduces both memory requirements and computational cost in the early stages of training when the model is learning basic patterns. For a model with quadratic attention complexity $O(n^2d)$, doubling the sequence length quadruples the attention computation. Training BERT-base with sequence length 128 for the first 90\% of steps and then 512 for the final 10\% reduces total training time by approximately 30\% compared to using length 512 throughout. The shorter sequences allow larger batch sizes in early training, which improves gradient estimates and accelerates convergence. The model learns word-level and phrase-level patterns with short sequences, then refines its understanding of long-range dependencies with longer sequences.

          Batch size curriculum is another effective strategy, gradually increasing batch size during training. Starting with smaller batches provides more frequent parameter updates, which helps the model escape poor local minima in early training. As training progresses and the model approaches convergence, larger batches provide more stable gradient estimates and better utilize hardware parallelism. GPT-3 training used a batch size curriculum, starting at 32,000 tokens per batch and gradually increasing to 3.2 million tokens per batch. This 100× increase in batch size was enabled by learning rate adjustments and gradient accumulation. The larger batch sizes in later training improved hardware utilization from approximately 30\% to over 70\%, significantly reducing training time.

          Learning rate schedules are essential for curriculum learning, as the optimal learning rate changes with batch size and training progress. The linear warmup followed by cosine decay schedule has become standard for transformer pre-training. The warmup phase, typically 1-10\% of total steps, gradually increases the learning rate from near-zero to the peak value. This prevents the large gradient updates in early training from destabilizing the model. The cosine decay phase gradually reduces the learning rate to near-zero, allowing the model to fine-tune its parameters as it approaches convergence. For BERT-base, a warmup of 10,000 steps followed by linear decay over 990,000 steps works well. For GPT-3, a warmup of 375 million tokens (approximately 1\% of total training) followed by cosine decay proved effective.

          The impact on training efficiency is substantial. Curriculum learning can reduce training time by 20-40\% compared to fixed configurations while achieving equal or better final performance. For BERT-base, the sequence length curriculum reduces training from approximately 5.5 days to 4 days on the same hardware. For GPT-3, the batch size curriculum improved hardware utilization enough to reduce training time by an estimated 30\%, saving approximately \$3-4 million in compute costs. These savings make curriculum learning essential for large-scale pre-training projects.

          Progressive Training

          Definition: Train on progressively harder examples:

          Stage 1: Easy examples (short sequences, simple patterns)

          Stage 2: Medium difficulty

          Stage 3: Full difficulty (long sequences, complex patterns)

          Benefits:

          Example: GPT-3 training:

          Stage 1 (0-100B tokens):

          Stage 2 (100B-300B tokens):

          Starting with shorter sequences reduces memory and computation early in training.

          Domain-Adaptive Pre-training

          Continue pre-training on domain-specific data:

          Algorithm: Domain Adaptation

          Step 1: Pre-train on general corpus (e.g., Common Crawl)

          Step 2: Continue pre-training on domain data (e.g., biomedical)

          Step 3: Fine-tune on task

          Examples:

          Hardware Requirements and Infrastructure

          BERT-base Training Infrastructure

          Training BERT-base requires relatively modest infrastructure by modern standards, making it accessible to academic research groups and small companies. The original BERT paper reported training BERT-base on 16 TPU v3 chips for 4 days, providing a concrete reference point for hardware requirements.

          Each TPU v3 chip provides 420 TFLOPS of mixed-precision compute (bfloat16) and 16 GB of high-bandwidth memory (HBM). The 16-chip configuration provides 6.7 PFLOPS total compute and 256 GB total memory. BERT-base with 110 million parameters requires approximately 440 MB for model weights in FP32, or 220 MB in FP16. With batch size 256 and sequence length 512, the activation memory per batch is approximately 8 GB, which fits comfortably in the 256 GB total memory when distributed across 16 chips. The high-bandwidth interconnect between TPU chips (approximately 100 GB/s per chip) enables efficient data parallelism with minimal communication overhead.

          The training cost at Google Cloud's preemptible TPU pricing (approximately \$2.40 per TPU v3 hour) is roughly \$3,700 for the full 4-day training run. Using on-demand pricing (\$8 per hour) would increase this to \$12,000. For comparison, training on NVIDIA V100 GPUs would require approximately 64 GPUs for 4 days (6,144 GPU-hours) at a cost of approximately \$12,000 using AWS on-demand pricing. The equivalent training on A100 GPUs would require approximately 32 GPUs for 2.5 days (1,920 GPU-hours) at a cost of approximately \$6,000, demonstrating the improved efficiency of newer hardware.

          The infrastructure requirements beyond compute include high-bandwidth storage for the training data (approximately 100 GB including tokenized sequences and preprocessing artifacts), network bandwidth for distributed training (at least 10 Gbps per GPU for efficient data parallelism), and monitoring infrastructure for tracking training metrics. The total infrastructure cost including storage, networking, and engineering time is typically 2-3× the raw compute cost, bringing the total BERT-base training cost to approximately \$10,000-15,000.

          GPT-3 Training Infrastructure

          Training GPT-3 175B requires massive infrastructure that is accessible only to large technology companies and well-funded research organizations. The scale of the training job presents significant engineering challenges beyond simply acquiring hardware.

          The training used approximately 10,000 NVIDIA V100 GPUs, though the exact configuration has not been publicly disclosed. Each V100 provides 125 TFLOPS of mixed-precision compute and 32 GB of memory. The 10,000-GPU configuration provides 1.25 exaFLOPS total compute and 320 TB total memory. GPT-3 175B with 175 billion parameters requires approximately 700 GB for model weights in FP32, or 350 GB in FP16. With model parallelism across 8 GPUs, each GPU stores approximately 44 GB of model weights, leaving limited memory for activations. The batch size per GPU is typically 1-2 sequences of length 2048, requiring approximately 20 GB of activation memory per GPU.

          The communication requirements are severe. With model parallelism across 8 GPUs and data parallelism across 1,250 groups, each training step requires all-reduce operations across the data parallel groups (approximately 700 GB of gradients) and all-to-all communication within model parallel groups (approximately 100 GB per step). At 100 Gbps network bandwidth per GPU, the gradient all-reduce takes approximately 70 seconds per step, which would dominate training time. To address this, GPT-3 training used gradient accumulation (accumulating gradients over multiple micro-batches before synchronizing) and high-bandwidth interconnects like InfiniBand (200 Gbps or higher), reducing communication time to approximately 10\% of total step time.

          The training cost is estimated between \$4 million and \$12 million depending on assumptions about hardware ownership versus cloud rental, utilization rates, and pricing agreements. At AWS on-demand pricing of \$3 per V100 hour, the 7.2 million GPU-hours would cost \$21.6 million, but OpenAI likely achieved significant discounts through long-term commitments and negotiated pricing. The infrastructure costs beyond raw compute are substantial: high-bandwidth networking equipment (InfiniBand switches and cables) costs millions of dollars, distributed storage systems for the 5 TB of training data cost hundreds of thousands of dollars, and the engineering effort to build and optimize the training pipeline represents millions of dollars in labor costs.

          The power consumption is also significant. Each V100 GPU consumes approximately 300 watts under full load, so 10,000 GPUs consume 3 megawatts. Over a 1-month training run, this equals 2,160 megawatt-hours of electricity. At typical data center electricity costs of \$0.10 per kWh, the electricity cost alone is \$216,000. Including cooling and power distribution overhead (typically 1.5-2× the compute power), the total power cost approaches \$400,000.

          LLaMA-65B Training Infrastructure

          Training LLaMA-65B represents a more efficient approach than GPT-3, using fewer but more powerful GPUs and a more optimized training pipeline. Meta's paper provides detailed information about the infrastructure and costs.

          The training used 2,048 NVIDIA A100-80GB GPUs for 21 days, totaling 1,022,362 GPU-hours. Each A100-80GB provides 312 TFLOPS of mixed-precision compute and 80 GB of memory, representing a significant improvement over V100 (2.5× compute, 2.5× memory). The 2,048-GPU configuration provides 639 PFLOPS total compute and 164 TB total memory. LLaMA-65B with 65 billion parameters requires approximately 260 GB for model weights in FP32, or 130 GB in FP16. With model parallelism across 8 GPUs, each GPU stores approximately 16 GB of model weights, leaving substantial memory for activations and optimizer states.

          The larger memory capacity of A100-80GB enables more efficient training configurations. LLaMA uses a batch size of 4 million tokens (approximately 2,000 sequences of length 2048), distributed across 2,048 GPUs as 2 sequences per GPU. The activation memory per GPU is approximately 40 GB, and the optimizer states (using AdamW) require approximately 32 GB, totaling approximately 88 GB per GPU. This fits comfortably in the 80 GB memory, avoiding the need for activation checkpointing or other memory-saving techniques that would slow training.

          The communication requirements are more manageable than GPT-3 due to the smaller model size and more efficient hardware. With model parallelism across 8 GPUs and data parallelism across 256 groups, each training step requires all-reduce operations across data parallel groups (approximately 260 GB of gradients) and all-to-all communication within model parallel groups (approximately 30 GB per step). Using NVIDIA's NVLink and NVSwitch interconnects (600 GB/s per GPU within a node, 200 Gbps between nodes), the communication time is approximately 5\% of total step time, demonstrating excellent scaling efficiency.

          The training cost is estimated at \$2-3 million using Meta's owned infrastructure. At cloud pricing of \$3 per A100 hour, the 1,022,362 GPU-hours would cost \$3.1 million. Meta's owned infrastructure likely reduced the effective cost to \$1.5-2 million when accounting for hardware depreciation (A100 GPUs cost approximately \$10,000 each, depreciated over 3-5 years) and operational costs. The power consumption is approximately 1 megawatt (2,048 GPUs × 400 watts per A100), totaling 504 megawatt-hours over 21 days. At \$0.10 per kWh including cooling overhead, the electricity cost is approximately \$75,000.

          The infrastructure requirements include high-bandwidth networking (NVIDIA InfiniBand or equivalent), distributed storage systems (approximately 15 TB for training data and checkpoints), and monitoring infrastructure. Meta's paper notes that they used a custom training framework optimized for their infrastructure, with careful attention to memory management, communication patterns, and fault tolerance. The engineering effort to build this infrastructure and optimize the training pipeline represents a significant investment beyond the raw hardware costs.

          Efficient Pre-training Techniques

          Large-scale pre-training relies on several key optimizations, covered in detail in Chapter~[ref]:

          For GPT-3 training, the combination of ZeRO Stage 3, gradient checkpointing, pipeline parallelism (8 stages), and data parallelism achieved $\sim$75\% hardware utilization on V100 GPUs.

          Parameter-Efficient Fine-tuning

          Motivation

          Full fine-tuning challenges:

          Solution: Fine-tune small subset of parameters.

          LoRA: Low-Rank Adaptation

          Definition: Inject trainable low-rank matrices into frozen model:

          Original: $\vh = \mW \vx$ where $\mW \in \R^{d \times d}$

          LoRA:

          $$ \vh = \mW \vx + \Delta \mW \vx = \mW \vx + \mB \mA \vx $$

          where $\mA \in \R^{r \times d}$, $\mB \in \R^{d \times r}$, and $r \ll d$ (typically $r = 4$ to $64$).

          Parameters:

          Example: GPT-3 175B, apply LoRA with $r=8$ to attention projections.

          Single attention layer:

          LoRA params per layer:

          $$ 4 \times 2 \times 8 \times 12288 = 786{,}432 \approx 0.79M $$

          96 layers total:

          Performance: Matches full fine-tuning on many tasks!

          Adapter Layers

          Definition: Insert small bottleneck layers between frozen layers:
          $$ \vh_{\text{adapter}} = \vh + \text{FFN}_{\text{adapter}}(\text{LayerNorm}(\vh)) $$

          where FFN$_{\text{adapter}}$: $d \to d_{\text{bottleneck}} \to d$ with $d_{\text{bottleneck}} \ll d$.

          Typical bottleneck: $d_{\text{bottleneck}} = 64$ for $d = 768$

          Parameters per adapter:

          $$ 2d \cdot d_{\text{bottleneck}} = 2 \times 768 \times 64 = 98{,}304 $$

          Prompt Tuning

          Definition: Prepend learnable "soft prompt" vectors:

          Input: $[\vp_1, \ldots, \vp_k, \vx_1, \ldots, \vx_n]$

          where $\vp_i \in \R^d$ are learned continuous prompts (not discrete tokens).

          Parameters: Only $k \times d$ prompt vectors (model frozen).

          Typical: $k = 20$ prompts, $d = 768$ $\to$ only 15,360 parameters!

          Multi-Task and Multi-Stage Pre-training

          Multi-Task Pre-training

          Train on multiple objectives simultaneously:

          $$ \mathcal{L}_{\text{total}} = \sum_{i=1}^{K} \lambda_i \mathcal{L}_i $$

          Example (T5):

          Benefits:

          Multi-Stage Pre-training

          Stage 1: General pre-training

          Stage 2: Instruction tuning

          Stage 3: RLHF

          Example: Stage 1: GPT-3 pre-training (175B params, 300B tokens)

          Stage 2: Supervised fine-tuning

          Stage 3: Reward modeling

          Stage 4: PPO optimization

          Result: 1.3B InstructGPT preferred over 175B GPT-3 by humans!

          Transfer Learning Analysis

          Measuring Transfer

          Metrics:

          1. Downstream Performance:

          $$ \Delta = \text{Performance}_{\text{fine-tuned}} - \text{Performance}_{\text{from-scratch}} $$

          2. Sample Efficiency:

          3. Convergence Speed:

          What Makes Good Pre-training?

          Data scale: More data $\to$ better transfer (up to a point)

          Data diversity: Diverse pre-training $\to$ broader transfer

          Model scale: Larger models transfer better

          Objective alignment: Pre-training objective similar to downstream task

          Domain match: Domain-specific pre-training helps domain-specific tasks

          Exercises

          Exercise 1: Compare pre-training objectives:
          1. Train BERT-tiny with: (a) MLM, (b) CLM, (c) Span corruption
          2. Evaluate on GLUE tasks
          3. Which objective transfers best? Why?
          Exercise 2: Implement data filtering pipeline:
          1. Download 10,000 documents from Common Crawl
          2. Remove duplicates (exact and near-duplicate)
          3. Filter by language (keep English)
          4. Filter low-quality (perplexity > threshold)
          5. Report statistics at each stage
          Exercise 3: Implement LoRA:
          1. Load pre-trained GPT-2
          2. Add LoRA layers with $r=8$ to attention
          3. Fine-tune on sentiment analysis
          4. Compare: (a) Full fine-tuning, (b) LoRA, (c) Frozen
          5. Measure: parameters trained, memory, accuracy
          Exercise 4: Analyze transfer learning:
          1. Fine-tune BERT on 5 GLUE tasks
          2. Vary training data: [100, 500, 1000, 5000, all]
          3. Compare to training from scratch
          4. Plot sample efficiency curves
          5. At what point does pre-training stop helping?

          Solutions

          Full solutions for all exercises are available at \url{https://deeplearning.hofkensvermeulen.be}.

          Solution: Exercise 1: Compare Pre-training Objectives

          Expected Results on GLUE:

          ObjectiveMNLIQQPSST-2Avg
          MLM (BERT-style)78.385.289.784.4
          CLM (GPT-style)72.181.587.380.3
          Span Corruption (T5-style)76.884.188.983.3

          Analysis:

          Part (c): Which transfers best?

          MLM (Masked Language Modeling) wins because:

          1. Bidirectional context: Sees both left and right context
          2. Better for understanding: GLUE tasks require comprehension
          3. Sentence-level tasks: MLM learns sentence representations
          4. Natural for classification: [CLS] token aggregates information

          CLM (Causal Language Modeling) is weaker because:

          Span Corruption is middle ground:

          Key Insight: Pre-training objective should match downstream task type:

          Solution: Exercise 2: Data Filtering Pipeline

          Pipeline Statistics:

          StageDocumentsRetention
          Initial (Common Crawl)10,000100\%
          After exact deduplication8,74287.4\%
          After near-duplicate removal7,89178.9\%
          After language filtering6,23462.3\%
          After quality filtering4,56745.7\%

          Key Steps:

          1. Exact Deduplication:

          2. Near-Duplicate Removal:

          3. Language Filtering:

          4. Quality Filtering:

          Final Result: 4,567 high-quality English documents (45.7\% retention)

          Quality Metrics:

          Key Insight: Aggressive filtering (54.3\% removed) dramatically improves data quality, leading to better pre-training outcomes.

          Solution: Exercise 3: LoRA Implementation

          Comparison Results:

          MethodParams TrainedMemory (GB)Accuracy
          Full Fine-tuning124M (100\%)8.294.3\%
          LoRA ($r=8$)0.3M (0.24\%)2.193.8\%
          Frozen (linear probe)0.001M (0.001\%)1.887.2\%

          Analysis:

          LoRA Advantages:

          1. Parameter efficiency: 400x fewer trainable parameters
          2. Memory efficiency: 4x less memory (no optimizer states for frozen params)
          3. Near full performance: Only 0.5\% accuracy drop
          4. Fast training: 3x faster per epoch
          5. Modular: Can swap LoRA adapters for different tasks

          When to use each:

          LoRA Configuration:

          Key Insight: LoRA achieves 99.5\% of full fine-tuning performance with 0.24\% of parameters, making it ideal for resource-constrained scenarios.

          Solution: Exercise 4: Transfer Learning Analysis

          Sample Efficiency Results:

          Training Size10050010005000All
          Pre-trained BERT72.381.785.288.990.1
          From Scratch51.263.871.482.387.8
          Improvement+21.1+17.9+13.8+6.6+2.3

          Analysis:

          Key Observations:

          1. Largest gains with small data: +21.1\% at 100 examples
          2. Diminishing returns: Improvement drops as data increases
          3. Convergence point: Around 10k-50k examples
          4. Always helpful: Even with full data, +2.3\% improvement

          Sample Efficiency Curve:

          The gap between pre-trained and from-scratch narrows as:

          When pre-training stops helping significantly:

          Practical Recommendations:

          Data SizeStrategy
          < 1,000Always use pre-training (critical)
          1,000 - 10,000Pre-training highly beneficial
          10,000 - 50,000Pre-training still helps
          > 50,000Pre-training provides marginal gains

          Key Insight: Pre-training is most valuable in low-resource settings, providing 10-20\% accuracy improvements with < 1000 examples, but benefits diminish as task-specific data grows.

          ← Chapter 19: Long Context Handling 📚 Table of Contents Chapter 21: PyTorch Implementation →