Best Practices and Production Case Studies
Chapter Overview
This final chapter synthesizes practical wisdom from deploying transformers at scale. We cover debugging strategies, hyperparameter tuning, common pitfalls, and real-world case studies from industry deployments of BERT, GPT, and other transformer models.
Learning Objectives
- Apply systematic debugging for transformer training
- Tune hyperparameters effectively
- Avoid common pitfalls in architecture and training
- Learn from real-world deployment case studies
- Design robust production systems
- Plan future-proof transformer architectures
Model Selection
Choosing the right transformer architecture is a critical decision that impacts both performance and resource requirements. This section provides a systematic framework for selecting among the major transformer variants based on task requirements, data availability, and computational constraints.
Architecture Selection Framework
The choice between BERT, GPT, T5, and other architectures depends fundamentally on the nature of your task. BERT and its variants excel at understanding tasks where bidirectional context is crucial, such as classification, named entity recognition, and question answering. The bidirectional attention mechanism allows BERT to build rich representations by attending to both past and future tokens simultaneously, making it particularly effective when the entire input is available at once.
GPT models, in contrast, are designed for generation tasks where autoregressive decoding is required. The unidirectional attention pattern makes GPT natural for text generation, code completion, and any task where outputs must be produced sequentially. While GPT can be adapted for understanding tasks through careful prompting, this is generally less efficient than using a bidirectional model designed for the purpose.
T5 represents a unified approach that frames all tasks as sequence-to-sequence problems. This architecture provides flexibility across both understanding and generation tasks, making it an excellent choice when you need a single model to handle diverse task types. The encoder-decoder structure allows T5 to leverage bidirectional attention in the encoder while maintaining autoregressive generation in the decoder.
Model Size Selection
Selecting the appropriate model size requires balancing performance requirements against computational constraints. The relationship between model size and performance generally follows a power law, with diminishing returns as models grow larger. For most practical applications, the base-sized models provide an excellent balance between capability and efficiency.
BERT-base with 110 million parameters serves as the standard choice for most understanding tasks. It provides strong performance across a wide range of benchmarks while remaining tractable for fine-tuning on a single GPU. BERT-large with 340 million parameters offers modest improvements, typically 1-3 percentage points on downstream tasks, but requires significantly more memory and computation. The large variant is justified primarily when you need to extract maximum performance and have sufficient computational resources.
For GPT models, the size selection depends heavily on the complexity of the generation task. GPT-2 small (117M parameters) suffices for simple completion tasks and domain-specific generation after fine-tuning. GPT-2 medium (345M parameters) provides better coherence for longer generations and more complex tasks. The larger variants (GPT-2 large at 774M and GPT-2 XL at 1.5B parameters) are necessary primarily when working with limited task-specific data, as their stronger pre-trained representations enable better few-shot performance.
Pre-trained versus Training from Scratch
The decision to use pre-trained models versus training from scratch depends on data availability, domain specificity, and computational budget. In nearly all cases, starting from pre-trained weights is the correct choice. Pre-training on large corpora provides general language understanding that transfers effectively to downstream tasks, and the computational cost of pre-training from scratch is prohibitive for most organizations.
Training from scratch becomes viable only in specific circumstances. When working with highly specialized domains where general language models perform poorly, such as medical text with extensive jargon or programming languages not well-represented in pre-training data, domain-specific pre-training may be justified. However, even in these cases, continued pre-training from existing checkpoints is typically more efficient than starting from random initialization.
The computational cost difference is substantial. Pre-training BERT-base from scratch requires approximately 64 TPU days or equivalent GPU time, representing tens of thousands of dollars in compute costs. Fine-tuning the same model on a downstream task typically requires only hours on a single GPU, costing tens of dollars. This thousand-fold difference in cost makes pre-trained models the default choice for nearly all applications.
Cost-Benefit Analysis
A systematic cost-benefit analysis should consider both direct computational costs and opportunity costs. For a typical classification task with 10,000 labeled examples, fine-tuning BERT-base requires approximately 2-4 hours on a single V100 GPU, costing roughly \$10-20 in cloud compute. This investment typically yields performance improvements of 5-15 percentage points over traditional methods like logistic regression on TF-IDF features.
Training a smaller model from scratch on the same data might require 8-16 hours and cost \$40-80, while likely achieving inferior performance due to the lack of pre-trained representations. The pre-trained approach thus provides both better performance and lower cost, a rare combination that explains the dominance of transfer learning in modern NLP.
For generation tasks, the cost analysis shifts somewhat. Fine-tuning GPT-2 medium on a specific generation task requires 4-8 hours on a V100, costing \$20-40. However, inference costs become more significant for generation, as producing each token requires a full forward pass through the model. For applications requiring high-throughput generation, the ongoing inference costs may exceed training costs within weeks or months of deployment, making inference optimization critical.
Training Best Practices
Effective training of transformer models requires careful attention to hyperparameter selection, monitoring, and debugging. This section provides comprehensive guidance on the key decisions that impact training success.
Learning Rate Selection
| Scenario | Range | Notes |
|---|---|---|
| BERT fine-tuning | $1$--$5 \times 10^{-5}$ | Lower end for small datasets |
| GPT fine-tuning | $2 \times 10^{-5}$--$10^{-4}$ | Autoregressive is more stable |
| Pre-training from scratch | $10^{-4}$--$6 \times 10^{-4}$ | Requires longer warmup |
Rules of thumb: Scale LR $\sim$linearly with batch size (use LAMB for very large batches). Warmup: 5--10\% of steps for fine-tuning, 10K--50K steps for pre-training.
Batch Size Selection
| Scenario | Typical Range | Notes |
|---|---|---|
| Fine-tuning (single GPU) | 16--32 | Gradient accumulation for larger effective batch |
| Fine-tuning (multi-GPU) | 32--256 | Scale LR accordingly |
| Pre-training | 256--4096 | Requires LR warmup + LAMB |
If memory-limited, use gradient accumulation: $B_{\text{eff}} = B_{\text{micro}} \times N_{\text{accum}}$ (see Chapter~[ref] for implementation).
Checkpointing and Monitoring Strategy
Effective monitoring is essential for detecting problems early and understanding training dynamics. At minimum, you should track training loss, validation loss, and task-specific metrics at regular intervals. Logging every 100-500 steps provides sufficient granularity to detect issues without generating excessive data.
Checkpointing strategy depends on training duration and stability. For short fine-tuning runs of a few hours, saving checkpoints every epoch is sufficient. For longer training runs, save checkpoints every few thousand steps to protect against hardware failures and enable recovery from divergence. Always keep at least the three most recent checkpoints, as the most recent checkpoint may be corrupted or represent a point after training has diverged.
Beyond basic loss monitoring, tracking gradient norms provides early warning of training instability. Gradient norms should remain relatively stable throughout training, typically in the range of 0.1 to 10.0. Sudden spikes in gradient norm often precede loss divergence and indicate that gradient clipping or learning rate reduction may be necessary. Similarly, monitoring the ratio of update magnitude to parameter magnitude helps ensure that learning rates are appropriate.
Memory Management
Memory is often the primary constraint in transformer training. Rather than repeating the detailed memory analysis from Chapter~21, this section provides a quick-reference decision guide.
Out-of-Memory Decision Checklist
When encountering memory errors, apply these steps in order:
- Reduce sequence length (if task permits). Attention memory scales quadratically with sequence length---truncating from 512 to 256 tokens saves $\sim$4× attention memory. Many classification tasks work well at 128 tokens.
- Enable mixed precision (
torch.cuda.amp). Halves activation and gradient memory with 2--3× speedup on tensor-core GPUs. Minimal code changes required. See Chapter~21 for implementation. - Enable gradient checkpointing. Trades 20--30\% additional compute time for 40--50\% activation memory reduction. Apply via
torch.utils.checkpoint. - Reduce batch size and use gradient accumulation. Maintain effective batch size $B_{\text{eff}} = B_{\text{micro}} \times N_{\text{accum}}$ while fitting in memory. Linear memory savings.
- Consider model parallelism. When the model itself exceeds single-GPU memory, use pipeline parallelism (split by layers) or tensor parallelism (split within layers). Frameworks: DeepSpeed, Megatron-LM. See Chapter~22 for multi-GPU strategies.
Memory Estimation Rule of Thumb
Total training memory (GB) $\approx$ (Parameters $\times$ 16 bytes) + (Batch $\times$ SeqLen $\times$ Hidden $\times$ Layers $\times$ 40 bytes). The first term covers parameters, gradients, and optimizer states; the second covers activations. For BERT-base (110M params, batch 32, seq 512): $\sim$8 GB.
Debugging Transformers
Systematic Debugging Workflow
Level 1: Data sanity checks
- Visualize input samples
- Verify labels are correct
- Check for data leakage
- Validate preprocessing
Level 2: Model sanity checks
- Overfit single batch (should reach near-zero loss)
- Check gradient flow (no dead neurons)
- Verify shapes at each layer
- Test with minimal model first
Level 3: Training dynamics
- Monitor loss curves (training + validation)
- Track gradient norms
- Visualize attention weights
- Check learning rate schedule
Diagnose:
- Learning rate too low? Try 10× higher
- Frozen layers? Check requires\_grad
- Optimizer issue? Try SGD as baseline
- Bad initialization? Re-initialize
- Data issue? Manually inspect batches
Symptom: NaN loss
Diagnose:
- Gradient explosion? Add clipping
- Numerical instability? Check mask values ($-\infty$ vs $-1e9$)
- Learning rate too high? Reduce 10×
- Mixed precision issue? Check loss scaling
Gradient Analysis
Monitor per-layer gradient norms throughout training (see Chapter~[ref] for PyTorch profiling tools). Healthy gradients have norms between $10^{-4}$ and $10^{1}$, are similar across layers, and are non-zero for all layers. Sudden spikes precede divergence; vanishing gradients indicate dead layers.
Common Training Issues: Quick Reference
| Symptom | Likely Cause | Fix |
|---|---|---|
| Out of memory | Batch/sequence too large | See memory checklist (Section~[ref]) |
| Loss not decreasing | Learning rate too low | Increase LR 3--10$\times$; verify overfit on 1 batch |
| Loss diverges / NaN | LR too high or no clipping | Reduce LR; clip gradients to norm 1.0 |
| Slow training | Low GPU utilization | Increase batch size; add DataLoader workers |
| Train/val gap growing | Overfitting | More dropout; data augmentation; smaller model |
Inference Optimization
Inference costs often exceed training costs over a model's lifetime. This section provides decision tables for choosing optimization strategies; see Chapters~21 and~22 for detailed implementations.
Optimizing for Latency
| Technique | Speedup | Accuracy Cost | Effort |
|---|---|---|---|
| FP16 inference | 1.5--2$\times$ | $<$0.1\% | Minimal |
| INT8 quantization (PTQ) | 2--4$\times$ | 0.5--2\% | Low (calibration) |
| INT8 quantization (QAT) | 2--4$\times$ | $<$0.5\% | Medium (retraining) |
| KV caching (autoregressive) | 5--10$\times$ | None | Low |
| TorchScript / torch.compile | 1.2--1.5$\times$ | None | Low |
| TensorRT compilation | 2--5$\times$ | $<$0.5\% | Medium |
Optimizing for Throughput
- Dynamic batching: Accumulate requests for 10--50\,ms, process together. Improves GPU utilization from 20--30\% to 70--90\%.
- ONNX Runtime / TensorRT: Graph-level optimizations provide 1.5--5$\times$ throughput gains via operator fusion and kernel selection.
- Model distillation: Train a smaller student (e.g., 6-layer DistilBERT retains 97\% of BERT-base accuracy at 1.6$\times$ speed). Combine with quantization for 5--10$\times$ cumulative speedup.
- Continuous batching (vLLM): For autoregressive generation, allow new requests to join in-flight batches as others complete. See Chapter~22.
Hardware Selection Summary
| Scenario | Recommended Hardware | Rationale |
|---|---|---|
| $<$10 req/s, small model | CPU | Lower cost, sufficient throughput |
| 10--100 req/s | T4 GPU + TensorRT | Good latency/cost balance |
| $>$100 req/s | A10/A100 GPU | Maximum throughput |
| Edge / mobile | INT8/INT4 on device | Memory and power constrained |
See Chapter~22 for detailed hardware analysis and cost breakdowns.
Cost Optimization
Understanding and optimizing costs is essential for sustainable deployment of transformer models. This section provides detailed analysis of training and inference costs with concrete examples.
Training Cost Analysis
Training costs depend on model size, dataset size, and hardware selection. For BERT-base pre-training on 16 GB of text, the original paper reports using 16 TPU chips for 4 days, equivalent to approximately 64 TPU days. At current Google Cloud pricing of roughly \$8 per TPU hour, this amounts to approximately \$12,000 for pre-training. Using equivalent GPU resources (64 V100 GPUs for 4 days) would cost approximately \$15,000 at on-demand rates.
Fine-tuning costs are much more modest. Training BERT-base on a typical classification task with 10,000 examples requires 2-4 hours on a single V100 GPU. At AWS on-demand pricing of approximately \$3 per hour for a p3.2xlarge instance, this amounts to \$6-12 per fine-tuning run. Even with extensive hyperparameter search involving 20-30 runs, total costs remain under \$300.
Cloud versus on-premise costs depend heavily on utilization. For continuous training workloads, purchasing GPUs becomes cost-effective after 12-18 months of use. A DGX A100 system costs approximately \$200,000 but provides compute equivalent to \$15,000 per month at cloud on-demand rates. For intermittent workloads or experimentation, cloud computing is more economical due to the flexibility to scale up and down.
Spot instances provide substantial savings for training workloads that can tolerate interruptions. AWS spot instances for p3.2xlarge typically cost 50-70\% less than on-demand rates, reducing fine-tuning costs to \$2-4 per run. Implementing checkpointing and automatic restart logic allows training to resume after spot instance interruptions, making this an attractive option for cost-conscious training.
Training Time Estimation
Estimating training time helps with planning and cost prediction. For fine-tuning, a useful rule of thumb is that BERT-base processes approximately 100-150 examples per second on a V100 GPU with batch size 32 and sequence length 128. For a dataset of 100,000 examples trained for 3 epochs, this translates to 2,000-3,000 seconds or roughly 1 hour of training time.
Pre-training time scales with dataset size and model size. BERT-base pre-training on 16 GB of text requires approximately 1 million training steps with batch size 256, processing roughly 4 billion tokens. At 1,000 tokens per second per V100 GPU, this requires 4 million GPU-seconds or approximately 1,100 GPU-hours. With 16 GPUs, this translates to roughly 70 hours or 3 days of training.
Larger models scale approximately linearly with parameter count for training time. GPT-2 medium with 345 million parameters takes roughly 3× longer to train than GPT-2 small with 117 million parameters, assuming the same dataset and batch size. However, larger models often benefit from larger batch sizes, which can partially offset the increased time per step.
Inference Cost Analysis
Inference costs depend on request volume, latency requirements, and model size. For a BERT-base classification service processing 1 million requests per day with average latency requirements of 100ms, a single V100 GPU can handle approximately 100 requests per second with dynamic batching, or 8.6 million requests per day. This suggests that a single GPU is sufficient, costing approximately \$200-300 per month for a cloud GPU instance.
For generation tasks, costs are higher due to the sequential nature of autoregressive decoding. GPT-2 medium generating 100 tokens per request can process approximately 10-20 requests per second per GPU, depending on batch size and sequence length. For 1 million requests per day, this requires 1-2 GPUs, costing \$400-600 per month. The cost per million tokens is approximately \$5-10 for self-hosted inference.
Comparing self-hosted to API costs reveals significant differences at scale. OpenAI's GPT-3.5 API costs approximately \$2 per million tokens for input and output combined. For applications processing 100 million tokens per month, this amounts to \$200 per month. Self-hosting a comparable model would require 4-8 GPUs costing \$1,600-3,200 per month, making the API more economical at this scale. However, at 1 billion tokens per month, self-hosting becomes competitive, and at 10 billion tokens per month, self-hosting is clearly more economical.
Cost Optimization Strategies
Several strategies can substantially reduce both training and inference costs. For training, using mixed precision reduces training time by 2-3×, directly reducing costs by the same factor. Gradient accumulation allows using smaller, cheaper GPU instances by simulating larger batch sizes. Spot instances reduce costs by 50-70\% for workloads that can tolerate interruptions.
For inference, quantization and distillation reduce both latency and cost. A distilled and quantized model may achieve 5-10× higher throughput than the original model, allowing a single GPU to handle the load that previously required 5-10 GPUs. This directly translates to 5-10× cost reduction. Dynamic batching improves GPU utilization from 20-30\% to 70-90\%, effectively tripling throughput without additional hardware.
Caching can dramatically reduce inference costs for applications with repeated queries. If 30\% of requests are duplicates or near-duplicates, caching responses eliminates 30\% of inference costs. Semantic caching using embedding similarity can extend this to near-duplicate queries, potentially caching 50-70\% of requests in some applications.
Autoscaling based on demand prevents paying for idle resources during low-traffic periods. For applications with diurnal traffic patterns, autoscaling can reduce costs by 40-60\% compared to provisioning for peak load. Kubernetes and cloud-native deployment platforms make autoscaling straightforward to implement.
Production Deployment
Production deployment involves serving infrastructure, monitoring, and safe rollout practices. Chapter~22 covers serving frameworks (TorchServe, Triton, vLLM), deployment architectures (Ray Serve, Kubernetes), and KV cache management in detail. Here we summarize the key decision points.
Deployment Checklist
Before deployment:
- Optimize model: quantize (INT8/FP16), export to ONNX or TensorRT, enable KV caching for generation.
- Benchmark under realistic conditions (expected batch sizes, sequence lengths, peak load).
- Set up monitoring: latency percentiles (p50, p95, p99), throughput, error rate, GPU utilization.
- Plan scaling: autoscaling rules, load balancing, maximum replica count.
Safe rollout: Use canary deployment (1--5\% traffic) or shadow mode (run new model alongside production without serving results) before full rollout. Monitor key metrics for several hours before increasing traffic. Maintain instant rollback capability.
Practical Checklists
These checklists provide systematic guidance for common transformer workflows, helping ensure that critical steps are not overlooked.
Before Training Checklist
Before beginning training, verify that you have made appropriate decisions about resources and configuration. Estimate memory requirements using the formulas provided earlier, ensuring that your chosen batch size and sequence length will fit in available GPU memory with some margin for safety. Select hardware appropriate for your model size and training duration, considering the trade-offs between cost and training time.
Choose batch size and sequence length based on your task requirements and memory constraints. Remember that sequence length has a quadratic effect on memory, so reducing it provides substantial savings if your task permits. Set up monitoring and logging infrastructure before training begins, as debugging issues after the fact is much more difficult than catching them in real-time.
Estimate training time and cost using the guidelines provided earlier. This helps with planning and ensures that you have allocated sufficient budget and time for the training run. For long training runs, verify that checkpointing is configured correctly and test recovery from checkpoints before committing to the full training run.
During Training Checklist
While training is in progress, monitor loss and metrics regularly to detect issues early. Training loss should decrease steadily, though not necessarily monotonically. Validation loss should track training loss initially, with some divergence expected as training progresses. If validation loss increases while training loss decreases, you may be overfitting.
Check GPU utilization to ensure that you are using resources efficiently. Utilization should be consistently above 80\% during training. Lower utilization suggests that batch size is too small, data loading is a bottleneck, or there are inefficiencies in the training loop. Monitor memory usage to ensure you are not close to OOM errors, which can cause training to fail unexpectedly.
Save checkpoints regularly according to your checkpointing strategy. Verify that checkpoints are being saved successfully and that you can load them for recovery. Validate periodically on a held-out set to track generalization performance. The frequency of validation depends on training duration, but every few hundred steps or every epoch is typical.
Before Deployment Checklist
Before deploying a model to production, optimize it for inference using the techniques described earlier. Apply quantization if accuracy permits, as the performance benefits are substantial. Consider distillation if you need further speedup and have time for the additional training. Export the model to an optimized format like ONNX or TensorRT if using those serving frameworks.
Benchmark latency and throughput under realistic conditions, including the batch sizes and sequence lengths you expect in production. Test with both average-case and worst-case inputs to understand performance variability. Estimate serving costs based on expected request volume and the hardware required to meet latency requirements.
Set up monitoring and alerting for the production deployment. Ensure that you can track request rate, latency, error rate, and resource utilization. Configure alerts for anomalies in these metrics. Plan your scaling strategy, including autoscaling rules if using dynamic scaling.
Test the deployment pipeline end-to-end, including model loading, preprocessing, inference, and postprocessing. Verify that error handling works correctly and that failures are logged appropriately. Conduct load testing to ensure the system can handle expected traffic with appropriate margins for spikes.
Hyperparameter Tuning
Critical Hyperparameters (Ordered by Impact)
1. Learning Rate (highest impact)
- Typical range: $[10^{-5}, 10^{-3}]$
- BERT: $1-5 \times 10^{-5}$
- GPT: $2-6 \times 10^{-4}$
- Rule: Larger models need smaller LR
2. Batch Size
- Trade-off: Speed vs generalization
- Typical: 32-512 for fine-tuning, 256-2048 for pre-training
- Scale LR linearly with batch size
3. Warmup Steps
- Typical: 5-10\% of total training steps
- BERT: 10,000 steps
- GPT-3: 375M tokens (out of 300B)
4. Weight Decay
- Typical: $0.01$ to $0.1$
- AdamW: Decouple from learning rate
5. Dropout
- Standard: $0.1$
- Larger models: Lower dropout (0.05 or none)
- Apply uniformly (attention, FFN, embeddings)
Tuning Strategy
Phase 1: Coarse search
- Grid/random search over wide ranges
- Short runs (10\% of full training)
- Focus on learning rate first
Phase 2: Fine search
- Narrow ranges around best from Phase 1
- Longer runs (50\% of full training)
- Tune other hyperparameters
Phase 3: Validation
- Full training with best settings
- Multiple seeds for robustness
- Final evaluation on test set
Coarse search:
- Try: $[10^{-5}, 3 \times 10^{-5}, 10^{-4}, 3 \times 10^{-4}]$
- Train 1 epoch each
- Best: $3 \times 10^{-5}$ (85.2\% dev accuracy)
Fine search:
- Try: $[2 \times 10^{-5}, 3 \times 10^{-5}, 4 \times 10^{-5}]$
- Train 3 epochs each
- Best: $3 \times 10^{-5}$ (86.1\% dev accuracy)
Final:
- Train with $\text{LR} = 3 \times 10^{-5}$, 5 epochs
- Test accuracy: 85.8\%
Common Pitfalls and Solutions
Architecture Pitfalls
Pitfall 1: Forgetting positional information
- Symptom: Model treats sequence as bag-of-words
- Solution: Verify position encoding is added
Pitfall 2: Incorrect masking
- Symptom: Information leakage or blocked attention
- Solution: Visualize attention matrices, verify mask shape
Pitfall 3: Not sharing embeddings
- Symptom: Twice as many parameters as expected
- Solution: Weight tying between input/output embeddings
Training Pitfalls
Pitfall 4: Insufficient warmup
- Symptom: Training unstable early, doesn't recover
- Solution: Increase warmup to 10\% of training
Pitfall 5: Wrong learning rate scale
- Symptom: Loss not decreasing or diverging
- Solution: Learning rate finder, try 10× up/down
Pitfall 6: Overfitting small datasets
- Symptom: Large train/val gap
- Solution: More dropout, data augmentation, smaller model
Deployment Pitfalls
Pitfall 7: Batch size 1 in production
- Symptom: Poor GPU utilization
- Solution: Dynamic batching, accumulate requests
Pitfall 8: Not using mixed precision
- Symptom: Slow inference, high memory
- Solution: FP16 inference, quantization
Pitfall 9: No KV caching for generation
- Symptom: Slow text generation (quadratic in length)
- Solution: Cache key/value tensors
Case Study: BERT for Search Ranking
Problem Setup
Task: Rank search results by relevance
Input: Query + Document pairs
Output: Relevance score [0, 1]
Architecture Decisions
Model: BERT-base with regression head
Input format:
[CLS] query tokens [SEP] document tokens [SEP]
Output: $\text{score} = \sigma(\mW \vh_{\text{[CLS]}} + b)$
Training Strategy
Data:
- 10M query-document pairs
- Labels: Click-through rate (0-1)
- Hard negatives: Top results without clicks
Loss: Mean squared error on CTR prediction
Optimization:
- Learning rate: $2 \times 10^{-5}$
- Batch size: 256
- Warmup: 10,000 steps
- Total: 100,000 steps
Production Deployment
Optimizations:
- Quantize to INT8 (3× speedup)
- Distill to 6-layer model (2× speedup)
- Deploy with ONNX Runtime
- Dynamic batching (avg batch size 32)
Results:
- Latency: 15ms p99 (vs 200ms baseline)
- Throughput: 2000 QPS per GPU
- Relevance: +8\% improvement over TF-IDF
Case Study: GPT for Code Generation
Problem Setup
Task: Generate Python code from natural language
Example:
Input: "Function to reverse a string"
Output:
def reverse_string(s):
return s[::-1]
Model and Data
Model: GPT-2 medium (345M params)
Data:
- GitHub public repositories (Python)
- Filtered: Only files with docstrings
- Format: Docstring $\to$ Implementation
- Total: 50GB, 10B tokens
Training
Pre-training: Start from GPT-2 checkpoint
Fine-tuning:
- 100,000 steps
- Learning rate: $5 \times 10^{-5}$
- Context: 1024 tokens
- Batch: 128 sequences
Evaluation
Metrics:
- Pass@k: \% correct in top-k samples
- BLEU: Token overlap with reference
- Human evaluation: Correctness + readability
Results:
- Pass@1: 42\%
- Pass@10: 71\%
- Human preferred over baseline: 78\%
Future Directions
Architectural Innovations
1. Efficient attention
- Linear complexity methods
- State space models (S4, Mamba)
- Hybrid CNN-attention architectures
2. Multimodal integration
- Unified text-image-audio models
- Better cross-modal alignment
- Efficient fusion strategies
3. Long context
- Million-token contexts
- Hierarchical memory
- Retrieval-augmented transformers
Training Innovations
1. Sample efficiency
- Better pre-training objectives
- Curriculum learning
- Few-shot and zero-shot learning
2. Scaling
- Mixture of experts
- Conditional computation
- Efficient parallelism strategies
3. Alignment
- Better RLHF techniques
- Constitutional AI
- Value alignment
Conclusion
Key Takeaways
Architecture:
- Attention is powerful and flexible
- Position encodings crucial for sequences
- Residuals + normalization enable depth
Training:
- Pre-training + fine-tuning is dominant paradigm
- Warmup is critical for stability
- Scale requires careful optimization
Deployment:
- Quantization and distillation for efficiency
- Batching crucial for throughput
- Monitor performance in production
Final Advice
For practitioners:
- Start simple: Use pre-trained models
- Debug systematically: Data, model, training
- Optimize iteratively: Accuracy first, then speed
- Monitor continuously: Metrics, errors, drift
For researchers:
- Understand fundamentals deeply
- Question assumptions: Why does this work?
- Experiment rigorously: Ablations, multiple seeds
- Share knowledge: Open source, papers, blogs
This concludes our comprehensive journey through deep learning and transformers. You now have the mathematical foundations, practical implementations, and real-world insights to build state-of-the-art transformer models!
Exercises
- Train 6-layer student on BERT-base teacher
- Use distillation + MLM + cosine losses
- Evaluate on GLUE
- Measure compression ratio and speedup
- Model trains but poor performance
- Find 3 subtle bugs (architecture, training, data)
- Fix and verify improvements
- Fine-tune on classification task
- Quantize to INT8
- Export to ONNX
- Create REST API with FastAPI
- Load test and optimize
Solutions
Full solutions for all exercises are available at \url{https://deeplearning.hofkensvermeulen.be}.
Training Configuration:
- Student: 6 layers, 768 hidden, 12 heads (66M params)
- Teacher: BERT-base (110M params)
- Loss: $\mathcal{L} = \alpha \mathcal{L}_{\text{distill}} + \beta \mathcal{L}_{\text{MLM}} + \gamma \mathcal{L}_{\text{cosine}}$
- Weights: $\alpha=0.5$, $\beta=0.25$, $\gamma=0.25$
Results on GLUE:
| Model | Params | GLUE Score | Speed |
|---|---|---|---|
| BERT-base | 110M | 84.5 | 1.0x |
| DistilBERT | 66M | 82.8 | 1.6x |
Compression Analysis:
- Parameters: 40\% reduction (110M $\to$ 66M)
- Inference speed: 60\% faster
- Accuracy: 98\% of teacher performance (2\% drop)
- Memory: 40\% less
Key Insights:
- Distillation preserves 98\% of teacher's knowledge
- Triple loss (distill + MLM + cosine) crucial for quality
- 6 layers sufficient for most understanding tasks
- Excellent trade-off for production deployment
When to use DistilBERT:
- Latency-sensitive applications (<50ms)
- Resource-constrained environments
- Mobile/edge deployment
- High-throughput serving
Common Bugs Found:
Bug 1 (Architecture): Missing dropout in attention
Symptom: Model overfits quickly, poor generalization
Fix:# Before (broken)
attn_weights = F.softmax(scores, dim=-1)
output = torch.matmul(attn_weights, V)
# After (fixed)
attn_weights = F.softmax(scores, dim=-1)
attn_weights = F.dropout(attn_weights, p=0.1, training=self.training)
output = torch.matmul(attn_weights, V)
Impact: Validation accuracy improves from 72\% to 84\%
Bug 2 (Training): Learning rate too high
Symptom: Loss oscillates, doesn't converge
Fix:# Before (broken)
optimizer = AdamW(model.parameters(), lr=1e-3) # Too high!
# After (fixed)
optimizer = AdamW(model.parameters(), lr=5e-5) # Appropriate for BERT
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=1000)
Impact: Loss converges smoothly, final accuracy 84\% $\to$ 87\%
Bug 3 (Data): Incorrect padding token handling
Symptom: Model attends to padding, poor performance on variable-length sequences
Fix:# Before (broken)
attn_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
attn_weights = F.softmax(attn_scores, dim=-1)
# After (fixed)
attn_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
# Mask padding tokens
attn_scores = attn_scores.masked_fill(attention_mask == 0, float('-inf'))
attn_weights = F.softmax(attn_scores, dim=-1)
Impact: Accuracy on variable-length sequences improves from 79\% to 87\%
Final Results:
| Version | Accuracy |
|---|---|
| Original (broken) | 72\% |
| After Bug 1 fix | 84\% |
| After Bug 2 fix | 87\% |
| After Bug 3 fix | 87\% (robust) |
Debugging Lessons:
- Always include dropout in attention
- Use appropriate learning rates (5e-5 for BERT-scale)
- Properly mask padding tokens
- Test on variable-length sequences
- Monitor both training and validation metrics
Deployment Pipeline:
Step 1: Fine-tune on Classification
- Task: Sentiment analysis (binary classification)
- Training: 10k examples, 3 epochs
- Validation accuracy: 92.3\%
Step 2: Quantize to INT8
- Method: Dynamic quantization
- Model size: 438 MB $\to$ 110 MB (75\% reduction)
- Accuracy: 92.3\% $\to$ 91.8\% (0.5\% drop)
- Inference speed: 2.4x faster
import torch.onnx
# Export model
dummy_input = torch.randint(0, 1000, (1, 128))
torch.onnx.export(
model,
dummy_input,
"bert_sentiment.onnx",
input_names=['input_ids'],
output_names=['logits'],
dynamic_axes={'input_ids': {0: 'batch', 1: 'sequence'}}
)
# Verify with ONNX Runtime
import onnxruntime as ort
session = ort.InferenceSession("bert_sentiment.onnx")
# 1.3x additional speedup
Step 4: Create REST API
from fastapi import FastAPI
from pydantic import BaseModel
import onnxruntime as ort
app = FastAPI()
session = ort.InferenceSession("bert_sentiment.onnx")
class TextInput(BaseModel):
text: str
@app.post("/predict")
async def predict(input: TextInput):
# Tokenize
tokens = tokenizer.encode(input.text, max_length=128, truncation=True)
# Inference
outputs = session.run(None, {"input_ids": [tokens]})
logits = outputs[0][0]
# Predict
prediction = "positive" if logits[1] > logits[0] else "negative"
confidence = float(max(logits))
return {"prediction": prediction, "confidence": confidence}
Step 5: Load Test and Optimize
Initial Performance:
- Latency: 45ms (p50), 78ms (p99)
- Throughput: 22 requests/second
Optimizations Applied:
- Dynamic batching (batch size 8): 3.2x throughput
- Connection pooling: 1.2x throughput
- Async processing: 1.5x throughput
Final Performance:
- Latency: 38ms (p50), 62ms (p99)
- Throughput: 127 requests/second
- 5.8x improvement over baseline
Production Checklist:
-
\item[$\checkmark$] Model quantized and optimized
\item[$\checkmark$] ONNX export for cross-platform compatibility
\item[$\checkmark$] REST API with proper error handling
\item[$\checkmark$] Load tested and optimized
\item[$\checkmark$] Monitoring and logging configured
\item[$\checkmark$] Auto-scaling based on load
\item[$\checkmark$] Health checks and graceful shutdown
Load Balancer
|
+-- API Server 1 (ONNX Runtime)
+-- API Server 2 (ONNX Runtime)
+-- API Server 3 (ONNX Runtime)
|
Monitoring (Prometheus + Grafana)
Key Metrics to Monitor:
- Request latency (p50, p95, p99)
- Throughput (requests/second)
- Error rate
- CPU/GPU utilization
- Memory usage
- Model accuracy (via A/B testing)
Cost Analysis:
- Hardware: 3x T4 GPUs (\$0.35/hour each)
- Total: \$1.05/hour = \$756/month
- Capacity: 127 req/s × 3 = 381 req/s
- Cost per 1M requests: \$0.55
Production Best Practices:
- Always quantize for inference (2-4x speedup)
- Use ONNX for deployment (cross-platform, optimized)
- Implement dynamic batching (3-5x throughput)
- Monitor latency percentiles (not just average)
- Set up auto-scaling for variable load
- Use health checks and graceful shutdown
- Implement request timeouts and retries
- Log predictions for model monitoring
Success Criteria Met:
-
\item[$\checkmark$] <50ms p99 latency
\item[$\checkmark$] >100 requests/second throughput
\item[$\checkmark$] <1\% accuracy degradation
\item[$\checkmark$] 75\% model size reduction
\item[$\checkmark$] Production-ready API
\item[$\checkmark$] Comprehensive monitoring