Best Practices and Production Case Studies

Chapter Overview

This final chapter synthesizes practical wisdom from deploying transformers at scale. We cover debugging strategies, hyperparameter tuning, common pitfalls, and real-world case studies from industry deployments of BERT, GPT, and other transformer models.

Learning Objectives

  1. Apply systematic debugging for transformer training
  2. Tune hyperparameters effectively
  3. Avoid common pitfalls in architecture and training
  4. Learn from real-world deployment case studies
  5. Design robust production systems
  6. Plan future-proof transformer architectures

Model Selection

Choosing the right transformer architecture is a critical decision that impacts both performance and resource requirements. This section provides a systematic framework for selecting among the major transformer variants based on task requirements, data availability, and computational constraints.

Architecture Selection Framework

The choice between BERT, GPT, T5, and other architectures depends fundamentally on the nature of your task. BERT and its variants excel at understanding tasks where bidirectional context is crucial, such as classification, named entity recognition, and question answering. The bidirectional attention mechanism allows BERT to build rich representations by attending to both past and future tokens simultaneously, making it particularly effective when the entire input is available at once.

GPT models, in contrast, are designed for generation tasks where autoregressive decoding is required. The unidirectional attention pattern makes GPT natural for text generation, code completion, and any task where outputs must be produced sequentially. While GPT can be adapted for understanding tasks through careful prompting, this is generally less efficient than using a bidirectional model designed for the purpose.

T5 represents a unified approach that frames all tasks as sequence-to-sequence problems. This architecture provides flexibility across both understanding and generation tasks, making it an excellent choice when you need a single model to handle diverse task types. The encoder-decoder structure allows T5 to leverage bidirectional attention in the encoder while maintaining autoregressive generation in the decoder.

Model Size Selection

Selecting the appropriate model size requires balancing performance requirements against computational constraints. The relationship between model size and performance generally follows a power law, with diminishing returns as models grow larger. For most practical applications, the base-sized models provide an excellent balance between capability and efficiency.

BERT-base with 110 million parameters serves as the standard choice for most understanding tasks. It provides strong performance across a wide range of benchmarks while remaining tractable for fine-tuning on a single GPU. BERT-large with 340 million parameters offers modest improvements, typically 1-3 percentage points on downstream tasks, but requires significantly more memory and computation. The large variant is justified primarily when you need to extract maximum performance and have sufficient computational resources.

For GPT models, the size selection depends heavily on the complexity of the generation task. GPT-2 small (117M parameters) suffices for simple completion tasks and domain-specific generation after fine-tuning. GPT-2 medium (345M parameters) provides better coherence for longer generations and more complex tasks. The larger variants (GPT-2 large at 774M and GPT-2 XL at 1.5B parameters) are necessary primarily when working with limited task-specific data, as their stronger pre-trained representations enable better few-shot performance.

Pre-trained versus Training from Scratch

The decision to use pre-trained models versus training from scratch depends on data availability, domain specificity, and computational budget. In nearly all cases, starting from pre-trained weights is the correct choice. Pre-training on large corpora provides general language understanding that transfers effectively to downstream tasks, and the computational cost of pre-training from scratch is prohibitive for most organizations.

Training from scratch becomes viable only in specific circumstances. When working with highly specialized domains where general language models perform poorly, such as medical text with extensive jargon or programming languages not well-represented in pre-training data, domain-specific pre-training may be justified. However, even in these cases, continued pre-training from existing checkpoints is typically more efficient than starting from random initialization.

The computational cost difference is substantial. Pre-training BERT-base from scratch requires approximately 64 TPU days or equivalent GPU time, representing tens of thousands of dollars in compute costs. Fine-tuning the same model on a downstream task typically requires only hours on a single GPU, costing tens of dollars. This thousand-fold difference in cost makes pre-trained models the default choice for nearly all applications.

Cost-Benefit Analysis

A systematic cost-benefit analysis should consider both direct computational costs and opportunity costs. For a typical classification task with 10,000 labeled examples, fine-tuning BERT-base requires approximately 2-4 hours on a single V100 GPU, costing roughly \$10-20 in cloud compute. This investment typically yields performance improvements of 5-15 percentage points over traditional methods like logistic regression on TF-IDF features.

Training a smaller model from scratch on the same data might require 8-16 hours and cost \$40-80, while likely achieving inferior performance due to the lack of pre-trained representations. The pre-trained approach thus provides both better performance and lower cost, a rare combination that explains the dominance of transfer learning in modern NLP.

For generation tasks, the cost analysis shifts somewhat. Fine-tuning GPT-2 medium on a specific generation task requires 4-8 hours on a V100, costing \$20-40. However, inference costs become more significant for generation, as producing each token requires a full forward pass through the model. For applications requiring high-throughput generation, the ongoing inference costs may exceed training costs within weeks or months of deployment, making inference optimization critical.

Training Best Practices

Effective training of transformer models requires careful attention to hyperparameter selection, monitoring, and debugging. This section provides comprehensive guidance on the key decisions that impact training success.

Learning Rate Selection

ScenarioRangeNotes
BERT fine-tuning$1$--$5 \times 10^{-5}$Lower end for small datasets
GPT fine-tuning$2 \times 10^{-5}$--$10^{-4}$Autoregressive is more stable
Pre-training from scratch$10^{-4}$--$6 \times 10^{-4}$Requires longer warmup

Rules of thumb: Scale LR $\sim$linearly with batch size (use LAMB for very large batches). Warmup: 5--10\% of steps for fine-tuning, 10K--50K steps for pre-training.

Batch Size Selection

ScenarioTypical RangeNotes
Fine-tuning (single GPU)16--32Gradient accumulation for larger effective batch
Fine-tuning (multi-GPU)32--256Scale LR accordingly
Pre-training256--4096Requires LR warmup + LAMB

If memory-limited, use gradient accumulation: $B_{\text{eff}} = B_{\text{micro}} \times N_{\text{accum}}$ (see Chapter~[ref] for implementation).

Checkpointing and Monitoring Strategy

Effective monitoring is essential for detecting problems early and understanding training dynamics. At minimum, you should track training loss, validation loss, and task-specific metrics at regular intervals. Logging every 100-500 steps provides sufficient granularity to detect issues without generating excessive data.

Checkpointing strategy depends on training duration and stability. For short fine-tuning runs of a few hours, saving checkpoints every epoch is sufficient. For longer training runs, save checkpoints every few thousand steps to protect against hardware failures and enable recovery from divergence. Always keep at least the three most recent checkpoints, as the most recent checkpoint may be corrupted or represent a point after training has diverged.

Beyond basic loss monitoring, tracking gradient norms provides early warning of training instability. Gradient norms should remain relatively stable throughout training, typically in the range of 0.1 to 10.0. Sudden spikes in gradient norm often precede loss divergence and indicate that gradient clipping or learning rate reduction may be necessary. Similarly, monitoring the ratio of update magnitude to parameter magnitude helps ensure that learning rates are appropriate.

Memory Management

Memory is often the primary constraint in transformer training. Rather than repeating the detailed memory analysis from Chapter~21, this section provides a quick-reference decision guide.

Out-of-Memory Decision Checklist

When encountering memory errors, apply these steps in order:

  1. Reduce sequence length (if task permits). Attention memory scales quadratically with sequence length---truncating from 512 to 256 tokens saves $\sim$4× attention memory. Many classification tasks work well at 128 tokens.
  2. Enable mixed precision (torch.cuda.amp). Halves activation and gradient memory with 2--3× speedup on tensor-core GPUs. Minimal code changes required. See Chapter~21 for implementation.
  3. Enable gradient checkpointing. Trades 20--30\% additional compute time for 40--50\% activation memory reduction. Apply via torch.utils.checkpoint.
  4. Reduce batch size and use gradient accumulation. Maintain effective batch size $B_{\text{eff}} = B_{\text{micro}} \times N_{\text{accum}}$ while fitting in memory. Linear memory savings.
  5. Consider model parallelism. When the model itself exceeds single-GPU memory, use pipeline parallelism (split by layers) or tensor parallelism (split within layers). Frameworks: DeepSpeed, Megatron-LM. See Chapter~22 for multi-GPU strategies.

Memory Estimation Rule of Thumb

Total training memory (GB) $\approx$ (Parameters $\times$ 16 bytes) + (Batch $\times$ SeqLen $\times$ Hidden $\times$ Layers $\times$ 40 bytes). The first term covers parameters, gradients, and optimizer states; the second covers activations. For BERT-base (110M params, batch 32, seq 512): $\sim$8 GB.

Debugging Transformers

Systematic Debugging Workflow

Level 1: Data sanity checks

  1. Visualize input samples
  2. Verify labels are correct
  3. Check for data leakage
  4. Validate preprocessing

Level 2: Model sanity checks

  1. Overfit single batch (should reach near-zero loss)
  2. Check gradient flow (no dead neurons)
  3. Verify shapes at each layer
  4. Test with minimal model first

Level 3: Training dynamics

  1. Monitor loss curves (training + validation)
  2. Track gradient norms
  3. Visualize attention weights
  4. Check learning rate schedule
Example: Symptom: Loss not decreasing

Diagnose:

Symptom: NaN loss

Diagnose:

Gradient Analysis

Monitor per-layer gradient norms throughout training (see Chapter~[ref] for PyTorch profiling tools). Healthy gradients have norms between $10^{-4}$ and $10^{1}$, are similar across layers, and are non-zero for all layers. Sudden spikes precede divergence; vanishing gradients indicate dead layers.

Common Training Issues: Quick Reference

SymptomLikely CauseFix
Out of memoryBatch/sequence too largeSee memory checklist (Section~[ref])
Loss not decreasingLearning rate too lowIncrease LR 3--10$\times$; verify overfit on 1 batch
Loss diverges / NaNLR too high or no clippingReduce LR; clip gradients to norm 1.0
Slow trainingLow GPU utilizationIncrease batch size; add DataLoader workers
Train/val gap growingOverfittingMore dropout; data augmentation; smaller model

Inference Optimization

Inference costs often exceed training costs over a model's lifetime. This section provides decision tables for choosing optimization strategies; see Chapters~21 and~22 for detailed implementations.

Optimizing for Latency

TechniqueSpeedupAccuracy CostEffort
FP16 inference1.5--2$\times$$<$0.1\%Minimal
INT8 quantization (PTQ)2--4$\times$0.5--2\%Low (calibration)
INT8 quantization (QAT)2--4$\times$$<$0.5\%Medium (retraining)
KV caching (autoregressive)5--10$\times$NoneLow
TorchScript / torch.compile1.2--1.5$\times$NoneLow
TensorRT compilation2--5$\times$$<$0.5\%Medium

Optimizing for Throughput

Hardware Selection Summary

ScenarioRecommended HardwareRationale
$<$10 req/s, small modelCPULower cost, sufficient throughput
10--100 req/sT4 GPU + TensorRTGood latency/cost balance
$>$100 req/sA10/A100 GPUMaximum throughput
Edge / mobileINT8/INT4 on deviceMemory and power constrained

See Chapter~22 for detailed hardware analysis and cost breakdowns.

Cost Optimization

Understanding and optimizing costs is essential for sustainable deployment of transformer models. This section provides detailed analysis of training and inference costs with concrete examples.

Training Cost Analysis

Training costs depend on model size, dataset size, and hardware selection. For BERT-base pre-training on 16 GB of text, the original paper reports using 16 TPU chips for 4 days, equivalent to approximately 64 TPU days. At current Google Cloud pricing of roughly \$8 per TPU hour, this amounts to approximately \$12,000 for pre-training. Using equivalent GPU resources (64 V100 GPUs for 4 days) would cost approximately \$15,000 at on-demand rates.

Fine-tuning costs are much more modest. Training BERT-base on a typical classification task with 10,000 examples requires 2-4 hours on a single V100 GPU. At AWS on-demand pricing of approximately \$3 per hour for a p3.2xlarge instance, this amounts to \$6-12 per fine-tuning run. Even with extensive hyperparameter search involving 20-30 runs, total costs remain under \$300.

Cloud versus on-premise costs depend heavily on utilization. For continuous training workloads, purchasing GPUs becomes cost-effective after 12-18 months of use. A DGX A100 system costs approximately \$200,000 but provides compute equivalent to \$15,000 per month at cloud on-demand rates. For intermittent workloads or experimentation, cloud computing is more economical due to the flexibility to scale up and down.

Spot instances provide substantial savings for training workloads that can tolerate interruptions. AWS spot instances for p3.2xlarge typically cost 50-70\% less than on-demand rates, reducing fine-tuning costs to \$2-4 per run. Implementing checkpointing and automatic restart logic allows training to resume after spot instance interruptions, making this an attractive option for cost-conscious training.

Training Time Estimation

Estimating training time helps with planning and cost prediction. For fine-tuning, a useful rule of thumb is that BERT-base processes approximately 100-150 examples per second on a V100 GPU with batch size 32 and sequence length 128. For a dataset of 100,000 examples trained for 3 epochs, this translates to 2,000-3,000 seconds or roughly 1 hour of training time.

Pre-training time scales with dataset size and model size. BERT-base pre-training on 16 GB of text requires approximately 1 million training steps with batch size 256, processing roughly 4 billion tokens. At 1,000 tokens per second per V100 GPU, this requires 4 million GPU-seconds or approximately 1,100 GPU-hours. With 16 GPUs, this translates to roughly 70 hours or 3 days of training.

Larger models scale approximately linearly with parameter count for training time. GPT-2 medium with 345 million parameters takes roughly 3× longer to train than GPT-2 small with 117 million parameters, assuming the same dataset and batch size. However, larger models often benefit from larger batch sizes, which can partially offset the increased time per step.

Inference Cost Analysis

Inference costs depend on request volume, latency requirements, and model size. For a BERT-base classification service processing 1 million requests per day with average latency requirements of 100ms, a single V100 GPU can handle approximately 100 requests per second with dynamic batching, or 8.6 million requests per day. This suggests that a single GPU is sufficient, costing approximately \$200-300 per month for a cloud GPU instance.

For generation tasks, costs are higher due to the sequential nature of autoregressive decoding. GPT-2 medium generating 100 tokens per request can process approximately 10-20 requests per second per GPU, depending on batch size and sequence length. For 1 million requests per day, this requires 1-2 GPUs, costing \$400-600 per month. The cost per million tokens is approximately \$5-10 for self-hosted inference.

Comparing self-hosted to API costs reveals significant differences at scale. OpenAI's GPT-3.5 API costs approximately \$2 per million tokens for input and output combined. For applications processing 100 million tokens per month, this amounts to \$200 per month. Self-hosting a comparable model would require 4-8 GPUs costing \$1,600-3,200 per month, making the API more economical at this scale. However, at 1 billion tokens per month, self-hosting becomes competitive, and at 10 billion tokens per month, self-hosting is clearly more economical.

Cost Optimization Strategies

Several strategies can substantially reduce both training and inference costs. For training, using mixed precision reduces training time by 2-3×, directly reducing costs by the same factor. Gradient accumulation allows using smaller, cheaper GPU instances by simulating larger batch sizes. Spot instances reduce costs by 50-70\% for workloads that can tolerate interruptions.

For inference, quantization and distillation reduce both latency and cost. A distilled and quantized model may achieve 5-10× higher throughput than the original model, allowing a single GPU to handle the load that previously required 5-10 GPUs. This directly translates to 5-10× cost reduction. Dynamic batching improves GPU utilization from 20-30\% to 70-90\%, effectively tripling throughput without additional hardware.

Caching can dramatically reduce inference costs for applications with repeated queries. If 30\% of requests are duplicates or near-duplicates, caching responses eliminates 30\% of inference costs. Semantic caching using embedding similarity can extend this to near-duplicate queries, potentially caching 50-70\% of requests in some applications.

Autoscaling based on demand prevents paying for idle resources during low-traffic periods. For applications with diurnal traffic patterns, autoscaling can reduce costs by 40-60\% compared to provisioning for peak load. Kubernetes and cloud-native deployment platforms make autoscaling straightforward to implement.

Production Deployment

Production deployment involves serving infrastructure, monitoring, and safe rollout practices. Chapter~22 covers serving frameworks (TorchServe, Triton, vLLM), deployment architectures (Ray Serve, Kubernetes), and KV cache management in detail. Here we summarize the key decision points.

Deployment Checklist

Before deployment:

  1. Optimize model: quantize (INT8/FP16), export to ONNX or TensorRT, enable KV caching for generation.
  2. Benchmark under realistic conditions (expected batch sizes, sequence lengths, peak load).
  3. Set up monitoring: latency percentiles (p50, p95, p99), throughput, error rate, GPU utilization.
  4. Plan scaling: autoscaling rules, load balancing, maximum replica count.

Safe rollout: Use canary deployment (1--5\% traffic) or shadow mode (run new model alongside production without serving results) before full rollout. Monitor key metrics for several hours before increasing traffic. Maintain instant rollback capability.

Practical Checklists

These checklists provide systematic guidance for common transformer workflows, helping ensure that critical steps are not overlooked.

Before Training Checklist

Before beginning training, verify that you have made appropriate decisions about resources and configuration. Estimate memory requirements using the formulas provided earlier, ensuring that your chosen batch size and sequence length will fit in available GPU memory with some margin for safety. Select hardware appropriate for your model size and training duration, considering the trade-offs between cost and training time.

Choose batch size and sequence length based on your task requirements and memory constraints. Remember that sequence length has a quadratic effect on memory, so reducing it provides substantial savings if your task permits. Set up monitoring and logging infrastructure before training begins, as debugging issues after the fact is much more difficult than catching them in real-time.

Estimate training time and cost using the guidelines provided earlier. This helps with planning and ensures that you have allocated sufficient budget and time for the training run. For long training runs, verify that checkpointing is configured correctly and test recovery from checkpoints before committing to the full training run.

During Training Checklist

While training is in progress, monitor loss and metrics regularly to detect issues early. Training loss should decrease steadily, though not necessarily monotonically. Validation loss should track training loss initially, with some divergence expected as training progresses. If validation loss increases while training loss decreases, you may be overfitting.

Check GPU utilization to ensure that you are using resources efficiently. Utilization should be consistently above 80\% during training. Lower utilization suggests that batch size is too small, data loading is a bottleneck, or there are inefficiencies in the training loop. Monitor memory usage to ensure you are not close to OOM errors, which can cause training to fail unexpectedly.

Save checkpoints regularly according to your checkpointing strategy. Verify that checkpoints are being saved successfully and that you can load them for recovery. Validate periodically on a held-out set to track generalization performance. The frequency of validation depends on training duration, but every few hundred steps or every epoch is typical.

Before Deployment Checklist

Before deploying a model to production, optimize it for inference using the techniques described earlier. Apply quantization if accuracy permits, as the performance benefits are substantial. Consider distillation if you need further speedup and have time for the additional training. Export the model to an optimized format like ONNX or TensorRT if using those serving frameworks.

Benchmark latency and throughput under realistic conditions, including the batch sizes and sequence lengths you expect in production. Test with both average-case and worst-case inputs to understand performance variability. Estimate serving costs based on expected request volume and the hardware required to meet latency requirements.

Set up monitoring and alerting for the production deployment. Ensure that you can track request rate, latency, error rate, and resource utilization. Configure alerts for anomalies in these metrics. Plan your scaling strategy, including autoscaling rules if using dynamic scaling.

Test the deployment pipeline end-to-end, including model loading, preprocessing, inference, and postprocessing. Verify that error handling works correctly and that failures are logged appropriately. Conduct load testing to ensure the system can handle expected traffic with appropriate margins for spikes.

Hyperparameter Tuning

Critical Hyperparameters (Ordered by Impact)

1. Learning Rate (highest impact)

2. Batch Size

3. Warmup Steps

4. Weight Decay

5. Dropout

Tuning Strategy

Phase 1: Coarse search

Phase 2: Fine search

Phase 3: Validation

Example: Task: Fine-tune BERT on classification

Coarse search:

Fine search:

Final:

Common Pitfalls and Solutions

Architecture Pitfalls

Pitfall 1: Forgetting positional information

Pitfall 2: Incorrect masking

Pitfall 3: Not sharing embeddings

Training Pitfalls

Pitfall 4: Insufficient warmup

Pitfall 5: Wrong learning rate scale

Pitfall 6: Overfitting small datasets

Deployment Pitfalls

Pitfall 7: Batch size 1 in production

Pitfall 8: Not using mixed precision

Pitfall 9: No KV caching for generation

Case Study: BERT for Search Ranking

Problem Setup

Task: Rank search results by relevance

Input: Query + Document pairs

Output: Relevance score [0, 1]

Architecture Decisions

Model: BERT-base with regression head

Input format:

[CLS] query tokens [SEP] document tokens [SEP]

Output: $\text{score} = \sigma(\mW \vh_{\text{[CLS]}} + b)$

Training Strategy

Data:

Loss: Mean squared error on CTR prediction

Optimization:

Production Deployment

Optimizations:

  1. Quantize to INT8 (3× speedup)
  2. Distill to 6-layer model (2× speedup)
  3. Deploy with ONNX Runtime
  4. Dynamic batching (avg batch size 32)

Results:

Case Study: GPT for Code Generation

Problem Setup

Task: Generate Python code from natural language

Example:

Input: "Function to reverse a string"
Output: 
def reverse_string(s):
    return s[::-1]

Model and Data

Model: GPT-2 medium (345M params)

Data:

Training

Pre-training: Start from GPT-2 checkpoint

Fine-tuning:

Evaluation

Metrics:

Results:

Future Directions

Architectural Innovations

1. Efficient attention

2. Multimodal integration

3. Long context

Training Innovations

1. Sample efficiency

2. Scaling

3. Alignment

Conclusion

Key Takeaways

Architecture:

Training:

Deployment:

Final Advice

For practitioners:

  1. Start simple: Use pre-trained models
  2. Debug systematically: Data, model, training
  3. Optimize iteratively: Accuracy first, then speed
  4. Monitor continuously: Metrics, errors, drift

For researchers:

  1. Understand fundamentals deeply
  2. Question assumptions: Why does this work?
  3. Experiment rigorously: Ablations, multiple seeds
  4. Share knowledge: Open source, papers, blogs

This concludes our comprehensive journey through deep learning and transformers. You now have the mathematical foundations, practical implementations, and real-world insights to build state-of-the-art transformer models!

Exercises

Exercise 1: Reproduce DistilBERT:
  1. Train 6-layer student on BERT-base teacher
  2. Use distillation + MLM + cosine losses
  3. Evaluate on GLUE
  4. Measure compression ratio and speedup
Exercise 2: Debug broken transformer (provided):
  1. Model trains but poor performance
  2. Find 3 subtle bugs (architecture, training, data)
  3. Fix and verify improvements
Exercise 3: Deploy BERT for production:
  1. Fine-tune on classification task
  2. Quantize to INT8
  3. Export to ONNX
  4. Create REST API with FastAPI
  5. Load test and optimize

Solutions

Full solutions for all exercises are available at \url{https://deeplearning.hofkensvermeulen.be}.

Solution: Exercise 1: Reproduce DistilBERT

Training Configuration:

Results on GLUE:

ModelParamsGLUE ScoreSpeed
BERT-base110M84.51.0x
DistilBERT66M82.81.6x

Compression Analysis:

Key Insights:

  1. Distillation preserves 98\% of teacher's knowledge
  2. Triple loss (distill + MLM + cosine) crucial for quality
  3. 6 layers sufficient for most understanding tasks
  4. Excellent trade-off for production deployment

When to use DistilBERT:

Solution: Exercise 2: Debug Broken Transformer

Common Bugs Found:

Bug 1 (Architecture): Missing dropout in attention

Symptom: Model overfits quickly, poor generalization

Fix:
# Before (broken)
attn_weights = F.softmax(scores, dim=-1)
output = torch.matmul(attn_weights, V)

# After (fixed)
attn_weights = F.softmax(scores, dim=-1)
attn_weights = F.dropout(attn_weights, p=0.1, training=self.training)
output = torch.matmul(attn_weights, V)

Impact: Validation accuracy improves from 72\% to 84\%

Bug 2 (Training): Learning rate too high

Symptom: Loss oscillates, doesn't converge

Fix:
# Before (broken)
optimizer = AdamW(model.parameters(), lr=1e-3)  # Too high!

# After (fixed)
optimizer = AdamW(model.parameters(), lr=5e-5)  # Appropriate for BERT
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=1000)

Impact: Loss converges smoothly, final accuracy 84\% $\to$ 87\%

Bug 3 (Data): Incorrect padding token handling

Symptom: Model attends to padding, poor performance on variable-length sequences

Fix:
# Before (broken)
attn_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
attn_weights = F.softmax(attn_scores, dim=-1)

# After (fixed)
attn_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
# Mask padding tokens
attn_scores = attn_scores.masked_fill(attention_mask == 0, float('-inf'))
attn_weights = F.softmax(attn_scores, dim=-1)

Impact: Accuracy on variable-length sequences improves from 79\% to 87\%

Final Results:

VersionAccuracy
Original (broken)72\%
After Bug 1 fix84\%
After Bug 2 fix87\%
After Bug 3 fix87\% (robust)

Debugging Lessons:

  1. Always include dropout in attention
  2. Use appropriate learning rates (5e-5 for BERT-scale)
  3. Properly mask padding tokens
  4. Test on variable-length sequences
  5. Monitor both training and validation metrics
Solution: Exercise 3: Deploy BERT for Production

Deployment Pipeline:

Step 1: Fine-tune on Classification

Step 2: Quantize to INT8

Step 3: Export to ONNX
import torch.onnx

# Export model
dummy_input = torch.randint(0, 1000, (1, 128))
torch.onnx.export(
    model,
    dummy_input,
    "bert_sentiment.onnx",
    input_names=['input_ids'],
    output_names=['logits'],
    dynamic_axes={'input_ids': {0: 'batch', 1: 'sequence'}}
)

# Verify with ONNX Runtime
import onnxruntime as ort
session = ort.InferenceSession("bert_sentiment.onnx")
# 1.3x additional speedup
Step 4: Create REST API
from fastapi import FastAPI
from pydantic import BaseModel
import onnxruntime as ort

app = FastAPI()
session = ort.InferenceSession("bert_sentiment.onnx")

class TextInput(BaseModel):
    text: str

@app.post("/predict")
async def predict(input: TextInput):
    # Tokenize
    tokens = tokenizer.encode(input.text, max_length=128, truncation=True)
    
    # Inference
    outputs = session.run(None, {"input_ids": [tokens]})
    logits = outputs[0][0]
    
    # Predict
    prediction = "positive" if logits[1] > logits[0] else "negative"
    confidence = float(max(logits))
    
    return {"prediction": prediction, "confidence": confidence}

Step 5: Load Test and Optimize

Initial Performance:

Optimizations Applied:

  1. Dynamic batching (batch size 8): 3.2x throughput
  2. Connection pooling: 1.2x throughput
  3. Async processing: 1.5x throughput

Final Performance:

Production Checklist:

Deployment Architecture:

Load Balancer
    |
    +-- API Server 1 (ONNX Runtime)
    +-- API Server 2 (ONNX Runtime)
    +-- API Server 3 (ONNX Runtime)
    |
Monitoring (Prometheus + Grafana)

Key Metrics to Monitor:

Cost Analysis:

Production Best Practices:

  1. Always quantize for inference (2-4x speedup)
  2. Use ONNX for deployment (cross-platform, optimized)
  3. Implement dynamic batching (3-5x throughput)
  4. Monitor latency percentiles (not just average)
  5. Set up auto-scaling for variable load
  6. Use health checks and graceful shutdown
  7. Implement request timeouts and retries
  8. Log predictions for model monitoring

Success Criteria Met:

← Chapter 22: From PyTorch to Accelerator Silicon 📚 Table of Contents Chapter 24: Domain-Specific Models →