Natural Language and Enterprise NLP: Search, Classification, Assistants

Chapter Overview

This chapter addresses the practical deployment of language models in enterprise environments, focusing on real-world applications that solve concrete business problems. We examine semantic search and retrieval systems, text classification for routing and triage, and conversational assistants that support customer interactions. Unlike the general-purpose language models of previous chapters, enterprise NLP applications balance accuracy, latency, cost, and regulatory constraints. This chapter emphasizes how the foundations of transformers, embeddings, and fine-tuning translate into production systems that generate measurable business value.

Learning Objectives

Understand semantic search architectures and vector database integration
Design and deploy text classification pipelines with BERT-style models
Build conversational assistants with guardrails and human escalation
Implement monitoring and drift detection for NLP systems
Balance accuracy, latency, and cost in production deployments
Handle domain-specific data challenges: labeling, imbalance, and long-tail classes

Enterprise Search and Semantic Retrieval

Enterprise search is one of the most impactful yet underappreciated applications of NLP. Consider the daily frustration of employees unable to find critical information buried in company wikis, documentation systems, and shared drives. A study by IDC found that knowledge workers spend 2.5 hours per day searching for information, costing large organizations millions in lost productivity annually. Traditional keyword-based search systems, built on TF-IDF or BM25, struggle with semantic similarity. A user searching for ``vehicle transmission problems'' may not retrieve documents discussing ``automotive gearbox failures,'' despite semantic equivalence.

The business case for semantic search is straightforward: if you can reduce search time by 50\%, you recover 1.25 hours per employee per day. For a 1,000-person organization with an average salary of \$100,000, that's approximately \$15 million in recovered productivity annually. The investment in semantic search infrastructure---typically \$50,000--\$200,000 for development and deployment---pays for itself within weeks.

Semantic search addresses the limitations of keyword matching by encoding both queries and documents into high-dimensional vector spaces where similar meanings are close together, enabling retrieval based on semantic understanding rather than lexical overlap. This means ``car transmission issues'' and ``automobile gearbox problems'' are recognized as semantically similar even though they share no common words.

Architecture and Workflow

A semantic search system has several components working in concert, each serving a specific purpose in the retrieval pipeline. Understanding this architecture helps you make informed decisions about where to invest engineering effort and computational resources.

First, a dense embedding model encodes documents and queries into vectors. This is the heart of semantic search---the model that learned to represent meaning as geometry. Pre-trained models like BERT-base, all-MiniLM-L6-v2, or specialized dense retrievers (e.g., DPR, ColBERT) map text to fixed-size embeddings. The choice of model involves trade-offs: larger models like BERT-base (768 dimensions) capture more nuance but require more compute; smaller models like MiniLM (384 dimensions) are faster but may miss subtle semantic distinctions.

Second, a vector database (Pinecone, Weaviate, Milvus, Qdrant) indexes these embeddings for fast nearest-neighbor retrieval. This is where engineering meets mathematics: the database must efficiently search through millions or billions of vectors to find the most similar ones. The data structure choices here---HNSW graphs, IVF indexes, or LSH hashing---determine whether your search takes milliseconds or seconds.

Third, a ranking model optionally re-scores the top-k candidates for finer-grained relevance. The initial retrieval casts a wide net; the ranker applies more sophisticated (and expensive) scoring to the candidates. This two-stage approach balances speed and accuracy: fast retrieval narrows the search space, then careful ranking orders the results.

The pipeline often combines dense retrieval with sparse BM25 ranking in a hybrid approach to capture both semantic and keyword-based signals. This hedges against the weaknesses of each method: dense retrieval might miss exact keyword matches (like product codes or technical terms), while BM25 might miss semantic similarity. Combining them gives you the best of both worlds.

Definition:

The semantic search pipeline transforms a user's natural language query into relevant documents through a series of steps, each optimized for a specific aspect of the retrieval problem:

Embed query: Transform the user's text into a vector representation that captures its semantic meaning:
$$\begin{align} \mathbf{q} = \text{encoder}(q_{\text{text}}) \end{align}$$
This encoding happens in real-time when the user submits their query. The encoder must be fast (ideally <50ms) to maintain responsive search.
Embed documents (offline): Pre-compute embeddings for all documents in your corpus:
$$\begin{align} \mathbf{d}_i = \text{encoder}(d_{\text{text}, i}) \quad \text{for } i = 1, \ldots, N \end{align}$$
This happens offline, typically during nightly batch jobs. For a corpus of 1 million documents, this might take hours, but it only needs to be done once per document (or when documents are updated).
Index vectors: Store $\{\mathbf{d}_i\}$ in vector database with fast retrieval structures. The indexing process builds data structures (like HNSW graphs) that enable sub-linear search time. Without indexing, finding nearest neighbors would require comparing the query to every document---infeasible at scale.
Retrieve top-k: Find the k documents whose embeddings are closest to the query embedding:
$$\begin{align} \text{topk} = \text{retrieve}(\mathbf{q}, \mathbf{D}, k) \end{align}$$
Typically k=100--1000 for the initial retrieval. This step is optimized for speed, using approximate nearest neighbor algorithms that trade perfect accuracy for 10--100x speedup.
Rank (optional): Apply a more sophisticated scoring function to re-order the top-k candidates:
$$\begin{align} s_i = \text{ranker}(q, d_i) \quad \text{and re-sort} \end{align}$$
The ranker might use cross-attention between query and document, consider user context, or incorporate business rules (e.g., prioritize recent documents).
Return: Present the top-k re-ranked results to user, typically with snippets highlighting relevant passages.

Practical Considerations: Embedding Model Selection

Choosing an embedding model is one of the most consequential decisions in building a semantic search system. The model determines your accuracy ceiling, computational costs, and operational complexity. Let's examine the trade-offs systematically.

BERT-base produces 768-dimensional embeddings with strong performance but requires GPU inference. For a query, BERT-base takes approximately 50--100ms on CPU or 5--10ms on GPU. At 1,000 queries per second, CPU-only deployment would require 50--100 CPU cores---expensive and operationally complex. GPU deployment is more cost-effective but requires GPU infrastructure and expertise.

Lighter models like all-MiniLM-L6-v2 (384-dim) offer 10--20$\times$ faster inference with minimal accuracy loss on semantic similarity tasks. This model runs comfortably on CPU, taking 5--10ms per query. For most enterprise applications, the accuracy difference (typically 2--5 percentage points on retrieval metrics) is worth the operational simplicity and cost savings.

For domain-specific applications (e.g., legal, medical), domain-adaptive pre-training or fine-tuning the embedding model on labeled pairs improves performance significantly. A legal search system using a general-purpose embedding model might achieve 70\% recall@10; fine-tuning on legal document pairs can push this to 85--90\%. The investment in fine-tuning (collecting labeled pairs, training for a few days) pays off in dramatically better user experience.

Embedding quality directly impacts retrieval accuracy, and the quality depends heavily on training data. Hard negative mining---training on documents that are contextually similar but semantically different---significantly improves ranking. For example, when searching for instructions on ``how to reset a router,'' a document about network protocols is harder to distinguish from the correct answer than a document about cooking recipes. The model must learn fine-grained distinctions, not just broad topic similarity.

Collecting and training on such hard negatives through user feedback or synthetic generation improves model robustness. In practice, this means logging queries where users didn't click the top results (indicating poor retrieval), then using those as training data. This creates a feedback loop: the system learns from its mistakes, continuously improving.

Vector Database Operations

Vector databases are specialized systems optimized for the specific workload of semantic search: storing millions of high-dimensional vectors and finding nearest neighbors in milliseconds. Understanding their internals helps you make informed deployment decisions.

Vector databases enable billion-scale retrieval in milliseconds through approximate nearest neighbor (ANN) algorithms like HNSW, IVF, and LSH. These algorithms trade exact nearest neighbors for speed---they might miss the true nearest neighbor but find one that's very close, which is usually good enough for search applications.

The memory requirements are substantial but manageable. For a 768-dimensional BERT-base embedding, storing one million documents requires approximately 3 GB of memory (1M $\times$ 768 $\times$ 4 bytes for float32). This fits comfortably in RAM on modern servers. For 100 million documents, you need 300 GB---requiring a large server or distributed deployment, but still feasible.

Querying is typically sub-millisecond for index lookups once the query embedding is computed. The total latency includes embedding generation (50--100 ms for BERT on CPU, 5--10 ms on GPU) and ranking post-processing (10--50 ms depending on complexity). End-to-end latency of 100--200ms is typical and acceptable for most applications.

The choice of ANN algorithm involves trade-offs. HNSW (Hierarchical Navigable Small World) graphs offer excellent recall and speed but require more memory. IVF (Inverted File Index) is more memory-efficient but slightly slower. LSH (Locality-Sensitive Hashing) is fastest but has lower recall. For most enterprise applications, HNSW is the best default choice.

Case Study: Internal Documentation Search

Let's examine a concrete example that illustrates the business impact and technical decisions involved in deploying semantic search.

A company with 100,000 internal documents (policies, runbooks, FAQs, wiki pages) seeks to enable employees to find answers through natural language search. The existing keyword search system is frustrating: employees report spending 30+ minutes searching for information they know exists but can't find. Traditional keyword search returns thousands of results for common queries; employees give up or ask colleagues, creating interruptions and inefficiency.

Implementation:

Embed all documents with all-MiniLM-L6-v2 (384-dim) offline. This choice prioritizes operational simplicity (CPU-only deployment) over marginal accuracy gains from larger models. The embedding process takes 8 hours on a standard server---acceptable for a one-time operation.
Store in Weaviate with HNSW indexing. Weaviate was chosen for its ease of deployment and good performance at this scale. The HNSW index provides 95\% recall@100 with sub-millisecond query time.
Query embedding on CPU takes 30 ms; ANN retrieval returns top-100 in 5 ms. The 30ms embedding time is the bottleneck, but it's acceptable for interactive search. GPU deployment could reduce this to 5ms but wasn't justified for the query volume (approximately 100 queries/second peak).
Hybrid search combines BM25 (keyword) and dense retrieval; fusion re-ranks. The hybrid approach catches edge cases where semantic search alone fails: exact product codes, acronyms, or technical terms that require keyword matching.
User-facing API returns top-5 results in $<200$ ms p95. This latency is acceptable for search; users perceive sub-200ms as instantaneous.

Metrics:

Accuracy: 88\% of users found their answer in top-5 (vs. 45\% with keyword search). This near-doubling of success rate translates directly to time saved and frustration reduced.
Latency: 180 ms p95 end-to-end (acceptable for internal tools). The p95 metric matters more than average because tail latency determines user experience.
Cost: Single standard VM with 4 CPU cores, no GPUs needed. Monthly infrastructure cost: approximately \$200. This is negligible compared to the productivity gains.
Maintenance: Re-index new documents nightly; model weights updated quarterly. The nightly re-indexing ensures new documents are searchable within 24 hours. Quarterly model updates incorporate user feedback and improve accuracy over time.

The system has been in production for 18 months. User satisfaction surveys show 85\% of employees rate the new search as ``much better'' than the old system. Search query volume increased 3x---not because employees search more, but because they're willing to use search now that it works. The system has become a critical piece of infrastructure, with employees reporting they ``couldn't work without it.''

Text Classification and Tagging in Production

Text classification is among the most successful enterprise NLP applications, powering intent recognition, routing, spam detection, and triage. The business value is immediate and measurable: automated classification reduces manual labor, speeds up processes, and improves consistency. A support team that manually routes 50,000 tickets monthly spends approximately 2,500 hours on triage alone---time that could be spent actually solving customer problems.

The technical approach is well-established: fine-tune a pre-trained encoder (BERT) with a lightweight classification head on labeled data, then deploy for inference. The challenge lies not in the algorithm but in the operational details: handling class imbalance, dealing with label scarcity, adapting to domain shift, and maintaining performance over time. These practical concerns determine whether a classification system succeeds or fails in production.

Standard Classification Workflow

Fine-tuning BERT-base on a classification task is straightforward in principle but requires attention to detail in practice. The process leverages transfer learning: BERT has already learned general language understanding from massive pre-training; we adapt it to our specific classification task with relatively little labeled data.

The input sequence is fed to BERT, producing a contextualized representation for each token and a pooled representation (typically the [CLS] token). This pooled representation captures the overall meaning of the input text. A classification head---a single linear layer mapping the 768-dimensional representation to class logits---is trained on labeled examples. The simplicity of the classification head is intentional: BERT does the heavy lifting of understanding language; the head just maps that understanding to class labels.

Standard hyperparameters are learning rate $2$--$5 \times 10^{-5}$, batch size 16--32, and 2--5 epochs of training on a single GPU. These hyperparameters are well-established through empirical research and work well across most classification tasks. The low learning rate prevents catastrophic forgetting---we want to adapt BERT, not overwrite its pre-trained knowledge.

Definition:

The fine-tuning process adapts a pre-trained BERT model to a specific classification task. Given a labeled dataset $\{(x_i, y_i)\}$ where $x_i$ is text and $y_i \in \{1, \ldots, C\}$ is a class label, we train the model to predict the correct class for each input.

The forward pass computes:

$$\begin{align} \mathbf{h} &= \text{BERT}(x_i)[\text{CLS}] \quad \text{(pooled representation)} \end{align}$$

This pooled representation $\mathbf{h}$ is a 768-dimensional vector that encodes the semantic meaning of the input text. BERT's self-attention mechanism has processed the entire input, allowing each token to attend to every other token, creating a rich contextual representation.

We then apply a linear classification layer:

$$\begin{align} \text{logits} &= W \mathbf{h} + b \quad (W \in \mathbb{R}^{C \times 768}) \end{align}$$

The weight matrix $W$ and bias $b$ are learned during fine-tuning. Each row of $W$ represents a class, and the dot product with $\mathbf{h}$ measures how well the input matches that class.

Finally, we compute the loss:

$$\begin{align} \text{loss} &= \text{CrossEntropy}(\text{softmax}(\text{logits}), y_i) \end{align}$$

Cross-entropy loss penalizes confident wrong predictions more than uncertain ones, encouraging the model to be both accurate and calibrated. During training, we backpropagate this loss through both the classification head and BERT, updating all parameters to minimize classification error.

In practice, this means the model learns to adjust BERT's representations to be more discriminative for the specific classes in your task, while preserving the general language understanding from pre-training.

Handling Common Challenges

Class Imbalance: Real-world datasets often exhibit severe class imbalance. For customer support ticket routing, perhaps 80\% of tickets are ``billing'' and 5\% are ``technical.'' Training naively yields high accuracy but poor performance on rare classes. Solutions include:

Weighted loss: Assign higher weights to underrepresented classes
Oversampling: Duplicate rare class examples during training
Focal loss: Reduce weight for easy (high-confidence) examples
Threshold tuning: Adjust class-specific decision thresholds at inference

Label Scarcity: Labeling 10,000 examples costs thousands of dollars. Active learning selects the most informative unlabeled examples for human annotation, reducing labeling costs by 50--70\%. Data augmentation (paraphrasing, synonym replacement, back-translation) creates synthetic training data, though quality must be validated.

Long-Tail Classes: Even with 100,000 labeled examples, certain classes appear only dozens of times. Few-shot learning or transfer learning from related tasks (e.g., sentiment classification to domain classification) helps. Pre-training on a large corpus in the same domain before task-specific fine-tuning improves performance on tail classes.

Domain Shift: Models trained on one domain (e.g., email support tickets) may perform poorly on another (chat support). Continuous retraining on recent data, periodic evaluation on holdout test sets, and monitoring prediction confidence detect drift. When drift is detected, the model is retrained on a mixture of historical and recent data.

Production Metrics and Monitoring

In production, classification models must maintain consistent performance while handling distribution shift. Key metrics include:

Precision, recall, F1 by class: Critical for decision-making systems
Confidence calibration: Does the model's confidence match actual accuracy?
Prediction volume by class: Sudden changes indicate drift
Latency: Per-example inference time; alert if > threshold
Null rate: Fraction of examples where the model abstains (low confidence)

Continuous monitoring via dashboards and automated alerts ensures performance stays within SLOs. When performance degrades, the team is alerted to investigate whether the issue is model drift, data quality degradation, or a change in the task itself.

Case Study: Support Ticket Triage

A support team receives 50,000 tickets monthly across 12 categories (Billing, Technical, Account Access, etc.). Manual triage takes 2--3 minutes per ticket, costing $\approx\$30,000/month in labor.

Solution:

Collect 5,000 labeled tickets from recent month; split 80/20 for train/val
Fine-tune BERT-base for 3 epochs with learning rate $3 \times 10^{-5}$, batch size 32
Accuracy on validation set: 88\% overall; 92\% for common classes, 71\% for rare classes
Deploy behind REST API; latency 20 ms per ticket on single T4 GPU
Auto-route tickets with confidence > 0.85 (covers 75\% of tickets)
Low-confidence and out-of-distribution cases routed to human agents

Impact:

75\% of tickets auto-routed correctly (78\% accuracy on auto-routed subset)
Human agents handle 25\% of tickets; manual triage for these improves team focus
Cost savings: \$22,500/month (75\% of triage labor eliminated)
Ticket resolution time: Shorter for correct auto-routes, minimal impact on misdirected tickets (human agents catch errors)

Conversational Assistants and Copilots

Conversational assistants powered by language models serve as frontline support, internal knowledge bases, or specialized domain advisors. Unlike single-turn classification, conversational systems must maintain context across multiple turns, handle follow-up questions, and know when to escalate to humans.

Architecture: Retrieval-Augmented Generation

A practical approach combines retrieval with generation. When a user asks a question, the system:

Encodes the user query into an embedding
Retrieves relevant documents from a knowledge base (e.g., FAQs, policies)
Constructs a prompt: ``Answer the question based on the following context:''
Feeds the prompt to a generative model (GPT-3.5, LLaMA, or smaller fine-tuned model)
Returns the generated response to the user

This retrieval-augmented generation (RAG) approach grounds responses in actual company knowledge, reducing hallucinations. It is far cheaper than querying a large proprietary LLM API repeatedly, as retrieval is fast and cheap, and the generative step uses a smaller, cheaper model.

Prompt Engineering and Guardrails

The quality of conversational responses depends heavily on prompt construction. A well-written system prompt establishes persona, instructions, and constraints:

Example System Prompt:


You are a customer support assistant for TechCorp. 
Answer customer questions based on the provided knowledge base.
If you cannot answer the question from the knowledge base, say:
"I don't have information on that topic. 
Let me connect you with a support agent."
Keep responses to 2-3 sentences.
Do not make up product features or pricing.

Guardrails prevent the model from generating harmful, offensive, or confidential content. Rule-based filters detect and block responses containing:

Sensitive information: Credit card numbers, SSNs (use regex or NER)
Offensive language: Trained classifier or list-based filter
Out-of-scope requests: Queries unrelated to the support domain

Semantic similarity checks can also detect if the model is attempting to answer outside its knowledge base. If the retrieved context has low relevance to the query, escalate to a human agent.

Multi-Turn Context and State Management

Conversational state management is non-trivial. The assistant must remember prior context while avoiding token count explosion. Common approaches:

Fixed-window context: Keep the last N messages in context. Older messages are discarded to maintain latency.

Hierarchical summarization: Periodically summarize earlier turns into a brief summary, freeing token budget for new context.

Entity and intent tracking: Extract key entities (product name, issue type) and maintain explicit state rather than relying purely on message history.

Human Escalation and Feedback Loops

Escalation is critical. If the user's issue cannot be resolved or satisfaction is low, the assistant should escalate to a human agent. Signals for escalation:

Low retrieval relevance (no good documents found)
Model confidence below threshold
Explicit user request: ``I want to talk to a human''
Repeated unresolved queries in the conversation

After escalation, the conversation history is logged for training data. Human feedback (``this response was not helpful'') provides signals for model improvement. A feedback loop where user interactions improve model quality is essential for long-term deployment.

Case Study: Customer Support Bot

A SaaS company receives 10,000 support inquiries monthly. Top inquiries (50\%) are about billing, account management, and common troubleshooting. A support bot can handle straightforward cases, reducing load on a 5-person support team.

Design:

Knowledge base: 500 FAQ entries and help articles (created by support team)
Retrieval: all-MiniLM-L6-v2 embeddings stored in Pinecone
Generation: Fine-tuned GPT-2 medium on 1,000 curated QA pairs from past support tickets
Guardrails: Regex filters for PII; confidence threshold for auto-responses
Escalation: Escalate if confidence < 0.6 or user requests human; max 3 turns before escalation

Metrics:

Accuracy (manual evaluation on 500 test conversations): 82\% of responses helpful
Containment rate: 58\% of inquiries resolved without human escalation
First-contact resolution: 65\% (vs. typical 45\% for human agents on common issues)
User satisfaction (CSAT on resolved cases): 3.8/5.0
Cost: Hosted on single standard VM; inference cost < \$50/month
Savings: 58\% $\times$ 10,000 inquiries/month $\times$ 5 min/inquiry $\times$ \$0.25/min labor = \$3,625/month labor reduction

Enterprise Constraints and Governance

Enterprise deployments face constraints absent from research settings: regulatory compliance, data residency, security, and auditability.

Data Residency and Privacy

Regulations like GDPR, CCPA, and industry-specific rules (HIPAA for healthcare, SOC 2 for SaaS) restrict where data can be stored and processed. Enterprises must often deploy models on-premises or in private cloud instances, forgoing the convenience of third-party APIs. This increases operational complexity but is non-negotiable for regulated industries.

Handling Sensitive Data

NLP systems process customer data, internal documents, and potentially sensitive information. Best practices include:

PII masking: Detect and mask sensitive information before processing
Differential privacy: Add noise to training data to prevent memorization
Audit logging: Track all model predictions and their basis (which document was retrieved)
Access control: Restrict who can view model outputs or fine-tune models

Model Explainability and Red-Teaming

For high-stakes applications (financial, healthcare, legal), explaining model predictions is essential. Techniques include:

Attention visualization: Show which input tokens influenced the prediction
Retrieval transparency: Display the documents from which answers were retrieved
Influence functions: Identify training examples most similar to a query

Red-teaming---testing the system for failure modes and adversarial inputs---should precede production deployment. Can the model be tricked into producing harmful content? Does it hallucinate credentials or confidential information? How does it handle non-English text, typos, or adversarial prompts?

Continuous Learning and Model Drift in Enterprise NLP

Enterprise NLP systems face unique drift challenges because language itself evolves, business contexts change, and user behavior shifts. A classification model trained on 2023 support tickets may struggle with 2024 tickets as products change, new issues emerge, and customer language evolves. Understanding and addressing drift is critical for maintaining long-term system value.

Language Evolution and Concept Drift

Language is not static. New terms emerge (``ChatGPT,'' ``NFT,'' ``metaverse''), meanings shift (``cloud'' once meant weather, now means computing), and communication styles evolve (emoji usage, abbreviations, formality levels). Enterprise NLP systems must adapt to these changes or risk becoming obsolete.

Vocabulary drift: New products, features, and services introduce new terminology. A support ticket classifier trained before a product launch won't recognize the new product name. This manifests as increased ``other'' or ``unknown'' classifications and decreased confidence scores.

Semantic drift: The meaning of existing terms changes. ``Premium support'' might have meant one thing in 2023 and something different after a service tier restructuring in 2024. The model's learned associations become outdated.

Style drift: Communication norms evolve. Customers might shift from formal emails to casual chat messages, or from text to voice transcriptions. The model trained on one style may struggle with another.

The generic framework for drift detection (performance monitoring, distribution monitoring, confidence tracking), continuous learning strategies (periodic retraining, triggered retraining, online learning, ensemble approaches, human-in-the-loop), and practical implementation infrastructure (data versioning, model versioning, automated pipelines, monitoring dashboards) is covered in Chapter~[ref], Section~[ref]. Here we highlight NLP-specific considerations.

Key NLP-specific adaptation strategies include:

Vocabulary updates: New products and terminology require updating the tokenizer and model vocabulary. Add tokens for new domain terms that would otherwise fragment into meaningless subwords.
Active learning for efficient labeling: Use model uncertainty to prioritize which recent examples to label, focusing annotation budget on drifting areas (e.g.\ new product categories).
Vocabulary monitoring: Track unknown-token frequency and subword fragmentation rates---spikes indicate terminology drift before performance degrades.

Case Study: Support Ticket Classification Drift

A SaaS company's support ticket classifier degraded from 88\% to 78\% accuracy over six months when three new products were launched. Tickets about these products were misclassified because the model had never seen the new product names. The immediate fix---collecting 500 labeled examples per product and fine-tuning on an 80/20 historical/new data mix---recovered accuracy to 85\% within one week. The long-term solution implemented monthly retraining on a rolling 6-month window with 1,000 actively sampled examples, stabilizing accuracy at 86--88\% at a cost of \$2,500/month against \$15,000/month in prevented misrouting.

Exercises

Exercise 1: Design a semantic search system for a legal firm with 50,000 case documents. What embedding model would you choose? How would you handle long documents that exceed token limits? How would you measure retrieval quality?

Exercise 2: Implement a text classification pipeline for email spam detection. How would you handle class imbalance if 95\% of emails are not spam? How would you detect and respond to concept drift over time?

Exercise 3: Build a conversational FAQ assistant for your organization. What knowledge base would you create? How would you measure success (containment, CSAT)? What guardrails would you implement?

Solutions

Full solutions for all exercises are available at \url{https://deeplearning.hofkensvermeulen.be}.

Solution: Exercise 1: Legal Document Semantic Search

Model Selection: Use a domain-adapted model like SPECTER (trained on academic citations) or fine-tune a legal BERT variant on case similarity pairs. All-MiniLM is too lightweight for legal nuance; BERT-base or a domain-specific model balances performance and latency.

Long Document Handling: Legal cases often exceed 512-token limits. Use hierarchical chunking: break documents into 400-token sections with 50-token overlap, embed each section, then aggregate scores across sections for final retrieval ranking.

Quality Measurement: - Precision@k and Recall@k for a test set of lawyer-annotated queries - A/B testing: deploy new model for 10\% of queries, measure lawyer satisfaction - Mean reciprocal rank (MRR): where is the most relevant document ranked?

Solution: Exercise 2: Email Spam Classification

Class Imbalance Handling: With 95\% negatives, standard cross-entropy biases toward the negative class. Solutions: - Weighted loss: $\text{Loss} = 0.95 \times \text{CE}_{\text{not-spam}} + 19 \times \text{CE}_{\text{spam}}$ - Focal loss: reduce weight for easy (high-confidence) examples - Threshold tuning: lower decision boundary to increase spam recall at cost of lower precision

Drift Detection: - Monitor spam rate over time; alert if it deviates > 2\% from baseline - Evaluate model monthly on recent data; retrain if F1 drops > 2\% - Flag misclassified emails for review; manually check false positives/negatives - Log all predictions + ground truth for offline evaluation

Solution: Exercise 3: FAQ Conversational Assistant

Knowledge Base: Curate 100--200 high-quality QA pairs addressing the 80\% of common questions. Structure as: question, answer, related questions, prerequisites (e.g., account must be created).

Success Metrics: - Containment: \% of conversations resolved without escalation (target: $\geq 50\%$) - CSAT: user satisfaction on 1--5 scale (target: $\geq 3.5$) - First-contact resolution: issue fully resolved without follow-up (target: $\geq 60\%$)

Guardrails: - PII filter: block responses containing names, emails, IDs matching regex - Confidence threshold: only auto-respond if score > 0.70 - Out-of-scope detector: if retrieved docs have low relevance, escalate - Rate limiting: prevent abuse (e.g., > 10 messages/minute)

Continuous Improvement: Collect feedback from users (``was this helpful?'') and escalated conversations. Retrain model quarterly on new QA pairs derived from unresolved queries.

← Chapter 24: Domain-Specific Models 📚 Table of Contents Chapter 26: Code and Language Models →