Data, Logs, and Observability: Models for Infrastructure and Operations
Chapter Overview
Modern digital infrastructure generates an overwhelming torrent of data. A typical enterprise with 10,000 servers produces 10 billion log messages daily, 100 million metric data points hourly, and millions of distributed traces. This machine-generated dataâlogs, metrics, traces, configurationsâforms a rich language describing system behavior, health, and failures. However, the volume far exceeds human capacity to analyze. A single on-call engineer cannot manually review billions of events to diagnose why a service failed at 3am.
The business stakes are enormous. System downtime costs enterprises \$5,600 per minute on average, with major outages costing millions per hour. Amazon loses \$220,000 per minute of downtime. Facebook loses \$90,000 per minute. Beyond direct revenue loss, downtime damages customer trust, violates SLAs (triggering financial penalties), and consumes engineering time in firefighting rather than feature development. For SaaS companies, reliability is a competitive differentiatorâ99.9\% uptime (8.7 hours downtime annually) versus 99.99\% uptime (52 minutes annually) can determine market leadership.
This chapter examines how transformers and deep learning are revolutionizing observabilityâthe ability to understand system behavior from external outputs. Traditional monitoring relies on static thresholds and manual analysis, generating alert fatigue (hundreds of false alarms daily) while missing subtle failures. AI-powered observability enables: anomaly detection that adapts to changing baselines, root-cause analysis that diagnoses failures in minutes rather than hours, and automated remediation that fixes common problems without human intervention.
The business impact is measurable and substantial. Companies implementing AI-driven observability report 50-70\% reduction in mean time to resolution (MTTR), 60-80\% reduction in false positive alerts, and 30-50\% reduction in on-call engineer workload. For a large enterprise, reducing MTTR from 30 minutes to 10 minutes saves millions annually in prevented downtime. Reducing false positives from 100 to 20 daily alerts prevents alert fatigue and improves engineer quality of life. Automating 40\% of incident responses frees engineers to focus on strategic work rather than repetitive firefighting.
However, observability AI faces unique challenges. Systems must operate in real-time with sub-minute latencyâslow anomaly detection means prolonged outages. False positives are costlyâwaking engineers at 3am for non-issues causes burnout and erodes trust. False negatives are catastrophicâmissing critical failures causes extended outages. The data is massive, noisy, and constantly changing as systems evolve. And critically, observability systems must be more reliable than the systems they monitorâif the monitoring system fails during an outage, engineers are blind.
This chapter provides the technical foundation and business context to build observability AI systems that detect, diagnose, and remediate infrastructure failures. We examine successful deployments, operational requirements, and the economic models that make observability AI essential for modern infrastructure. The focus is on practical systems that work in production at scale, handling billions of events daily while maintaining engineer trust.
Learning Objectives
- Understand machine data: logs, metrics, traces, configurations
- Parse semi-structured logs with variable formats
- Build anomaly detection models for multi-dimensional time series (metrics)
- Implement root-cause analysis using sequence models
- Automate incident response and remediation
- Design closed-loop systems: detect â diagnose â remediate â learn
- Optimize for operational metrics: false positive rate, MTTR, accuracy of diagnostics
Machine Data as a Language
Machine data consists of events generated by software and hardware:
Types of Machine Data
[ERROR] Connection timeout after 5000ms to database.example.com:5432. Metrics are numeric time series tracking system health indicators like CPU usage, memory consumption, latency, and request counts, typically sampled at 1-minute or 1-hour granularity. Traces provide detailed request flows through distributed services, recording timestamps, service names, durations, and parent-child relationships between service calls. Events capture discrete occurrences such as deployments, configuration changes, and scaling operations, often accompanied by structured metadata. Configurations describe system state including service versions, feature flags, and environment variables, typically stored as text or structured formats like JSON or YAML.
Machine Language Grammar
Machine data has structure, though not formal grammar:
Machine data exhibits structural patterns despite lacking formal grammar. Log templates follow consistent patterns where multiple logs share the same structure, typically formatted as [LEVEL] Message with parameters. Metric names use hierarchical naming conventions like system.cpu.usage or app.request.latency, enabling organized aggregation and querying. Trace structure forms directed acyclic graphs (DAGs) of service calls with associated timings, revealing request paths through distributed systems. Event sequences follow causal chains where actions trigger consequencesâfor example, a deployment leads to configuration updates, which trigger service restarts, culminating in system recovery.
Data Collection and Storage
Production systems generate massive amounts of data:
Production systems generate massive volumes of data that challenge traditional analysis approaches. A single server typically generates 100-1,000 log messages per second during normal operation. Scaling to datacenter level, a facility with 10,000 servers produces 1-10 billion events per second, creating petabytes of data annually. Specialized databases handle this ingestion: Elasticsearch for logs, Prometheus for metrics, and Jaeger for traces. Retention policies typically maintain 3-30 days of detailed data for operational analysis, with historical data archived to cheaper storage for compliance requirements and long-term research.
Anomaly Detection
Most events are normal. A model trained on normal data learns baseline behavior; deviations are anomalies.
Metric Anomaly Detection
A metric time series (e.g., CPU usage over time) has structure:
Baseline: Normal operating level (e.g., CPU averages 40\%)
Seasonality: Predictable patterns (e.g., higher traffic 9am--5pm)
Trend: Long-term changes (e.g., growing traffic week-over-week)
Anomaly: Deviation from expected pattern (e.g., CPU spikes to 95\% unexpectedly)
Traditional Approaches
Traditional anomaly detection approaches vary in sophistication and effectiveness. Static thresholds provide the simplest approachâalerting when CPU exceeds 80
Deep Learning Approaches
Transformers excel at multi-step prediction:
- Input: Metric values for past H hours (e.g., 24 hours)
- Prediction: Predict next hour's value given history
- Anomaly: If actual value differs significantly from prediction, it's anomalous
- Advantage: Model learns complex temporal patterns including seasonality and trend
Multivariate Anomaly Detection
Most alerts involve multiple metrics. CPU spike alone might be normal; CPU spike + disk I/O spike + context switch spike together indicate problem.
Multivariate anomaly detection recognizes that most meaningful alerts involve correlations across multiple metrics. While a CPU spike alone might represent normal burst activity, the combination of CPU spike, disk I/O surge, and elevated context switches together indicates a genuine problem. Univariate approaches analyze each metric independently, missing these critical correlations. Multivariate approaches learn the correlation matrix and joint distributions across all metrics, enabling detection of anomalous patterns that only emerge when considering multiple signals together.
A transformer encoder processes all metrics jointly, learning correlations:
Multivariate detection is more accurate but requires more data (training on normal behavior across all combinations).
Practical Challenges
Several practical challenges complicate anomaly detection in production. False positives from legitimate operational eventsâdeployments, scheduled backups, batch jobsâtrigger alerts that waste engineering time and erode trust in the monitoring system. Tuning thresholds to balance sensitivity (catching real problems) and specificity (avoiding false alarms) proves difficult and requires continuous adjustment. Data quality issues including missing values from sensor failures, measurement errors, and incomplete distributed traces confuse models and degrade accuracy. Concept drift occurs as system behavior evolves over time with growing user bases, architectural changes, and new deployment patterns, causing models trained on historical data to become stale and inaccurate. Alert fatigue results when excessive alerts desensitize on-call engineers, leading them to ignore or dismiss notificationsâa dangerous situation where critical alerts may be missed among the noise.
Root-Cause Analysis and Diagnosis
Detecting an anomaly is step one. Diagnosing the cause is step two. A model can:
Root-cause analysis models perform several key functions to diagnose detected anomalies. They retrieve similar historical incidents from the incident database, leveraging past resolutions to inform current diagnosis. They identify temporal relationships between metric changes, distinguishing symptoms from root causes by determining which metrics changed first in the causal chain. They correlate anomalies with operational events like deployments and configuration changes, suggesting likely causes based on temporal proximity. Finally, they generate causal hypotheses ranked by likelihood, providing engineers with actionable starting points for investigation.
Architecture for RCA
- Anomaly detection: Identify unusual pattern
- Signal correlation: Which metrics changed together? In what order?
- Timeline: Build timeline of events (metrics, logs, config changes)
- Similar incidents: Retrieve similar past incidents from database
- Hypothesis generation: Propose likely causes
- Explanation: Generate human-readable explanation
Example: API Latency Spike
Observed: API latency jumped from 50ms to 500ms at 2am.
RCA Process:
- Anomaly detected: Latency > 3x baseline
- Correlations: Database CPU also spiked; query count unchanged
- Timeline: Database CPU spike occurred 2 minutes before latency spike
- Hypothesis 1: Slow database queries (query count normal, but duration increased)
- Investigation: Check slow query log â Find expensive query running
- Root cause: A data migration job ran at 2am, locking tables
- Remediation: Kill migration job; reschedule for lower-traffic time
A system that automates this diagnosis reduces MTTR from 30 minutes (human detective work) to 2 minutes (system analysis + human confirmation).
Incident Automation and Remediation
Beyond diagnosis, systems can automatically remediate common incidents. Automated actions include restarting unhealthy services when health checks fail three consecutive times, scaling up by adding servers when CPU exceeds 80
Safety in Automated Remediation
Automated actions must incorporate multiple safety mechanisms. Conservative decision-making ensures actions are only taken when diagnosis confidence is high. Reversibility requires that all automated actions can be easily undoneârestarting services is safe, but deleting data is not. Bounded execution limits the frequency of actions, preventing repeated restarts that could worsen problems. Human oversight provides notification before action execution, allowing cancellation within a grace period. Comprehensive audit logs record all automated actions for post-incident review and continuous improvement.
Log Parsing and Understanding
Logs are semi-structured: same template with variable values. Example:
[2024-01-30 10:23:45] [ERROR] Connection timeout to user_service:8080 after 5000ms
[2024-01-30 10:23:46] [ERROR] Connection timeout to user_service:8080 after 5000ms
[2024-01-30 10:23:47] [ERROR] Connection timeout to user_service:8080 after 5000ms
Log template: [TIME] [LEVEL] Connection timeout to HOST:PORT after \{DURATION\}ms
Log Parsing Models
Neural models can parse logs:
- Tokenization: Split log into tokens
- Classification: Each token is a constant or variable (e.g., ``timeout'' is constant; ``5000'' is variable)
- Template extraction: Infer template from multiple logs
- Clustering: Group logs by template; identify new template types
Anomalous Log Detection
A model can flag unusual logs:
- Encode log as sequence of tokens
- Compute probability under learned language model
- Flag logs with very low probability (likely anomalous)
Example: [ERROR] Connection timeout to user\_service:999 after -5000ms
This is anomalous (negative duration) and would have low probability.
Configuration and Policy Compliance
Infrastructure configuration models provide several capabilities. They parse and understand configurations, extracting what systems declare about desired state. They perform policy compliance checks, verifying that configurations adhere to organizational policies such as replica limits for cost control. They suggest improvements aligned with best practices, recommending additions like health checks, resource limits, and proper labeling.
Configuration Language Models
Configuration language models fine-tuned on infrastructure code provide intelligent assistance. They suggest missing configurations that improve reliability and observability, such as health checks, resource limits, and metadata labels. They detect risky patterns including overly permissive security groups, missing backup configurations, and single points of failure. They propose refactoring opportunities to deduplicate code, extract reusable modules, and improve maintainability.
AIOps: AI-Powered IT Operations (2024-2025)
AIOps (Artificial Intelligence for IT Operations) has emerged as a comprehensive framework for applying AI and machine learning to IT operations, moving beyond isolated anomaly detection to integrated, intelligent operations management. As of 2024-2025, AIOps platforms have matured significantly, incorporating advances in causal inference, automated remediation, and predictive maintenance.
AIOps Platform Architecture
Modern AIOps platforms integrate multiple AI capabilities into unified systems. Data ingestion and correlation collects and correlates data from multiple sourcesâmetrics, logs, traces, events, configurations, and ticketsâin real-time, building a unified timeline of system state and changes. Intelligent anomaly detection performs multi-signal analysis that correlates anomalies across different data types, reducing false positives by 60-80\% compared to single-signal detection through context-aware analysis. Causal inference for root cause analysis uses advanced methods to distinguish correlation from causation, determining whether metric A causes metric B, B causes A, or both are caused by a hidden factor C through causal graphs and do-calculus. Predictive failure detection identifies early warning signs hours or days before complete failure by training machine learning models on historical failure patterns to recognize precursor signals like gradual memory leaks, increasing error rates, and degrading performance. Automated remediation and self-healing automatically executes remediation actions for common failure patterns, including restarting unhealthy services, scaling resources, rerouting traffic, and rolling back deployments, while implementing safety constraints to prevent automated actions from causing additional problems. Incident management and collaboration integrates with systems like PagerDuty and ServiceNow to automatically create tickets, assign to appropriate teams, suggest runbooks, track resolution, and provide collaboration tools for distributed teams responding to incidents.
Causal Inference for Root Cause Analysis
Traditional RCA relies on correlationâif metric A and metric B change together, assume relationship. However, correlation doesn't imply causation. Causal inference methods provide more accurate diagnosis:
Causal graph construction: Build directed acyclic graph (DAG) representing causal relationships between system components. Nodes are services, metrics, or resources. Edges represent causal dependencies (service A calls service B, CPU affects latency).
Causal discovery algorithms: Automatically learn causal graphs from observational data using algorithms like PC (Peter-Clark), FCI (Fast Causal Inference), or GES (Greedy Equivalence Search). These algorithms use conditional independence tests to infer causal structure.
Interventional analysis: When anomaly occurs, use causal graph to identify root causes through interventional reasoning. If intervening on metric A would fix metric B, then A is likely causing B. This is formalized through do-calculus and counterfactual reasoning.
Example: API latency spike occurs. Correlation analysis shows database CPU also spiked. Causal analysis determines: (1) Database CPU spike occurred 2 minutes before API latency spike (temporal precedence), (2) API latency is conditionally dependent on database CPU given other factors (statistical dependence), (3) Intervening on database CPU (reducing load) would fix API latency (interventional test). Conclusion: Database CPU is root cause. Action: Investigate database queries, find expensive query, optimize or kill it.
Predictive Failure Detection and Preventive Maintenance
Rather than reacting to failures, predict and prevent them:
Failure precursor detection trains models on historical failure data to identify patterns that precede failures. Common precursors include gradual memory leaks where memory usage increases 1\% daily over weeks, increasing error rates that grow from 0.1\% to 0.5\% over days, degrading performance with latency increasing 10\% weekly, and resource exhaustion trends as disk usage approaches 90\%.
Time-to-failure prediction: Predict not just that failure will occur, but when. This enables scheduling maintenance during low-traffic periods rather than emergency response during peak hours. Use survival analysis and time-series forecasting to estimate time-to-failure distributions.
Preventive actions: When failure is predicted with high confidence and sufficient lead time, take preventive actions:
- Schedule maintenance window for service restart
- Gradually drain traffic from at-risk instances
- Provision additional capacity before resource exhaustion
- Alert teams to investigate and fix underlying issues
Implementation Considerations: Predictive models require careful calibration. False positives (predicting failures that don't occur) waste resources on unnecessary maintenance. False negatives (missing actual failures) defeat the purpose. Target 80-90\% precision and 70-80\% recall, with lead times of 4-24 hours for actionable predictions.
Automated Remediation and Self-Healing Systems
AIOps platforms increasingly incorporate automated remediation, moving from detection to resolution:
Runbook automation: Codify common remediation procedures as executable runbooks. When specific failure patterns are detected, automatically execute appropriate runbooks. Examples:
- Service health check failure â Restart service
- High CPU usage â Scale out (add instances)
- Disk space exhaustion â Clean up old logs
- Database connection pool exhaustion â Increase pool size
Reinforcement learning for remediation: Use reinforcement learning to learn optimal remediation strategies. The agent observes system state, takes actions (restart, scale, reroute), and receives rewards (system recovery, minimal disruption). Over time, the agent learns which actions work best for different failure scenarios.
Safety constraints and human oversight: Automated remediation must be safe:
- Whitelist of allowed actions (only safe, reversible actions)
- Rate limiting (max 1 restart per service per hour)
- Human approval for high-risk actions (database restarts, traffic rerouting)
- Automatic rollback if action makes situation worse
- Complete audit logs of all automated actions
Adoption Challenges: Engineers are often hesitant to trust automated remediation due to fear of automation causing additional problems. Gradual adoption is key: start with safest actions (service restarts), build trust through demonstrated reliability, gradually expand to more complex actions. Maintain human oversight and easy override mechanisms.
AIOps Platform Vendors and Ecosystem (2024-2025)
The AIOps market has matured significantly, with several established platforms:
Commercial platforms:
- Datadog AIOps: Integrated with Datadog's monitoring platform. Strong anomaly detection and correlation. \$500-5,000/month depending on scale.
- Splunk IT Service Intelligence (ITSI): Enterprise-focused. Advanced analytics and machine learning. \$10,000-100,000+/year.
- Dynatrace Davis AI: Automated root cause analysis and predictive analytics. Strong causal inference capabilities. \$5,000-50,000+/month.
- Moogsoft: Specializes in event correlation and noise reduction. Reduces alert volume by 90\%. \$2,000-20,000/month.
Open-source and custom solutions:
- Many large tech companies (Google, Facebook, Amazon) build custom AIOps platforms tailored to their infrastructure
- Open-source components: Prometheus (metrics), Elasticsearch (logs), Jaeger (traces), Grafana (visualization)
- ML frameworks: TensorFlow, PyTorch for custom anomaly detection and RCA models
Selection criteria:
- Scale: Can platform handle your data volume (billions of events daily)?
- Integration: Does it integrate with your existing monitoring tools?
- Customization: Can you train custom models on your data?
- Cost: Total cost of ownership (licensing + infrastructure + personnel)
- Vendor lock-in: Can you migrate to alternative platforms if needed?
Future Directions and Research Frontiers
AIOps continues to evolve rapidly. Emerging trends as of 2024-2025:
Large language models for operations: Using LLMs (GPT-4, Claude) to understand natural language incident descriptions, generate remediation suggestions, and explain system behavior to engineers. Early results show 30-40\% improvement in incident response time when engineers have LLM assistants.
Federated learning for cross-organization insights: Multiple organizations collaboratively train AIOps models without sharing sensitive data. This enables learning from broader failure patterns while preserving privacy. Particularly valuable for industry-specific platforms (healthcare, finance).
Quantum-inspired optimization for resource allocation: Using quantum-inspired algorithms to optimize resource allocation and capacity planning. Early research shows 10-20\% improvement in resource utilization compared to classical optimization.
Explainable AI for operations: Improving interpretability of AIOps decisions. Engineers need to understand why the system flagged an anomaly or suggested a remediation. Research focuses on attention visualization, counterfactual explanations, and natural language generation for explanations.
Case Study: Intelligent Alerting and Incident Response
A SaaS company operates a large distributed system: 500 microservices, 10,000 servers. Manual monitoring is infeasible.
System Design
The system design handles massive scale with 10 billion events per day including logs, metrics, and traces. Storage uses specialized databases: Elasticsearch for logs, Prometheus for metrics, and Jaeger for traces. The models include an anomaly detector using a multivariate transformer on key metrics, an RCA engine that correlates metrics and queries logs to match similar incidents, and a recommendation engine that suggests fixes based on diagnosis.
Workflow
- System detects anomaly in real-time
- RCA engine analyzes metrics and logs
- System generates incident summary: ``API latency spike in payment service. Database CPU also elevated. Similar to 3 prior incidents.''
- Suggested actions: ``Restart payment service or check database slow query log''
- On-call engineer reviews suggestion; approves auto-restart
- Incident resolved in 2 minutes (vs. 30 minutes manual detective work)
Results
Performance metrics demonstrate the system's effectiveness. Detection latency averages 2 minutes, faster than humans typically notice issues. MTTR is reduced to 5 minutes with auto-remediation compared to 30 minutes for manual resolution. The false positive rate is 5
Business impact is substantial across multiple dimensions. Uptime improved from 99.98
Model Maintenance and Drift in Observability Systems
Observability AI systems face a paradoxical drift challenge: they must monitor systems that are constantly evolving while themselves remaining stable and reliable. Infrastructure changes continuouslyânew services are deployed, traffic patterns shift, hardware is upgraded, and architectures evolve. Each change alters the "normal" behavior that anomaly detection models learn, causing drift. Yet observability systems cannot afford frequent retraining downtime or accuracy degradationâthey must remain operational 24/7, detecting anomalies even as the definition of "normal" changes.
The business consequences of observability drift are severe and immediate. When anomaly detection models drift, two problems occur simultaneously: (1) False positives increaseânormal behavior is flagged as anomalous, generating alert fatigue and eroding engineer trust, (2) False negatives increaseâactual failures go undetected, causing prolonged outages and revenue loss. A 10\% increase in false positives might generate 50 additional false alerts daily, waking on-call engineers unnecessarily and causing burnout. A 10\% increase in false negatives might miss 2-3 critical incidents monthly, each costing \$100K-1M in downtime.
The challenge is compounded by the meta-monitoring problem: observability systems monitor other systems, but who monitors the observability system? If the anomaly detector itself fails or drifts during an outage, engineers lose visibility precisely when they need it most. This creates a requirement for extreme reliabilityâobservability systems must be more reliable than the systems they monitor, typically targeting 99.99\%+ uptime and <1\% error rates.
Domain-Specific Drift Patterns in Observability
Observability drift manifests in several distinct ways, each requiring different detection and mitigation strategies:
Infrastructure evolution and service changes. Modern infrastructure evolves rapidly through continuous deployment. New services are added, old services are deprecated, service dependencies change, and architectures are refactored. Each change alters system behavior and the patterns that anomaly detection models have learned. A model trained when the system had 300 services may perform poorly when it has 500 services with different interaction patterns.
Example: A company migrates from monolithic architecture to microservices over 6 months. The monolith had predictable resource usage patterns. Microservices have different patternsâmore network traffic, different latency distributions, cascading failures. Anomaly detection models trained on monolith data generate thousands of false positives on microservices, causing alert fatigue. Models require complete retraining on microservices data.
Traffic pattern drift and load changes. User traffic patterns change over time due to business growth, seasonal variations, marketing campaigns, and user behavior evolution. A model trained when daily traffic was 1M requests may consider 2M requests anomalous, even though it's normal growth. Seasonal patterns shiftâholiday traffic, back-to-school, tax season. Marketing campaigns create sudden traffic spikes that look like attacks but are legitimate.
Example: E-commerce platform experiences 10x traffic spike during Black Friday. Anomaly detection models trained on normal traffic flag this as attack, triggering rate limiting that blocks legitimate customers. Models must learn that Black Friday traffic patterns are normal, not anomalous. This requires either seasonal models or adaptive baselines that adjust to traffic growth.
Deployment and configuration drift. Every deployment changes system behavior slightly. New code has different performance characteristics, resource usage, and failure modes. Configuration changes (feature flags, scaling parameters, database settings) alter behavior. Models must distinguish between expected changes from deployments and unexpected failures.
The challenge is that deployments sometimes introduce bugs that cause failures. Models must detect deployment-related failures while not flagging every deployment as anomalous. This requires understanding deployment contextâif latency increases after deployment, it might be a bug; if it increases randomly, it's likely a failure.
Hardware and infrastructure drift. Hardware ages and degrades. Disks slow down, network cards fail intermittently, CPUs throttle due to heat. Cloud providers change instance types, network configurations, and availability zones. These hardware changes alter performance characteristics that models have learned. A model trained on fast SSDs may consider slow HDD performance anomalous, even though HDDs are now the standard.
Monitoring system changes. The observability infrastructure itself evolves. Metrics are added, removed, or renamed. Log formats change. Sampling rates adjust. Monitoring agents are upgraded. Each change affects the data that models consume, potentially causing drift. A model trained on one log format may fail when log format changes.
Example: Company upgrades logging library, changing log format from plain text to JSON. Log parsing models trained on plain text fail completely on JSON logs. Models require immediate retraining or format-agnostic parsing.
Concept drift in failure patterns. Failure modes evolve as systems mature. Early in a system's life, failures are often configuration errors or resource exhaustion. As the system matures, failures become more subtleârace conditions, memory leaks, cascading failures. Models trained on early failures may not recognize mature failure patterns.
Alert fatigue and threshold drift. As engineers respond to alerts, they adjust thresholds to reduce false positives. This threshold drift changes what the system considers anomalous. Additionally, engineers become desensitized to frequent alerts (alert fatigue), effectively raising their personal threshold for action. Models must adapt to these changing expectations.
Key observability-specific strategies beyond the generic framework include:
- Online learning with conservative updates: Use exponential moving averages (e.g.\ decay factor 0.95) for baselines, allowing gradual adaptation to traffic growth while resisting sudden anomalous spikes.
- Deployment-aware anomaly detection: Integrate CI/CD deployment events to temporarily relax thresholds during expected change windows, reducing false positives from legitimate deployments.
- Multi-signal correlation: Correlate metrics, logs, traces, and deployment events rather than analyzing signals in isolation---multi-signal anomalies are far more likely to be real issues, reducing false positives by 50--70\%.
- Synthetic incident injection: Inject synthetic failures in test environments to verify detection systems catch them---analogous to fire drills for monitoring infrastructure.
- Adaptive percentile-based thresholds: Use percentile thresholds (e.g.\ alert if metric exceeds 99th percentile of recent values) rather than absolute thresholds, automatically adjusting to growth and seasonal patterns.
Exercises
Solutions
Full solutions for all exercises are available at \url{https://deeplearning.hofkensvermeulen.be}.
\itshape Data:
- Metric: CPU usage over 1 month (1 sample per minute; 44,640 samples)
- Anomalies: 10 known incidents (planned maintenance excluded)
- Train/test: 3 weeks train, 1 week test with anomalies
\itshape Model:
- Input: 60 samples (1 hour history)
- Transformer encoder with positional encoding
- Output: Predict next 60 minutes of CPU usage
- Loss: MSE on predictions
\itshape Results:
- Prediction RMSE: 3\% (accurate forecasting)
- Anomaly detection (threshold on prediction error):
- At threshold = 10\% error: Detect 90\% of anomalies, 5\% false positive rate
- At threshold = 5\% error: Detect 75\% of anomalies, 1\% false positive rate
- Comparison to baseline (statistical method): Similar FPR at same sensitivity
- Advantage: Transformer captures complex patterns (daily seasonality, trends)
\itshape Practical deployment: Use threshold = 10\% error (catch 90\% of anomalies). Alert human; 5\% FPR requires tuning or feedback loop to reduce over time.
\itshape Data:
- 100K logs from 5 services
- Each log has template (constant structure) + variables
- Test set: 1K logs with unseen templates
\itshape Model:
- Token classifier: BPE tokenization, BERT classification (constant vs. variable per token)
- Template extraction: Logs with identical constant tokens grouped
- Clustering: New templates identified via similarity to known templates
\itshape Results:
- Template recovery: 95\% of logs assigned to correct template
- Unseen templates: 80\% detection (identify new templates not in training)
- False positives: 2\% (misassign variable token as constant; rare)
\itshape Improvements:
- Use domain knowledge (common log patterns) to improve accuracy
- Online learning: Continuously update templates as new logs arrive
- Hybrid: Combine regex-based parsing (for expected patterns) + neural parsing (for novel patterns)
\itshape Dataset:
- 500 historical incidents with:
- Metrics at time of incident (CPU, latency, memory, etc.)
- Root cause (labeled by on-call engineer)
- Logs from incident timeframe
- Test set: 50 held-out incidents
\itshape Model:
- Metric correlation: Given anomaly metrics, find correlated metrics
- Similar incident retrieval: Embed current incident; find similar past incidents
- RCA generation: Based on similar incidents + metric correlations, predict root cause
- Confidence scoring: How confident is this diagnosis?
\itshape Evaluation:
- Exact match: Predicted root cause matches labeled root cause. 65\% accuracy.
- Top-3 accuracy: Predicted root cause in top 3 suggestions. 88\% accuracy.
- Confidence calibration: When model says 80\% confident, is accuracy actually â 80\%?
\itshape Practical use:
- Display top-3 suggestions to engineer (not top-1, to avoid over-reliance)
- Engineer picks most relevant suggestion
- System learns from engineer feedback; retrains monthly