- Background
 - High-Level View
 - Empirical Studies
 - Silent Errors
 - Distributed Training
 - Diagnosis
 - Code Bug Testing
 - Monitoring
 - Model Behavior Testing
 - Fault Injection Tools
 - Industry Post Mortems
 
- MLSys: The New Frontier of Machine Learning Systems — Position paper outlining the co-design challenges/opportunities across ML, systems, and hardware.
 - AI Engineering Quick Start — Practical guide to end-to-end AI engineering workflows and best practices.
 
- Machine Learning Testing: Survey, Landscapes and Horizons, TSE 2020 — Comprehensive survey of ML testing techniques, tools, and open challenges.
 - Hidden Technical Debt in Machine Learning Systems, NeurIPS 2015 — Seminal discussion of non-obvious maintenance costs and systemic risks in ML systems.
 
- A First Look at Bugs in LLM Inference Engines, arXiv 2025 [Inference] — Early taxonomy and root causes of bugs in LLM inference engines across open-source stacks.
 - Towards Understanding Bugs in Distributed Training and Inference Frameworks for Large Language Models, arXiv 2025 [Training] [Inference] — Characterizes bug patterns in distributed LLM frameworks and offers mitigation guidance.
 - Characterization of Large Language Model Development in the Datacenter, NSDI 2024 [Training] — Cluster-scale study of LLM development workloads and bottlenecks at Shanghai AI Lab. Review
 - An Empirical Study on Low GPU Utilization of Deep Learning Jobs, ICSE 2024 [Training] — Identifies causes of low GPU utilization in DL jobs and practical optimizations.
 - Toward Understanding Deep Learning Framework Bugs, TOSEM 2023 [Kernels] — Analyzes bug types, triggers, and impact across major DL frameworks.
 - Are Machine Learning Cloud APIs Used Correctly?, ICSE 2021 [Inference] — Studies real-world misuse patterns of ML cloud APIs and their consequences.
 - A Comprehensive Empirical Study on Bug Characteristics of Deep Learning Frameworks, IST 2021 [Kernels] — Large-scale analysis of DL framework bug reports to extract categories and trends.
 - An Empirical Study on Program Failures of Deep Learning Jobs, ICSE 2020 [Training] — Characterizes failure modes in production DL jobs at Microsoft focusing on exception-throwing failures.
 - An Empirical Study of Common Challenges in Developing Deep Learning Applications, IEEE Software 2020 [Training] [Inference] — Surveys and categorizes practical challenges in building correct and accurate DL apps.
 - Taxonomy of Real Faults in Deep Learning Systems, ICSE 2019 [Training] [Inference] — Builds a taxonomy of real-world DL faults with testing implications.
 
- Understanding Silent Data Corruption in LLM Training, arXiv 2025 [Training] — Amazon's study of SDC causes, manifestations, and detection challenges in LLM training.
 - Silent Errors in Large-scale LLM Training: Challenges and Lessons Learned, 2025 [Training] — NVIDIA experience report on prevalence, sources, and mitigations for silent training errors.
 
- Training with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks, OSDI 2025 [Training] — TrainCheck automatically infers training invariants and proactively flags silent correctness errors.
 - XPUTIMER: Anomaly Diagnostics for Divergent LLM Training in GPU Clusters of Thousand-Plus Scale, arXiv 2025 [Training] — Real-time anomaly diagnostics tailored for large-scale distributed LLM training.
 
- TrainVerify: Equivalence-Based Verification for Distributed LLM Training, SOSP 2025 [Training] — Verifies distributed training by checking equivalence against a reference to catch silent errors.
 - TTrace: Lightweight Error Checking and Diagnosis for Distributed Training, arXiv 2025 [Training] — Low-overhead tracing to detect and localize errors in distributed training.
 
- Defeating Nondeterminism in LLM Inference, Blog 2025 [Inference] [Kernels] — Blog on ensuring "batch invariance" of kernels by Thinking Machines
 - PerfTracker: Online Performance Troubleshooting for Large-scale Model Training in Production, arXiv 2025 [Training] — Alibaba Cloud system for online localization of training performance bottlenecks and regressions.
 - Emerging Platforms Meet Emerging LLMs: A Year-Long Journey of Top-Down Development, arXiv 2024 [Training] — Case study on co-designing software and hardware platforms to support rapidly evolving LLMs.
 - Debugging Machine Learning Pipelines, DEEM 2019 [Training] — Uses decision trees over historical runs to localize ML pipeline performance anomalies. Review
 
- CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep Learning Libraries, ICSE 2019 [Kernels] — Differential testing across DL backends to expose and localize inconsistencies.
 - A Static Analyzer for Detecting Tensor Shape Errors in Deep Neural Network Training Code, ICSE 2022 [Training] — Abstract-interpretation-based static analysis to detect tensor shape errors. Review
 - AutoTrainer: An Automatic DNN Training Problem Detection and Repair System, ICSE 2021 [Training] — Detects training issues and applies automated repairs to improve convergence.
 - Reliability Assurance for Deep Neural Network Architectures against Numerical Defects, ICSE 2023 [Kernels] — Identifies and mitigates numerical instability (e.g., NaNs/overflow) in DNN computation.
 - NeuRI: Diversifying DNN Generation via Inductive Rule Inference, FSE 2023 [Kernels] — Generates diverse DNNs via learned transformation rules to boost test coverage.
 - NNSmith: Generating Diverse and Valid Test Cases for Deep Learning Compilers, ASPLOS 2023 [Kernels] — Synthesizes semantically valid models to stress and validate DL compilers.
 - Fuzzing Automatic Differentiation in Deep-Learning Libraries, ICSE 2023 [Kernels] — Fuzzes autodiff implementations to reveal gradient calculation bugs.
 - Fuzzing Deep-Learning Libraries via Automated Relational API Inference, FSE 2022 [Kernels] — Infers API relations to generate relational checks that uncover defects.
 - Free Lunch for Testing: Fuzzing Deep-Learning Libraries from Open Source, ICSE 2022 [Kernels] — Leverages OSS artifacts to derive oracles and fuzz tests for DL libraries.
 - Automated Testing of Software that Uses Machine Learning APIs, ICSE 2022 [Inference] — Techniques for testing applications that integrate ML APIs and handling API misuses.
 - Keeper: Automated Testing and Fixing of Machine Learning Software, TOSEM 2024 [Training] — End-to-end system to automatically generate tests and propose fixes for ML code.
 
- Training with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks, OSDI 2025 [Training] — Proactive runtime monitoring via inferred invariants to detect silent training errors.
 - Self-Checking Deep Neural Networks in Deployment, ICSE 2022 [Inference] — Embeds runtime checks to detect anomalies and trigger self-tests during inference.
 
- DeepXplore: Automated Whitebox Testing of Deep Learning Systems, SOSP 2017 [Inference] — Introduces neuron coverage and differential testing to generate inputs and expose discrepancies.
 - Oracle Issues in Machine Learning and Where to Find Them, ICSEW 2020 [Data] — Identifies and detects issues in ML oracles/labels using entropy and semantic analysis. Review
 
- NVBitFI: Dynamic Fault Injection for GPUs, DSN 2021 [Kernels] — Injects faults into GPU binaries to evaluate resilience and error propagation in DL workloads.
 
- Anthropic: A Postmortem of Three Recent Issues — Lessons and mitigations drawn from consecutive reliability incidents in Claude models during Aug 2025.