CSCI 5942: AI Engineering
Instructor: Christoffer Heckman
Email:christoffer.heckman@colorado.edu
This is the new, required cornerstone course for the 3-course foundational core (CSCI 5622 ML -> CSCI 5922 DL -> CSCI 5942 - AI Engineering) for the residential MS-AI program and the AI focus within the CS MS degree. It serves as the hands-on, systems-focused component of this core and is designed to be the gateway to advanced 6000-level specialization tracks.
Course Description
Engineers and analyzes the trade-offs in large-scale AI systems. This course connects model architecture design with the practical, distributed systems required to train and serve them. Covers the full engineering lifecycle, including data curation, distributed training, inference optimization, and scaling large language models.
Prerequisites
Prerequisite: CSCI 5622 Machine Learning or its equivalent
Corequisite: CSCI 5922 Deep Learning
Recommended Prerequisites and Corequisites:
- Strong programming proficiency in Python.
- Experience with a modern ML framework (e.g., PyTorch).
- Proficiency with git and command-line environments.
Topics Covered
- AI Engineering Lifecycle: Co-design of models, data, and infrastructure; system vs. model metrics; "Hidden Technical Debt"; MLOps frameworks (e.g., Ray).
- Data Engineering & Curation: Data-centric AI; versioning and storage patterns; data pipelines and orchestration (e.g., cedar, Modyn); programmatic labeling.
- Model Design & Training Systems: Architectural trade-offs (e.g., Transformer vs. MoE); training systems (e.g., Docker, MLflow); optimizing compilers (e.g., TVM, MLIR).
- Distributed Training & Scaling: Parallelization strategies (e.g., PipeDream, Megatron); memory optimization (e.g., ZeRO-Infinity); checkpointing, resiliency, and datacenter scheduling.
- Model Optimization & Inference: Inference optimization (quantization, distillation, compilation); LLM serving (e.g., Orca); KV Caching and attention optimizations (e.g., PagedAttention, FlashAttention).
- Compound AI Systems & Alignment: Compound AI systems (e.g., Hermes RAG); large-scale fine-tuning and RLHF (e.g., DeepSeek-R1); security, privacy, and sustainability.
Course Readings
- Primary Text: Reddi, Vijay. Machine Learning Systems: Principles and Practices of Engineering Artificially Intelligent Systems. MIT Press 2025.
- Supplemental Text: Monarch, Robert. Human-in-the-Loop Machine Learning. Manning, 2021. Huyen, Chip. Designing Machine Learning Systems. O'Reilly Media, 2022.
- Additional Readings: A required list of seminal papers from industry (Google, Meta) and systems conferences (e.g., MLSys, OSDI).
Semester Grades
- Programming Assignments (4-5): 40% A series of assignments focused on implementing components of the AI engineering lifecycle (e.g., "Build a Data-Versioned Pipeline," "Implement Parallelized Training," and "PEFT vs. FFT").
- Midterm Checkpoint: 20% A take-home system design challenge requiring students to architect a solution to a real-world AI engineering problem.
- Final Project (Team-based): 40% A capstone project where teams build, deploy, document, and present a full-stack, production-style AI application, evaluated on system robustness, scalability, and engineering best practices.
Course Outcome
Upon successful completion of this course, students will be able to:
- Design and implement robust, versioned data pipelines for large-scale AI workflows.
- Engineer and scale the training of large models using distributed computing frameworks and experiment management tools.
- Deploy, serve, and optimize a complete AI model for low-latency, high-throughput inference.
- Critically analyze the system-level trade-offs between model architecture, hardware, and distributed algorithms for large-scale AI.