CSC 2541 Winter 2025: Large Models

Overview and Motivation

Large Models

Large language models have revolutionized artificial intelligence and machine learning. These models, trained on massive datasets, can generate human-like text, code, and (apparently) engage in complex reasoning tasks. Driving these breakthroughs are a couple of empirical findings: large models improve predictably with the amount of compute used to train them and diverse capabilities emerge as the models see more data from the internet. These findings motivated an immense industrial effort to build and deploy very large models.

The course will focus on understanding the practical aspects of large model training through an in-depth study of the Llama 3 technical report. We will cover the whole pipeline, from pre-training and post-training to evaluation and deployment.

Students will be expected to present a paper, prepare code notebooks, and complete a final project on a topic of their choice. While the readings are largely applied or methodological, theoretically-minded students are welcome to focus their project on a theoretical topic related to large models.

The course is heavily inspired by similar courses like CS336: Language Modeling from Scratch taught by Tatsunori Hashimoto, Percy Liang, Nelson Liu, and Gabriel Poesia at Stanford and CS 886: Recent Advances on Foundation Models taught by Wenhu Chen and Cong Wei at Waterloo.

Course Information

Teaching Staff

Instructor: Chris Maddison
TAs: Ayoub El Hanchi and Frieda Rong
Email Instructor and TAs: csc2541-large-models@cs.toronto.edu

What, When, and Where

Syllabus and Policies

The full syllabus and policies are available here.

Assignments and Grading

Assignments for the course include paper presentations and a final project. The marking scheme is as follows:

Prerequisites

This is a graduate course designed to guide students in an exploration of the current state of the art. So, while there are no formal prerequisites, the course does assume a certain level of familiarity with machine learning and deep learning concepts. A previous course in machine learning such as CSC311 or STA314 or ECE421 is required to take full advantage of the course, and, ideally, students will have taken a course in deep learning such as CSC413. In addition, it is strongly recommended that students have a strong background in linear algebra, multivariate calculus, probability, and computer programming.

Auditing

It is possible for non-enrolled persons to audit this course (sit in on the lectures) only if the auditor is a student at U of T, and no University resources are to be committed to the auditor. This means that students of other universities, employees of outside organizations, or any other non-students, are not permitted to be auditors.

Schedule and Readings

The structure of the course closely follows the structure of Meta's technical report:

(Llama 3) Llama Team, AI @ Meta. "The Llama 3 Herd Of Models". arXiv:2407.21783.

With the exception of the first two weeks, each week students will be presenting a paper that elaborates on the section of the report that is assigned that week.

This is a preliminary schedule, and it may change throughout the term. We won't check whether you've read the assigned readings, but you will get more out of the course if you do.

Day Topic Core Readings Papers for Student Presentations
10/1 The Bitter Lesson
[slides]
None
17/1 A Tiny Large Model
[github][notebook][slides]
None
24/1 Pre-training:
Scaling

[slides]
31/1 Pre-training:
Parallelism

[slides]
7/2 Prompting
[slides]
  • Read any one of the papers from the Student Presentations column for this week.
14/2 Post-training:
Alignment

[slides]
28/2 Post-training:
Capabilities

[slides]
7/3 Evaluation
[slides]
14/3 Safety
[slides]
21/3 Deployment
[slides]
28/3 Beyond Human Language
[slides]
4/4 Future Directions
[slides]

The following reports cover the technical details of other important large models:

Student Presentations and Notebooks

Presenters Paper Materials
Ben Agro, Sourav Biswas Hoffman et al., 2022. Training Compute-Optimal Large Language Models [slides][colab][github]
Samarendra Dash Wei et al., 2022. Emergent Abilities of Large Language Models [slides][colab]
Quentin Clark, Jack Sun Gunasekar et al., 2023. Textbooks Are All You Need [slides][colab]
Hunter Richards Rajbhandari et al., 2020. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models [slides]
Yi Lu, Keyu Bai Huang et al., 2019. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism [slides][colab]
Younes Hourri, Gyung Hyun Je Narayanan et al., 2021. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM [slides][colab]
Jonah Strimas-Mackey, Chu Jie Jiessie Tie Brown et al., 2020. Language Models are Few-Shot Learners [slides][colab]
Anny Dai, Yixin Chen Wei et al., 2023. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models [slides][colab]
Andrew Liu, Kyung Jae Lee Wang et al., 2023. Self-Consistency Improves Chain of Thought Reasoning in Language Models [slides][colab]
Gul Sena Altintas Wei et al., 2023. Finetuned Language Models Are Zero-Shot Learners [slides][colab]
Yuchong Zhang, Lev McKinney Ouyang et al., 2022. Training language models to follow instructions with human feedback [slides][colab]
Di Mu, An Cao Rafailov et al., 2024. Direct Preference Optimization [slides][colab]
Rajat Bansal, Abhinav Muraleedharan Singh et al., 2024. Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models [slides][colab]
David Glukhov, NoƩ Artru Madaan et al., 2023. Self-Refine: Iterative Refinement with Self-Feedback [slides][colab]
Samanvay Vajpayee, Yongxin Zhao Hendrycks et al., 2021. Measuring Massive Multitask Language Understanding [slides][colab]
Frank Bai, Steven Yuan Biderman et al., 2024. Lessons from the Trenches on Reproducible Evaluation of Language Models [slides][colab]
Younwoo Choi, Kailun Jin Vidgen et al., 2024. Introducing v0.5 of the AI Safety Benchmark from MLCommons [slides][colab]
Laurel Aquino, Mohammad Abdul Basit Marcahl et al., 2024. Generative AI Misuse: A Taxonomy of Tactics and Insights from Real-World Data [slides][colab]
Alfred Gu, Victor Huang Greenblatt et al., 2024. Alignment faking in large language models [slides][colab]
Ao Li, Bogdan Pikula Leviathan et al., 2023. Fast Inference from Transformers via Speculative Decoding [slides][colab]
Bailey Ng, Paul Tang Kwon et al., 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention [slides][colab]
Raghav Sharma, Daniel Hocevar Dao et al., 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness [slides][colab]
Kecen Yao, Zhichen Ren Xiao et al., 2023. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models [slides][colab]
Qi Zhao, Sumin Lee Alayrac et al., 2022. Flamingo: a Visual Language Model for Few-Shot Learning [slides][colab 1][colab 2][colab 3]
Nathan de Lara, Jasper Gerigk Tschannen et al., 2024. JetFormer: An Autoregressive Generative Model of Raw Images and Text [slides][colab]
Xingyu Chen, Cong Yu Fang Nguyen et al., 2024. Sequence modeling and design from molecular to genome scale with Evo [slides][colab]
Nikita Ivanov, Elias Abou Farhat DeepSeek-AI, 2025. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning [slides][colab]
Nico Schiavone, Yifeng Chen Gu and Dao, 2024. Mamba: Linear-Time Sequence Modeling with Selective State Spaces [slides][colab]
Hao Li, Jing Neng Hsu Pagnoni et al., 2024. Byte Latent Transformer: Patches Scale Better Than Tokens [slides][colab]