CSC 2541 Winter 2025: Large Models

Overview and Motivation

Large language models have revolutionized artificial intelligence and machine learning. These models, trained on massive datasets, can generate human-like text, code, and (apparently) engage in complex reasoning tasks. Driving these breakthroughs are a couple of empirical findings: large models improve predictably with the amount of compute used to train them and diverse capabilities emerge as the models see more data from the internet. These findings motivated an immense industrial effort to build and deploy very large models.

The course will focus on understanding the practical aspects of large model training through an in-depth study of the Llama 3 technical report. We will cover the whole pipeline, from pre-training and post-training to evaluation and deployment.

Students will be expected to present a paper, prepare code notebooks, and complete a final project on a topic of their choice. While the readings are largely applied or methodological, theoretically-minded students are welcome to focus their project on a theoretical topic related to large models.

The course is heavily inspired by similar courses like CS336: Language Modeling from Scratch taught by Tatsunori Hashimoto, Percy Liang, Nelson Liu, and Gabriel Poesia at Stanford and CS 886: Recent Advances on Foundation Models taught by Wenhu Chen and Cong Wei at Waterloo.

Course Information

Teaching Staff

Instructor: Chris Maddison
TAs: Ayoub El Hanchi and Frieda Rong
Email Instructor and TAs: csc2541-large-models@cs.toronto.edu

What, When, and Where

Class – Fridays 11:00AM to 1:00PM in Myhal Centre 380
Instructor Office Hours – 1:00PM to 2:00PM on Fridays in DL Pratt Building 398A (I will be walking from class, so I may be a few minutes late)
Announcements – Quercus
Assignment submissions – MarkUs
Forum for discussions (optional) – Piazza (see Quercus)

Syllabus and Policies

The full syllabus and policies are available here.

Assignments and Grading

Assignments for the course include paper presentations and a final project. The marking scheme is as follows:

25% – Paper presentation and code notebook, due at the start of your presentation (handout v1, handout v2, example notebook)
15% – Project proposal, due date 14 February (handout v1, handout v2)
60% – Project report, due date 9 April (handout v1, handout v2)

Prerequisites

This is a graduate course designed to guide students in an exploration of the current state of the art. So, while there are no formal prerequisites, the course does assume a certain level of familiarity with machine learning and deep learning concepts. A previous course in machine learning such as CSC311 or STA314 or ECE421 is required to take full advantage of the course, and, ideally, students will have taken a course in deep learning such as CSC413. In addition, it is strongly recommended that students have a strong background in linear algebra, multivariate calculus, probability, and computer programming.

Auditing

It is possible for non-enrolled persons to audit this course (sit in on the lectures) only if the auditor is a student at U of T, and no University resources are to be committed to the auditor. This means that students of other universities, employees of outside organizations, or any other non-students, are not permitted to be auditors.

Schedule and Readings

The structure of the course closely follows the structure of Meta's technical report:

(Llama 3) Llama Team, AI @ Meta. "The Llama 3 Herd Of Models". arXiv:2407.21783.

With the exception of the first two weeks, each week students will be presenting a paper that elaborates on the section of the report that is assigned that week.

This is a preliminary schedule, and it may change throughout the term. We won't check whether you've read the assigned readings, but you will get more out of the course if you do.

Day	Topic	Core Readings	Papers for Student Presentations
10/1	The Bitter Lesson [slides]	The Bitter Lesson	None
17/1	A Tiny Large Model [github][notebook][slides]	Formal Algorithms for Transformers	None
24/1	Pre-training: Scaling [slides]	Llama 3, sec. 1-3.2	Compute-Optimal Chinchilla Scaling Emergent Abilities of Large Language Models Textbooks Are All You Need
31/1	Pre-training: Parallelism [slides]	Llama 3, sec. 3.3-3.4	GPipe ZeRO PTD-P
7/2	Prompting [slides]	Read any one of the papers from the Student Presentations column for this week.	Language Models are Few-Shot Learners Chain-of-Thought Prompting Self-Consistency Prompting
14/2	Post-training: Alignment [slides]	Llama 3, sec. 4-4.2	Training language models to follow instructions with human feedback Finetuned Language Models Are Zero-Shot Learners DPO
28/2	Post-training: Capabilities [slides]	Llama 3, sec. 4.3	ReST^EM Self-Refine
7/3	Evaluation [slides]	Llama 3, sec. 5.1-5.3 Successful language model evals by Jason Wei	MMLU Lessons from the Trenches on Reproducible Evaluation of Language Models
14/3	Safety [slides]	Llama 3, sec. 5.4 Responsible Scaling Policy, Anthropic, 15 Oct 2024	AI Safety Benchmark from ML Commons (just sec. 2 & 5) Generative AI Misuse Alignment Faking
21/3	Deployment [slides]	Llama 3, sec. 6 From Decoding to Meta-Generation	Speculative decoding SmoothQuant PagedAttention FlashAttention and FlashDecoding
28/3	Beyond Human Language [slides]	Llama 3, sec. 7-8	Flamingo JetFormer Evo
4/4	Future Directions [slides]	Llama 3, sec. 9-10	DeepSeek-R1 Mamba Byte Latent Transformer

The following reports cover the technical details of other important large models:

Student Presentations and Notebooks

Presenters	Paper	Materials
Ben Agro, Sourav Biswas	Hoffman et al., 2022. Training Compute-Optimal Large Language Models	[slides][colab][github]
Samarendra Dash	Wei et al., 2022. Emergent Abilities of Large Language Models	[slides][colab]
Quentin Clark, Jack Sun	Gunasekar et al., 2023. Textbooks Are All You Need	[slides][colab]
Hunter Richards	Rajbhandari et al., 2020. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models	[slides]
Yi Lu, Keyu Bai	Huang et al., 2019. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism	[slides][colab]
Younes Hourri, Gyung Hyun Je	Narayanan et al., 2021. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM	[slides][colab]
Jonah Strimas-Mackey, Chu Jie Jiessie Tie	Brown et al., 2020. Language Models are Few-Shot Learners	[slides][colab]
Anny Dai, Yixin Chen	Wei et al., 2023. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models	[slides][colab]
Andrew Liu, Kyung Jae Lee	Wang et al., 2023. Self-Consistency Improves Chain of Thought Reasoning in Language Models	[slides][colab]
Gul Sena Altintas	Wei et al., 2023. Finetuned Language Models Are Zero-Shot Learners	[slides][colab]
Yuchong Zhang, Lev McKinney	Ouyang et al., 2022. Training language models to follow instructions with human feedback	[slides][colab]
Di Mu, An Cao	Rafailov et al., 2024. Direct Preference Optimization	[slides][colab]
Rajat Bansal, Abhinav Muraleedharan	Singh et al., 2024. Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models	[slides][colab]
David Glukhov, Noé Artru	Madaan et al., 2023. Self-Refine: Iterative Refinement with Self-Feedback	[slides][colab]
Samanvay Vajpayee, Yongxin Zhao	Hendrycks et al., 2021. Measuring Massive Multitask Language Understanding	[slides][colab]
Frank Bai, Steven Yuan	Biderman et al., 2024. Lessons from the Trenches on Reproducible Evaluation of Language Models	[slides][colab]
Younwoo Choi, Kailun Jin	Vidgen et al., 2024. Introducing v0.5 of the AI Safety Benchmark from MLCommons	[slides][colab]
Laurel Aquino, Mohammad Abdul Basit	Marcahl et al., 2024. Generative AI Misuse: A Taxonomy of Tactics and Insights from Real-World Data	[slides][colab]
Alfred Gu, Victor Huang	Greenblatt et al., 2024. Alignment faking in large language models	[slides][colab]
Ao Li, Bogdan Pikula	Leviathan et al., 2023. Fast Inference from Transformers via Speculative Decoding	[slides][colab]
Bailey Ng, Paul Tang	Kwon et al., 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention	[slides][colab]
Raghav Sharma, Daniel Hocevar	Dao et al., 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness	[slides][colab]
Kecen Yao, Zhichen Ren	Xiao et al., 2023. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models	[slides][colab]
Qi Zhao, Sumin Lee	Alayrac et al., 2022. Flamingo: a Visual Language Model for Few-Shot Learning	[slides][colab 1][colab 2][colab 3]
Nathan de Lara, Jasper Gerigk	Tschannen et al., 2024. JetFormer: An Autoregressive Generative Model of Raw Images and Text	[slides][colab]
Xingyu Chen, Cong Yu Fang	Nguyen et al., 2024. Sequence modeling and design from molecular to genome scale with Evo	[slides][colab]
Nikita Ivanov, Elias Abou Farhat	DeepSeek-AI, 2025. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	[slides][colab]
Nico Schiavone, Yifeng Chen	Gu and Dao, 2024. Mamba: Linear-Time Sequence Modeling with Selective State Spaces	[slides][colab]
Hao Li, Jing Neng Hsu	Pagnoni et al., 2024. Byte Latent Transformer: Patches Scale Better Than Tokens	[slides][colab]