Large language models have revolutionized artificial intelligence and machine learning. These models, trained on massive datasets, can generate human-like text, code, and (apparently) engage in complex reasoning tasks. Driving these breakthroughs are a couple of empirical findings: large models improve predictably with the amount of compute used to train them and diverse capabilities emerge as the models see more data from the internet. These findings motivated an immense industrial effort to build and deploy very large models.
The course will focus on understanding the practical aspects of large model training through an in-depth study of the Llama 3 technical report. We will cover the whole pipeline, from pre-training and post-training to evaluation and deployment.
Students will be expected to present a paper, prepare code notebooks, and complete a final project on a topic of their choice. While the readings are largely applied or methodological, theoretically-minded students are welcome to focus their project on a theoretical topic related to large models.
The course is heavily inspired by similar courses like CS336: Language Modeling from Scratch taught by Tatsunori Hashimoto, Percy Liang, Nelson Liu, and Gabriel Poesia at Stanford and CS 886: Recent Advances on Foundation Models taught by Wenhu Chen and Cong Wei at Waterloo.
Instructor: Chris Maddison TAs: Ayoub El Hanchi and Frieda Rong Email Instructor and TAs: csc2541-large-models@cs.toronto.edu
The full syllabus and policies are available here.
Assignments for the course include paper presentations and a final project. The marking scheme is as follows:
This is a graduate course designed to guide students in an exploration of the current state of the art. So, while there are no formal prerequisites, the course does assume a certain level of familiarity with machine learning and deep learning concepts. A previous course in machine learning such as CSC311 or STA314 or ECE421 is required to take full advantage of the course, and, ideally, students will have taken a course in deep learning such as CSC413. In addition, it is strongly recommended that students have a strong background in linear algebra, multivariate calculus, probability, and computer programming.
It is possible for non-enrolled persons to audit this course (sit in on the lectures) only if the auditor is a student at U of T, and no University resources are to be committed to the auditor. This means that students of other universities, employees of outside organizations, or any other non-students, are not permitted to be auditors.
The structure of the course closely follows the structure of Meta's technical report:
(Llama 3) Llama Team, AI @ Meta. "The Llama 3 Herd Of Models". arXiv:2407.21783.
With the exception of the first two weeks, each week students will be presenting a paper that elaborates on the section of the report that is assigned that week.
This is a preliminary schedule, and it may change throughout the term. We won't check whether you've read the assigned readings, but you will get more out of the course if you do.
Day | Topic | Core Readings | Papers for Student Presentations |
---|---|---|---|
10/1 |
The Bitter Lesson [slides] |
None | |
17/1 |
A Tiny Large Model [github] [notebook] [slides] |
None | |
24/1 |
Pre-training: Scaling |
||
31/1 |
Pre-training: Parallelism |
||
7/2 | Prompting |
|
|
14/2 |
Post-training: Alignment |
||
28/2 |
Post-training: Capabilities |
||
7/3 | Evaluation | ||
14/3 | Safety | TBD | |
21/3 | Deployment | ||
28/3 | Beyond Language | TBD | |
4/4 | Future Directions | TBD |
The following reports cover the technical details of other important large models: