Contact information
Instructors Annie En-Shiun Lee, Raeid Saqur, and Zining Zhu. Office Zoom Office hours Wednesdays 12.30-1.30 pm Email csc401-2023-01@cs. (add the toronto.edu suffix) Forum (Piazza) Piazza Quercus https://q.utoronto.ca/courses/293764 Email policy For non-confidential inquiries, consult the Piazza forum first. Otherwise, for confidential assignment-related inquiries, consult the TA associated with the particular assignment. Emails sent with appropriate subject headings and from University of Toronto email addresses are most likely not to be redirected towards junk email folders, for example.
Course outline
This course presents an introduction to natural language computing in applications such as information retrieval and extraction, intelligent web searching, speech recognition, and machine translation. These applications will involve various statistical and machine learning techniques. Assignments will be completed in Python. All code must run on the 'teaching servers'.
The theme for this year is speech and text analysis in a post-truth society.
Prerequisites: CSC207H1 / CSC209H1 ; STA247H1 / STA255H1 / STA257H1
Recommended Preparation: MAT221H1 / MAT223H1 / MAT240H1 / CSC311 are strongly recommended. For advice, contact the Undergraduate Office.The course information sheet is available here.
Meeting times
TL;DR All meetings are in-person at EM 001, except for the first section (10h-11h) on Wednesdays.
Locations EM 001 Emmanuel College [Classfinder link] AH 100 Muzzo Family Alumni Hall [Classfinder link] Lectures M 10-11h; 11-12h in EM 001 W 10h-11h in AH 100 11h-12h in EM 001 Tutorials F 10-11h; 11-12h in EM 001
Syllabus
The following is an estimate of the topics to be covered in the course and is subject to change.
- Introduction to corpus-based linguistics
- N-gram models
- Entropy and decisions
- Neural language models and word embedding
- Machine translation (statistical and neural) (MT)
- Hidden Markov models (HMMs)
- Natural Language Understanding (NLU)
- Automatic speech recognition (ASR)
- Information retrieval (IR)
- Interpretability and Large Language Models
Calendar
9 January First lecture 17 January Last day to add CSC 2511 23 January Last day to add CSC 401 10 February Assignment 1 due 20 February Last day to drop CSC 2511 20--24 February Reading week -- no lectures or tutorial 10 March Assignment 2 due 14 March Last day to drop CSC 401 8 April Last lecture 8 April Assignment 3 due 8 April Project report due TBD April Final exam
Readings for this course
Optional Foundations of Statistical Natural Language Processing C. Manning and H. Schutze Optional Speech and Language Processing D. Jurafsky and J.H. Martin (2nd ed.) Optional Deep Learning I Goodfellow, Y Bengio, and A Courville
Supplementary reading
Please see additional lecture specific supplementary resources under Lecture Materials section.
Evaluation policies
- General
- You will be graded on three homework assignments and a final exam. The relative proportions of these grades are as follows:
Graduate students enrolled in CSC2511 will have the option of undertaking a course project (instead of the assignments), in teams of at most two students, for 60% of the course grade (the final exam, worth 40%, is still required). Information on the course project can be found here.Assignment with lowest mark 15% Assignment with median mark 20% Assignment with highest mark 25% Final exam 40% - Lateness
- A 10% (absolute) deduction is applied to late homework one minute after the due time. Thereafter, an additional 10% deduction is applied every 24 hours up to 72 hours late at which time the homework will receive a mark of zero. No exceptions will be made except in emergencies, including medical emergencies, at the instructor's discretion.
- Final
- A mark of at least D- on the final exam is required to pass the course. In other words, if you receive an F on the final exam then you automatically fail the course, regardless of your performance in the rest of the course.
- Collaboration and plagiarism
- No collaboration on the homeworks is permitted. The work you submit must be your own. `Collaboration' in this context includes but is not limited to sharing of source code, correction of another's source code, copying of written answers, and sharing of answers prior to submission of the work (including the final exam). Failure to observe this policy is an academic offense, carrying a penalty ranging from a zero on the homework to suspension from the university. See Academic integrity at the University of Toronto.
Lecture materials
Assigned readings give you more in-depth information on ideas covered in lectures. You will not be asked questions relating to readings for the assignments, but they will be useful in studying for the final exam.
Provided PDFs are ~ 10% of their original size for portability, at the expense of fidelity.
For pre-lecture readings and in-class note taking, please see the Quercus page Lecture Materials and Handouts. The final versions (ex-post errata, and/or other modifications will be posted here on the course website.
- Introduction.
- Date: 9 Jan.
- Reading: Manning & Schütze: Sections 1.3-1.4.2, Sections 6.0-6.2.1
- Corpora, language models, Zipf, and smoothing.
- Dates: 11, 16 Jan.
- Reading: Manning & Schütze: Section 1.4.3, Section 6.1-6.2.2, Section 6.2.5, Sections 6.3
- Reading: Jurafsky & Martin: 3.4-3.5
- Features and classification.
- Dates: 18, 23 Jan.
- Reading: Manning & Schütze: Section 16.1, 16.4
- Reading: Jurafsky & Martin (2nd ed): Sections 5.1-5.5
- Entropy and decisions.
- Dates: 25, 30 Jan.
- Reading: Manning & Schütze: Sections 2.2, 5.3-5.5
- Neural models of language.
- Dates: 1, 6 Feb.
- Reading: DL (Goodfellow et al.). Sections: 6.3, 6.6, 10.2, 10.5, 10.10
- (Optional) Supplementary resources and readings:
- Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space. (2013)" link
- Xin Rong. "word2Vec Parameter Learning Explained". link
- Bolukbasi, Tolga, et al. "Man is to computer programmer as woman is to homemaker? debiasing word embeddings." NeurIPS (2016). link
- Greff, Klaus, et al. "LSTM: A search space odyssey." IEEE (2016). link
- Jozefowicz, Sutskever et al. "An empirical exploration of recurrent network architectures." ICML (2015). link
- GRU: Cho, et al. "Learning phrase representations using RNN encoder-decoder for statistical machine translation." (2014). link
- ELMo: Peters, Matthew E., et al. "Deep contextualized word representations. (2018)." link Blogs:
- The Unreasonable Effectiveness of Recurrent Neural Networks. link
- Colah's Blog. "Understanding LSTM Networks". link.
- Machine translation (MT).
- Dates: 8, 13, 15 Feb.
- Readings:
- Manning & Schuütze Sections 13.0, 13.1.2, 13.1.3, 13.2, 13.3, 14.2.2
- DL (Goodfellow et al.). Sections: 10.3, 10.4, 10.7
- Vaswani et al. "Attention is all you need." (2017). link
- (Optional) Supplementary resources and readings:
- Papineni, et al. "BLEU: a method for automatic evaluation of machine translation." ACL (2002). link
- Sutskever, Ilya, Oriol Vinyals et al. "Sequence to sequence learning with neural networks."(2014). link
- Bahdanau, Dzmitry, et al. "Neural machine translation by jointly learning to align and translate."(2014). link
- Luong, Manning, et al. "Effective approaches to attention-based neural machine translation." arXiv (2015). link
- Britz, Denny, et al. "Massive exploration of neural machine translation architectures."(2017). link
- BPE: Sennrich, et al. "Neural machine translation of rare words with subword units." arXiv (2015). link
- Wordpiece: Wu, Yonghui, et al. "Google's neural machine translation system: Bridging the gap between human and machine translation." arXiv (2016). link Blogs:
- Distill: Olah & Carter "Attention and Augmented RNNs"(2016). link
- Jay Allamar. "The Illustrated Transformer". link.
- More neural language models.
- Date: 27 Feb.
- Readings: No required readings for this lecture.
- (Optional) Supplementary resources and readings:
- Bommasani et al. "On the opportunities and risks of foundation models." (2022). link
- Devlin et al. "BERT: Pre-training of deep bidirectional transformers for language understanding." (2019). link
- Clark et al. "What does bert look at? an analysis of bert's attention." (2019). link
- Tenney et al. "BERT rediscovers the classical NLP pipeline." (2019). link
- Rogers, Anna et al. "A primer in BERTology: What we know about how bert works." TACL(2020). link
- Lewis et al. "BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension." (2019). link
- T5: Raffel et al. "Exploring the limits of transfer learning with a unified text-to-text transformer." J. Mach. Learn. Res. 21.140 (2020). link
- GPT3: Radford et al. "Language models are few-shot learners." (2020). link
- InstructGPT: Ouyang, Long, et al. "Training language models to follow instructions with human feedback. arXiv preprint (2022)." link
- RLHF: Christiano et al. "Deep reinforcement learning from human preferences." (2017). link
- RLHF: Stiennon et al. "Learning to summarize with human feedback." (2020). link
- Kaplan et al. "Scaling laws for neural language models." (2020). link
- Kudo and Richardson. "Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing." (2018). link Token-free models:
- Clark et al. "CANINE: Pre-training an efficient tokenization-free encoder for language representation." (2021). link
- Xue et al. "ByT5: Towards a token-free future with pre-trained byte-to-byte models." (2022). link
- Hidden Markov Models (HMMs).
- Dates: 1, 6, 8 Mar
- Reading: Manning & Schütze: Section 9.2-9.4.1 (an alternative formulation)
- (Optional) Supplementary resources and readings:
- Rabiner, Lawrence R. "A tutorial on HMMs and selected applications in speech recognition." (1990). link
- Stamp, Mark. "A revealing introduction to hidden Markov models." (2004). link
- Bilmes, Jeff. "A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and HMMs." (1998). link
- Chen and Goodman. "An empirical study of smoothing techniques for language modeling." (1999). link
- Hidden Markov Model Toolkit (HTK). link
- Scikit's HMM. link
- Natural Language Understanding (NLU). Slides: NLU-1/2; NLU-2/2, Colab notebook.
- Automatic Speech Recognition (ASR).
- Information Retrieval (IR).
- Date: 27 Mar.
- Readings:
- Jurafsky & Martin SLP3 (3rd ed.): Chapter 14, only the first part (14.1). link
- Interpretability.
- Date: 29 Mar.
- Readings:
- Lundberg & Lee "A Unified approach to interpreting model predictions" (2022). link
- Large Language Models (LLMs).
- Date: 3 Apr.
- (Optional) Supplementary resources and readings:
- GPT3: Radford et al. "Language models are few-shot learners." (2020). link
- OpenAI. "ChatGPT: Optimizing Language models for Dialogue." (2022). link
- Nature News. C. Walker. "ChatGPT listed as author on research papers: many scientists disapprove". (2023). link
- AI Readings. UofT Academic and Collaborative Technologies (ACT). link
- Summary and Review.
- Date: 5 Apr.
- Introduction.
Tutorial materials
Enrolled students: Please see the Quercus page Tutorial Materials. The final version (ex-post errata, and/or other modifications will be posted here on the course website (for anyone auditing).
- Assignment 1 tutorials:
- Jan. 13, 2023: Tutorial 1
- Jan. 20, 2023: Tutorial 2 - Entropy and decisions
- Feb. 10, 2023: A1 Q/A + O.H. w/ the TAs (no slides).
- Assignment 2 tutorials:
- Feb. 17, 2023: Tutorial-I: Intro. to PyTorch (ft. Gavin Guan)
- Mar. 3, 2023: Tutorial-II: Machine Translation (ft. Frank Niu)
- Mar. 10, 2023: A2 Q/A + O.H. w/ the TAs (no slides).
- Assignment 3 tutorials:
- Mar. 17, 2023: Tutorial-I
- Mar. 24, 2023: Tutorial-II
- See Quercus -> Pages -> Tutorial Materials
Assignments
Enrolled students: Please use the Quercus Assignments page for all materials. The final version (ex-post errata, updates) will be posted here (for anyone auditing the course). Here is the ID template that you must submit with your assignments. Here is the MarkUs link you use to submit them.
Extension requests: Please follow the extension request procedure detailed here. A copy of Special Consideration Form here.
Remark requests: Please follow the remarking policy detailed here.
- Assignment 1: Identifying political persuasion on Reddit
- Released: Jan 14, 2023. Due: Feb 10, 2023
- For all A1 related emails, please use: csc401-2023-01-a1@cs. (add the toronto.edu suffix)
- Download the starter code from MarkUS
- Marking rubric
- The associated requirements file,
- Assignment 2: Neural Machine Translation
- Released: Feb 11, 2023. Due: Mar 10, 2023
- For all A2 related emails, please use: csc401-2023-01-a2@cs. (add the toronto.edu suffix)
- Download the starter code from MarkUS
- Marking rubric
- The associated requirements file,
- LaTex Report template reports.zip,
- Assignment 3: NLU, ASR, LLM
- Released: Mar 11, 2023. Due: Apr 8, 2023
- For all A3 related emails, please use: csc401-2023-01-a3@cs. (add the toronto.edu suffix)
- The starter code is available on the teach server at: "/u/cs401/A3_minBERT"
- Marking rubric
Project
The course project is an optional replacement for the assignments available to graduate students in CSC2511.
Past course materials
Fall 2022: course page
Old Exams
- The old exam repository from UofT libraries here (May not contain this particular course's final exams).
- (Old) final exam from 2017 here. Please N.B. while the exam bears structural similarity with a concurrent final exam, the materials may or may not corresopnd to concurrent syllabus (e.g. all statistical MT questions are outside scope in W23), thereby appear as estoeric.
News and announcements
- FIRST LECTURE: 9 January at 10h or 11h (check your section on ACORN enrolment).
- FIRST TUTORIAL: There will be a tutorial on the first week of lectures (i.e. 13 January, Fri).
- READING WEEK BREAK: The week of Feb. 20-24 - there will be no lectures or tutorials.
- FINAL EXAM: 22-APR-2023: 14.00-15.00. Exam schedule and (location) info now available here.
Frequently asked questions
Please see the Arts and Science departmental student F.A.Q page for additional answers.
- How much Machine Learning background do I need?
Please look at lecture slides posted and see how comfortable you are with the middle and later materials. Please consider the waitlist and the eligible people waiting to get into the class. Machine Learning is highly recommended but not a pre-requisite of the course. A lot of the course content requires some basic knowledge and appreciation of machine learning, and it will help if you have some knowledge. Obviously, you do not need all the knowledge of CSC311 Introduction to Machine Learning, but some understanding of machine learning is helpful, such as Andrew Ng's "Machine Learning for Everyone". You should at least know, supervised learning, classification, features, vectors, feature extraction/engineering, training/test dataset, and naive bayes for the first 1/3 of the lectures then deep learning, softmax, objective functions, basic training mechanism etc. for the remainder.
Also see this instructor answer on piazza. - Are the lectures recorded?
Students are expected to attend all lectures and tutorials in person, consistent and quality video recordings are not guaranteed; so it'd be best to assume "No" to this question and plan accordingly. Any recordings (if any) may be posted under the Class Recordings page on Quercus.
- What if I am unable to attend the class?
If a student has to miss a substantial portion of the course or is consistently missing one of the weekly lecture times and needs to watch a recorded one; then they should drop the course to allow someone else on the waiting list to take the course.
- How to audit the course?
Students are allowed to audit the course, as long as they are not taking up any course resources (including instructor/TA/admin time). They can attend lectures assuming there is space, and access any publicly accessible course materials (e.g. a Quercus page that is set to Institution visibility), but they don't get access to anything else. For example, if a regular student in that section (or another section) cannot find a seat at the lecture, they should give up the seat. They shouldn't expect questions to be answered on course resources or assignments/projects graded.
- Course enrollment issues?
Unfortunately, the course staff of 401 and the instructors do not have the ability to handle the lecture sections enrollment or move a student from one section to another. You might need to get in contact with your college registrar for anything related to ACORN/course enrollment. If you have further questions and concerns, please do not hesitate to reach out.
- Lecture conflicts?
Unfortunately, the course staff of 401 and the instructors do not have the ability to move a student from one section to another. And given the long waiting lists, it is unlikely that you will get a spot in a section that you request in ACORN only at this late date. One thing that may help you is that if you remain registered in the section with the conflict, we will allow you to attend lectures in any other in-person section as long as there is space in the classroom. However, if that isn't possible for you to attend one of the weekly lecture times in person, you will need to drop CSC401 for this term and hopefully take it in the future.
- (Grad. students only) What does doing the 'Project' option mean for course deliverables?
You can do the project in place of doing the 3 assignments. However, you can NOT skip the final exam (which will include materials covered for the assignments). Please read the project document and see Quercus for details. Assignments are optional (but recommended for those new to NLP), Final Exam is required.