CSC401/2511 :: Natural Language Computing

Contact information

Instructors		Annie En-Shiun Lee, Raeid Saqur, and Zining Zhu.
Office		Zoom
Office hours		Wednesdays 12.30-1.30 pm
Email		csc401-2023-01@cs. (add the toronto.edu suffix)

Forum (Piazza)		Piazza
Quercus		https://q.utoronto.ca/courses/293764

Email policy		For non-confidential inquiries, consult the Piazza forum first. Otherwise, for confidential assignment-related inquiries, consult the TA associated with the particular assignment. Emails sent with appropriate subject headings and from University of Toronto email addresses are most likely not to be redirected towards junk email folders, for example.

Back to top

Course outline

This course presents an introduction to natural language computing in applications such as information retrieval and extraction, intelligent web searching, speech recognition, and machine translation. These applications will involve various statistical and machine learning techniques. Assignments will be completed in Python. All code must run on the 'teaching servers'.

The theme for this year is speech and text analysis in a post-truth society.

Prerequisites: CSC207H1 / CSC209H1 ; STA247H1 / STA255H1 / STA257H1
Recommended Preparation: MAT221H1 / MAT223H1 / MAT240H1 / CSC311 are strongly recommended. For advice, contact the Undergraduate Office.

The course information sheet is available here.

Back to top

Meeting times

TL;DR All meetings are in-person at EM 001, except for the first section (10h-11h) on Wednesdays.

Locations	EM 001	Emmanuel College [Classfinder link]
	AH 100	Muzzo Family Alumni Hall [Classfinder link]
Lectures	M	10-11h; 11-12h in EM 001
	W	10h-11h in AH 100
		11h-12h in EM 001
Tutorials	F	10-11h; 11-12h in EM 001

Back to top

Syllabus

The following is an estimate of the topics to be covered in the course and is subject to change.

Introduction to corpus-based linguistics
N-gram models
Entropy and decisions
Neural language models and word embedding
Machine translation (statistical and neural) (MT)
Hidden Markov models (HMMs)
Natural Language Understanding (NLU)
Automatic speech recognition (ASR)
Information retrieval (IR)
Interpretability and Large Language Models

Calendar

9 January		First lecture
17 January		Last day to add CSC 2511
23 January		Last day to add CSC 401
10 February		Assignment 1 due
20 February		Last day to drop CSC 2511
20--24 February		Reading week -- no lectures or tutorial
10 March		Assignment 2 due
14 March		Last day to drop CSC 401
8 April		Last lecture
8 April		Assignment 3 due
8 April		Project report due
TBD April		Final exam

See Dates for undergraduate students.

See Dates for graduate students.

Back to top

Readings for this course

Optional	Foundations of Statistical Natural Language Processing	C. Manning and H. Schutze	Errata Free online edition (free if you're on a UofT computer of VPN)
Optional	Speech and Language Processing	D. Jurafsky and J.H. Martin (2nd ed.)	Errata 3rd ed. N.B. all readings sections refer to the 2nd ed.
Optional	Deep Learning	I Goodfellow, Y Bengio, and A Courville

Supplementary reading

Please see additional lecture specific supplementary resources under Lecture Materials section.

Topic	Title	Author(s)	Misc
ML History	What is science for? The Lighthill report on artificial intelligence reinterpreted	Jon Agar
Smoothing	An Empirical Study of Smoothing Techniques for Language Modeling	Stanley F Chen and Joshua Goodman
Hidden Markov models	A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition	Lawrence R. Rabiner
Sentence alignment	A Program for Aligning Sentences in Bilingual Corpora	William A. Gale and Kenneth W. Church
Transformation-based learning	Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging	Eric Brill
Sentence boundaries	Sentence boundaries	Read J, Dridan R, Oepen S, Solberg LJ
Seq2Seq	Sequence to Sequence Learning with Neural Networks	Ilya Sutskever, Oriol Vinyals, Quoc V. Le
Transformer	Attention Is All You Need	Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
Attention-based NMT	Effective Approaches to Attention-based Neural Machine Translation	Minh-Thang Luong, Hieu Pham, Christopher D. Manning
NMT	Neural machine translation by jointly learning to align and translate	Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio
NMT	Massive exploration of neural machine translation architectures	Britz, Denny, et al.

Back to top

Evaluation policies

General

You will be graded on three homework assignments and a final exam. The relative proportions of these grades are as follows:

Assignment with lowest mark		15%
Assignment with median mark		20%
Assignment with highest mark		25%
Final exam		40%

Graduate students enrolled in CSC2511 will have the option of undertaking a course project (instead of the assignments), in teams of at most two students, for 60% of the course grade (the final exam, worth 40%, is still required). Information on the course project can be found here.

Lateness

A 10% (absolute) deduction is applied to late homework one minute after the due time. Thereafter, an additional 10% deduction is applied every 24 hours up to 72 hours late at which time the homework will receive a mark of zero. No exceptions will be made except in emergencies, including medical emergencies, at the instructor's discretion.

Final

A mark of at least D- on the final exam is required to pass the course. In other words, if you receive an F on the final exam then you automatically fail the course, regardless of your performance in the rest of the course.

Collaboration and plagiarism

No collaboration on the homeworks is permitted. The work you submit must be your own. `Collaboration' in this context includes but is not limited to sharing of source code, correction of another's source code, copying of written answers, and sharing of answers prior to submission of the work (including the final exam). Failure to observe this policy is an academic offense, carrying a penalty ranging from a zero on the homework to suspension from the university. See Academic integrity at the University of Toronto.

Back to top

Lecture materials

Assigned readings give you more in-depth information on ideas covered in lectures. You will not be asked questions relating to readings for the assignments, but they will be useful in studying for the final exam.

Provided PDFs are ~ 10% of their original size for portability, at the expense of fidelity.

For pre-lecture readings and in-class note taking, please see the Quercus page Lecture Materials and Handouts. The final versions (ex-post errata, and/or other modifications will be posted here on the course website.

Introduction.
- Date: 9 Jan.
- Reading: Manning & Schütze: Sections 1.3-1.4.2, Sections 6.0-6.2.1
Corpora, language models, Zipf, and smoothing.
- Dates: 11, 16 Jan.
- Reading: Manning & Schütze: Section 1.4.3, Section 6.1-6.2.2, Section 6.2.5, Sections 6.3
- Reading: Jurafsky & Martin: 3.4-3.5
Features and classification.
- Dates: 18, 23 Jan.
- Reading: Manning & Schütze: Section 16.1, 16.4
- Reading: Jurafsky & Martin (2nd ed): Sections 5.1-5.5
Entropy and decisions.
- Dates: 25, 30 Jan.
- Reading: Manning & Schütze: Sections 2.2, 5.3-5.5
Neural models of language.
- Dates: 1, 6 Feb.
- Reading: DL (Goodfellow et al.). Sections: 6.3, 6.6, 10.2, 10.5, 10.10
- (Optional) Supplementary resources and readings:
Machine translation (MT).
- Dates: 8, 13, 15 Feb.
- Readings:
- (Optional) Supplementary resources and readings:
More neural language models.
- Date: 27 Feb.
- Readings: No required readings for this lecture.
- (Optional) Supplementary resources and readings:
Hidden Markov Models (HMMs).
- Dates: 1, 6, 8 Mar
- Reading: Manning & Schütze: Section 9.2-9.4.1 (an alternative formulation)
- (Optional) Supplementary resources and readings:
Natural Language Understanding (NLU). Slides: NLU-1/2; NLU-2/2, Colab notebook.
- Dates: 13, 15 Mar.
- Readings:
Automatic Speech Recognition (ASR).
- Dates: 20, 22 Mar.
- Readings:
Information Retrieval (IR).
- Date: 27 Mar.
- Readings:
Interpretability.
- Date: 29 Mar.
- Readings:
Large Language Models (LLMs).
- Date: 3 Apr.
- (Optional) Supplementary resources and readings:
Summary and Review.
- Date: 5 Apr.

Tutorial materials

Enrolled students: Please see the Quercus page Tutorial Materials. The final version (ex-post errata, and/or other modifications will be posted here on the course website (for anyone auditing).

Assignment 1 tutorials:

Jan. 13, 2023: Tutorial 1
Jan. 20, 2023: Tutorial 2 - Entropy and decisions
Feb. 10, 2023: A1 Q/A + O.H. w/ the TAs (no slides).

Assignment 2 tutorials:

Feb. 17, 2023: Tutorial-I: Intro. to PyTorch (ft. Gavin Guan)
Mar. 3, 2023: Tutorial-II: Machine Translation (ft. Frank Niu)
Mar. 10, 2023: A2 Q/A + O.H. w/ the TAs (no slides).

Assignment 3 tutorials:

See Quercus -> Pages -> Tutorial Materials

Mar. 17, 2023: Tutorial-I
Mar. 24, 2023: Tutorial-II

Assignments

Enrolled students: Please use the Quercus Assignments page for all materials. The final version (ex-post errata, updates) will be posted here (for anyone auditing the course). Here is the ID template that you must submit with your assignments. Here is the MarkUs link you use to submit them.

Extension requests: Please follow the extension request procedure detailed here. A copy of Special Consideration Form here.

Remark requests: Please follow the remarking policy detailed here.

Assignment 1: Identifying political persuasion on Reddit

Released: Jan 14, 2023. Due: Feb 10, 2023
For all A1 related emails, please use: csc401-2023-01-a1@cs. (add the toronto.edu suffix)
Download the starter code from MarkUS
Marking rubric
The associated requirements file,

Assignment 2: Neural Machine Translation

Released: Feb 11, 2023. Due: Mar 10, 2023
For all A2 related emails, please use: csc401-2023-01-a2@cs. (add the toronto.edu suffix)
Download the starter code from MarkUS
Marking rubric
The associated requirements file,
LaTex Report template reports.zip,

Assignment 3: NLU, ASR, LLM

Released: Mar 11, 2023. Due: Apr 8, 2023
For all A3 related emails, please use: csc401-2023-01-a3@cs. (add the toronto.edu suffix)
The starter code is available on the teach server at: "/u/cs401/A3_minBERT"
Marking rubric

Project

The course project is an optional replacement for the assignments available to graduate students in CSC2511.

Project handout

Back to top

Past course materials

Fall 2022: course page

Old Exams

The old exam repository from UofT libraries here (May not contain this particular course's final exams).
(Old) final exam from 2017 here. Please N.B. while the exam bears structural similarity with a concurrent final exam, the materials may or may not corresopnd to concurrent syllabus (e.g. all statistical MT questions are outside scope in W23), thereby appear as estoeric.

Back to top

News and announcements

FIRST LECTURE: 9 January at 10h or 11h (check your section on ACORN enrolment).
FIRST TUTORIAL: There will be a tutorial on the first week of lectures (i.e. 13 January, Fri).
READING WEEK BREAK: The week of Feb. 20-24 - there will be no lectures or tutorials.
FINAL EXAM: 22-APR-2023: 14.00-15.00. Exam schedule and (location) info now available here.

Back to top

Frequently asked questions

Please see the Arts and Science departmental student F.A.Q page for additional answers.

How much Machine Learning background do I need?
Please look at lecture slides posted and see how comfortable you are with the middle and later materials. Please consider the waitlist and the eligible people waiting to get into the class. Machine Learning is highly recommended but not a pre-requisite of the course. A lot of the course content requires some basic knowledge and appreciation of machine learning, and it will help if you have some knowledge. Obviously, you do not need all the knowledge of CSC311 Introduction to Machine Learning, but some understanding of machine learning is helpful, such as Andrew Ng's "Machine Learning for Everyone". You should at least know, supervised learning, classification, features, vectors, feature extraction/engineering, training/test dataset, and naive bayes for the first 1/3 of the lectures then deep learning, softmax, objective functions, basic training mechanism etc. for the remainder.
Also see this instructor answer on piazza.
Are the lectures recorded?
Students are expected to attend all lectures and tutorials in person, consistent and quality video recordings are not guaranteed; so it'd be best to assume "No" to this question and plan accordingly. Any recordings (if any) may be posted under the Class Recordings page on Quercus.
What if I am unable to attend the class?
If a student has to miss a substantial portion of the course or is consistently missing one of the weekly lecture times and needs to watch a recorded one; then they should drop the course to allow someone else on the waiting list to take the course.
How to audit the course?
Students are allowed to audit the course, as long as they are not taking up any course resources (including instructor/TA/admin time). They can attend lectures assuming there is space, and access any publicly accessible course materials (e.g. a Quercus page that is set to Institution visibility), but they don't get access to anything else. For example, if a regular student in that section (or another section) cannot find a seat at the lecture, they should give up the seat. They shouldn't expect questions to be answered on course resources or assignments/projects graded.
Course enrollment issues?
Unfortunately, the course staff of 401 and the instructors do not have the ability to handle the lecture sections enrollment or move a student from one section to another. You might need to get in contact with your college registrar for anything related to ACORN/course enrollment. If you have further questions and concerns, please do not hesitate to reach out.
Lecture conflicts?
Unfortunately, the course staff of 401 and the instructors do not have the ability to move a student from one section to another. And given the long waiting lists, it is unlikely that you will get a spot in a section that you request in ACORN only at this late date. One thing that may help you is that if you remain registered in the section with the conflict, we will allow you to attend lectures in any other in-person section as long as there is space in the classroom. However, if that isn't possible for you to attend one of the weekly lecture times in person, you will need to drop CSC401 for this term and hopefully take it in the future.
(Grad. students only) What does doing the 'Project' option mean for course deliverables?
You can do the project in place of doing the 3 assignments. However, you can NOT skip the final exam (which will include materials covered for the assignments). Please read the project document and see Quercus for details. Assignments are optional (but recommended for those new to NLP), Final Exam is required.

Back to top

CSC401/2511 - Natural Language Computing

Spring 2023

Contact information

Course outline

Meeting times

Syllabus

Calendar

Readings for this course

Supplementary reading

Evaluation policies

Lecture materials

Tutorial materials

Assignments

Project

Past course materials

Old Exams

News and announcements

Frequently asked questions