CSC401/2511 :: Natural Language Computing

Contact information

Instructors		Frank Rudzicz, Raeid Saqur, and Zining Zhu.
Office		TBD
Office hours		T 10h-11h on Zoom initially
Email		csc401-2022-01@cs (add the toronto.edu suffix)

Forum (Piazza)		Piazza
Quercus		https://q.utoronto.ca/courses/250773

Email policy		For non-confidential inquiries, consult the Piazza forum first. Otherwise, for confidential assignment-related inquiries, consult the TA associated with the particular assignment. Emails sent with appropriate subject headings and from University of Toronto email addresses are most likely not to be redirected towards junk email folders, for example.

Back to top

Meeting times

Lectures		MW	10h-11h, or 11h-12h (check your section) on Zoom, initially, and LM 162, thereafter.
Tutorials		F	10h-11h, or 11h-12h (check your section) on Zoom, initially, and LM 162, thereafter.

Back to top

Course outline

This course presents an introduction to natural language computing in applications such as information retrieval and extraction, intelligent web searching, speech recognition, and machine translation. These applications will involve various statistical and machine learning techniques. Assignments will be completed in Python. All code must run on the 'teaching servers'.

The theme for this year is speech and text analysis in a post-truth society.

Prerequisites: CSC207H1 / CSC209H1 ; STA247H1 / STA255H1 / STA257H1
Recommended Preparation: MAT221H1 / MAT223H1 / MAT240H1 is strongly recommended. For advice, contact the Undergraduate Office.

The course information sheet is available here.

Back to top

Readings for this course

Optional	Foundations of Statistical Natural Language Processing	C. Manning and H. Schutze	Errata Free online edition (free if you're on a UofT computer of VPN)
Optional	Speech and Language Processing	D. Jurafsky and J.H. Martin	Errata
Optional	Deep Learning	I Goodfellow, Y Bengio, and A Courville

Readings on the theme of the course

Kao J (2017) More than a Million Pro-Repeal Net Neutrality Comments were Likely Faked.
Masip J (2017) Deception detection: State of the art and future prospects, Psicothema, 29(2):149-159.
Wang WY (2017) "Liar, Liar Pants on Fire": A New Benchmark Dataset for Fake News Detection.

Supplementary reading

Topic	Title	Author(s)	Misc
Smoothing	An Empirical Study of Smoothing Techniques for Language Modeling	Stanley F Chen and Joshua Goodman
Hidden Markov models	A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition	Lawrence R. Rabiner
Sentence alignment	A Program for Aligning Sentences in Bilingual Corpora	William A. Gale and Kenneth W. Church
Seq2Seq	Sequence to Sequence Learning with Neural Networks	Ilya Sutskever, Oriol Vinyals, Quoc V. Le
Attention	Attention Is All You Need	Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
Attention-based NMT	Effective Approaches to Attention-based Neural Machine Translation	Minh-Thang Luong, Hieu Pham, Christopher D. Manning
Phonetic alphabets	ASCII Phonetic Symbols for the World's Languages: Worldbet	James L. Hieronymus
Gaussian mixture models	Robust Text-Independent Speaker Identification Using Gaussian Mixture Speaker Models	Douglas A. Reynolds and Richard C. Rose
Transformation-based learning	Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging	Eric Brill
Feature selection	Correlation-based feature selection for machine learning	Mark A. Hall
Feature selection	minimum redundancy, maximum relevance	Hanchuan Peng
Sentence boundaries	Sentence boundaries	Read J, Dridan R, Oepen S, Solberg LJ
ML History	What is science for? The Lighthill report on artificial intelligence reinterpreted	Jon Agar
MT w/ Attention	Neural machine translation by jointly learning to align and translate	Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio
MT w/ Attention	Massive exploration of neural machine translation architectures	Britz, Denny, et al.
Attention	Attention is not Explanation	Sarthak Jain, Byron C. Wallace
Attention	Attention is not not Explanation	Sarah Wiegreffe, Yuval Pinter

Back to top

Evaluation policies

General

You will be graded on three homework assignments and a final exam. The relative proportions of these grades are as follows:

Assignment with lowest mark		15%
Assignment with median mark		20%
Assignment with highest mark		25%
Final exam		40%

Graduate students enrolled in CSC2511 will have the option of undertaking a course project (instead of the assignments), in teams of at most two students, for 60% of the course grade (the final exam, worth 40%, is still required). Information on the course project can be found here.

Lateness

A 10% (absolute) deduction is applied to late homework one minute after the due time. Thereafter, an additional 10% deduction is applied every 24 hours up to 72 hours late at which time the homework will receive a mark of zero. No exceptions will be made except in emergencies, including medical emergencies, at the instructor's discretion.

Final

A mark of at least D- on the final exam is required to pass the course. In other words, if you receive an F on the final exam then you automatically fail the course, regardless of your performance in the rest of the course.

Collaboration and plagiarism

No collaboration on the homeworks is permitted. The work you submit must be your own. `Collaboration' in this context includes but is not limited to sharing of source code, correction of another's source code, copying of written answers, and sharing of answers prior to submission of the work (including the final exam). Failure to observe this policy is an academic offense, carrying a penalty ranging from a zero on the homework to suspension from the university. See Academic integrity at the University of Toronto.

Back to top

Syllabus

The following is an estimate of the topics to be covered in the course and is subject to change.

Introduction to corpus-based linguistics
N-gram models
Entropy and information theory
Hidden Markov models
Machine translation (statistical and neural)
Neural language models and word embedding
Articulatory and acoustic phonetics
Automatic speech recognition
Speech synthesis
Information retrieval
Dialogue and chatbots

Calendar

10 January		First lecture
17 January		Last day to add CSC 2511
23 January		Last day to add CSC 401
11 February		Assignment 1 due
20 February		Last day to drop CSC 2511
21--25 February		Reading week -- no lectures or tutorial
11 March		Assignment 2 due
14 March		Last day to drop CSC 401
8 April		Last lecture
8 April		Assignment 3 due
8 April		Project report due
12 April		Final exam

See Dates for undergraduate students.

See Dates for graduate students.

Back to top

News and announcements

FIRST LECTURE: 10 January at 10h or 11h (check your section) on Zoom.
FIRST TUTORIAL: TBD.
FINAL EXAM: 12 April, 19h-22h in EX 200.

Back to top

Lecture materials

Assigned readings give you more in-depth information on ideas covered in lectures. You will not be asked questions relating to readings for the assignments, but they will be useful in studying for the final exam.

Provided PDFs are ~ 10% of their original size for portability, at the expense of fidelity.

Introduction.
- Date: 10 Jan.
- Reading: Manning & Schütze: Sections 1.3-1.4.2, Sections 6.0-6.2.1
Corpora, language models, Zipf, and smoothing.
- Date: 12 Jan.
- Reading: Manning & Schütze: Section 1.4.3, Section 6.1-6.2.2, Section 6.2.5, Sections 6.3
- Reading: Jurafsky & Martin: 3.4-3.5
Features and classification.
- Date: 19 Jan
- Reading: Manning & Schütze: Section 16.1, 16.4
- Reading: Jurafsky & Martin (2nd ed): Sections 5.1-5.5
Entropy and decisions.
- Date: 24 Jan
- Reading: Manning & Schütze: Sections 2.2, 5.3-5.5
Neural language models.
- Date: 31 Jan, 2 Feb
- Reading: DL (Goodfellow et al.). Sections: 6.3, 6.6, 10.2, 10.5, 10.10
Machine translation.
- Date: 7, 9, 14 Feb
- Reading: Manning & Schuütze Sections 13.0, 13.1.2, 13.1.3, 13.2, 13.3, 14.2.2
- Reading: DL (Goodfellow et al.). Sections: 10.3, 10.4, 10.7
Hidden Markov Models.
- Date: 16, 28 Feb; 2 Mar
- Reading: Manning & Schütze: Section 9.2-9.4.1 (an alternative formulation)
Speech.
- Date: 7, 9 Mar
ASR.
- Date: 14, 16 Mar
Speech synthesis.
- Date: 21 Mar
Language understanding and information retrieval.
- Date: 23, 28 March
Dialogue.
- Date: 4 April
Review.
- Date: 6 Apr

Tutorial materials

Assignments

Here is the ID template that you must submit with your assignments. Here is the MarkUs link you use to submit them.

Assignment 1: Identifying political persuasion on Reddit, the associated requirements file, the rubric, and the data, features, and code so you can work from home if you don't have a teach.cs account.
Assignment 2: Neural machine translation and its rubric and its requirements.
Assignment 3: Speech and its rubric.

Project

The course project is an optional replacement for the assignments available to graduate students in CSC2511.

Project handout

Back to top

Midterm

2021 lectures

Here are the lectures from last year's iteration of this course.

Back to top

CSC401/2511 - Natural Language Computing

Spring 2022

Contact information

Meeting times

Course outline

Readings for this course

Readings on the theme of the course

Supplementary reading

Evaluation policies

Syllabus

Calendar

News and announcements

Lecture materials

Tutorial materials

Assignments

Project

Midterm

2021 lectures