Contact information
-
Instructors Frank Rudzicz, Raeid Saqur, and Zining Zhu. Office TBD Office hours T 10h-11h on Zoom initially Email csc401-2022-01@cs (add the toronto.edu suffix) Forum (Piazza) Piazza Quercus https://q.utoronto.ca/courses/250773 Email policy For non-confidential inquiries, consult the Piazza forum first. Otherwise, for confidential assignment-related inquiries, consult the TA associated with the particular assignment. Emails sent with appropriate subject headings and from University of Toronto email addresses are most likely not to be redirected towards junk email folders, for example.
Meeting times
-
Lectures MW 10h-11h, or 11h-12h (check your section) on Zoom, initially, and LM 162, thereafter. Tutorials F 10h-11h, or 11h-12h (check your section) on Zoom, initially, and LM 162, thereafter.
Course outline
-
This course presents an introduction to natural language computing in applications such as information retrieval and extraction, intelligent web searching, speech recognition, and machine translation. These applications will involve various statistical and machine learning techniques. Assignments will be completed in Python. All code must run on the 'teaching servers'.
The theme for this year is speech and text analysis in a post-truth society.
Prerequisites: CSC207H1 / CSC209H1 ; STA247H1 / STA255H1 / STA257H1
Recommended Preparation: MAT221H1 / MAT223H1 / MAT240H1 is strongly recommended. For advice, contact the Undergraduate Office.The course information sheet is available here.
Readings for this course
-
Optional Foundations of Statistical Natural Language Processing C. Manning and H. Schutze Optional Speech and Language Processing D. Jurafsky and J.H. Martin Optional Deep Learning I Goodfellow, Y Bengio, and A Courville
Readings on the theme of the course
- Kao J (2017) More than a Million Pro-Repeal Net Neutrality Comments were Likely Faked.
- Masip J (2017) Deception detection: State of the art and future prospects, Psicothema, 29(2):149-159.
- Wang WY (2017) "Liar, Liar Pants on Fire": A New Benchmark Dataset for Fake News Detection.
Supplementary reading
Evaluation policies
- General
- You will be graded on three homework assignments and a final exam. The relative proportions of these grades are as follows:
Assignment with lowest mark 15% Assignment with median mark 20% Assignment with highest mark 25% Final exam 40% - Lateness
- A 10% (absolute) deduction is applied to late homework one minute after the due time. Thereafter, an additional 10% deduction is applied every 24 hours up to 72 hours late at which time the homework will receive a mark of zero. No exceptions will be made except in emergencies, including medical emergencies, at the instructor's discretion.
- Final
- A mark of at least D- on the final exam is required to pass the course. In other words, if you receive an F on the final exam then you automatically fail the course, regardless of your performance in the rest of the course.
- Collaboration and plagiarism
- No collaboration on the homeworks is permitted. The work you submit must be your own. `Collaboration' in this context includes but is not limited to sharing of source code, correction of another's source code, copying of written answers, and sharing of answers prior to submission of the work (including the final exam). Failure to observe this policy is an academic offense, carrying a penalty ranging from a zero on the homework to suspension from the university. See Academic integrity at the University of Toronto.
Syllabus
-
The following is an estimate of the topics to be covered in the course and is subject to change.
- Introduction to corpus-based linguistics
- N-gram models
- Entropy and information theory
- Hidden Markov models
- Machine translation (statistical and neural)
- Neural language models and word embedding
- Articulatory and acoustic phonetics
- Automatic speech recognition
- Speech synthesis
- Information retrieval
- Dialogue and chatbots
Calendar
-
10 January First lecture 17 January Last day to add CSC 2511 23 January Last day to add CSC 401 11 February Assignment 1 due 20 February Last day to drop CSC 2511 21--25 February Reading week -- no lectures or tutorial 11 March Assignment 2 due 14 March Last day to drop CSC 401 8 April Last lecture 8 April Assignment 3 due 8 April Project report due 12 April Final exam
News and announcements
-
- FIRST LECTURE: 10 January at 10h or 11h (check your section) on Zoom.
- FIRST TUTORIAL: TBD.
- FINAL EXAM: 12 April, 19h-22h in EX 200.
Lecture materials
-
Assigned readings give you more in-depth information on ideas covered in lectures. You will not be asked questions relating to readings for the assignments, but they will be useful in studying for the final exam.
Provided PDFs are ~ 10% of their original size for portability, at the expense of fidelity.
- Introduction.
- Date: 10 Jan.
- Reading: Manning & Schütze: Sections 1.3-1.4.2, Sections 6.0-6.2.1
- Corpora, language models, Zipf, and smoothing.
- Date: 12 Jan.
- Reading: Manning & Schütze: Section 1.4.3, Section 6.1-6.2.2, Section 6.2.5, Sections 6.3
- Reading: Jurafsky & Martin: 3.4-3.5
- Features and classification.
- Date: 19 Jan
- Reading: Manning & Schütze: Section 16.1, 16.4
- Reading: Jurafsky & Martin (2nd ed): Sections 5.1-5.5
- Entropy and decisions.
- Date: 24 Jan
- Reading: Manning & Schütze: Sections 2.2, 5.3-5.5
- Neural language models.
- Date: 31 Jan, 2 Feb
- Reading: DL (Goodfellow et al.). Sections: 6.3, 6.6, 10.2, 10.5, 10.10
- Machine translation.
- Date: 7, 9, 14 Feb
- Reading: Manning & Schuütze Sections 13.0, 13.1.2, 13.1.3, 13.2, 13.3, 14.2.2
- Reading: DL (Goodfellow et al.). Sections: 10.3, 10.4, 10.7
- Hidden Markov Models.
- Date: 16, 28 Feb; 2 Mar
- Reading: Manning & Schütze: Section 9.2-9.4.1 (an alternative formulation)
- Speech.
- Date: 7, 9 Mar
- ASR.
- Date: 14, 16 Mar
- Speech synthesis.
- Date: 21 Mar
- Language understanding and information retrieval.
- Date: 23, 28 March
- Dialogue.
- Date: 4 April
- Review.
- Date: 6 Apr
- Introduction.
Tutorial materials
Assignments
Here is the ID template that you must submit with your assignments. Here is the MarkUs link you use to submit them.
- Assignment 1: Identifying political persuasion on Reddit, the associated requirements file, the rubric, and the data, features, and code so you can work from home if you don't have a teach.cs account.
- Assignment 2: Neural machine translation and its rubric and its requirements.
- Assignment 3: Speech and its rubric.
Project
The course project is an optional replacement for the assignments available to graduate students in CSC2511.