Contact information

InstructorsFrank Rudzicz and Chloé Pou-Prom
OfficeThe Vector Institute
Office hoursM 11h00-12h00
Emailcsc401-2019-01@cs (add the toronto.edu suffix)
Forum (Piazza)http://piazza.com/utoronto.ca/winter2019/csc4012511
Email policyFor non-confidential inquiries, consult the Piazza forum first. Otherwise, for confidential assignment-related inquiries, consult the TA associated with the particular assignment. Emails sent with appropriate subject headings and from University of Toronto email addresses are most likely not to be redirected towards junk email folders, for example.

TAsA1: Zhewei Sun and Maryam Fallah
A2: Mohamed Abdalla, Raeid Saqur and Charlie Zhang
A3: Amanjit Kainth, Jianan Chen and Charlie Zhang.
Fix the suffixes in the linked email addresses.
Back to top

Meeting times

LecturesMF 10h00-11h00 in PB B250
TutorialsW 10h00-11h00 in MB 128
Back to top

Course outline

This course presents an introduction to natural language computing in applications such as information retrieval and extraction, intelligent web searching, speech recognition, and machine translation. These applications will involve various statistical and machine learning techniques. Assignments will be completed in Python. All code must run on the 'teaching servers'.

The theme for this year is speech and text analysis in a post-truth society.

Prerequisites: CSC207H1 / CSC209H1 ; STA247H1 / STA255H1 / STA257H1
Recommended Preparation: MAT221H1 / MAT223H1 / MAT240H1 is strongly recommended. For advice, contact the Undergraduate Office.

The course information sheet is available here.

Unofficial statistics summarizing the 2014 term are available here.

Unofficial statistics summarizing the 2013 term are available here.

Back to top

Readings for this course

Optional/td> Foundations of Statistical Natural Language ProcessingC. Manning and H. Schutze
Optional Speech and Language ProcessingD. Jurafsky and J.H. Martin

Readings on the theme of the course


Supplementary reading

TopicTitleAuthor(s)Misc
SmoothingAn Empirical Study of Smoothing Techniques for Language ModelingStanley F Chen and Joshua Goodman
Hidden Markov modelsA Tutorial on Hidden Markov Models and Selected Applications in Speech RecognitionLawrence R. Rabiner
Sentence alignmentA Program for Aligning Sentences in Bilingual CorporaWilliam A. Gale and Kenneth W. Church
Decoding for MTFast Decoding and Optimal Decoding for Machine TranslationUlrich Germann, Michael Jahr, Kevin Knight, Daniel Marcu, Kenji Yamada
Improving IBM Model-1Improving IBM Word-Alignment Model 1Robert C. Moore
HMMs for word alignmentHMM-based word alignment in statistical translationStephan Vogel, Hermann Ney, Christoph Tillmann
Phonetic alphabetsASCII Phonetic Symbols for the World's Languages: WorldbetJames L. Hieronymus
Gaussian mixture modelsRobust Text-Independent Speaker Identification Using Gaussian Mixture Speaker ModelsDouglas A. Reynolds and Richard C. Rose
Transformation-based learningTransformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech TaggingEric Brill
Feature selectionCorrelation-based feature selection for machine learningMark A. Hall
Feature selectionminimum redundancy, maximum relevanceHanchuan Peng
Sentence boundariesSentence boundariesRead J, Dridan R, Oepen S, Solberg LJ
Back to top

Evaluation policies

General
You will be graded on three homework assignments and a final exam. The relative proportions of these grades are as follows:
Assignment 120%
Assignment 220%
Assignment 320%
Final exam40%
Graduate students enrolled in CSC2511 will have the option of undertaking a course project (instead of the assignments), in teams of at most two students, for 60% of the course grade (the final exam, worth 40%, is still required). Information on the course project can be found here.
Lateness
A 10% deduction is applied to late homework one minute after the due time. Thereafter, an additional 10% deduction is applied every 24 hours up to 72 hours late at which time the homework will receive a mark of zero. No exceptions will be made except in emergencies, including medical emergencies, at the instructor's discretion.
Final
A mark of at least D- on the final exam is required to pass the course. In other words, if you receive an F on the final exam then you automatically fail the course, regardless of your performance in the rest of the course.
Collaboration and plagiarism
No collaboration on the homeworks is permitted. The work you submit must be your own. `Collaboration' in this context includes but is not limited to sharing of source code, correction of another's source code, copying of written answers, and sharing of answers prior to submission of the work (including the final exam). Failure to observe this policy is an academic offense, carrying a penalty ranging from a zero on the homework to suspension from the university. See Academic integrity at the University of Toronto.
Back to top

Syllabus

The following is an estimate of the topics to be covered in the course and is subject to change.

  • Introduction to corpus-based linguistics
  • N-gram models
  • Entropy and information theory
  • Hidden Markov models
  • Statistical machine translation
  • Neural language models and word embedding
  • Articulatory and acoustic phonetics
  • Automatic speech recognition
  • Speech synthesis
  • Information retrieval
  • Dialogue and chatbots

Calendar

7 JanuaryFirst lecture
20 JanuaryLast day to add CSC 401
21 JanuaryLast day to add CSC 2511
11 FebruaryAssignment 1 due
18--22 FebruaryReading week -- no lectures or tutorial
25 FebruaryLast day to drop CSC 2511
8 MarchAssignment 2 due
17 MarchLast day to drop CSC 401
5 AprilLast lecture
5 AprilAssignment 3 due
5 AprilProject report due
23 AprilFinal exam

See Dates for undergraduate students.

See Dates for graduate students.

Back to top

News and announcements

Back to top

Lecture materials

Assigned readings give you more in-depth information on ideas covered in lectures. You will not be asked questions relating to readings for the assignments, but they will be useful in studying for the final exam.

Provided PDFs are ~ 10% of their original size for portability, at the expense of fidelity.

  1. Introduction.
    • Date: 7 Jan.
    • Reading: Manning & Schütze: Sections 1.3-1.4.2, Sections 6.0-6.2.1
  2. Corpora, language models, Zipf, and smoothing.
    • Date: 11, 14 Jan.
    • Reading: Manning & Schütze: Section 1.4.3, Section 6.1-6.2.2, Section 6.2.5, Sections 6.3
    • Reading: Jurafsky & Martin: 3.4-3.5
  3. Entropy and decisions.
    • Date: 18 Jan
    • Reading: Manning & Schütze: Sections 2.2, 5.3-5.5
  4. Features and classification.
    • Date: 25 Jan
    • Reading: Manning & Schütze: Section 16.1, 16.4
    • Reading: Jurafsky & Martin (2nd ed): Sections 5.1-5.5
  5. Hidden Markov models pt 1 and pt 2.
    • Date: 28 Jan
    • Reading: Manning & Schütze: Section 9.2—9.4.1 (an alternative formulation)
  6. Statistical machine translation pt 1, pt 2, and pt 3.
    • Date: 8 Feb
    • Reading: Manning & Schuütze Sections 13.0, 13.1.2, 13.1.3, 13.2, 13.3, 14.2.2
  7. Neural language models.
    • Date: 1 Mar
  8. Speech.
    • Date: 8 Mar
  9. ASR.
    • Date: 18 Mar
  10. Speech synthesis.
    • Date: 25 Mar
  11. Intelligent dialogue agents.
    • Date: 1 Apr
  12. Review.
    • Date: 5 Apr
    • Note that exam rooms have changed! These slides are out-of-date.
  13. Annotated slides.

Tutorial materials


Assignments

Here is the ID template that you must submit with your assignments. Here is the MarkUs link you use to submit them.

Project

The course project is an optional replacement for the assignments available to graduate students in CSC2511.

Back to top

Midterm

Here is the not-for-marks midterm: Midterm 2019.

2018 lectures

Here are the lectures from last year's iteration of this course.

Back to top