Contact information

InstructorFrank Rudzicz
OfficeBA 4261
Office hoursMW 11h11-12h00, M 16h00-17h00
Office phone416 946 8573
Emailfrank@cdf.utoronto.[CANADA] (fix the suffix)
Forumhttps://csc.cdf.toronto.edu/bb/YaBB.pl?board=CSC401H1S-CSC2511H1S
Email policyFor non-confidential inquiries, consult the CDF forum first. Otherwise, for confidential assignment-related inquiries, consult the TA associated with the particular assignment. Emails sent with appropriate subject headings and from University of Toronto email addresses are most likely not to be redirected towards junk email folders, for example.

TAsJulian Brooke (Assignment 1), Aida Nematzadeh Chekoudar (Assignment 2), and Siavash Kazemian (Assignment 3). Fix the suffixes in the linked email addresses.
Back to top

Meeting times

LecturesMW 10h00-11h00 in LM 158
TutorialsF 10h00-11h00 in LM 158
Back to top

Course outline

This course presents an introduction natural language computing in applications such as information retrieval and extraction, intelligent web searching, speech recognition, and multi-lingual systems including machine translation. These applications will involve techniques such as n-grams, part-of-speech tagging, semantic distance metrics, indexing, entropy, hidden Markov models, and corpus analysis. Assignments will be completed in Python and MATLAB (with optional C/C++ modules at the student's discretion). All code must run on the CDF machines.

Prerequisites: CSC 207 or 209 or 228, and STA 247 or 255 or 257 and a CGPA of 3.0 or higher or a CSC subject POSt. MAT 223 or 240 is strongly recommended. Please see http://www.cdf.toronto.edu/~clarke/ugn/20111/prereq.html for information on prerequisites. For advice, contact the Undergraduate Office, in rooms 4252-4 of the Bahen Centre.

The course information sheet is available here.

Unofficial statistics summarizing the 2012 term are available here.

Back to top

Readings for this course

Required Foundations of Statistical Natural Language ProcessingC. Manning and H. Schutze
Optional Speech and Language ProcessingD. Jurafsky and J.H. Martin

Supplementary reading

TopicTitleAuthor(s)Misc
Hidden Markov modelsA Tutorial on Hidden Markov Models and Selected Applications in Speech RecognitionLawrence R. Rabiner
Sentence alignmentA Program for Aligning Sentences in Bilingual CorporaWilliam A. Gale and Kenneth W. Church
Decoding for MTFast Decoding and Optimal Decoding for Machine TranslationUlrich Germann, Michael Jahr, Kevin Knight, Daniel Marcu, Kenji Yamada
Improving IBM Model-1Improving IBM Word-Alignment Model 1Robert C. Moore
HMMs for word alignmentHMM-based word alignment in statistical translationStephan Vogel, Hermann Ney, Christoph Tillmann
Phonetic alphabetsASCII Phonetic Symbols for the World's Languages: WorldbetJames L. Hieronymus
Gaussian mixture modelsRobust Text-Independent Speaker Identification Using Gaussian Mixture Speaker ModelsDouglas A. Reynolds and Richard C. Rose
Transformation-based learningTransformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech TaggingEric Brill
Back to top

Evaluation policies

General
You will be graded on three homework assignments and a final exam. The relative proportions of these grades are as follows:
Assignment 120%
Assignment 220%
Assignment 320%
Final exam40%
Graduate students enrolled in CSC2511 will have the option of undertaking a course project (instead of the assignments), in teams of at most two students, for 60% of the course grade (the final exam, worth 40%, is still required). Information on the course project can be found here.
Lateness
A 10% deduction is applied to late homework one minute after the due time. Thereafter, an additional 10% deduction is applied every 24 hours up to 72 hours late at which time the homework will receive a mark of zero. No exceptions will be made except in emergencies, including medical emergencies, at the instructor's discretion.
Final
A mark of at least D- on the final exam is required to pass the course. In other words, if you receive an F on the final exam then you automatically fail the course, regardless of your performance in the rest of the course.
Collaboration and plagiarism
No collaboration on the homeworks is permitted. The work you submit must be your own. `Collaboration' in this context includes but is not limited to sharing of source code, correction of another's source code, copying of written answers, and sharing of answers prior to submission of the work (including the final exam). Failure to observe this policy is an academic offense, carrying a penalty ranging from a zero on the homework to suspension from the university. See Academic integrity at the University of Toronto.
Back to top

Syllabus

The following is an estimate of the topics to be covered in the course and is subject to change.

  • Introduction to corpus-based linguistics
  • Text categorization
  • n-gram models
  • Entropy
  • Part-of-speech tagging
  • Markov models
  • Statistical machine translation
  • Automatic speech recognition
  • Information retrieval
  • Text summarization

Calendar

9 JanuaryFirst lecture
22 JanuaryLast day to add CSC 401
23 JanuaryLast day to add CSC 2511
10 FebruaryAssignment 1 due
20--24 FebruaryReading week -- no lectures or tutorial
27 FebruaryLast day to drop CSC 2511
9 MarchAssignment 2 due
11 MarchLast day to drop CSC 401
4 AprilLast lecture
6 AprilAssignment 3 due
12 AprilFinal exam

See Dates for undergraduate students.

See Dates for graduate students.

Back to top

News and announcements

  • FIRST LECTURE: 9 January at 10h00 in LM158.
  • FIRST TUTORIAL: 20 January at 10h00 in LM158.
  • EXTRA OFFICE HOURS: Mondays from 16h00-17h00 in BA4261, in addition to the other two office hours.
  • EXTENSION: Assignment 1 due by 19h00 (7pm) on 13 February.
  • READING WEEK: No lectures, tutorials, or office hours during the week of 20 February. The instructor is available by appointment, however.
  • FINAL EXAM: 12 April, 9h00--12h00 in ES1050.
  • TUTORIAL REPLACED: The 9 March tutorial is cancelled and is being replaced by office hours on 7 March (17h00--18h00) in Pratt 271 and on 8 March (between 12h00 and 16h30 -- check the CDF message boards).

Back to top

Lecture materials

Assigned readings give you more in-depth information on ideas covered in lectures. You will not be asked questions relating to readings for the assignments, but they will be very useful in studying for the final exam.

WeekSubjectsLecture slidesAssigned reading
1
(9, 11 Jan.)
  • Introduction
  • Language models and corpora
Manning and Schutze: sections 1.3--1.4.2 and sections 6.0--6.2.1
2
(16, 18 Jan.)
  • N-grams, Zipf, and smoothing
  • Part-of-Speech (PoS) tagging
Manning and Schutze: sections 1.4.3, 6.2.2 and sections 6.3--6.3.3
3
(23, 25 Jan.)
  • Entropy
  • Statistical significance and decision trees
Manning and Schutze: section 2.2 and sections 5.3--5.3.2
4
(30 Jan.,
1 Feb.)
  • Hidden Markov models
Manning and Schutze: sections 9.2--9.4.1 and rabiner.pdf
5
(6, 8 Feb.)
  • Hidden Markov models
  • Statistical machine translation
Manning and Schutze: sections 13.0 and 13.2
6
(13,15 Feb.)
  • Statistical machine translation
Manning and Schutze: sections 13.1.2, 13.1.3, and 13.3
7
(29 Feb.)
  • Acoustics and speech perception
8
(5, 7 Mar.)
  • Acoustics and speech production
  • Automatic speech recognition
Manning and Schutze: Section 14.2.2
9
(12, 14 Mar.)
  • Automatic speech recognition
  • Speech synthesis
10
(19, 21 Mar.)
  • Information retrieval
Manning and Schutze: chapter 15 (especially 15.2 and 15.4)
11
(26, 28 Mar.)
  • Summarization
  • Misc. classification
12
(2, 4 Apr.)
  • Review

Tutorial materials

Assignments

Here is the ID template that you must submit with your assignments.

Project

The course project is an optional replacement for the assignments available to graduate students in CSC2511.

Back to top

Old exams

Here are some old exams for this course, without solutions.

Here is the midterm of 27 February (not for marks), with solutions: MidtermAnswers.pdf.

2011 lectures

Here are the lectures from last year's iteration of this course.

2010 website

Here is the website for the iteration of this course offered in 2010, with additional handouts: CSC401/2511 2010 webpage

Back to top