CSC 401/2511  Natural Language Computing
Winter 2010
Index of this document
Contact information
Instructor: Gerald Penn

Office: PT 396B (St. George campus)

Office hours: immediately following lectures (normally Mondays and
Fridays) 12, or by appointment

Tel: 9787390

Email: gpenn@cdf.utoronto.ca
Back to the index
Meeting times

Lectures: MF 121, BA 1190

Tutorials: W 121, BA 1190

(Exceptions: there will be lectures on MWF, 4/6/8 January  no tutorial
first week;
there will be a lecture on Wednesday, 10 February and a tutorial
on Friday, 12 February;
there will be a tutorial on Monday, 22 February and a lecture
on Wednesday, 24 February;
there will be a lecture on Wednesday, 17 March, and a tutorial on
Friday, 19 March;
there will be a tutorial on Monday, 22 March, and a lecture on Wednesday,
24 March;
there will no lecture or tutorial on Friday, 2 April)
A bulletin
board has also been created for the class, which willi be monitored
by the TAs.
Back to the index
Texts for the Course
Required 
C. Manning &
H.
Schuetze,
Foundations
of Statistical Natural Language Processing, MIT,
1999. 
Errata 

for which there is an
online edition from MIT CogNet 

Optional 
D. Jurafsky
& J. Martin, Speech
and Language Processing, Prentice
Hall, 2nd ed., 2008. 
Errata 
Recommended 
A. Martelli, Python
in a Nutshell, 2nd ed., O'Reilly,
2006. 
Errata 
Optional 
M. Lutz, Learning Python, 3rd
ed., O'Reilly, 2007. 
Errata 
Free! 
various tutorials on the Python website 

Supplementary Reading for the Lectures
Topic 
Title 
Author 
Publication Details 
parsing,
phrase structure models 
Statistical
Language Learning 
E. Charniak 
MIT Press, 1993. 
machine learning 
The
Elements of Statistical Learning 
T. Hastie, R. Tibshirani and J. Friedman 
Springer, 2001. 
information theory
(including entropy) 
Elements
of Information Theory 
T. M. Cover and J. A. Thomas 
Wiley & Sons, 1991. 
maximum entropy modelling 
A Maximum Entropy Approach to Natural Language Processing 
A. L. Berger, S. A. Della Pietra and V. J. Della Pietra 
Computational
Linguistics, 22(1): 3971. 
hidden Markov models
(state emission) 
Fundamentals
of Speech Recognition, Chapter 6. 
L. Rabiner and B.H. Juang 
Prentice Hall, 1993. 
GoodTuring estimation 
A comparison of the enhanced GoodTuring and deleted estimation methods
for estimating probabilities of English bigrams 
K. Church and W. Gale 
Computer
Speech and Language 5:1954. 
information retrieval 
Modern
Information Retrieval 
R. BaezaYates and B. RibeiroNeto 
ACM Press, 1999. 
text summarization 
Automatic
Summarization 
I. Mani 
Benjamins, 2001. 
phonetics (articulatory and acoustic) 
Acoustic
Phonetics 
K. N. Stevens 
MIT Press, 1998. 
Back to the index
Tentative Course outline

Introduction to Corpusbased Linguistics

Text Categorisation

Ngram Models

Markov Models

Automatic Speech Recognition

PartofSpeech Tagging

Information Retrieval

Text Summarisation

Statistical Machine Translation
Back to the index
Calendar of important courserelated events
Date 
Event 
Mon, 4 January 
First lecture 
Fri, 15 January 
Last day to add course (CSC 2511) 
Sun, 10 January 
Last day to add course (CSC 401) 
Fri, 5 February 
Assignment 1 due 
1519 February 
Reading Week  no classes 
Fri, 26 February 
Last day to drop course (CSC 2511) 
Sun, 7 March 
Last day to drop course (CSC 401) 
Fri, 5 March 
Assignment 2 due 
Mon, 29 March 
Last lecture 
Thu, 1 April 
Assignment 3 due 
723 April 
Final exam period 
Back to the index
Evaluation and related policies
There will be three homeworks, and a final exam. The relative weights of
these components towards the final mark are shown in the table below:
Assignment 1 
20% 
Assignment 2 
20% 
Assignment 3 
20% 
Final 
40% 
Important note on final: A mark of at least a D on the final
exam is required to pass the course. In other words, if you receive
an F on the final exam you automatically fail the course, regardless of
your performance on homeworks.
Important note on homeworks: No late homeworks will be accepted
except in case of documented medical or other emergencies.
Policy on collaboration: No collaboration on homeworks is permitted.
The work you submit must be your own. No student is permitted to
discuss the final exam with any other student until the instructor or TAs
make the solutions publicly available. Failure to observe this policy
is an academic offense, carrying a penalty ranging from a zero on
the homework to suspension from the university.
Back to the index
Announcements
In this space, you will find announcements related to the course. Please
check this space at least weekly.

FINAL EXAM: As a reminder, you will not be permitted to take any
exam aids into the final exam. We'll be having a special tutorial
session on Wednesday, 7th April from 1:30 to 3:30 in GB 404. The
purpose of this session is to answer questions you've had in revising for
the final exam.

MATERIAL COVERED IN WEEK 12: information retrieval, singular value decomposition.
You should read M&S Chapter 15.

MATERIAL COVERED IN WEEK 11: text summarization, naive Bayes classification.

MATERIAL COVERED IN WEEK 10: Fourier transforms, spectrograms, vowel classification,
relative entropy, mutual information, the flipflop algorithm. You should
read M&S sections 7.2 and 8.4.

MATERIAL COVERED IN WEEK 9: articulatory phonetics, sound waves, acoustic
phonetics. You should read J&M sections 4.14.2 and Chapter 7.

MATERIAL COVERED IN WEEK 8: Partofspeech tagging, tagging with HMMs,
transformationbased tagging, the Brill tagger. You should read M&S
Chapter 10.

MATERIAL COVERED IN WEEK 7: forward algorithm, backwards algorithm, Viterbi
decoding, BaumWelch reestimation.

MATERIAL COVERED IN WEEK 6: smoothing, Markov models, HMMs. You should
read M&S Chapter 9.

MATERIAL COVERED IN WEEK 5: iterative scaling, language modelling, ngrams,
maximum likelihood estimation. Bayes's rule. You should read M&S
Chapter 6.

MATERIAL COVERED IN WEEK 4: knearest neighbours, perceptron learning,
Lagrange's method, maximum entropy modelling. You should read M&S 2.12.2.4.

MATERIAL COVERED IN WEEK 3: cosine method, entropy, decision trees.

MATERIAL COVERED IN WEEK 2: corpus annotation, genre classification, endofsentence
boundary detection. You should read M&S Chapter 16, Sections
15.215.2.1 and Section 8.1.

MATERIAL COVERED IN WEEK 1: Zipf's Law, parts of speech. You should
read M&S Chapter 1, Section 3.1 and Chapter 4.

18 December: PREREQUISITES. CSC 207 or 209 or 228, and STA 247 or 255 or
257 and a CGPA of 3.0 or higher or a CSC subject POSt. MAT 223 or 240 is
strongly recommended. Note that the University's automatic registration
system does not check for prerequisites: even if you have registered for
the course, you will not receive credit for it unless you had satisfied
the prerequisite before you registered.
Back to the index
Handouts
In this space you will find online PDF versions of course handouts,
including homeworks.
To view these handouts you will need access to a PDF viewer. If your
machine does not have the required software, you can
download
Adobe Acrobat Reader for free.
Back to the index
Old Exams
Some old midterm and final exams for this course (with no solutions).
Back to the index
Gerald Penn, 6 April,
2010
This webpage was adapted from the webpage for another course,
created by Vassos Hadzilacos.