Meetings
Time: Alternating Wednesdays at 09h30-11h00
Location: PT266
Location: PT266
If you would like to schedule a meeting, or for more information, please email the meeting organizers at cl-mo followed by @cs.toronto.edu.
| Date | Speaker | Title (click on title to show/hide abstract) |
|---|---|---|
| Sep. 1 | Barrou Diallo |
Research in Chinese machine translation at the European Patent Office
Note special time: 9h00 (not 9h10) to 10h00 About the speaker Barrou Diallo is the Head of Research at the European Patent Office and Advisor to the Information Retrieval Facility. His focus is on machine translation, data mining, and enterprise architecture. He holds a Ph.D. in computer sciences, an M.Sc. in biomathematics and an M.Sc. in law and cyberspace. He has published several papers on patent processing, computer graphics, 3D visualisation and database management. He was the project manager of the first real-time European machine translation system for patents at the EPO. Prior to these various positions at the EPO, Barrou Diallo had a chair as Professor in the Chamber of Commerce of Le Mans and Assistant Professor at the University of Compiègne. |
| Sep. 2 | Akira Ushioda |
MT research and development in Japan
Note special time and place: PT378 at 11h00 About the speaker Dr. Akira Ushioda obtained his Ph.D. from Carnegie Mellon University in 2000, and worked as a senior researcher (2000-2002), and director of Intelligent Systems Laboratory (2003-2004) at Fujitsu Laboratories Ltd.. He is currently a Research Fellow at Fujitsu Laboratories and Guest Associate Professor of Nara Institute of Science and Technology, Japan. Dr. Ushioda's research interests cover a range of topics in the area of Natural Language Processing and statistical learning, including a lexical statistical parser, integration of SMT and RBMT, automatic clustering of words and phrases, and statistical word sense disambiguation. Abstract: The market size of private cramming schools and preparatory schools in Japan is 10 billion dollars and more than a third of the market is comprised of language schools, mostly English language schools. The Japanese people are thus enthusiastic about learning English, and yet the TOEIC report on test-takers worldwide shows that the average TOEIC score of Japanese test-takers is ranked 25th out of the 27 countries with most active test-takers. The awareness of poor performance makes them more desperate to learn English. Poor human performance, on the other hand, makes relative performance of, and expectation for, MT higher. Japan has been thus quite actively engaged in developing machine translation technology both at the government level and on the private-sector level. EDR (Electronic Dictionary Research) project, a government-led electronic dictionary research project, for example, began in 1986, and continued for a decade with a total budget of 150 million dollars. The participants in the project from the private-sector include major Japanese electronics companies, such as Hitachi, Toshiba, Panasonic, Sharp, NEC and Mitsubishi Electric Corp. Fujitsu Laboratories, also a participant in the EDR project, began developing English-to-Japanese and Japanese-to-English MT systems in early 80's. Unlike other Japanese MT makers Fujitsu employs an interlingua-oriented translation scheme which makes difference in concept representation between Japanese and English easier to overcome. The deeper semantic representation, on the other hand, makes the grammar rule set somewhat harder to maintain and grow. Instead of further modifying the rule-based scheme, we are investigating the way to incorporate SMT framework into the existing scheme. One of the issues at hand is how to bridge the gap between the RBMT ``phrases'' and SMT ``phrases.'' This talk will provide a background and an overview of MT development in Japan, describes Fujitsu's MT research and development, and discusses future direction of MT research and major challenges. |
| Sep. 16 | CL Group | Fall 2009 welcoming meeting |
| Sep. 23 | Varada Kolhatkar |
An extended analysis of a method of all words sense disambiguation
One of the central problems in processing a natural language is ambiguity. In every natural language there are many potentially ambiguous words. Humans are fairly adept at solving ambiguity by drawing on context and their knowledge of the world. However, it is not so easy for machines to understand the intended meaning of a word in a given context. Word Sense Disambiguation (WSD) is the process of selecting the correct sense of a word in a specific context. It is often useful to generalize the problem of disambiguating a single word to that of disambiguating all content words in a given text. This generalized problem is referred to as all-words sense disambiguation. The long history of WSD research includes many different supervised, unsupervised and knowledge-based approaches. But the reality is that current state-of-the-art accuracy in WSD remains a long way off far from natural human abilities. We present our analysis of some of the components that might be contributing to the level of error currently plaguing all-words sense disambiguation. Our analysis makes use of WordNet::SenseRelate::AllWords, an unsupervised knowledge-based system for all-words sense disambiguation, which is freely available on the Web as a perl Module. The system assigns a WordNet sense to each word in a text using measures of semantic similarity and relatedness. We find that the degree of difficulty in disambiguating a word is proportional to the number of senses of that word (polysemy). The experimental evidence indicates that a significant percentage of word sense disambiguation error is caused by a relatively small number of highly frequent word types. We also demonstrate that part-of-speech tagged text will be disambiguated more accurately than raw text. We show that expanding the context window helps in terms of coverage but doesn’t improve disambiguation. Finally we find that if the answer is not the most frequent sense, disambiguation turns out to be a hard problem even for an unsupervised system which doesn’t use any information about sense distribution. |
| Oct. 7 | Mohamed Attia |
Automatic full phonetic transcription of Arabic script
Abstract: Handling most of the non-trivial NLP tasks via rule-based (i.e. language factorizing) methods typically ends up with multiple possible solutions/analyses. After exhausting all the known/applicable rule-based methods, statistical methods are one of the most effective, feasible, and widely adopted approaches to automatically resolve that ambiguity. Many researchers, however, argue that if statistical disambiguation is eventually deployed to get the most likely analysis/sequence of analyses, why do not we go fully statistical (i.e. non factorizing) from the very beginning and give up the burden of rule-based methods? In our attempt to get the best performance of automatic full phonetic transcription of open-domain Arabic script, which is a tough industrial problem vital for applications like Arabic TTS systems, building Arabic ASR training corpora ... etc., one fundamental design task was to decide whether to go with the former design architecture (language factorization, then statistical disambiguation) or with the latter one (statistical disambiguation on un-factorized tokens). While our years-long research on ``automatic Arabic phonetic transcription'' ended up with the experimentally evident best performing system reported so far in the scientific literature (as per the mid. of 2009), the winning architecture has interestingly been neither of the two abovementioned options alone but a hybrid of both! While the non-factorizing architecture is more computationally economic and easier to implement, the language factorizing one overcomes the severe problem of coverage that emerges with the non-factorizing one. While both approaches asymptote to the same ceiling of accuracy, the former has a faster learning curve than the latter. So, the best hybrid architecture starts with trying the non-factorizing method on the input raw Arabic string. Only if a mis-coverage happens, it switches (backs-off) to factorizing method. While these conclusions have been obtained on the specific problem of ``Automatic Full Phonetic Transcription of Arabic Script'', we think that many other problems - where selecting between going factorizing or non-factorizing is an issue - may also benefit from this experience. |
| Oct. 21 | Julian Brooke |
A semantic approach to automated text sentiment analysis
The identification and characterization of evaluative stance in written language poses a unique set of cross-disciplinary challenges. Beginning with a review of relevant literature in linguistics and psychology, I trace recent interest in automated detection of author opinion in online product reviews, focusing on two main approaches: the semantic model, which is centered on deriving the semantic orientation (SO) of individual words and expressions, and machine learning classifiers, which rely on statistical information gathered from large corpora. To show the potential long-term advantages of the former, I describe the creation of an SO Calculator, highlighting relevant linguistic features such as intensification, negation, modality, and discourse structure, and devoting particular attention to the detection of genre in movie reviews, integrating machine classifier modules into my core semantic model. Finally, I discuss sentiment analysis in languages other than English, including Spanish and Chinese. |
| Nov. 4 | Paul Thompson |
Semantic Hacking
About the speaker Paul Thompson is Chief Computational Linguist, Text Exploitation and Decision Support at General Dynamics Advanced Information Systems, Buffalo. Abstract Forensic linguistics, or the use of linguistic analysis techniques to interpret evidence, e.g., authorship attribution, is an established discipline. In this talk I will describe research on the application of forensic linguistic techniques to computer security in the context of the Semantic Hacking project at Dartmouth College's Institute for Security Technology Studies. I will also discuss related research projects, including research on the detection of deception in text and in computer-mediated communication. |
| Nov. 18 | Gabriel Murray |
Summarizing Conversations in Various Modalities
Abstract: In recent years, summarization research has extended beyond the extractive summarization of well-structured documents such as newswire and journal articles to consider corpora such as meeting transcripts, web-logs, lectures and emails. In many of these domains, researchers have found evidence that domain-specific features can yield additional improvement beyond the performance provided by standard text summarization algorithms. For example, prosodic features can be extracted from the speech signal to aid meeting and lecture summarization, while emails contain useful header information such as the number of recipients and the presence of attachments. In our research we investigate whether these conversational domains can be treated similarly, using a unified conversation feature set for extractive summarization. We show that this novel conversation summarization approach can perform on par with domain-specific approaches for meeting and email data, while being flexible enough to apply to many other conversation domains. This talk will also include a description of subjectivity detection and its application to conversation summarization, as well as an overview of our current approach which moves beyond extractive summarization. |
| Dec. 2 | Daphna Heller | TBD |
| Winter 2009 | ||
| Jan. 16 | Shalom Lappin |
Expressiveness and Complexity in Underspecified Semantics
Today's speaker is a visiting professor from the Department of Philosophy at King's College London. Abstract: In this paper we address an important issue in the development of an adequate formal theory of underspecified semantics. The tension between expressive power and computational tractability poses an acute problem for any such theory. Generating the full set of resolved scope readings from an underspecified representation produces a combinatorial explosion that undermines the efficiency of these representations. Moreover, Ebert (2005) shows that most current theories of underspecified semantic representation suffer from expressive incompleteness. In previous work we present an account of underspecified scope representations within Property Theory with Curry Typing (PTCT), an intensional first-order theory for natural language semantics. We review this account, and we show that filters applied to the underspecified-scope terms of PTCT permit expressive completeness. While they do not solve the general complexity problem, they do significantly reduce the search space for computing the full set of resolved scope readings in non-worst cases. We explore the role of filters in achieving expressive completeness, and their relationship to the complexity involved in producing full interpretations from underspecified representations. |
| Feb. 27 | Canceled |
Graduate Visit Day
Abstract: |
| Mar. 13 | Shane Bergsma |
Web-Scale Models of Natural Language
Today's speaker is visiting from the University of Alberta. Abstract: The World Wide Web has had an enormous impact on Natural Language Processing (NLP) research, both as a source of data and as a stimulus for new language technology. In this talk, I describe several recent NLP systems that use web-scale statistics to achieve superior performance. These systems employ supervised machine learning as a simple but powerful mechanism for integrating web-scale data. I present the evolution of using the Internet for language research: from the initial enthusiasm for search-engine page counts to the more scientifically-sound usage of web-scale text databases. |
| Mar. 20 | Yang Liu |
Extractive summarization and keyword extraction using meeting transcripts
Abstract: Meeting corpus is much more challenging than written text (such as news article) for various language processing tasks. In this talk, I will discuss some research we have done in the past two years on meeting understanding, specifically, extractive meeting summarization and keyword extraction. When using a supervised learning framework for summarization, to address the imbalanced data problem and human annotation disagreement, we propose different sampling methods and a regression model. I will present improved results using these methods for meeting summarization, as well as studies on the correlation of the automatic ROUGE measures and human evaluation for summarization. I will also show various results for keyword extraction, comparing supervised and unsupervised approaches, and how to leverage summaries for keyword extraction. |
| Mar. 27 | Abdel-Rahman Mohamed |
Hafss, A Computer Aided Pronunciation Learning system
Abstract: In this talk, I will describe a speech-enabled Computer Aided Pronunciation teaching (CAPT) system HAFSS. This system was developed for teaching Holy Qur'an recitation rules and Arabic pronunciations. HAFSS uses a state of the art speech recognizer to detect errors in user recitation. One important point that is critical in any practical language learning system that exploits ASR technology is the user enrollment time (the time needed to train the system to the user's voice). I will talk about the enrollment process in Hafss and I will discuss methods that was found helpful in reducing the total enrollment time needed by the system. I will also introduce one experiment that measures the usefulness of the system to a novice user and another one that measures the correlation between the judgments of HAFSS system and the judgments of four human experts. |
| Apr. 3 | Tong Wang |
Extracting Synonyms from Dictionary Definitions
Abstract: Many research efforts have been spent in extracting words of different lexical semantic relations from various resources; the extraction of synonyms, however, is proved to be nontrivial due to the difficulty of coming up with features that are exclusive for synonymy. I will talk about two rule-based approaches for extracting synonyms from dictionary definitions: by building an inverted index and by bootstrapping and matching against regex patterns. In one of the two evaluation schemes I used, these seemingly simple approaches actually outperform the best reported lexicon-based method by a large margin. |
| Fall 2008 | ||
| Sep. 19 | Naishi Liu |
A Reduced Graph Model of Jokes
Today's speaker is a visiting scholar from Shanghai Jio Tong University. Abstract: The talk is an introduction to a graph-theoretic model for the understanding of verbal humor (especially jokes). It follows the tradition of CL and is based on the previous linguistic researchers making use of the graph elements such as vertices, edges, and subgraphs. The result is an interpretation model that accounts for how we understand humor, based on which algorithms may be designed to facilitate automatic humor processing. Warning: The presentation may contain some sexually oriented or sexist data. |
| Oct. 3 | Anatoliy Gruzd |
Name Networks: A Content-Based Method for Automated Discovery of Social
Networks to Study Collaborative Learning
Today's speaker is a PhD student at the Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign. Abstract As a way to gain greater insight into the operation of e-learning communities, the presented work applies automated text mining techniques to text-based communication to identify, describe and evaluate underlying social networks. The research demonstrates that the resulting social networks can be used by members of e-learning communities to improve the learning experience. While faculty and administration can use them to understand online learning processes and to develop more appropriate and effective programs for the next generation of learners. |
| Oct. 17 | Prof. Iryna Gurevych |
Putting the "Wisdom-of-Crowds" to Use in NLP: Collaboratively Constructed Semantic Resources on the Web
About the Speaker: Iryna Gurevych is Director of the Ubiquitous Knowledge Processing (UKP) Lab at the Technical University of Darmstadt. She is a receiver of the Young Excellence Emmy-Noether Award by the German Research Foundation (DFG) and a Lichtenberg-Professorship Award by the Volkswagen-Foundation. Iryna is currently Principal Investigator of the projects ``Semantic Information Retrieval'', funded by the German Research Foundation (DFG), ``Mining Lexical-Semantic Knowledge from Dynamic and Linguistic Sources and Integration into Question Answering for Discourse-Based Knowledge Acquisition in eLearning'', funded by the DFG, and THESEUS ``TEXO - Future Business Value Networks: Business Web'', funded by the German Ministry of Economics and Technology. She is a lecturer and scientific advisor in the research training program ``Quality Enhancement in eLearning through Regenerative Processes'', funded by the DFG. s Lab conducts research in the areas of lexical semantic processing with the focus on Web-based semantic resources, integrating lexical semantic knowledge in information retrieval and question answering, and text mining with the focus on sentiment analysis. Further information: http://www.ukp.tu-darmstadt.de/ Abstract: The rise of Web 2.0 and the so called Socio-Semantic technologies in recent years has led to huge amounts of user generated content produced by ordinary users on the Web. This content called for user-generated tagging to enable better information navigation and retrieval. Therefore, semantically tagged collaboratively constructed knowledge repositories emerged that represent a novel type of Web-originated resources - we call them collaboratively constructed semantic resources (CCSR). Example instances of CCSR are collaboratively constructed and semantically enriched multilingual online encyclopedias, such as Wikipedia, or collaboratively constructed online multilingual dictionaries, such as Wiktionary. NLP researchers have started to employ CCSRs as substitutes for conventional lexical semantic resources and repositories of world knowledge, such as thesauri, machine readable dictionaries, or wordnets. In overcoming the limitations of existing resources, such as their coverage gaps, significant construction and maintenance costs, and restricted availability, there is now a hope to significantly enhance the performance of numerous algorithms by utilizing the so called ``wisdom-of-crowds'' in broad coverage NLP systems. Combining CCSRs with statistical measures resulting in the shallow, approximative semantic knowledge has already demonstrated excellent results in some NLP tasks. The talk will present some of the recent work done at the Ubiquitous Knowledge Processing Lab that had a significant impact in the above outlined area. In the first part, a set of semantic relatedness measures operating on various datasets and utilizing either conventional wordnets or CCSRs will be examined. In the second part, the knowledge in Wikipedia and Wiktionary is employed in domain-specific information retrieval and yields significant improvements. The talk will be concluded with some remarks on the interoperability of conventional knowledge resources and CCSRs. |
| Oct. 31 | Fraser Shein |
WordQ and SpeakQ software: Writing made easier
About the speaker Fraser Shein is a new faculty member in the CL group. He is a senior rehabilitation engineer at Bloorview Kids Rehab where his research interests include advanced computer accessibility technology, natural language processing as applied to writing software, speech recognition, and consumer-driven reporting of assistive technology experiences. He is also the President and CEO of Quillsoft Ltd., which produces software to help individuals write text using technologies such as natural sounding text-to-speeech, contextual word prediction, and speech recognition. His profile at Bloorview is here. Abstract: This presentation will discuss and demonstrate how WordQ/SpeakQ software (both Windows and Mac OS X) helps you write more easily. Both were developed at Bloorview Kids Rehab (Toronto). As you type, WordQ continuously presents a list of relevant correctly spelled words using word prediction. When the desired word is shown, you can choose it with a single keystroke. High quality text-to-speech feedback enables you to more easily choose words and to identify mistakes. SpeakQ plugs into WordQ and adds simple speech recognition. You can then benefit from a combination of word prediction, speech output and speech input to generate text when stuck with spelling and word forms, identifying errors, proofreading and editing. Current research at Bloorview relating to syntactical and semantic knowledge in word prediction will also be discussed. |
| Nov. 7 | Libby Barak |
Keyword based Text Categorization
Abstract Text Categorization (TC) task is mostly approached via supervised or semi-supervised methods. These solutions require excessive manual labor in order to annotate text samples as training data, which is not always feasible. In this work we investigate Keyword-based Text Categorization using as input only a taxonomy of the category names. The TC method uses a novel combination of Textual Entailment based categorization and Latent Semantic Analysis (LSA) based categorization to create an initial set of unsupervised classified documents. The initial classified set is then used as input for standard supervised categorization method. The proposed method shows promising initial results and reveals interesting phenomena as a basis for further research. |
| Nov. 28 | TBD | To be determined |
| Dec. 5 | Hani Safadi |
Crosslingual implementation of linguistic taggers using parallel corpora
Abstract: The talk addresses the problem of creating linguistic taggers for resource-poor languages using existing taggers in resource rich languages. Linguistic taggers are classifiers that map individual words or phrases from a sentence to a set of tags. Part of speech tagging and named entity extraction are two examples of linguistic tagging. Linguistic taggers are usually trained using supervised learning algorithms. This requires the existence of labeled training data, which is not available for many languages. We describe an approach for assigning linguistic tags to sentences in a target (resource-poor) language by exploiting a linguistic tagger that has been configured in a source (resource-rich) language. The approach does not require that the input sentence be translated into the source language. Instead, projection of linguistic tags is accomplished through the use of a parallel corpus, which is a collection of texts that are available in a source language and a target language. The correspondence between words of the source and target language allows us to project tags from source to target language words. The projected tags are further processed to compute the final tags of the target language words. A system for part of speech (POS) tagging of French language sentences using an English language POS tagger and an English/French parallel corpus has been implemented and evaluated using this approach. |
| Dec. 9 | Dan Jurafsky |
Distinguished Lecture Series Colloquium
Note special time and place: 11:00-13:00, Bahen 1180 About the speaker Dan Jurafsky works at the nexus of language and computation, focusing on statistical models of human and machine language processing. Recent topics include the induction and use of computational models of meaning, the automatic recognition and synthesis of speech, and the comprehension and production of dialogue. He is the recipient of the MacArthur Fellowship and an NSF CAREER award. His most recent book is the second edition of his widely-used textbook with Jim Martin, Speech and Language Processing. |
| Winter 2008 | ||
| Jan. 18 | Frank Rudzicz |
Speech Recognition and Computational Linguistics: How to wreck a nice beach whenever a wand Aztecs
Speech and language research is big. Very big. You just won't believe how vastly, hugely, mind- bogglingly big it is. I mean, you may think it's a long way down the road to the chemist's, but that's just peanuts to speech and language research! Listen! And so on... |
| Jan. 30 | Rada Mihalcea |
Linking Documents to Encyclopedic Knowledge: Using Wikipedia as a Source of Linguistic Evidence
Note special time and place: 10:30-12:00, Pratt 266 Wikipedia is an online encyclopedia that has grown to become one of the largest online repositories of encyclopedic knowledge, with millions of articles available for a large number of languages. In fact, Wikipedia editions are available for more than 200 languages, with a number of entries varying from a few pages to more than one million articles per language. In this talk, I will describe the use of Wikipedia as a source of linguistic evidence for natural language processing tasks. In particular, I will show how this online encyclopedia can be used to achieve state-of-the-art results on two text processing tasks: automatic keyword extraction and word sense disambiguation. I will also show how the two methods can be combined into a system able to automatically enrich a text with links to encyclopedic knowledge. Given an input document, the system identifies the important concepts in the text and automatically links these concepts to the corresponding Wikipedia pages. Evaluations of the system showed that the automatic annotations are reliable and hardly distinguishable from manual annotations. Additionally, an evaluation of the system in an educational environment showed that the availability of encyclopedic knowledge within easy reach of a learner can improve both the quality of the knowledge acquired and the time needed to obtain such knowledge. This is joint work with Andras Csomai. |
| Feb. 15 | Graeme Hirst |
Real-word spelling correction with trigrams: A reconsideration of the Mays, Damerau, and Mercer model
The trigram-based noisy-channel model of real-word spelling-error correction that was presented by Mays, Damerau, and Mercer in 1991 has never been adequately evaluated or compared with other methods. We analyze the advantages and limitations of the method, and present a new evaluation that enables a meaningful comparison with the WordNet-based method of Hirst and Budanitsky. The trigram method is found to be superior, even on content words. We then show that optimizing over sentences gives better results than variants of the algorithm that optimize over fixed-length windows. This talk represents collaborative work between Amber Wilcox-Hearn, Graeme Hirst, and Alexander Budanitsky |
| Feb. 29 | Cancelled | Graduate Visit Day |
| Mar. 14 | (Afra Alishahi || Afsaneh Fazly) | A Probabilistic Incremental Model of Word Learning in the Presence of Referential Uncertainty We present a probabilistic incremental model of word learning in children. The model acquires the meaning of words from exposure to word usages in sentences, paired with appropriate semantic representations, in the presence of referential uncertainty. A distinct property of our model is that it continually revises its learned knowledge of a word's meaning, but over time converges on the most likely meaning of the word. Another key feature is that the model bootstraps its own partial knowledge of word--meaning associations to help more quickly learn the meanings of novel words. Results of simulations on naturalistic child-directed data show that our model exhibits behaviours similar to those observed in the early lexical acquisition of children, such as vocabulary spurt and fast mapping. |
| Mar. 28 | Chris Parisien |
An Incremental Bayesian Model for Learning Syntactic Categories
Abstract: I present a method for the unsupervised learning of syntactic categories from text. The method uses an incremental Bayesian clustering algorithm to find groups of words that occur within similar syntactic contexts. The model draws information from the distributional cues of words within an utterance, while explicitly bootstrapping its development on its own partial knowledge of syntactic categories. Using a corpus of child-directed speech, we demonstrate the benefit of a syntactic bootstrap for an incremental categorization model. The model is robust to the noise in real language data, manages lexical ambiguity, and shows learning behaviours similar to what we observe in children. |
| Apr. 11 | Tim Fowler |
Navigating the parsing landscape
Abstract: We will introduce context free grammars (CFGs) and combinatory categorial grammars (CCGs) with a focus on how these formalisms deal with semantics. The known differences between the formalisms will be discussed and the Lambek calculus will be introduced as an ideal comparison point between the two. To do this, we will need to consider the formal language class of natural language. A recent polynomial time parsing result for the Lambek calculus will be introduced and we will discuss possible future research opened up by this result. |
| Fall 2007 | ||
| Sept. 14 | CL Group | Fall 2007 Welcoming Meeting |
| Sept. 28 | Gerald Penn |
The Quantitative Study of Writing Systems
Abstract: If you understood all of the world's languages, you would still not be able to read many of the texts that you find on the world wide web, because they are written in non-Roman scripts -- often ones that have been arbitrarily encoded for electronic transmission in the absence of an accepted standard. This very modern nuisance reflects a dilemma as ancient as writing itself: the association between a language as it is spoken and its written form has a sort of internal logic to it that we can comprehend, but the conventions are different in every individual case --- even among languages that use the same script, or between scripts used by the same language. This conventional association between language and script, called a writing system, is indeed reminiscent of the Saussurean conception of language itself, a conventional association of meaning and sound, upon which modern linguistic theory is based. Despite linguists' reliance upon writing to present and preserve linguistic data, however, writing systems were a largely forgotten corner of linguistics until the 1960s, when Gelb presented their first classification. This talk will describe recent work that aims to place the study of writing systems upon a sound computational and statistical foundation. While archaeological decipherment may eternally remain the holy grail of this area of research, it also has applications to speech synthesis, machine translation, and multilingual document retrieval. |
| Oct. 12 | Paul Cook |
Pulling their Weight: Exploiting Syntactic Forms for the Automatic
Identification of Idiomatic Expressions in Context
Abstract: Much work on idioms has focused on type identification, i.e., determining whether a sequence of words can form an idiomatic expression. Since an idiom type often has a literal interpretation as well, token classification of potential idioms in context is critical for NLP. We explore the use of informative prior knowledge about the overall syntactic behaviour of a potentially-idiomatic expression (type-based knowledge) to determine whether an instance of the expression is used idiomatically or literally (token-based knowledge). We develop unsupervised methods for the task, and show that their performance is comparable to that of standard supervised techniques. |
| Oct. 26 | Cancelled | Cancelled |
| Nov. 9 | Graeme Hirst |
Views of Text-Meaning in Computational Linguistics
Abstract: Three views of text-meaning compete in the philosophy of language: objective, subjective, and authorial -- "in" the text, or "in" the reader, or "in" the writer. Computational linguistics has ignored the competition and implicitly embraced all three, and rightly so; but different views have predominated at different times and in different applications. Contemporary applications mostly take the crudest view: meaning is objectively "in" a text. The more-sophisticated applications now on the horizon, however, demand the other two views: as the computer takes on the user's purpose, it must also take on the user's subjective views; but sometimes, the user's purpose is to determine the author's intent. Accomplishing this requires, among other things, an ability to determine what could have been said but wasn't, and hence a sensitivity to linguistic nuance. It is therefore necessary to develop computational mechanisms for this sensitivity. |
| Nov. 23 | Diana Raffman |
Psychological Hysteresis and the Nontransitivity of Insignificant Differences
Abstract: Vague words in natural language cause semantic and logical problems in a variety of disciplines. An especially persistent problem has to do with the nontransitivity of insignificant differences. For example, if eating one candy won't make me fat, then eating two won't; but if eating two won't, then eating three won't; and so on. It seems to follow that eating a thousand pieces of candy won't make me fat. This paradoxical result shows that the word 'fat' is vague. Similarly, if Hillary Clinton is a person, then she was a person one second ago; and if she was a person one second ago, then she was a person two seconds ago; etc. It seems to follow that the conceptus from which Hillary Clinton developed was also a person. The word 'person' is vague. Clearly there is something wrong with this paradoxical form of reasoning, but a satisfactory diagnosis has not been found. In this talk I will propose a diagnosis that appeals to the hysteretical nature of our judgments involving vague words. To that end I will present preliminary results of a psychological study of our use of vague words. |
| Dec. 7 | TBD | TBD |