Time:10:30-12:00 Alternating Wednesdays
If you would like to schedule a meeting, or for more information, please email the meeting organizers at cl-mo followed by @cs.toronto.edu.
|Date||Speaker||Title (click on title to show/hide abstract)|
|Apr. 24||Fraser Shein||
iWordQ: Pragmatics of word prediction to assist struggling and emerging readers and writers
Dr. Shein will present and discuss the word prediction technology used within Quillsoft's iWordQ iPad App that was designed to support struggling and emerging readers and writers. A particular focus will be on the pragmatic aspects that must be considered in delivering a commercial product to meet real-life needs. The current prediction model is based on bigram/unigram statistics derived from a billion-word blog corpus supported by up to 5-ngrams where prediction following function words is highly ambiguous. While relatively simple in concept, practical aspects such as memory management, look-up speed, and accuracy of spelling are very real determiners of usefulness. We also created Canadian, British, and American spelling dictionaries and removed noise and inappropriate words for final use by children. While seemingly simple, this has been the most significant effort. Improvements to the algorithm still remain related to handling poor spelling, punctuation, and contractions among other issues. Suggestions for future research will be discussed.
|Feb. 19||Hui Jiang||
Why Deep Neural Network (DNN) Works for Acoustic Modelling in Speech Recognition?
Note special time 10h00--11h30
Recently, deep neural network (DNN) has been combined with hidden Markov model (HMM) as basic acoustic models to replace the traditional Gaussian mixture model (GMM) in automatic speech recognition (ASR). When output nodes of DNN are expanded from a small number of phonemes into a large number of tied-states of triphone HMMs, it has been reported that the so-called context-dependent DNN/HMM hybrid model has achieved unprecedented performance gain in many challenging ASR tasks, including the well-known Switchboard task. At this point, it is interesting to investigate where the unprecedented gain comes from and how DNN has beaten GMMs in acoustic modelling for ASR. In this talk, I will start to report some experiments that may reveal some clues to answer these questions. Our experimental results suggest that DNN does not necessarily yield better modelling capability than the conventional GMMs for standard speech features but DNN is indeed very powerful in terms of leveraging highly correlated features. Experimental results on several large vocabulary ASR tasks (including Switchboard) have shown that the unprecedented gain of the context-dependent DNN/HMM model can be almost entirely attributed to DNN's input feature vectors that are concatenated from several consecutive speech frames within a relatively long context window. Based on these observations, I will present our recent research work to explore the concatenated features under the traditional GMM/HMM framework, where DNN is only used as a frond-end feature extractor to perform dimensional reduction. Moreover, I will also introduce a new training algorithm, called incoherent training, which attempts to explicitly de-correlate feature vectors in learning of DNN parameters. The proposed incoherent training relies on the idea of directly minimizing coherence of weight matrices of DNN during the normal back-propagation training process. Experimental results on several large-scale ASR tasks have shown that the discriminatively trained GMM/HMMs using feature vectors derived from incoherent training have consistently surpassed the state-of-the-art context-dependent DNN/HMMs in all evaluated cases.
|Sept. 10||Coco Wang||
Interpreting Intentions of Speech
Note special time 10h00--11h30
Many scientists (Cohen et al., 1990) have tried to interpret intention through logical inference. However, lacking an effective means of semantic analysis, the results were not very satisfying. Vanderveken (1990) tried to construct a logic of illocutionary force. But he couldn't reveal the semantic implications, because he couldn't explain how different illocutionary forces compose as a whole. This paper's purpose is to reveal the mechanism of interpreting intentions of speech. Firstly, we present a grammar system for extracting the semantic structures of intentions. The grammar system includes a taxonomy of intentions of speech, which is based on Speech Act Theory (Searle, 1969; Grice, 1989) and Searle's philosophy about "social reality" (Searle, 1997), and a set of grammar rules. Then, we give a logic of semantic implication to explain how people understand and respond to the implicit meanings of complex intentions, such as an imperative hidden in a query, and a query embedded in an indirect speech.
|Oct. 31||Alistair Kennedy||
Measuring Semantic Relatedness Across Languages
Measures of Semantic Relatedness are well established in Natural Language Processing. Their purpose is to determine the degree of relatedness between two words, without specifying the nature of their relationship. One method of accomplishing this is to use a words distribution to determine its meaning. Distributional measures of semantic relatedness represent words as weighted vectors of the contexts in which that word appears. The relatedness between two words is determined by their vector distance. One limitation of distributional measures is that they are successful only between pairs of words in a single language, as contexts between two languages are not usually comparable. In this presentation I will describe a novel method of measuring semantic relatedness between pairs of words in two different languages, using distributional relatedness. This new cross-language measure uses paris of known translations to create a mapping between between distributional representations in two languages. I evaluated this new measure on two data sets. For the first I constructed a data set of cross-language word pairs, with similarity scores, from French and English versions of Rubenstein & Goodenough's data set. My cross-language measure was evaluated based on how closely it correlated to human assigned scores. The second evaluation was to use the cross-language measures to select the correct translation of a word from a set of two candidates. I found that the new cross-language measure outperformed a unilingual baseline on both experiments.
|Nov. 14||Barend Beekhuizen||
Learning relational meanings from situated caregiver-child
interaction: A computational approach
The difficulty of learning the relational meanings of words like verbs and prepositions has long been acknowledged (Gentner 1978; Gleitman 1990). This acquisition problem has been explored using human subjects (Hirsh-Pasek & Golinkoff 2006 and papers therein) and computational experiments (Siskind 1996, Alishahi & Stevenson 2008), and substantive progress has been made in understanding the acquisition of relational meaning. However, the nature of the available relational meaning in both approaches is to some extent artificial: in lab settings, the noise and variation is controlled and limited, while computational models often do not take the actual situational context into account (exceptions being Fleischman & Roy 2005 and Frank et al. 2009). In this talk, we discuss the acquisition problem using situational contexts from natural, interactional data and computational modeling techniques. We investigate the sources of the learning difficulty and discuss information that is known to affect the process. We believe this combination of situational data and computational techniques presents an important methodological direction for the (cognitive) linguistic enterprise, as we can approximate the source of the meaning closely.
On the basis of natural data (video recordings of caregiver-child dyads playing a game), we first present the magnitude of the problem. In learning the mapping between a linguistic item L and a meaning M that is grounded in a part of a situation, the learner faces three (related) subproblems. First, it may be that in the situation co-occurring with L, the meaning M is absent. Second, it may be that in a lot of situations not co-occurring with L, the meaning M is present. Finally, we often find other aspects of the situation, relating to other, irrelevant, meanings to be systematically present in the situations co-occurring with L.
Next, we describe a computational model of cross-situational word learning (Fazly et al. 2010), which has been shown to perform well on natural language data with synthetic meanings. Using the natural, situated data, we find that when the model's only source of information is the set of situations holding at the moment of speech, it will learn little about both the meanings of nouns and verbs. However, the child has more sources of information at its disposal and we discuss the effects of these. Here we consider the child's insight in typical social interactions (i.c., gamerelated intentions, Tomasello 2001), the emergent distributional knowledge of word classes (Fazly & Alishahi 2010; Mintz 2003) and the selective attention to different aspects of the perceived situation (Alishahi et al. 2012, Nematzadeh et al. 2012). Combining these, we arrive at a usage-based computational learner that uses cues from different domains, in line with the approach suggested by Hollich et al. (2000). Taking a computational modeling approach and using natural linguistic and situational data, we can show the extent to which each cue plays a role in learning different sorts of meaning (referring to objects, their properties, static relations and behavioural actions), thus extending our understanding of the driving factors behind the acquisition of word meanings.
|Nov. 21||Aditya Bhargava||
Leveraging supplemental transcriptions and transliterations via re-ranking
Note special time 11h00--12h00
Grapheme-to-phoneme conversion (G2P) and machine transliteration are important tasks in natural language processing. Supplemental data can often help resolve difficult ambiguities: existing transliterations of the same word can help choose among a G2P system's candidate output transcriptions; similarly, transliterations from other languages can help choose among candidate transliterations in a given language. Transcriptions can be leveraged in this way as well. In this thesis, I investigate the problem of applying supplemental data to improve G2P and machine transliteration results. I present a unified method for leveraging related transliteration or transcription data to improve the performance of a base G2P or machine transliteration system. My approach constructs features with the supplemental data, which are then used in an SVM re-ranker. This re-ranking approach is shown to work across multiple base systems and achieves error reductions ranging from 8% to 43% over state-of-the-art base systems in cases where supplemental data are available.
|Nov. 28||Rouzbeh Farahmand||
Flexible Structural Analysis of Near-Meet-Semilattices for Typed Unication-based Grammar Design
Note special time 11h00--12h00
We present a new method for directly working with typed unification grammars in which type unification is not well-defined. This is often the case, as large-scale HPSG grammars now usually have type systems for which many pairs do not have least upper bounds. Our method yields a unification algorithm that compiles quickly and yet is nearly as fast during parsing as one that requires least upper bounds. The method also provides a natural naming convention for unification results in cases where no user-defined type exists.
|Dec. 5||Frank Rudzicz||
Communicating with Machines: An Introduction to SPOClab
In this talk I introduce SPOClab (Signal Processing and Oral Communication), which bridges Computer Science at the University of Toronto with the Toronto Rehabilitation Institute. The goal of our lab is to produce software that helps to overcome challenges of communication including speech and language disorders. This will be organized into two co-dependent streams of research. First, we will embed control-theoretic models of speech production into augmented ASR systems using various machine-learning techniques. Second, these systems will be deployed in software that can be used in practice; this involves adjacent disciplines such as human-computer interaction and general natural language processing to design and study application interfaces for disabled users.
|Jan. 13||Julia Hirschberg||
Entrainment in Prosody, Turn-taking, and Social Behaviors
Note special time: 9h30--11h00
When people speak together, they often adapt aspects of their speaking style based upon the style of their conversational partner. This phenomena goes by many names, including adaptation, alignment, and entrainment, inter alia. In this talk, I will describe experiments in prosodic entrainment in the Columbia Games Corpus, a larger corpus of speech recorded from subjects playing a series of computer games. I will discuss how prosodic entrainment is related to turn-taking behaviors, to several measures of task and dialogue success, and to perceived social behaviors. This is joint work with Stefan Benus, Agustín Gravano, Ani Nenkova, Rivka Levitan, and Laura Willson.
|Jan. 18||Heike Zinsmeister||
Towards Gold Corpora for Abstract Anaphora Resolution
Abstract anaphora refer to anaphoric elements, such as that or this issue, that refer to abstract referents such as facts or events. The antecedents of abstract anaphors are often realised as verbal or clausal categories as in example (1) adapted from Byron (2002), which poses problems for the automatic resolution of the anaphoric relation.
(1) Each Fall, penguins migrate to Fiji. [That]'s why I'm going there next month
The resolution problem can be split into three subtasks: (i) deciding whether an anaphoric element refers to an abstract or a concrete referent, (ii) identifying the antecedent string, (iii) inducing the abstract referent.
When creating a gold standard in this domain, it is easy for human annotators to agree on the first task. It is much harder to get reliable data with respect to the other two tasks. I will present a survey on annotation projects and discuss how they approach this challenge.
Furthermore, I will outline ongoing work on cross-linguistic annotation of abstract anaphora in a parallel corpus of English and German, that also addresses the question of the reliability of translated texts as a source for feature induction.
|Feb. 3||Ciprian Chelba||
Language Modeling for Automatic Speech Recognition Meets the Web: Google Search by Voice
Note special time and location: 10h00--12h00 in BA5256
The talk presents key aspects faced when building language models (LM) for the google.com query stream, and their use for automatic speech recognition (ASR). Distributed LM tools enable us to handle a huge amount of data, and experiment with LMs that are two orders of magnitude larger than usual. An empirical exploration of the problem lead us to re-discovering a less known interaction between Kneser-Ney smoothing and entropy pruning, possible non-stationarity of the query stream, as well as strong dependence on various English locales---USA, Britain and Australia. LM compression techniques allowed us to use one billion n-gram LMs in the 1-st pass of an ASR system built on FST technology, and evaluate empirically whether a two pass system architecture has any loses over one pass. Confidence filtering on logs data provides us with enormous amounts of automatically transcribed speech data of near human transcriber quality, and enables us to train distributed LMs discriminatively.
About the speaker
Ciprian Chelba is a Research Scientist with Google. His research interests are in statistical modeling for natural language and speech. His recent projects include query language modeling for Google voice search, and indexing, ranking and snippeting for search in spoken content.
Prior to Google he spent 6 years in Microsoft Research working with the speech technology group. He graduated from The Johns Hopkins University in 2000. His thesis work was on "Structured Language Modeling"---exploiting syntax for improved language modeling.
|Feb 10||Wael Khreich||
Adaptive Techniques for Statistical Machine Translation in Industrial Environment
Note special time and location: 10h00--12h00 in BA5256
This presentation highlights current results on domain adaptation techniques for statistical machine translation (SMT) systems, conducted during my postdoctoral research at NLP Technologies Inc. Since there are continuous requests for translation of new specific domains with limited amounts of parallel sentences, the adaption of current SMT systems to these domains would reduce the translation time and efforts. The performance of different domain adaptation techniques such as log-linear models and mixture models have been evaluated in the translation environment of NLP Technologies using legal corpora. Evaluation involved human post-editing effort and time as well as automated scoring techniques (BLEU scores). Results have shown that the domain adaptation techniques can yield a significant increase in BLEU score (up to four points) and a reduction in post-editing time of about one second per word.
Future work involves the dynamic integration of post-editors feedback into the SMT system.
Postdoctoral Industrial R&D Fellow at NLP Technologies Inc., Montreal, Canada.
PhD in Engineering from École de technologie supérieure, Montreal, Canada.
Conducting research on adaptive methods for statistical machine translation.
Other research interests include on-line and incremental learning, multiple classifier systems, decision fusion and novelty detection.
|May 8||Manfred Stede||
CANCELLED The structure of argument: Manual (and automatic) text annotation
While certain aspects of text coherence apply to almost any text, others are specific to the particular _text_type_ or _discourse_mode_ (descriptive, narrative, expository, instructive, argumentative). I describe an approach toward representing the "deep" structure of argumentative texts, which is inspired by Freeman (1991) but adds a number of modifications. After presenting some initial results on manual annotation, I will sketch our ongoing work aiming at automating this annotation process, i.e. the notion of "argument mining".
|May 31||Eduard Hovy||
NLP: Its Past and 3½ Possible Futures
Note special time: 10h30--12h00
Natural Language text and speech processing (Computational Linguistics) is just over 50 years old, and is still continuously evolving — not only in its technical subject matter, but in the basic questions being asked and the style and methodology being adopted to answer them. As unification followed finite-state technology in the 1980s, statistical processing followed that in the 1990s, and large-scale processing is increasingly being adopted (especially for commercial NLP) in this decade, a new and quite interesting trend is emerging: a split of the field into three somewhat complementary and rather different directions, each with its own goals, evaluation paradigms, and methodology. The resource creators focus on language and the representations required for language processing; the learning researchers focus on algorithms to effect the transformation of representation required in NLP; and the large-scale hackers produce engines that win the NLP competitions. But where the latter two trends have a fairly well-established methodology for research and papers, the first doesn't, and consequently suffers in recognition and funding. In the talk, I describe each trend, provide some examples of the first, and conclude with a few general questions, including: Where is the heart of NLP? What is the nature of the theories developed in each stream (if any)? What kind of work should one choose to do if one is a grad student today?
|August 15||Mehdi Hafezi Manshadi||
Dealing with quantifier scope ambiguity in deep semantic analysis
Note special time: 10h30--12h00
Quantifier scope ambiguity is one of the most challenging problems in deep language understanding systems. In this talk, I briefly discuss Scope Underspecification, the most common way to deal with quantifier scope ambiguity in deep semantic representation. I will then discuss our efforts to build the first corpus of scope-disambgated english text in which there is no restriction on the number or the type of the scopal operators. I will explain some of the major difficulties in the hand-annotation of quantifier scoping and present our solutions to overcome those. Finally, I will explain a Maximum Entropy model adopted to do automatic scope disambiguation on the corpus, defining a baseline for future efforts.
|Sep. 20||Dan Roth||
Learning from Natural Instructions
Note special time: 9h30--11h00
About the speaker
Dan Roth is a Professor in the Department of Computer Science and the Beckman Institute at the University of Illinois at Urbana-Champaign and a University of Illinois Scholar. He is the director of a DHS Center for Multimodal Information Access & Synthesis (MIAS) and also has faculty positions in Statistics, Linguistics and at the School of Library and Information Sciences.
Roth is a Fellow of AAAI for his contributions to the foundations of machine learning and inference and for developing learning-centered solutions for natural language processing problems. He has published broadly in machine learning, natural language processing, knowledge representation and reasoning, and learning theory, and has developed advanced machine learning based tools for natural language applications that are being used widely by the research community.
Prof. Roth has given keynote talks in major conferences, including AAAI, EMNLP and ECML and presented several tutorials in universities and conferences including at ACL and EACL. Roth was the program chair of AAAI'11, CoNLL'02 and of ACL'03, and is or has been on the editorial board of several journals in his research areas and has won several teaching and paper awards. Prof. Roth received his B.A Summa cum laude in Mathematics from the Technion, Israel, and his Ph.D in Computer Science from Harvard University in 1995.
Machine learning is traditionally formalized as the study of learning concepts and decision functions from labeled examples. This requires representations that encode information about the target function's domain. We are interested in providing a way for a human teacher to interact with an automated learner using natural instructions communicating relevant domain expertise to the learner without necessarily knowing anything about the internal representation used in the learning process. The underlying problem becomes that of interpreting the natural language lesson in the context of the task of interest.
This talk focuses on the machine learning aspects of this problem. The key challenge is to learn intermediate structured representation - natural language interpretations - without being given direct supervision at that level.
We will present research on Constrained Conditional Models (CCMs), a framework that augments probabilistic models with declarative constraints in order to support learning such interpretations. In CCMs we formulate natural language interpretation problems as Integer Linear Programs, as a way to assign values to sets of interdependent variables and perform constraints-driven learning and global inference that accounts for the interdependencies.
In particular, we will focus on new algorithms for training these global models using indirect supervision signals. Learning models for structured tasks is difficult partly since generating supervision signals is costly. We show that it is often easy to obtain a related indirect supervision signal, and discuss several options for deriving this supervision signal, including inducing it from the world's response to the model's actions, thus supporting Learning from Natural Instructions.
We will explain and show the contribution of easy-to-get indirect supervision to other NLP tasks such as Information Extraction, Transliteration and Textual Entailment.
|Oct. 5||Kathleen Fraser||
Projected Barzilai-Borwein Method with Infeasible Iterates for Nonnegative Image Deconvolution
The Barzilai-Borwein (BB) method for unconstrained optimization has attracted attention for its "chaotic" behaviour and fast convergence on image deconvolution problems. However, images with large areas of darkness, such as those often found in astronomy or microscopy, have been shown to benefit from approaches which impose a nonnegativity constraint on the pixel values. I present a new adaptation of the BB method which enforces a nonnegativity constraint by projecting the solution onto the feasible set, but allows for infeasible iterates between projections. I show that this approach results in faster convergence than the basic Projected Barzilai-Borwein (PBB) method, while achieving better quality images than the unconstrained BB method.
|Oct. 26||Sravana Reddy||
Unsupervised Learning of Pronunciations
Note special time: 10h30--12h00
How well can we guess the sound of a word from its textual representation? Translating written language to its spoken form is a key component of speech technology. The standard problem of learning a model of letter-to-phoneme transformations from an existing lexicon is especially hard in writing systems like English where there is a non-trivial mapping between letters and phonemes. The problem becomes even more complex when we involve accents and dialects, or when we have no parallel training data.
In this talk, I will present some of my research that involves learning pronunciations with various degrees of unsupervision. I will first describe two methods for learning the latent alignments between letters and phonemes from an existing pronunciation lexicon in order to build a letter-to-phoneme model. I will also present a method to augment letter-to-phoneme models with speech information -- specifically, speech recognition errors on out-of-vocabulary words. The talk will then discuss the problem of extracting rhyme and meter in an unsupervised way from written poetry, both of which provide major cues to historical and dialectical pronunciations. Finally, I will present ongoing work on learning pronunciations from speech when both the lexicon and the speech transcriptions are unknown, a novel problem that is potentially useful for low-resource languages and dialects.
This talk covers joint work with John Goldsmith, Kevin Knight, Evandro Gouvea, and Karen Livescu.
|Nov. 16||Alistair Kennedy||
A Supervised Method of Feature Weighting for Measuring Semantic Relatedness
Note special time: 10h30--12h00
Clustering of related words is crucial for a variety of Natural Language Processing applications. A popular technique is to use the context that a word appears in to build vectors that represent a words meaning. Vector distance is then taken to determine whether two words have similar meanings. Usually these contexts are given weight based on some measure of association between the word and the context. These measures increase the weight of contexts where a word appears regularly but other words do not, and decrease the weight of contexts where many words may appear. Essentially, it is unsupervised feature weighting. I will present and discuss a method of supervised feature weighting. It identifies contexts shared by pairs of words known to be semantically related or unrelated, and then uses this information to weight these contexts on how well they indicate word relatedness. The system can be trained with data from resources such as WordNet or Roget's Thesaurus. This work is as a step towards adding new terms to Roget's Thesaurus automatically, and doing so with high confidence.
|Nov. 23||Ulrich Germann||
Resolving Word-Order Differences in German-to-English Machine Translation
Word-order differences between human languages pose one of the big challenges in automatic translation. The currently predominant translation paradigm, phrase-based statistical machine translation (PBSMT), does reasonably well at handling local word order changes, i.e., word order changes that occur within a small window of just a few words, but often fails to perform necessary large-scale re-arrangements.
Translation models based on syntactic trees provide a good account for patterns of such large-scale re-ordering, but rarely outperform PBSMT in practice, as measured by standard evaluation metrics.
In this talk, I'll explain why this is the case and present a hybrid method that combines information from source-side parse forests with the strengths of PBMST.
|Dec. 7||Atefeh Farzindar||
Trusted Automatic summarization and translation of Legal Information
NLP Technologies and RALI (Applied Research in Computational Linguistics, Université de Montréal) have developed a technology for automated analysis of legal information in order to facilitate the information research in banks of judgments published by legal information providers. During this seminar, Atefeh Farzindar will give a presentation on TRANSLI, a statistical machine translation system specifically designed for legal texts, and DecisionExpress, a supervised machine learning for the summaries of legal documents in three legal fields: immigration, tax and intellectual property.
About the speaker
Dr. Atefeh Farzindar is the founder of NLP Technologies Inc., a company specializing in Natural Language Processing, automatic summarization and statistical machine translation. Dr. Farzindar received her Ph.D. in Computer Science from the Université de Montréal and Paris-Sorbonne University. She is an adjunct professor at the Department of Computer Science at the Université de Montréal. Ms. Farzindar has made many contributions to research on the automatic summarization and content management system. As president of NLP Technologies, she has managed multiple collaborative R&D projects with various industry and university partners. She is the chair of the language technologies sector of the Language Industry Association (AILIA). Dr. Farzindar is a board member of the Language Technologies Research Centre, co-chair of the Canadian Conference on Artificial Intelligence 2010 and industry chair for Canadian AI'2011 and AI'2012.
About NLP Technologies Inc.
NLP Technologies is a specialized company and industry leader in the field of automatic summarization and statistical translation. Founded in February 2005, NLP developed and marketed a computer-generated summarization and statistical translation software stemming from our research, software tools, and related services. The company was founded in response to a specific need of the Canadian government: offering services that streamline the traditional cumbersome and time-consuming processes of reading, analyzing, and researching legal information, and a lack of skilled translators while facing an increasing amount of texts in foreign languages.
|Feb. 4||Kinfe Tadesse Mengistu||
Adapting Acoustic and Lexical Models to Dysarthric Speech
Dysarthria is a condition in which speech is made unintelligible due to neurological damage to the part of the brain that controls the physical production of speech and is in part characterized by pronunciation errors that include deletions, substitutions, insertions, and distortions of phonemes. These errors follow consistent intra-speaker patterns that we exploit through acoustic and lexical model adaptation to improve automatic speech recognition (ASR) on dysarthric speech. We show that acoustic model adaptation yields an average relative word error rate (WER) reduction of 36.99% and that pronunciation lexicon adaptation (PLA) reduces the relative WER further by an average of 8.29% on a large vocabulary task of over 1500 words for 6 speakers with severe to moderate dysarthria. PLA also shows an average relative WER reduction of 7.11% on speaker-dependent models evaluated using 5-fold cross-validation.
|Feb. 18||Chris Parisien||
Finding Structure in the Muck: Bayesian Models of How Kids Learn to Use Verbs
Children are fantastic data miners. In the first few years of their lives, they discover a vast amount of knowledge about their native language. This means learning not just the abstract representations that make up a language, but also learning how to generalize that knowledge to new situations -- in other words, figuring out how language is productive. Given the noise and complexity in what kids hear, this is incredibly difficult, yet still, it seems effortless. In verb learning, a lot of this generalization appears to be driven by strong regularities between form and meaning. Seeing how a certain verb has been used, kids can make a decent guess about what it means. Knowing what a verb means can suggest how to use it.
In this talk, I present a series of hierarchical Bayesian models to explain how children can acquire and generalize abstract knowledge of verbs from the language they would naturally hear. Using a large, messy corpus of child-directed speech, these models can discover a broad range of abstractions governing verb argument structure, verb classes, and alternation patterns. By simulating experimental studies in child development, I show that these complex probabilistic abstractions are robust enough to capture key generalization behaviours of children and adults. Finally, I will discuss some promising ways that the insights gained from modeling child language can benefit the development of a valuable large-scale linguistic resource, namely VerbNet.
|Mar. 4||Antti Arppe||
How to THINK in Finnish? -- Making sense of multivariate statistical analysis of linguistic corpora
I will discuss an overall methodological framework presented in my dissertation for studying linguistic alternations, focusing specifically on lexical variation in denoting a single meaning, that is, synonymy. As the practical example, I employ the synonymous set of the four most common Finnish verbs denoting THINK, namely 'ajatella, miettiä, pohtia, harkita', roughly corresponding 'think, reflect, ponder, consider'. As a continuation to previous work, I describe the extension of statistical methods from dichotomous linguistic settings (e.g., Gries 2003; Bresnan et al. 2007) to polytomous ones, that is, concerning more than two possible alternative outcomes.
As the key multivariate method, I demonstrate the use polytomous logistic regression on studied phenomenon. The results of the various statistical analyses confirm that a wide range of contextual features across different categories are associated with the use and selection of the selected Finnish verbs of thinking, with the differences among them being subtle but systematic. Interestingly, many of the individual contextual preferences of these currently abstract verbs can be traced back to their etymological origins denoting concrete agricultural, hunting and fishing activities in early Finnish culture and society. In terms of overall performance of the multivariate analysis and modeling, the prediction accuracy seemingly reaches a ceiling at a Recall rate of roughly two-thirds of the sentences in the research corpus. The analysis of these results suggests a limit to what can be explained and determined within the *immediate sentential context* and applying the conventional descriptive and analytical apparatus based on currently available linguistic theories and models. Nevertheless, Inkpen and Hirst (2006) have reported over 90% accuracy in a similar synonym choice modeling task, but this required explanatory variables indicating "nuances" such as denotational microdistinctions as well as expressive ones concerning the speaker's intention to convey some attitude, in addition to the sought-after style, which are not necessarily explicitly evident in the immediate sentential context nor easily amenable to accurate automated extraction.
The results also support Bresnan's (2007) and others' (e.g., Bod et al. 2003) probabilistic view of the relationship between linguistic usage and the underlying linguistic system, in which only a minority of linguistic choices are categorical, given the known context - represented as a feature cluster - that can be analytically grasped and identified. Thus, instead of viewing lexical choice in terms of context-specific rules with exceptions, we may rather interpret different contexts to exhibit degrees of variation as to their outcomes, observable as proportionate choices over longer stretches of usage in texts or speech.
|Mar. 21||Uzzi Ornan||
Semantic Search Engine
Note special place and time: PT266 from 10h30--12h00
One of the obstacles to a complete and accurate search engine is that in every natural language there are many words with more than one meaning. If the search is done according to the form only, a significant portion of the results will be superfluous,and thus not accurate. The advice to add other word or words to the search may hurt the completeness.
In our engine we want to find the intended meaning only of every word. To achieve it, the whole clause or even the sentence must be consulted first. We follow the ideas of early Fillmore's framework that an expression is built around a verb as its center. We have built 'conceptual' lexicons for verbs and for nouns with semantic distinctive features. It describes the real world, thus may fit to all languages. Of course usual morphological and syntactic tools for each language must be activated first. The engine identifies the verb of the syntactic unit, firstly eliminates syntactic improper NPs (for some languages word order is essential). Then it looks for semantic distinctive features demanded by the verb, and eliminates NPs that don' have the needed features. The result is that the program chooses the word with the requested meaning only. Building the engine is still in progress. We are manually enlarging the lexicons from 20,000 to 30,000 entries.
At the beginning the Engine has been built for Hebrew, which includes special problems due to its script. Most of the vowels in Hebrew or Arabic script are not written, many particles are attached to the word without space, a double consonant is written with one letter, and some letters signify both vowels and consonants. Thus, almost every string of characters may designate several words (the average in Hebrew is almost three words). The program converts each string to all possible phonemic written words in Latin characters, thus have a plenty of possible words in the expression to consider. Now we choose the proper word in the whole sentence, as was described above.
|Mar. 22||Mark Hasegawa-Johnson||
Semi-Supervised Learning for Spoken Language User Interface
Note special place and time: PT266 from 11h00--13h00
About the speaker
Mark is Associate Professor, Department of Electrical and Computer Engineering, at the University of Illinois, Urbana. His field of interest is speech production and recognition by humans and computers, including landmark-based speech recognition, integration of prosody in speech recognition and understanding, audiovisual speech recognition, computational auditory scene analysis, and biomedical imaging of the muscular and neurological correlates of speech production and perception.
Speech is rhythm and melody, with perceptually salient pops and hisses inserted as necessary to optimize the channel capacity. Although speech is usually transcribed using a sequence of letters, it is rarely spoken using the sounds those letters represent. In this talk I will argue that babies, polyglots, and machine learning algorithms are best able to learn speech if they treat its associated text transcription as, at best, an untrustworthy indication of things that might have been contained in the utterance. Speech is primarily prosody; the pragmatics of an utterance govern its phrasing, and the phrasing of the utterance governs the coordination and strength of the articulatory gestures implementing any particular syllable. Fortunately, the phrasing of an utterance is one of its most perceptually salient characteristics, therefore prosody can be learned with good accuracy using semi-supervised machine learning techniques, including regularized Gaussian mixture modeling methods. Phonetic landmarks can also be learned using semi-supervised methods. The sequence of articulatory gestures is hard to learn using regularized semi-supervised methods, but can be predicted pretty well from first principles, and is therefore amenable to modeling using a finite state transducer or dynamic Bayesian network. With appropriate combination of landmark-based acoustic analysis, gesture-based pronunciation analysis, and prosody-based content analysis, it becomes possible to create human-computer interfaces that perform better than the state of the art in some challenging domains, e.g., in the domain of second-language pronunciation training, and in the domain of assistive and augmentative communication for talkers with Cerebral Palsy.
|Apr. 1||Vivian Tsang||
Error Recovery in Learning
Garden path sentences are sentences that are grammatically correct but written in a way that the most likely interpretation is incorrect. These sentences highlight two interesting aspects of human communication: 1) humans have a tendency to make assumptions (predictions?) about the content before it is completely revealed and 2) when an assumption is broken, it is not easy to recover unless the person is willing to backtrack and start afresh.
In communication, we posit that miscommunication occurs often due to incorrectly made assumptions about the content or the mental state of the other person. Miscommunication itself can be repaired as long as the participants involved are willing to clarify and repair. The repair may not be so straightforward when it is cast within the context of learning where there is a power differential between the learner and the authority, and the learner has to the rely on the authority as the "gold standard."
We take a more nuanced view about learning in that a learner's error may not be entirely erroneous. For example, children are known to overgeneralize past-tense inflection, such as applying the -ed inflection on irregular verbs. This is indeed an error but also demonstrates an awareness of the regular inflection. How is the learner (and importantly, the authority) to recognize the error and yet to be cognizant of the partially correct aspect of the erroneous behaviour? (And how does one know if a correct behaviour is correct for the right reason?) We will describe our preliminary experimental setup to examine error recovery in human learning.
|Sept. 24||Suresh Manandhar||
Graph based methods for inducing word senses
Note special place: BA5256
Unsupervised learning of lexical semantics is an emerging area within NLP that poses interesting and challenging problems. The primary advantage of unsupervised and minimally supervised methods is that annotated data is not required or required only in small quantities. In this talk, I will present our current work on word sense induction. Unsupervised sense induction is the task of discovering all the senses of a given word from raw unannotated data. Our collocational graph based method achieves high evaluation scores while overcoming some of the limitations of existing methods. We show graph connectivity measures can be employed to avoid the need for supervised parameter tuning. And finally, hierarchical clustering and hierarchical random graphs can be employed for inducing concept hierarchies.
|Oct. 25||Yuji Matsumoto||
Japanese National Corpus Project and Corpus Tools
Note the atypical day and time: 25 Oct at 9h30. The meeting will be held in PT266.
We have been participating in the Japanese National Corpus Project, aiming at construction of 100 million word contemporary Japanese corpus. Our main tasks in this project are to develop corpus annotation tools such as POS taggers, chunkers, dependency parsers, and predicate-argument structure analyzers, and to implement corpus management tools for corpus retrieval and annotation error correction.
After brief introduction of the project, I will mainly talk about the problems in syntactic annotation, especially the way we handle coordination structure analysis and its annotation scheme.
|Oct. 25||Eric Nichols||
Statement Map: Reducing Web Information Noise through Opinion Classification
Note the atypical day and time: 25 Oct at 9h30. The meeting will be held in PT266.
On the Internet, users often encounter noise in the form of spelling errors or unknown words, however, dishonest, unreliable, or biased information also acts as noise that makes it difficult to find credible sources of information. As people come to rely on the Internet for more and more information, reducing this credibility noise grows ever more urgent. The Statement Map project's goal is to help Internet users evaluate the credibility of information sources by mining the Web for a variety of viewpoints on their topics of interest and presenting them to users together with supporting evidence in a way that makes it clear how they are related.
In this presentation, we show how a Statement Map system can be constructed by combining Information Retrieval (IR) and Natural Language Processing (NLP) technologies, focusing on the task of organizing statements retrieved from the Web by viewpoints. We frame this as a semantic relation classification task, and identify 4 semantic relations: [AGREEMENT], [CONFLICT], [CONFINEMENT], and [EVIDENCE]. The former two relations are identified by measuring semantic similarity through sentence alignment, while the latter two are identified through sentence-internal discourse processing. As a prelude to end-to-end user evaluation of Statement Map, we present a large-scale evaluation of semantic relation classification between user queries and Internet texts in Japanese and conduct detailed error analysis to identify the remaining areas of improvement.
|Oct. 29||Bob Carpenter||
Hierarchical Models of Data Coding: Inferring Ground Truth along with Annotator Accuracy, Bias, and Variability
Supervised statistical models often rely on human-coded data. For instance, linguists might code Arabic text for syntactic categories or code newspaper titles for political bias. In epidemiology, doctors tag images or tissue samples with respect to patient disease status. Most commonly, their collective decisions are coerced by voting, adjudication, and/or censoring into a best-guess ``gold standard'' corpus, which is then used to evaluate model performance.
In this talk, I'll introduce a generative hierarchical model and full Bayesian posterior inference for the annotation process for categorical data. Given a collection of annotated data, we can infer the true labels of items, the prevalence of some phenomenon (e.g. a given intonation or syntactic alternation or the disease prevalence in a population), the accuracy and category bias of each annotator, and the codability of the theory as measured by the hierarchical model of accuracy and bias of annotators and their variability.
I'll demonstrate the efficacy of the approach using expert and non-expert pools of annotators for simple linguistic labelling tasks such as textual inference, morphological tagging, and named-entity extraction, as well as for dentists labeling X-rays for cavities. The model not only automatically adjusts for spam annotators, it infers more accurate gold-standard data than simpler approaches such as voting and censoring.
I'll discuss applications such as monitoring an annotation effort, selecting items with active learning, and generating a probabilistic gold standard for model training and evaluation.
I'll also discuss the challenge of estimating item difficulty effects, which are evident to annotators and also apparent through observed covariance among annotation decisions.
|Nov. 12||Tim Fowler||Parsing with categorical grammars|
|Nov. 26||Tong Wang||
Associating Difficulty in Near-Synonymy Choice with Types of Nuance using Core Vocabulary
Stylistic variation among near-synonyms is an important dimension that has been frequently addressed in near-synonymy research. In this study, we hypothesize that the stylistic nature of nuances correlates to the degree of difficulty in choosing between near-synonyms. Contrasting some recent studies that focus on contextual preferences of synonyms (e.g., Arppe & Järvikivi 2007), we elect to investigate the internal features of near-synonym nuances. We adopt the notion of core vocabulary to associate stylistic variation in theory with the difficulty level of near-synonym choice in practice.
To test our hypothesis, a near-synonym lexical choice task (Edmonds 1997) is employed to measure difficulty levels. Our study shows that variance of performance on this task is correlated with differing degrees of coreness of the near-synonyms, and in turn, different types of near-synonym variations. Counter to intuition, the seemingly subtle stylistic nuances are usually easier for subjects to distinguish than non-stylistic differences.
|Dec. 10||Vanessa (Wei) Feng||
Classifying arguments by scheme
Argumentation schemes are structures or templates for various kinds of arguments. The argumentation scheme classification system that I am going to present introduces a new task in this field. To the best of our knowledge, this is the first attempt to classify arguments into argumentation schemes automatically.
Given the text of an argument with premises and conclusion identified, we classify it as an instance of one of five common schemes, using general features and other features specific to each scheme, including lexical, syntactic, and shallow semantic features. We achieve accuracies of 63-91% in one-against-others classification and 80-94% in pairwise classification (baseline = 50% in both cases).
We design a pipeline framework whose ultimate goal is to reconstruct the implicit premises in an argument, and our argumentation scheme classification system is aimed to address the third component in this framework. While the first two portions of this framework can be fulfilled by work of other researchers, we propose a syntactic-based approach to the last component of this framework. The completion of the entire system will benefit many professionals in applications such as automatic reasoning assistance.
|Mar. 3||Frank Rudzicz||
Adaptive kernel canonical correlation analysis for estimation of task dynamics from acoustics
I present a method for acoustic-articulatory inversion whose targets are the abstract tract variables from task dynamic theory. Towards this end I construct a non-linear Hammerstein system whose parameters are updated with adaptive kernel canonical correlation analysis. This approach is notably semi-analytical and applicable to large sets of data. Training behaviour is compared across four kernel functions and prediction of tract variables is shown to be significantly more accurate than state-of-the-art mixture density networks.
|Mar. 17||Chris Parisien||
CANCELLED: Learning verb alternations in a usage-based Bayesian model
One of the key debates in language acquisition involves the degree to which children's early linguistic knowledge employs abstract representations. While usage-based accounts that focus on input-driven learning have gained prominence, it remains an open question how such an approach can explain the evidence for children's apparent use of abstract syntactic generalizations. We develop a novel hierarchical Bayesian model that demonstrates how abstract knowledge can be generalized from usage-based input. We demonstrate the model on the learning of verb alternations, showing that such a usage-based model must allow for the inference of verb class structure, not simply the inference of individual constructions, in order to account for the acquisition of alternations.
|Apr. 14||Aida Nematzadeh||TBD|
|Apr. 15||Jackie C.K. Cheung||
Parsing German Topological Fields with Probabilistic Context-Free Grammars
Research in statistical parsing has produced a number of high-performance parsers using probabilistic context-free (PCFG) models to parse English text (Collins, 2003; Charniak and Johnson, 2005 inter alia). Problems arise, however, when applying these methods to freer-word-order languages. Such languages as Russian, Warlpiri, and German feature syntactic constructions that produce discontinuous constituents, directly violating one of the crucial assumptions of context-free models of syntax.
While PCFG technologies may thus be inadequate for full syntactic analysis of all phrasal structure in these languages, clausal structure can still be fruitfully parsed with these methods. In this work, we apply a latent variable-based PCFG parser (Petrov et al., 2006) to extract the topological field structure of German. These topological fields provide a high-level description of the major sections of a clause in relation to the clausal main verb and the subordinating heads and appear in strict linear sequences amenable to PCFG parsing. They are useful for tasks such as deep syntactic analysis, part-of-speech tagging and coreference resolution.
We perform a qualitative error analysis of the parser output, and identify constructions like ellipses and parentheticals as the chief sources of remaining error. This result is confirmed by a further experiment in which parsing performance improves after restricting the training and test set to those sentences without these constructions. We also explore techniques for further improving parsing results. For example, discriminative reranking of parses made by a generative parser could incorporate linguistic information such as those derived by our qualitative analysis. Another possibility is self-training, a semi-supervised technique which utilizes additional unannotated data for training.
|Sep. 1||Barrou Diallo||
Research in Chinese machine translation at the European Patent Office
Note special time: 9h00 (not 9h10) to 10h00
About the speaker
Barrou Diallo is the Head of Research at the European Patent Office and Advisor to the Information Retrieval Facility. His focus is on machine translation, data mining, and enterprise architecture. He holds a Ph.D. in computer sciences, an M.Sc. in biomathematics and an M.Sc. in law and cyberspace. He has published several papers on patent processing, computer graphics, 3D visualisation and database management. He was the project manager of the first real-time European machine translation system for patents at the EPO. Prior to these various positions at the EPO, Barrou Diallo had a chair as Professor in the Chamber of Commerce of Le Mans and Assistant Professor at the University of Compiègne.
|Sep. 2||Akira Ushioda||
MT research and development in Japan
Note special time and place: PT378 at 11h00
About the speaker
Dr. Akira Ushioda obtained his Ph.D. from Carnegie Mellon University in 2000, and worked as a senior researcher (2000-2002), and director of Intelligent Systems Laboratory (2003-2004) at Fujitsu Laboratories Ltd.. He is currently a Research Fellow at Fujitsu Laboratories and Guest Associate Professor of Nara Institute of Science and Technology, Japan. Dr. Ushioda's research interests cover a range of topics in the area of Natural Language Processing and statistical learning, including a lexical statistical parser, integration of SMT and RBMT, automatic clustering of words and phrases, and statistical word sense disambiguation.
The market size of private cramming schools and preparatory schools in Japan is 10 billion dollars and more than a third of the market is comprised of language schools, mostly English language schools. The Japanese people are thus enthusiastic about learning English, and yet the TOEIC report on test-takers worldwide shows that the average TOEIC score of Japanese test-takers is ranked 25th out of the 27 countries with most active test-takers. The awareness of poor performance makes them more desperate to learn English. Poor human performance, on the other hand, makes relative performance of, and expectation for, MT higher. Japan has been thus quite actively engaged in developing machine translation technology both at the government level and on the private-sector level. EDR (Electronic Dictionary Research) project, a government-led electronic dictionary research project, for example, began in 1986, and continued for a decade with a total budget of 150 million dollars. The participants in the project from the private-sector include major Japanese electronics companies, such as Hitachi, Toshiba, Panasonic, Sharp, NEC and Mitsubishi Electric Corp.
Fujitsu Laboratories, also a participant in the EDR project, began developing English-to-Japanese and Japanese-to-English MT systems in early 80's. Unlike other Japanese MT makers Fujitsu employs an interlingua-oriented translation scheme which makes difference in concept representation between Japanese and English easier to overcome. The deeper semantic representation, on the other hand, makes the grammar rule set somewhat harder to maintain and grow. Instead of further modifying the rule-based scheme, we are investigating the way to incorporate SMT framework into the existing scheme. One of the issues at hand is how to bridge the gap between the RBMT ``phrases'' and SMT ``phrases.''
This talk will provide a background and an overview of MT development in Japan, describes Fujitsu's MT research and development, and discusses future direction of MT research and major challenges.
|Sep. 16||CL Group||Fall 2009 welcoming meeting|
|Sep. 23||Varada Kolhatkar||
An extended analysis of a method of all words sense disambiguation
One of the central problems in processing a natural language is ambiguity. In every natural language there are many potentially ambiguous words. Humans are fairly adept at solving ambiguity by drawing on context and their knowledge of the world. However, it is not so easy for machines to understand the intended meaning of a word in a given context. Word Sense Disambiguation (WSD) is the process of selecting the correct sense of a word in a specific context.
It is often useful to generalize the problem of disambiguating a single word to that of disambiguating all content words in a given text. This generalized problem is referred to as all-words sense disambiguation. The long history of WSD research includes many different supervised, unsupervised and knowledge-based approaches. But the reality is that current state-of-the-art accuracy in WSD remains a long way off far from natural human abilities. We present our analysis of some of the components that might be contributing to the level of error currently plaguing all-words sense disambiguation. Our analysis makes use of WordNet::SenseRelate::AllWords, an unsupervised knowledge-based system for all-words sense disambiguation, which is freely available on the Web as a perl Module. The system assigns a WordNet sense to each word in a text using measures of semantic similarity and relatedness.
We find that the degree of difficulty in disambiguating a word is proportional to the number of senses of that word (polysemy). The experimental evidence indicates that a significant percentage of word sense disambiguation error is caused by a relatively small number of highly frequent word types. We also demonstrate that part-of-speech tagged text will be disambiguated more accurately than raw text. We show that expanding the context window helps in terms of coverage but doesn’t improve disambiguation. Finally we find that if the answer is not the most frequent sense, disambiguation turns out to be a hard problem even for an unsupervised system which doesn’t use any information about sense distribution.
|Oct. 7||Mohamed Attia||
Automatic full phonetic transcription of Arabic script
Handling most of the non-trivial NLP tasks via rule-based (i.e. language factorizing) methods typically ends up with multiple possible solutions/analyses. After exhausting all the known/applicable rule-based methods, statistical methods are one of the most effective, feasible, and widely adopted approaches to automatically resolve that ambiguity.
Many researchers, however, argue that if statistical disambiguation is eventually deployed to get the most likely analysis/sequence of analyses, why do not we go fully statistical (i.e. non factorizing) from the very beginning and give up the burden of rule-based methods?
In our attempt to get the best performance of automatic full phonetic transcription of open-domain Arabic script, which is a tough industrial problem vital for applications like Arabic TTS systems, building Arabic ASR training corpora ... etc., one fundamental design task was to decide whether to go with the former design architecture (language factorization, then statistical disambiguation) or with the latter one (statistical disambiguation on un-factorized tokens).
While our years-long research on ``automatic Arabic phonetic transcription'' ended up with the experimentally evident best performing system reported so far in the scientific literature (as per the mid. of 2009), the winning architecture has interestingly been neither of the two abovementioned options alone but a hybrid of both!
While the non-factorizing architecture is more computationally economic and easier to implement, the language factorizing one overcomes the severe problem of coverage that emerges with the non-factorizing one. While both approaches asymptote to the same ceiling of accuracy, the former has a faster learning curve than the latter. So, the best hybrid architecture starts with trying the non-factorizing method on the input raw Arabic string. Only if a mis-coverage happens, it switches (backs-off) to factorizing method.
While these conclusions have been obtained on the specific problem of ``Automatic Full Phonetic Transcription of Arabic Script'', we think that many other problems - where selecting between going factorizing or non-factorizing is an issue - may also benefit from this experience.
|Oct. 21||Julian Brooke||
A semantic approach to automated text sentiment analysis
The identification and characterization of evaluative stance in written language poses a unique set of cross-disciplinary challenges. Beginning with a review of relevant literature in linguistics and psychology, I trace recent interest in automated detection of author opinion in online product reviews, focusing on two main approaches: the semantic model, which is centered on deriving the semantic orientation (SO) of individual words and expressions, and machine learning classifiers, which rely on statistical information gathered from large corpora. To show the potential long-term advantages of the former, I describe the creation of an SO Calculator, highlighting relevant linguistic features such as intensification, negation, modality, and discourse structure, and devoting particular attention to the detection of genre in movie reviews, integrating machine classifier modules into my core semantic model. Finally, I discuss sentiment analysis in languages other than English, including Spanish and Chinese.
|Nov. 4||Paul Thompson||
About the speaker
Paul Thompson is Chief Computational Linguist, Text Exploitation and Decision Support at General Dynamics Advanced Information Systems, Buffalo.
Forensic linguistics, or the use of linguistic analysis techniques to interpret evidence, e.g., authorship attribution, is an established discipline. In this talk I will describe research on the application of forensic linguistic techniques to computer security in the context of the Semantic Hacking project at Dartmouth College's Institute for Security Technology Studies. I will also discuss related research projects, including research on the detection of deception in text and in computer-mediated communication.
|Nov. 18||Gabriel Murray||
Summarizing Conversations in Various Modalities
In recent years, summarization research has extended beyond the extractive summarization of well-structured documents such as newswire and journal articles to consider corpora such as meeting transcripts, web-logs, lectures and emails. In many of these domains, researchers have found evidence that domain-specific features can yield additional improvement beyond the performance provided by standard text summarization algorithms. For example, prosodic features can be extracted from the speech signal to aid meeting and lecture summarization, while emails contain useful header information such as the number of recipients and the presence of attachments. In our research we investigate whether these conversational domains can be treated similarly, using a unified conversation feature set for extractive summarization. We show that this novel conversation summarization approach can perform on par with domain-specific approaches for meeting and email data, while being flexible enough to apply to many other conversation domains. This talk will also include a description of subjectivity detection and its application to conversation summarization, as well as an overview of our current approach which moves beyond extractive summarization.
|Dec. 2||Daphna Heller||
The use of common ground information in real-time comprehension and production
It is well known that the appropriateness of utterances depends on contextual information, but since contextual information is extremely varied in nature and has to be gathered from multiple sources, it remains an open question whether interlocutors can, in fact, use contextual information in real-time comprehension and production. In this talk, I focus on perspective information: what information is assumed to be shared among interlocutors and what information is privileged to one interlocutor but not the other. I present two psycholinguistic experiments investigating the ability of interlocutors to use the distinction between shared and privileged information in the earliest moments of comprehension and production. Experiment 1 uses the 'visual world' eye-tracking paradigm to study the comprehension of definite descriptions containing scalar adjectives when the visual perspectives of the interlocutors differ. Experiment 2 examines the production of artificial names for novel shapes in cases where the speaker learned more names than the addressee. The results demonstrate that perspective information is used from the earliest moments, of both comprehension and production highlighting interlocutors impressive ability to use contextual information in real time.
|Jan. 16||Shalom Lappin||
Expressiveness and Complexity in Underspecified Semantics
Today's speaker is a visiting professor from the Department of Philosophy at King's College London.
In this paper we address an important issue in the development of an adequate formal theory of underspecified semantics. The tension between expressive power and computational tractability poses an acute problem for any such theory. Generating the full set of resolved scope readings from an underspecified representation produces a combinatorial explosion that undermines the efficiency of these representations. Moreover, Ebert (2005) shows that most current theories of underspecified semantic representation suffer from expressive incompleteness. In previous work we present an account of underspecified scope representations within Property Theory with Curry Typing (PTCT), an intensional first-order theory for natural language semantics. We review this account, and we show that filters applied to the underspecified-scope terms of PTCT permit expressive completeness. While they do not solve the general complexity problem, they do significantly reduce the search space for computing the full set of resolved scope readings in non-worst cases. We explore the role of filters in achieving expressive completeness, and their relationship to the complexity involved in producing full interpretations from underspecified representations.
Graduate Visit Day
|Mar. 13||Shane Bergsma||
Web-Scale Models of Natural Language
Today's speaker is visiting from the University of Alberta.
The World Wide Web has had an enormous impact on Natural Language Processing (NLP) research, both as a source of data and as a stimulus for new language technology. In this talk, I describe several recent NLP systems that use web-scale statistics to achieve superior performance. These systems employ supervised machine learning as a simple but powerful mechanism for integrating web-scale data. I present the evolution of using the Internet for language research: from the initial enthusiasm for search-engine page counts to the more scientifically-sound usage of web-scale text databases.
|Mar. 20||Yang Liu||
Extractive summarization and keyword extraction using meeting transcripts
Meeting corpus is much more challenging than written text (such as news article) for various language processing tasks. In this talk, I will discuss some research we have done in the past two years on meeting understanding, specifically, extractive meeting summarization and keyword extraction. When using a supervised learning framework for summarization, to address the imbalanced data problem and human annotation disagreement, we propose different sampling methods and a regression model. I will present improved results using these methods for meeting summarization, as well as studies on the correlation of the automatic ROUGE measures and human evaluation for summarization. I will also show various results for keyword extraction, comparing supervised and unsupervised approaches, and how to leverage summaries for keyword extraction.
|Mar. 27||Abdel-Rahman Mohamed||
Hafss, A Computer Aided Pronunciation Learning system
In this talk, I will describe a speech-enabled Computer Aided Pronunciation teaching (CAPT) system HAFSS. This system was developed for teaching Holy Qur'an recitation rules and Arabic pronunciations. HAFSS uses a state of the art speech recognizer to detect errors in user recitation. One important point that is critical in any practical language learning system that exploits ASR technology is the user enrollment time (the time needed to train the system to the user's voice). I will talk about the enrollment process in Hafss and I will discuss methods that was found helpful in reducing the total enrollment time needed by the system. I will also introduce one experiment that measures the usefulness of the system to a novice user and another one that measures the correlation between the judgments of HAFSS system and the judgments of four human experts.
|Apr. 3||Tong Wang||
Extracting Synonyms from Dictionary Definitions
Many research efforts have been spent in extracting words of different lexical semantic relations from various resources; the extraction of synonyms, however, is proved to be nontrivial due to the difficulty of coming up with features that are exclusive for synonymy.
I will talk about two rule-based approaches for extracting synonyms from dictionary definitions: by building an inverted index and by bootstrapping and matching against regex patterns. In one of the two evaluation schemes I used, these seemingly simple approaches actually outperform the best reported lexicon-based method by a large margin.
|Sep. 19||Naishi Liu||
A Reduced Graph Model of Jokes
Today's speaker is a visiting scholar from Shanghai Jio Tong University.
The talk is an introduction to a graph-theoretic model for the understanding of verbal humor (especially jokes). It follows the tradition of CL and is based on the previous linguistic researchers making use of the graph elements such as vertices, edges, and subgraphs. The result is an interpretation model that accounts for how we understand humor, based on which algorithms may be designed to facilitate automatic humor processing. Warning: The presentation may contain some sexually oriented or sexist data.
|Oct. 3||Anatoliy Gruzd||
Name Networks: A Content-Based Method for Automated Discovery of Social
Networks to Study Collaborative Learning
Today's speaker is a PhD student at the Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign.
As a way to gain greater insight into the operation of e-learning communities, the presented work applies automated text mining techniques to text-based communication to identify, describe and evaluate underlying social networks. The research demonstrates that the resulting social networks can be used by members of e-learning communities to improve the learning experience. While faculty and administration can use them to understand online learning processes and to develop more appropriate and effective programs for the next generation of learners.
|Oct. 17||Prof. Iryna Gurevych||
Putting the "Wisdom-of-Crowds" to Use in NLP: Collaboratively Constructed Semantic Resources on the Web
About the Speaker:
Iryna Gurevych is Director of the Ubiquitous Knowledge Processing (UKP) Lab at the Technical University of Darmstadt. She is a receiver of the Young Excellence Emmy-Noether Award by the German Research Foundation (DFG) and a Lichtenberg-Professorship Award by the Volkswagen-Foundation. Iryna is currently Principal Investigator of the projects ``Semantic Information Retrieval'', funded by the German Research Foundation (DFG), ``Mining Lexical-Semantic Knowledge from Dynamic and Linguistic Sources and Integration into Question Answering for Discourse-Based Knowledge Acquisition in eLearning'', funded by the DFG, and THESEUS ``TEXO - Future Business Value Networks: Business Web'', funded by the German Ministry of Economics and Technology. She is a lecturer and scientific advisor in the research training program ``Quality Enhancement in eLearning through Regenerative Processes'', funded by the DFG. s Lab conducts research in the areas of lexical semantic processing with the focus on Web-based semantic resources, integrating lexical semantic knowledge in information retrieval and question answering, and text mining with the focus on sentiment analysis.
Further information: http://www.ukp.tu-darmstadt.de/
The rise of Web 2.0 and the so called Socio-Semantic technologies in recent years has led to huge amounts of user generated content produced by ordinary users on the Web. This content called for user-generated tagging to enable better information navigation and retrieval. Therefore, semantically tagged collaboratively constructed knowledge repositories emerged that represent a novel type of Web-originated resources - we call them collaboratively constructed semantic resources (CCSR). Example instances of CCSR are collaboratively constructed and semantically enriched multilingual online encyclopedias, such as Wikipedia, or collaboratively constructed online multilingual dictionaries, such as Wiktionary.
NLP researchers have started to employ CCSRs as substitutes for conventional lexical semantic resources and repositories of world knowledge, such as thesauri, machine readable dictionaries, or wordnets. In overcoming the limitations of existing resources, such as their coverage gaps, significant construction and maintenance costs, and restricted availability, there is now a hope to significantly enhance the performance of numerous algorithms by utilizing the so called ``wisdom-of-crowds'' in broad coverage NLP systems. Combining CCSRs with statistical measures resulting in the shallow, approximative semantic knowledge has already demonstrated excellent results in some NLP tasks.
The talk will present some of the recent work done at the Ubiquitous Knowledge Processing Lab that had a significant impact in the above outlined area. In the first part, a set of semantic relatedness measures operating on various datasets and utilizing either conventional wordnets or CCSRs will be examined. In the second part, the knowledge in Wikipedia and Wiktionary is employed in domain-specific information retrieval and yields significant improvements. The talk will be concluded with some remarks on the interoperability of conventional knowledge resources and CCSRs.
|Oct. 31||Fraser Shein||
WordQ and SpeakQ software: Writing made easier
About the speaker
Fraser Shein is a new faculty member in the CL group. He is a senior rehabilitation engineer at Bloorview Kids Rehab where his research interests include advanced computer accessibility technology, natural language processing as applied to writing software, speech recognition, and consumer-driven reporting of assistive technology experiences. He is also the President and CEO of Quillsoft Ltd., which produces software to help individuals write text using technologies such as natural sounding text-to-speeech, contextual word prediction, and speech recognition.
His profile at Bloorview is here.
This presentation will discuss and demonstrate how WordQ/SpeakQ software (both Windows and Mac OS X) helps you write more easily. Both were developed at Bloorview Kids Rehab (Toronto). As you type, WordQ continuously presents a list of relevant correctly spelled words using word prediction. When the desired word is shown, you can choose it with a single keystroke. High quality text-to-speech feedback enables you to more easily choose words and to identify mistakes. SpeakQ plugs into WordQ and adds simple speech recognition. You can then benefit from a combination of word prediction, speech output and speech input to generate text when stuck with spelling and word forms, identifying errors, proofreading and editing. Current research at Bloorview relating to syntactical and semantic knowledge in word prediction will also be discussed.
|Nov. 7||Libby Barak||
Keyword based Text Categorization
Text Categorization (TC) task is mostly approached via supervised or semi-supervised methods. These solutions require excessive manual labor in order to annotate text samples as training data, which is not always feasible. In this work we investigate Keyword-based Text Categorization using as input only a taxonomy of the category names.
The TC method uses a novel combination of Textual Entailment based categorization and Latent Semantic Analysis (LSA) based categorization to create an initial set of unsupervised classified documents. The initial classified set is then used as input for standard supervised categorization method. The proposed method shows promising initial results and reveals interesting phenomena as a basis for further research.
|Nov. 28||TBD||To be determined|
|Dec. 5||Hani Safadi||
Crosslingual implementation of linguistic taggers using parallel corpora
The talk addresses the problem of creating linguistic taggers for resource-poor languages using existing taggers in resource rich languages. Linguistic taggers are classifiers that map individual words or phrases from a sentence to a set of tags. Part of speech tagging and named entity extraction are two examples of linguistic tagging. Linguistic taggers are usually trained using supervised learning algorithms. This requires the existence of labeled training data, which is not available for many languages.
We describe an approach for assigning linguistic tags to sentences in a target (resource-poor) language by exploiting a linguistic tagger that has been configured in a source (resource-rich) language. The approach does not require that the input sentence be translated into the source language.
Instead, projection of linguistic tags is accomplished through the use of a parallel corpus, which is a collection of texts that are available in a source language and a target language. The correspondence between words of the source and target language allows us to project tags from source to target language words. The projected tags are further processed to compute the final tags of the target language words.
A system for part of speech (POS) tagging of French language sentences using an English language POS tagger and an English/French parallel corpus has been implemented and evaluated using this approach.
|Dec. 9||Dan Jurafsky||
Distinguished Lecture Series Colloquium
Note special time and place: 11:00-13:00, Bahen 1180
About the speaker
Dan Jurafsky works at the nexus of language and computation, focusing on statistical models of human and machine language processing. Recent topics include the induction and use of computational models of meaning, the automatic recognition and synthesis of speech, and the comprehension and production of dialogue. He is the recipient of the MacArthur Fellowship and an NSF CAREER award. His most recent book is the second edition of his widely-used textbook with Jim Martin, Speech and Language Processing.
|Jan. 18||Frank Rudzicz||
Speech Recognition and Computational Linguistics: How to wreck a nice beach whenever a wand Aztecs
Speech and language research is big. Very big. You just won't believe how vastly, hugely, mind- bogglingly big it is. I mean, you may think it's a long way down the road to the chemist's, but that's just peanuts to speech and language research! Listen!
And so on...
|Jan. 30||Rada Mihalcea||
Linking Documents to Encyclopedic Knowledge: Using Wikipedia as a Source of Linguistic Evidence
Note special time and place: 10:30-12:00, Pratt 266
Wikipedia is an online encyclopedia that has grown to become one of the largest online repositories of encyclopedic knowledge, with millions of articles available for a large number of languages. In fact, Wikipedia editions are available for more than 200 languages, with a number of entries varying from a few pages to more than one million articles per language.
In this talk, I will describe the use of Wikipedia as a source of linguistic evidence for natural language processing tasks. In particular, I will show how this online encyclopedia can be used to achieve state-of-the-art results on two text processing tasks: automatic keyword extraction and word sense disambiguation. I will also show how the two methods can be combined into a system able to automatically enrich a text with links to encyclopedic knowledge. Given an input document, the system identifies the important concepts in the text and automatically links these concepts to the corresponding Wikipedia pages. Evaluations of the system showed that the automatic annotations are reliable and hardly distinguishable from manual annotations. Additionally, an evaluation of the system in an educational environment showed that the availability of encyclopedic knowledge within easy reach of a learner can improve both the quality of the knowledge acquired and the time needed to obtain such knowledge.
This is joint work with Andras Csomai.
|Feb. 15||Graeme Hirst||
Real-word spelling correction with trigrams: A reconsideration of the Mays, Damerau, and Mercer model
The trigram-based noisy-channel model of real-word spelling-error correction that was presented by Mays, Damerau, and Mercer in 1991 has never been adequately evaluated or compared with other methods. We analyze the advantages and limitations of the method, and present a new evaluation that enables a meaningful comparison with the WordNet-based method of Hirst and Budanitsky. The trigram method is found to be superior, even on content words. We then show that optimizing over sentences gives better results than variants of the algorithm that optimize over fixed-length windows.
This talk represents collaborative work between Amber Wilcox-Hearn, Graeme Hirst, and Alexander Budanitsky
|Feb. 29||Cancelled||Graduate Visit Day|
|Mar. 14||(Afra Alishahi || Afsaneh Fazly)||A Probabilistic Incremental Model of Word Learning in the Presence of Referential Uncertainty We present a probabilistic incremental model of word learning in children. The model acquires the meaning of words from exposure to word usages in sentences, paired with appropriate semantic representations, in the presence of referential uncertainty. A distinct property of our model is that it continually revises its learned knowledge of a word's meaning, but over time converges on the most likely meaning of the word. Another key feature is that the model bootstraps its own partial knowledge of word--meaning associations to help more quickly learn the meanings of novel words. Results of simulations on naturalistic child-directed data show that our model exhibits behaviours similar to those observed in the early lexical acquisition of children, such as vocabulary spurt and fast mapping.|
|Mar. 28||Chris Parisien||
An Incremental Bayesian Model for Learning Syntactic Categories
I present a method for the unsupervised learning of syntactic categories from text. The method uses an incremental Bayesian clustering algorithm to find groups of words that occur within similar syntactic contexts. The model draws information from the distributional cues of words within an utterance, while explicitly bootstrapping its development on its own partial knowledge of syntactic categories. Using a corpus of child-directed speech, we demonstrate the benefit of a syntactic bootstrap for an incremental categorization model. The model is robust to the noise in real language data, manages lexical ambiguity, and shows learning behaviours similar to what we observe in children.
|Apr. 11||Tim Fowler||
Navigating the parsing landscape
We will introduce context free grammars (CFGs) and combinatory categorial grammars (CCGs) with a focus on how these formalisms deal with semantics. The known differences between the formalisms will be discussed and the Lambek calculus will be introduced as an ideal comparison point between the two. To do this, we will need to consider the formal language class of natural language. A recent polynomial time parsing result for the Lambek calculus will be introduced and we will discuss possible future research opened up by this result.
|Sept. 14||CL Group||Fall 2007 Welcoming Meeting|
|Sept. 28||Gerald Penn||
The Quantitative Study of Writing Systems
If you understood all of the world's languages, you would still not be able to read many of the texts that you find on the world wide web, because they are written in non-Roman scripts -- often ones that have been arbitrarily encoded for electronic transmission in the absence of an accepted standard. This very modern nuisance reflects a dilemma as ancient as writing itself: the association between a language as it is spoken and its written form has a sort of internal logic to it that we can comprehend, but the conventions are different in every individual case --- even among languages that use the same script, or between scripts used by the same language. This conventional association between language and script, called a writing system, is indeed reminiscent of the Saussurean conception of language itself, a conventional association of meaning and sound, upon which modern linguistic theory is based. Despite linguists' reliance upon writing to present and preserve linguistic data, however, writing systems were a largely forgotten corner of linguistics until the 1960s, when Gelb presented their first classification.
This talk will describe recent work that aims to place the study of writing systems upon a sound computational and statistical foundation. While archaeological decipherment may eternally remain the holy grail of this area of research, it also has applications to speech synthesis, machine translation, and multilingual document retrieval.
|Oct. 12||Paul Cook||
Pulling their Weight: Exploiting Syntactic Forms for the Automatic
Identification of Idiomatic Expressions in Context
Much work on idioms has focused on type identification, i.e., determining whether a sequence of words can form an idiomatic expression. Since an idiom type often has a literal interpretation as well, token classification of potential idioms in context is critical for NLP. We explore the use of informative prior knowledge about the overall syntactic behaviour of a potentially-idiomatic expression (type-based knowledge) to determine whether an instance of the expression is used idiomatically or literally (token-based knowledge). We develop unsupervised methods for the task, and show that their performance is comparable to that of standard supervised techniques.
|Nov. 9||Graeme Hirst||
Views of Text-Meaning in Computational Linguistics
Three views of text-meaning compete in the philosophy of language: objective, subjective, and authorial -- "in" the text, or "in" the reader, or "in" the writer. Computational linguistics has ignored the competition and implicitly embraced all three, and rightly so; but different views have predominated at different times and in different applications. Contemporary applications mostly take the crudest view: meaning is objectively "in" a text. The more-sophisticated applications now on the horizon, however, demand the other two views: as the computer takes on the user's purpose, it must also take on the user's subjective views; but sometimes, the user's purpose is to determine the author's intent. Accomplishing this requires, among other things, an ability to determine what could have been said but wasn't, and hence a sensitivity to linguistic nuance. It is therefore necessary to develop computational mechanisms for this sensitivity.
|Nov. 23||Diana Raffman||
Psychological Hysteresis and the Nontransitivity of Insignificant Differences
Vague words in natural language cause semantic and logical problems in a variety of disciplines. An especially persistent problem has to do with the nontransitivity of insignificant differences. For example, if eating one candy won't make me fat, then eating two won't; but if eating two won't, then eating three won't; and so on. It seems to follow that eating a thousand pieces of candy won't make me fat. This paradoxical result shows that the word 'fat' is vague. Similarly, if Hillary Clinton is a person, then she was a person one second ago; and if she was a person one second ago, then she was a person two seconds ago; etc. It seems to follow that the conceptus from which Hillary Clinton developed was also a person. The word 'person' is vague.
Clearly there is something wrong with this paradoxical form of reasoning, but a satisfactory diagnosis has not been found. In this talk I will propose a diagnosis that appeals to the hysteretical nature of our judgments involving vague words. To that end I will present preliminary results of a psychological study of our use of vague words.