Adversarial Evaluation for Models of Natural Language
by Noah Smith
Presented at the Front Range NLP Meetup by L. Amber Wilcox-O'Hearn
November 6th, 2013
(These slides were made with hieroglyph.)
by Noah Smith
Presented at the Front Range NLP Meetup by L. Amber Wilcox-O'Hearn
November 6th, 2013
(These slides were made with hieroglyph.)
- Intrinsic
- Extrinsic
- Perplexity
- Adversarial
Spurious disctinction‽
Practical difference?
In both cases, a central goal is to make and use better models.
So, supposing you have designed an NLP model. How do you evaluate it?
In this paper, these methods are discussed:
To illustrate the these methods, let's suppose that we want to model POS tagging with an HMM.
In intrinsic evaluation
For POS tagging:
In particular, this evaluation technique can never help us design a better set of tags, let alone a theory of word characterisation that improves on the theory of POS.
In extrinsic evaluation
Sounds reasonable: Insofar as a model improves the result of our NLP task, it will be desirable for the task to incorporate it.
However, what can we conclude if it doesn't improve the task?
Suppose we have a POS tagger that is being used as a component in a speech recogniser. If a change to our tagger does not improve our results, we still can't be sure the tagger isn't better!
- Does it assume some structure?
- Does it take advantage of all available information the tagger could provide?
This kind of evaluation can not take into account the way the model is used, which limits the scope and structure of the model, and can miss improvements. I.e. the model could be a better characterisation of the language and not improve the results of the task.
Limitations we would like to avoid:
Of course, improved perplexity cannot guarantee improved results on an extrinsic task, and in practice it often doesn't.
Intrinsic evaluation doesn't work for unsupervised methods.
"it is not at all clear [unsupervised taggers] are actually learning taggers for part-of-speech" -- Garrette and Baldridge 2013
The contentious point is that unsupervised models aren't replicating linguistic expertise.
Similarly, in Chang et al. 2009, the entire paper is based on the problem that the best topic models in terms of predictive power are not the most intuitive to humans.
Subsymbolic phenomenon, symbolic understanders.
The evaluation cycle has two roles that evolve with respect to each other:
A good Claude is one that can reliably distinguish real text from synthesised text.
What makes a good Zellig?
- Even the best Claude cannot tell two well-formed texts apart.
- As state-of-the-art Claudes improve, it takes more sophisticated transformations to find their weakness.
Learning linguistic structure through this cycle is a way to incorporate negative evidence, a powerful tool for grammar induction.
Often an induced grammar is over-general; the more well-formed structures it can produce or recognise, the more ill-formed ones come along for the ride.
By evaluating a model on its ability to detect ill-formed text, we incorporate negative evidence into the iterations.
If linguists are not annotating data or evaluating intuitiveness, how can we use their expertise?
Malaprop involves transformations of natural text that result in some words being replaced by real-word near neighbours.
Two tasks:
Baselines:
The baseline using this Zellig is too easy.
Evaluation code includes some basic error analysis
In the README.