Adversarial Evaluation for Models of Natural Language

by Noah Smith

Presented at the Front Range NLP Meetup by L. Amber Wilcox-O'Hearn

November 6th, 2013

(These slides were made with hieroglyph.)


  • Intrinsic
  • Extrinsic
  • Perplexity
  • Adversarial

Computational Linguistics vs. Natural Language Processing?

Spurious disctinction‽

Practical difference?

In both cases, a central goal is to make and use better models.

Evaluation Methods

So, supposing you have designed an NLP model. How do you evaluate it?

In this paper, these methods are discussed:

To illustrate the these methods, let's suppose that we want to model POS tagging with an HMM.

Intrinsic Evaluation

In intrinsic evaluation

For POS tagging:

Weakness of Intrinsic Evaluation

In particular, this evaluation technique can never help us design a better set of tags, let alone a theory of word characterisation that improves on the theory of POS.

Extrinsic Evaluation

In extrinsic evaluation

Sounds reasonable: Insofar as a model improves the result of our NLP task, it will be desirable for the task to incorporate it.

However, what can we conclude if it doesn't improve the task?

Implicitly Evaluates Architecture

Suppose we have a POS tagger that is being used as a component in a speech recogniser. If a change to our tagger does not improve our results, we still can't be sure the tagger isn't better!

  • Does it assume some structure?
  • Does it take advantage of all available information the tagger could provide?

This kind of evaluation can not take into account the way the model is used, which limits the scope and structure of the model, and can miss improvements. I.e. the model could be a better characterisation of the language and not improve the results of the task.


Limitations we would like to avoid:

Of course, improved perplexity cannot guarantee improved results on an extrinsic task, and in practice it often doesn't.

Should a Model be Interpretable by Linguists?

Intrinsic evaluation doesn't work for unsupervised methods.

"it is not at all clear [unsupervised taggers] are actually learning taggers for part-of-speech" -- Garrette and Baldridge 2013

The contentious point is that unsupervised models aren't replicating linguistic expertise.

Similarly, in Chang et al. 2009, the entire paper is based on the problem that the best topic models in terms of predictive power are not the most intuitive to humans.

Subsymbolic phenomenon, symbolic understanders.

The Adversarial Model

The evaluation cycle has two roles that evolve with respect to each other:

Strong Players

A good Claude is one that can reliably distinguish real text from synthesised text.

What makes a good Zellig?

  • Even the best Claude cannot tell two well-formed texts apart.
  • As state-of-the-art Claudes improve, it takes more sophisticated transformations to find their weakness.

Negative Evidence

Learning linguistic structure through this cycle is a way to incorporate negative evidence, a powerful tool for grammar induction.

Often an induced grammar is over-general; the more well-formed structures it can produce or recognise, the more ill-formed ones come along for the ride.

By evaluating a model on its ability to detect ill-formed text, we incorporate negative evidence into the iterations.

Role of the Linguist

If linguists are not annotating data or evaluating intuitiveness, how can we use their expertise?


Malaprop involves transformations of natural text that result in some words being replaced by real-word near neighbours.

Malaprop task generation

Two tasks:



Adversarial Task

The baseline using this Zellig is too easy.

Evaluation code includes some basic error analysis


In the README.