Multimodal Neural Language Models

Ryan Kiros, Ruslan Salakhutdinov, Richard Zemel
University of Toronto


Supplementary material: [pdf]
A previous version appeared at the NIPS 2013 Deep Learning Workshop: [pdf]



IAPR TC-12 data: download
IAPR TC-12 models: download


All results below use a randomly chosen subset of test images/descriptions. The train/test splits for IAPR TC-12 can be obtained here.

Image to text retrieval

Returns the training description which has lowest perplexity when conditioned on the given test image. A shortlist of the 15 nearest training images (in feature space) is used.

Text to image retrieval

Returns the top 4 training images which result in the lowest perplexity when conditioned on with the given test description.

Text generation

For IAPR TC-12, we initialize the model to the first 5 words of the best retrieved training description. On Attributes discovery, the model is initialized to "this product contains a".
The model then generates words one at a time conditioned on the image.

Tag visualization

A t-SNE visualization of the IAPR TC-12 tags using multimodal word embeddings picture

Multimodal Word Embeddings

These embeddings were learned on roughly 400,000 image-text captions (about 5.5 million words) from the SBU Captioned Photo dataset. The vocabulary size is 57,070.
Also included are embeddings learned without images and a script to compare the difference between nearest neigbours of words. [download]