Multimodal Neural Language Models
Ryan Kiros, Ruslan Salakhutdinov, Richard Zemel
Supplementary material: [pdf]
University of Toronto
A previous version appeared at the NIPS 2013 Deep Learning Workshop: [pdf]
IAPR TC-12 data: download
IAPR TC-12 models: download
All results below use a randomly chosen subset of test images/descriptions. The train/test splits for IAPR TC-12 can be obtained here.
Image to text retrieval
Returns the training description which has lowest perplexity when conditioned on the given test image. A shortlist of the 15 nearest training images (in feature space) is used.
Text to image retrieval
Returns the top 4 training images which result in the lowest perplexity when conditioned on with the given test description.
For IAPR TC-12, we initialize the model to the first 5 words of the best retrieved training description.
On Attributes discovery, the model is initialized to "this product contains a".
The model then generates words one at a time conditioned on the image.
A t-SNE visualization of the IAPR TC-12 tags using multimodal word embeddings picture
Multimodal Word Embeddings
These embeddings were learned on roughly 400,000 image-text captions (about 5.5 million words) from the SBU Captioned Photo dataset. The vocabulary size is 57,070.
Also included are embeddings learned without images and a script to compare the difference between nearest neigbours of words. [download]