Teaching Machines to Describe Images with Natural Language Feedback

Abstract

Robots will eventually be part of every household. It is thus critical to enable algorithms to learn from and be guided by non-expert users. In this paper, we bring a human in the loop, and enable a human teacher to give feedback to a learning agent in the form of natural language. A descriptive sentence can provide a stronger learning signal than a numeric reward in that it can easily point to where the mistakes are and how to correct them. We focus on the problem of image captioning in which the quality of the output can easily be judged by non-experts. We propose a phrase-based captioning model trained with policy gradients, and design a feedback network that provides reward to the learner by conditioning on the human-provided feedback. We show that by exploiting descriptive feedback our model learns to perform better than when given independently written human captions.


Paper

  • Teaching Machines to Describe Images with Natural Language Feedback
  • Huan Ling, Sanja Fidler
  • arXiv preprint, May 2017

  • Download PDF           Bibtex

Phrase-based Image Captioning

Our captioning model, forming the base of our approach, uses a hierarchical Recurrent Neural Network

  • See Section 3.1 for details

Crowd-sourcing Human Feedback

We create a web interface for collecting feedback information on a larger scale via AMT. We collect feedback in the form of natural language as well as information about the type of mistake, mistaken words, and the corrected captions.


  • See Demo
  • See examples of collected feedback below

Policy Gradient Optimization using Natural Language Feedback

We directly optimize for the desired image captioning metrics plus human feedbacks using the Policy Gradient technique..

    See section 3.2~3.4 for details.


Examples of Human-Provided Natural Language Feedback

REF: captions from our reference model (we took the MLE model);  FB: feedback provided by a human annotator;  CORR: Corrected sentence provided by the human annotator.
See section 3.2 for details

REF: ( a woman ) ( is sitting ) ( on a bench ) ( with a plate ) ( of food . )
FB: What the woman is sitting on is not visible.
CORR: ( a woman ) ( is sitting ) ( with a plate ) ( of food . )

REF: ( a horse ) ( is standing ) ( in a barn ) ( in a field . )
FB: There is no barn. There is a fence.
CORR: ( a horse ) ( is standing ) ( in a fence ) ( in a field . )

REF: ( a man ) ( riding a motorcycle ) ( on a city street . )
FB: There is a man and a woman.
CORR: ( a man and a woman ) ( riding a motorcycle ) ( on a city street . )



Examples of Generated Captions

MLE: maximum likelihood model;  RLB: reinforcement learning baseline model;  RLF: reinforcement learning with human-provided feedback.
See section 4 for details

MLE : ( a man ) ( walking ) ( in front of a building ) ( with a cell phone . )
RLB : ( a man ) ( is standing ) ( on a sidewalk ) ( with a cell phone . )
RLF : ( a man ) ( wearing a black suit ) ( and tie ) ( on a sidewalk . )

MLE : ( a clock tower ) ( with a clock ) ( on top . )
RLB : ( a clock tower ) ( with a clock ) ( on top of it . )
RLF : ( a clock tower ) ( with a clock ) ( on the front . )

MLE : ( two giraffes ) ( are standing ) ( in a field ) ( in a field . )
RLB : ( a giraffe ) ( is standing ) ( in front of a large building . )
RLF : ( a giraffe ) ( is ) ( in a green field ) ( in a zoo . )

MLE : ( two birds ) ( are standing ) ( on the beach ) ( on a beach . )
RLB : ( a group ) ( of birds ) ( are ) ( on the beach . )
RLF : ( two birds ) ( are standing ) ( on a beach ) ( in front of water . )

MLE : ( a table ) ( with a variety ) ( of food . )
RLB : ( a table ) ( filled ) ( with different types ) ( of different foods . )
RLF : ( a tray ) ( of food on a table ) ( with a variety ) ( of different toppings . )

MLE : ( a computer ) ( sitting ) ( on top of a desk ) ( on a monitor . )
RLB : ( a laptop ) ( sitting ) ( on top of a desk ) ( next to a computer monitor . )
RLF : ( a computer ) ( sitting ) ( on top of a desk ) ( with a monitor . )

MLE : ( a suitcase ) ( is ) ( on a bed ) ( with a bag ) ( on top of it . )
RLB : ( a suitcase ) ( is ) ( on a table ) ( with a suitcase . )
RLF : ( a luggage bag ) ( sitting ) ( on a floor ) ( in a room . )

MLE : ( a red bus ) ( driving down a street ) ( with a person ) ( waiting ) ( on the street . )
RLB : ( a red bus ) ( driving down a street ) ( with people ) ( driving ) ( on the side . )
RLF : ( a red bus ) ( is driving down the street . )

MLE : ( a street ) ( with a traffic light ) ( and a bus ) ( in the background . )
RLB : ( a person ) ( walking ) ( on a city street ) ( with a yellow sign . )
RLF : ( a street ) ( with a car ) ( is driving down a street . )