Learning to Answer Questions From Image Using Convolutional Neural Network

Lin Ma, Zhengdong Lu, and Hang Li

Huawei Noah's Ark Lab, Hong Kong



In this paper, we propose to employ the convolutional neural network (CNN) for the image question answering (QA). Our proposed CNN provides an end-to-end framework with convolutional architectures for learning not only the image and question representations, but also their inter-modal interactions to produce the answer. More specifically, our model consists of three CNNs: one image CNN to encode the image content, one sentence CNN to compose the words of the question, and one multimodal convolution layer to learn their joint representation for the classification in the space of candidate answer words. We demonstrate the efficacy of our proposed model on the DAQUAR and COCO-QA datasets, which are two benchmark datasets for the image QA, with the performances significantly outperforming the state-of-the-art..

The contributions of this work:

Convolutional Neural Network for Image Question Answering (ConvIQA)

The image QA task, resembling the visual Turing test, differs with the other multimodal learning tasks between image and sentence, such as the automatic image captioning. The answer produced by the image QA needs to be conditioned on both the image and question. As such, the image QA involves more interactions between image and language. The questions about the images are very specific, which requires a detailed understanding of the image content.

ConvIQA is to predict the answer a given the question q and the related image I.

In order to make a reliable prediction of the answer a, the question q and image I need to be adequately represented. Based on their representations, the relations between the two multimodal inputs are further learned to produce the answer.

ConvIQA consists of three individual CNNs:

Experimental Results

Image question answering performances on DAQUAR-All

Image question answering performances on DAQUAR-Reduced 

Image question answering performances on COCO-QA


Multi-World Approach
M. Malinowski and M. Fritz, "A Multi-world Approach to Question Answering about Real-world Scenes Based on Uncertain Input", NIPS 2014. [Full Text]
M. Malinowski, M. Rohrbach, and M. Fritz, "Ask Your Neurons: A Neural-based Approach to Answering Questions about Images", ICCV 2015. [Full Text]
M. Ren, R. Kiros, and R. S. Zemel, "Exploring Models and Data for Image Question Answering", NIPS 2015. [Full Text]
Z. Wu and M. S. Palmer, "Verb Semantics and Lexical Selection", ACL 1994.

Contact Me

If you have any questions, please feel free to contact Dr. Lin Ma (forest.linma@gmail.com).

Back to top

Last update: Nov. 24, 2015