Guided Open Vocabulary Image Captioning with Constrained Beam Search

doi:10.18653/V1/D17-1098

Home
/
Papers
/
Guided Open Vocabulary Image Captioning with Constrained Beam Search

Proceedings Article•DOI•

Guided Open Vocabulary Image Captioning with Constrained Beam Search

Peter Anderson¹, Basura Fernando¹, Mark Johnson, Stephen Gould¹•Institutions (1)

Australian National University¹

01 Sep 2017-pp 936-945

TL;DR: This work uses constrained beam search to force the inclusion of selected tag words in the output, and fixed, pretrained word embeddings to facilitate vocabulary expansion to previously unseen tag words to achieve state of the art results for out-of- domain captioning on MSCOCO (and improved results for in-domain captioning).

read less

Abstract: Existing image captioning models do not generalize well to out-of-domain images containing novel scenes or objects. This limitation severely hinders the use of these models in real world applications dealing with images in the wild. We address this problem using a flexible approach that enables existing deep captioning architectures to take advantage of image taggers at test time, without re-training. Our method uses constrained beam search to force the inclusion of selected tag words in the output, and fixed, pretrained word embeddings to facilitate vocabulary expansion to previously unseen tag words. Using this approach we achieve state of the art results for out-of-domain captioning on MSCOCO (and improved results for in-domain captioning). Perhaps surprisingly, our results significantly outperform approaches that incorporate the same tag predictions into the learning algorithm. We also show that we can significantly improve the quality of generated ImageNet captions by leveraging ground-truth labels.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Posted Content•

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

[...]

Xiujun Li¹, Xi Yin¹, Chunyuan Li¹, Pengchuan Zhang¹, Xiaowei Hu¹, Lei Zhang¹, Lijuan Wang¹, Houdong Hu¹, Li Dong¹, Furu Wei¹, Yejin Choi², Jianfeng Gao¹ - Show less +8 more•Institutions (2)

Microsoft¹, University of Washington²

13 Apr 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: This paper proposes a new learning method Oscar (Object-Semantics Aligned Pre-training), which uses object tags detected in images as anchor points to significantly ease the learning of alignments.

...read moreread less

Abstract: Large-scale pre-training methods of learning cross-modal representations on image-text pairs are becoming popular for vision-language tasks. While existing methods simply concatenate image region features and text features as input to the model to be pre-trained and use self-attention to learn image-text semantic alignments in a brute force manner, in this paper, we propose a new learning method Oscar (Object-Semantics Aligned Pre-training), which uses object tags detected in images as anchor points to significantly ease the learning of alignments. Our method is motivated by the observation that the salient objects in an image can be accurately detected, and are often mentioned in the paired text. We pre-train an Oscar model on the public corpus of 6.5 million text-image pairs, and fine-tune it on downstream tasks, creating new state-of-the-arts on six well-established vision-language understanding and generation tasks.

...read moreread less

887 citations

Proceedings Article•DOI•

Justifying Recommendations using Distantly-Labeled Reviews and Fine-Grained Aspects.

[...]

Jianmo Ni¹, Jiacheng Li¹, Julian McAuley¹•Institutions (1)

University of California, San Diego¹

01 Nov 2019

TL;DR: This work proposes an ‘extractive’ approach to identify review segments which justify users’ intentions and designs two personalized generation models which can generate diverse justifications based on templates extracted from justification histories.

...read moreread less

Abstract: Several recent works have considered the problem of generating reviews (or ‘tips’) as a form of explanation as to why a recommendation might match a customer’s interests. While promising, we demonstrate that existing approaches struggle (in terms of both quality and content) to generate justifications that are relevant to users’ decision-making process. We seek to introduce new datasets and methods to address the recommendation justification task. In terms of data, we first propose an ‘extractive’ approach to identify review segments which justify users’ intentions; this approach is then used to distantly label massive review corpora and construct large-scale personalized recommendation justification datasets. In terms of generation, we are able to design two personalized generation models with this data: (1) a reference-based Seq2Seq model with aspect-planning which can generate justifications covering different aspects, and (2) an aspect-conditional masked language model which can generate diverse justifications based on templates extracted from justification histories. We conduct experiments on two real-world datasets which show that our model is capable of generating convincing and diverse justifications.

...read moreread less

686 citations

Cites background from "Guided Open Vocabulary Image Captio..."

...To mitigate the trade-off between diversity and relevance, one approach is to add more constraints during generation such as constrained Beam Search (Anderson et al., 2017)....
[...]
...RQ2: How does aspect planning affect generation? To mitigate the trade-off between diversity and relevance, one approach is to add more constraints during generation such as constrained Beam Search (Anderson et al., 2017)....
[...]

Proceedings Article•DOI•

Meshed-Memory Transformer for Image Captioning

[...]

Marcella Cornia¹, Matteo Stefanini¹, Lorenzo Baraldi¹, Rita Cucchiara¹•Institutions (1)

University of Modena and Reggio Emilia¹

14 Jun 2020

TL;DR: The architecture improves both the image encoding and the language generation steps: it learns a multi-level representation of the relationships between image regions integrating learned a priori knowledge, and uses a mesh-like connectivity at decoding stage to exploit low- and high-level features.

...read moreread less

Abstract: Transformer-based architectures represent the state of the art in sequence modeling tasks like machine translation and language understanding. Their applicability to multi-modal contexts like image captioning, however, is still largely under-explored. With the aim of filling this gap, we present M² - a Meshed Transformer with Memory for Image Captioning. The architecture improves both the image encoding and the language generation steps: it learns a multi-level representation of the relationships between image regions integrating learned a priori knowledge, and uses a mesh-like connectivity at decoding stage to exploit low- and high-level features. Experimentally, we investigate the performance of the M² Transformer and different fully-attentive models in comparison with recurrent ones. When tested on COCO, our proposal achieves a new state of the art in single-model and ensemble configurations on the "Karpathy" test split and on the online test server. We also assess its performances when describing objects unseen in the training set. Trained models and code for reproducing the experiments are publicly available at: https://github.com/aimagelab/meshed-memory-transformer.

...read moreread less

660 citations

Cites methods from "Guided Open Vocabulary Image Captio..."

...As expected, the use of CBS significantly enhances the performances, in particular on out-of-domain captioning....
[...]
...We compare with Up-Down [4] and Neural Baby Talk [25], when using GloVe word embeddings and Constrained Beam Search (CBS) [3] to address the generation of out-of-vocabulary words and constrain the presence of categories detected by an object detector....
[...]

Proceedings Article•DOI•

Neural Baby Talk

[...]

Jiasen Lu¹, Jianwei Yang¹, Dhruv Batra¹, Devi Parikh¹•Institutions (1)

Georgia Institute of Technology¹

27 Mar 2018

TL;DR: A novel framework for image captioning that can produce natural language explicitly grounded in entities that object detectors find in the image is introduced and reaches state-of-the-art on both COCO and Flickr30k datasets.

...read moreread less

Abstract: We introduce a novel framework for image captioning that can produce natural language explicitly grounded in entities that object detectors find in the image. Our approach reconciles classical slot filling approaches (that are generally better grounded in images) with modern neural captioning approaches (that are generally more natural sounding and accurate). Our approach first generates a sentence 'template' with slot locations explicitly tied to specific image regions. These slots are then filled in by visual concepts identified in the regions by object detectors. The entire architecture (sentence template generation and slot filling with object detectors) is end-to-end differentiable. We verify the effectiveness of our proposed model on different image captioning tasks. On standard image captioning and novel object captioning, our model reaches state-of-the-art on both COCO and Flickr30k datasets. We also demonstrate that our model has unique advantages when the train and test distributions of scene compositions - and hence language priors of associated captions - are different. Code has been made available at: https://github.com/jiasenlu/NeuralBabyTalk.

...read moreread less

436 citations

Cites methods from "Guided Open Vocabulary Image Captio..."

...G means greedy decoding, and T1−2 means using constrained beam search [2] with 1−2 top detected concepts....
[...]
...When using ResNet-101 and constrained beam search [2], our model significantly outperforms prior works under F1 scores, SPICE, METEOR, and CIDEr, across both out-of-domain and in-domain test data....
[...]
...Following [2], the test set is split into in-domain and outof-domain subsets....
[...]
...We can force our model to produce a caption containing “orange” and “bird” using constrained beam search [2], further illustrated in Sec....
[...]
...Since the visual words are grounded at the object-level, by using [2], our model was able to significantly boost the captioning performance on out-of-domain images....
[...]

Proceedings Article•DOI•

Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts

[...]

Soravit Changpinyo¹, Piyush Sharma¹, Nan Ding¹, Radu Soricut¹•Institutions (1)

Google¹

17 Feb 2021

TL;DR: The Conceptual 12M (CC12M) dataset as mentioned in this paper is a dataset with 12 million image-text pairs specifically meant to be used for vision-and-language pre-training.

...read moreread less

Abstract: The availability of large-scale image captioning and visual question answering datasets has contributed significantly to recent successes in vision-and-language pretraining. However, these datasets are often collected with overrestrictive requirements inherited from their original target tasks (e.g., image caption generation), which limit the resulting dataset scale and diversity. We take a step further in pushing the limits of vision-and-language pretraining data by relaxing the data collection pipeline used in Conceptual Captions 3M (CC3M) [54] and introduce the Conceptual 12M (CC12M), a dataset with 12 million image-text pairs specifically meant to be used for visionand-language pre-training. We perform an analysis of this dataset and benchmark its effectiveness against CC3M on multiple downstream tasks with an emphasis on long-tail visual recognition. Our results clearly illustrate the benefit of scaling up pre-training data for vision-and-language tasks, as indicated by the new state-of-the-art results on both the nocaps and Conceptual Captions benchmarks.1

...read moreread less

376 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

Collapse

References

PDF

Open Access

More filters

Proceedings Article•DOI•

Deep Residual Learning for Image Recognition

[...]

Kaiming He¹, Xiangyu Zhang¹, Shaoqing Ren¹, Jian Sun¹•Institutions (1)

Microsoft¹

27 Jun 2016

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

...read moreread less

123,388 citations

Journal Article•DOI•

Long short-term memory

[...]

Sepp Hochreiter¹, Jürgen Schmidhuber²•Institutions (2)

Technische Universität München¹, Dalle Molle Institute for Artificial Intelligence Research²

01 Nov 1997-Neural Computation

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.

...read moreread less

Abstract: Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O. 1. Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.

...read moreread less

72,897 citations

"Guided Open Vocabulary Image Captio..." refers methods in this paper

...The LRCN consists of a CNN visual feature extractor followed by two LSTM layers (Hochreiter and Schmidhuber, 1997), each with 1,000 hidden units....
[...]
...In each case the decoding process remains the same—captions are generated by searching over output sequences greedily 1www.panderson.me/constrained-beam-search or with beam search....
[...]

Proceedings Article•

Very Deep Convolutional Networks for Large-Scale Image Recognition

[...]

Karen Simonyan¹, Andrew Zisserman¹•Institutions (1)

University of Oxford¹

01 Jan 2015

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.

...read moreread less

Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

...read moreread less

49,914 citations

Journal Article•DOI•

ImageNet Large Scale Visual Recognition Challenge

[...]

Olga Russakovsky¹, Jia Deng², Hao Su¹, Jonathan Krause¹, Sanjeev Satheesh¹, Sean Ma¹, Zhiheng Huang¹, Andrej Karpathy¹, Aditya Khosla³, Michael S. Bernstein¹, Alexander C. Berg⁴, Li Fei-Fei¹ - Show less +8 more•Institutions (4)

Stanford University¹, University of Michigan², Massachusetts Institute of Technology³, University of North Carolina at Chapel Hill⁴

01 Dec 2015-International Journal of Computer Vision

TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.

...read moreread less

Abstract: The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide a detailed analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the 5 years of the challenge, and propose future directions and improvements.

...read moreread less

30,811 citations

Proceedings Article•DOI•

Glove: Global Vectors for Word Representation

[...]

Jeffrey Pennington¹, Richard Socher², Christopher D. Manning¹•Institutions (2)

Stanford University¹, University of Colorado Boulder²

01 Oct 2014

TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.

...read moreread less

Abstract: Recent methods for learning vector space representations of words have succeeded in capturing fine-grained semantic and syntactic regularities using vector arithmetic, but the origin of these regularities has remained opaque. We analyze and make explicit the model properties needed for such regularities to emerge in word vectors. The result is a new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods. Our model efficiently leverages statistical information by training only on the nonzero elements in a word-word cooccurrence matrix, rather than on the entire sparse matrix or on individual context windows in a large corpus. The model produces a vector space with meaningful substructure, as evidenced by its performance of 75% on a recent word analogy task. It also outperforms related models on similarity tasks and named entity recognition.

...read moreread less

30,558 citations

"Guided Open Vocabulary Image Captio..." refers background or methods in this paper

...To introduce an additional vocabulary word, the GloVe embedding for the new word is simply concatenated with We as an additional column, increasing the dimension of both Πt and pt by one....
[...]
...The model is trained with the conventional softmax cross-entropy loss function, and learns to predict vt vectors that have a high dot-product similarity with the GloVe embedding of the correct output word....
[...]
...Since GloVe embeddings capture semantic and syntactic similarities (Pennington et al., 2014), intuitively the captioning model will generalize from similar words in order to understand how the new word can be used....
[...]
...Concretely, the ith column of the We input embedding matrix is initialized with the GloVe vector associated with vocabulary word i....
[...]
...The model output is then: vt = tanh (Wvh2t + bv) (11) p(yt | yt−1, ..., y1, I) = softmax (W Te vt) (12) where vt represents the top LSTM output projected to 300 dimensions,W Te contains GloVe embeddings as row vectors, and p(yt | yt−1, ..., y1, I) represents the normalized probability distribution over the predicted output word yt at timestep t, given the previous output words and the image....
[...]