Top 10 papers published by Margaret Mitchell from Google in 2016

Proceedings Article•DOI•

Generating Natural Questions About an Image

[...]

Nasrin Mostafazadeh¹, Ishan Misra², Jacob Devlin², Margaret Mitchell³, Xiaodong He³, Lucy Vanderwende³ - Show less +2 more•Institutions (3)

University of Rochester¹, Carnegie Mellon University², Microsoft³

19 Mar 2016

TL;DR: This paper introduces the novel task of Visual Question Generation, where the system is tasked with asking a natural and engaging question when shown an image, and provides three datasets which cover a variety of images from object-centric to event-centric.

...read moreread less

Abstract: There has been an explosion of work in the vision & language community during the past few years from image captioning to video transcription, and answering questions about images. These tasks have focused on literal descriptions of the image. To move beyond the literal, we choose to explore how questions about an image are often directed at commonsense inference and the abstract events evoked by objects in the image. In this paper, we introduce the novel task of Visual Question Generation (VQG), where the system is tasked with asking a natural and engaging question when shown an image. We provide three datasets which cover a variety of images from object-centric to event-centric, with considerably more abstract training data than provided to state-of-the-art captioning systems thus far. We train and test several generative and retrieval models to tackle the task of VQG. Evaluation results show that while such models ask reasonable questions for a variety of images, there is still a wide gap with human performance which motivates further work on connecting images with commonsense knowledge and pragmatics. Our proposed task offers a new challenge to the community which we hope furthers interest in exploring deeper connections between vision & language.

...read moreread less

300 citations

Posted Content•

Visual Storytelling

[...]

Ting-Hao, Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Aishwarya Agrawal, Jacob Devlin, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, C. Lawrence Zitnick, Devi Parikh, Lucy Vanderwende, Michel Galley, Margaret Mitchell - Show less +12 more

13 Apr 2016-arXiv: Computation and Language

TL;DR: Modelling concrete description as well as figurative and social language, as provided in this dataset and the storytelling task, has the potential to move artificial intelligence from basic understandings of typical visual scenes towards more and more human-like understanding of grounded event structure and subjective expression.

...read moreread less

Abstract: We introduce the first dataset for sequential vision-to-language, and explore how this data may be used for the task of visual storytelling The first release of this dataset, SIND v1, includes 81,743 unique photos in 20,211 sequences, aligned to both descriptive (caption) and story language We establish several strong baselines for the storytelling task, and motivate an automatic metric to benchmark progress Modelling concrete description as well as figurative and social language, as provided in this dataset and the storytelling task, has the potential to move artificial intelligence from basic understandings of typical visual scenes towards more and more human-like understanding of grounded event structure and subjective expression

...read moreread less

231 citations

Proceedings Article•DOI•

Seeing through the Human Reporting Bias: Visual Classifiers from Noisy Human-Centric Labels

[...]

Ishan Misra¹, C. Lawrence Zitnick, Margaret Mitchell², Ross Girshick•Institutions (2)

Carnegie Mellon University¹, Microsoft²

01 Jun 2016

TL;DR: In this article, the authors use human reporting bias to learn visually correct image classifiers and demonstrate that the noise in these annotations exhibits structure and can be modeled, which is highly interpretable for reporting "what's in the image" versus "What's worth saying".

...read moreread less

Abstract: When human annotators are given a choice about what to label in an image, they apply their own subjective judgments on what to ignore and what to mention. We refer to these noisy "human-centric" annotations as exhibiting human reporting bias. Examples of such annotations include image tags and keywords found on photo sharing sites, or in datasets containing image captions. In this paper, we use these noisy annotations for learning visually correct image classifiers. Such annotations do not use consistent vocabulary, and miss a significant amount of the information present in an image, however, we demonstrate that the noise in these annotations exhibits structure and can be modeled. We propose an algorithm to decouple the human reporting bias from the correct visually grounded labels. Our results are highly interpretable for reporting "what's in the image" versus "what's worth saying." We demonstrate the algorithm's efficacy along a variety of metrics and datasets, including MS COCO and Yahoo Flickr 100M.We show significant improvements over traditional algorithms for both image classification and image captioning, doubling the performance of existing methods in some cases.

...read moreread less

210 citations

Proceedings Article•DOI•

Visual Storytelling

[...]

Ting-Hao Kenneth Huang¹, Francis Ferraro², Nasrin Mostafazadeh³, Ishan Misra¹, Aishwarya Agrawal⁴, Jacob Devlin¹, Ross Girshick⁵, Xiaodong He⁶, Pushmeet Kohli⁶, Dhruv Batra⁴, C. Lawrence Zitnick⁶, Devi Parikh⁴, Lucy Vanderwende⁶, Michel Galley⁶, Margaret Mitchell⁶ - Show less +11 more•Institutions (6)

Carnegie Mellon University¹, Johns Hopkins University², University of Rochester³, Virginia Tech⁴, Facebook⁵, Microsoft⁶

13 Jun 2016

TL;DR: Modelling concrete description as well as figurative and social language, as provided in this dataset and the storytelling task, has the potential to move artificial intelligence from basic understandings of typical visual scenes towards more and more human-like understanding of grounded event structure and subjective expression.

...read moreread less

Abstract: We introduce the first dataset for sequential vision-to-language, and explore how this data may be used for the task of visual storytelling. The first release of this dataset, SIND1 v.1, includes 81,743 unique photos in 20,211 sequences, aligned to both descriptive (caption) and story language. We establish several strong baselines for the storytelling task, and motivate an automatic metric to benchmark progress. Modelling concrete description as well as figurative and social language, as provided in this dataset and the storytelling task, has the potential to move artificial intelligence from basic understandings of typical visual scenes towards more and more human-like understanding of grounded event structure and subjective expression.

...read moreread less

184 citations

Journal Article•DOI•

Large Scale Retrieval and Generation of Image Descriptions

[...]

Vicente Ordonez¹, Xufeng Han¹, Polina Kuznetsova², Girish Kulkarni², Margaret Mitchell³, Kota Yamaguchi⁴, Karl Stratos⁵, Amit Goyal⁶, Jesse Dodge⁷, Alyssa Mensch⁸, Hal Daumé⁹, Alexander C. Berg¹, Yejin Choi¹⁰, Tamara L. Berg¹ - Show less +10 more•Institutions (10)

University of North Carolina at Chapel Hill¹, Stony Brook University², Microsoft³, Tohoku University⁴, Columbia University⁵, Yahoo!⁶, Carnegie Mellon University⁷, University of Pennsylvania⁸, University of Maryland, College Park⁹, University of Washington¹⁰

01 Aug 2016-International Journal of Computer Vision

TL;DR: The end result is two simple, but effective, methods for harnessing the power of big data to produce image captions that are altogether more general, relevant, and human-like than previous attempts.

...read moreread less

Abstract: What is the story of an image? What is the relationship between pictures, language, and information we can extract using state of the art computational recognition systems? In an attempt to address both of these questions, we explore methods for retrieving and generating natural language descriptions for images. Ideally, we would like our generated textual descriptions (captions) to both sound like a person wrote them, and also remain true to the image content. To do this we develop data-driven approaches for image description generation, using retrieval-based techniques to gather either: (a) whole captions associated with a visually similar image, or (b) relevant bits of text (phrases) from a large collection of image + description pairs. In the case of (b), we develop optimization algorithms to merge the retrieved phrases into valid natural language sentences. The end result is two simple, but effective, methods for harnessing the power of big data to produce image captions that are altogether more general, relevant, and human-like than previous attempts.

...read moreread less

72 citations

Journal Article•DOI•

Measuring Machine Intelligence Through Visual Question Answering

[...]

C. Lawrence Zitnick¹, Aishwarya Agrawal, Stanislaw Antol, Margaret Mitchell², Dhruv Batra, Devi Parikh - Show less +2 more•Institutions (2)

Facebook¹, Microsoft²

13 Apr 2016-Ai Magazine

TL;DR: In this article, a case study exploring the recently popular task of image captioning and its limitations as a task for measuring machine intelligence is presented, with a dataset unprecedented in size created for the task that contains over 760,000 human generated questions about images.

...read moreread less

Abstract: As machines have become more intelligent, there has been a renewed interest in methods for measuring their intelligence. A common approach is to propose tasks for which a human excels, but one which machines find difficult. However, an ideal task should also be easy to evaluate and not be easily gameable. We begin with a case study exploring the recently popular task of image captioning and its limitations as a task for measuring machine intelligence. An alternative and more promising task is Visual Question Answering that tests a machine’s ability to reason about language and vision. We describe a dataset unprecedented in size created for the task that contains over 760,000 human generated questions about images. Using around 10 million human generated answers, machines may be easily evaluated.

...read moreread less

38 citations

Posted Content•

Generating Natural Questions About an Image

[...]

Nasrin Mostafazadeh¹, Ishan Misra², Jacob Devlin², Margaret Mitchell³, Xiaodong He³, Lucy Vanderwende³ - Show less +2 more•Institutions (3)

University of Rochester¹, Carnegie Mellon University², Microsoft³

19 Mar 2016-arXiv: Computation and Language

TL;DR: In this article, the authors introduce the task of Visual Question Generation (VQG), where the system is tasked with asking a natural and engaging question when shown an image, and provide three datasets which cover a variety of images from object-centric to event-centric.

...read moreread less

Abstract: There has been an explosion of work in the vision & language community during the past few years from image captioning to video transcription, and answering questions about images. These tasks have focused on literal descriptions of the image. To move beyond the literal, we choose to explore how questions about an image are often directed at commonsense inference and the abstract events evoked by objects in the image. In this paper, we introduce the novel task of Visual Question Generation (VQG), where the system is tasked with asking a natural and engaging question when shown an image. We provide three datasets which cover a variety of images from object-centric to event-centric, with considerably more abstract training data than provided to state-of-the-art captioning systems thus far. We train and test several generative and retrieval models to tackle the task of VQG. Evaluation results show that while such models ask reasonable questions for a variety of images, there is still a wide gap with human performance which motivates further work on connecting images with commonsense knowledge and pragmatics. Our proposed task offers a new challenge to the community which we hope furthers interest in exploring deeper connections between vision & language.

...read moreread less

26 citations

Posted Content•

Memory-augmented Attention Modelling for Videos

[...]

Rasool Fakoor, Abdelrahman Mohamed, Margaret Mitchell, Sing Bing Kang, Pushmeet Kohli - Show less +1 more

04 Nov 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work proposes a novel memory-based attention model for video description that utilizes memories of past attention when reasoning about where to attend to in the current time step, similar to the central executive system proposed in human cognition.

...read moreread less

Abstract: We present a method to improve video description generation by modeling higher-order interactions between video frames and described concepts. By storing past visual attention in the video associated to previously generated words, the system is able to decide what to look at and describe in light of what it has already looked at and described. This enables not only more effective local attention, but tractable consideration of the video sequence while generating each word. Evaluation on the challenging and popular MSVD and Charades datasets demonstrates that the proposed architecture outperforms previous video description approaches without requiring external temporal video features.

...read moreread less

22 citations

Proceedings Article•

Microsummarization of online reviews: an experimental study

[...]

Rebecca Mason¹, Benjamin Gaska², Benjamin Van Durme³, Pallavi Choudhury⁴, Ted Hart⁴, Bill Dolan⁴, Kristina Toutanova⁴, Margaret Mitchell⁴ - Show less +4 more•Institutions (4)

Google¹, University of Arizona², Johns Hopkins University³, Microsoft⁴

12 Feb 2016

TL;DR: The task of microsummarization is introduced, which combines sentiment analysis, summarization, and entity recognition in order to surface key content to users and finds it can reliably extract relevant entities and the sentiment targeted towards them using crowd-sourced labels as supervision.

...read moreread less

Abstract: Mobile and location-based social media applications provide platforms for users to share brief opinions about products, venues, and services. These quickly typed opinions, or micro-reviews, are a valuable source of current sentiment on a wide variety of subjects. However, there is currently little research on how to mine this information to present it back to users in easily consumable way. In this paper, we introduce the task of microsummarization, which combines sentiment analysis, summarization, and entity recognition in order to surface key content to users. We explore unsupervised and supervised methods for this task, and find we can reliably extract relevant entities and the sentiment targeted towards them using crowd-sourced labels as supervision. In an end-to-end evaluation, we find our best-performing system is vastly preferred by judges over a traditional extractive summarization approach. This work motivates an entirely new approach to summarization, incorporating both sentiment analysis and item extraction for modernized, at-a-glance presentation of public opinion.

...read moreread less

12 citations

Posted Content•

Measuring Machine Intelligence Through Visual Question Answering

[...]

C. Lawrence Zitnick¹, Aishwarya Agrawal, Stanislaw Antol, Margaret Mitchell², Dhruv Batra, Devi Parikh - Show less +2 more•Institutions (2)

Facebook¹, Microsoft²

31 Aug 2016-arXiv: Artificial Intelligence

TL;DR: A case study exploring the recently popular task of image captioning and its limitations as a task for measuring machine intelligence, and an alternative and more promising task that tests a machine’s ability to reason about language and vision.

...read moreread less

Abstract: As machines have become more intelligent, there has been a renewed interest in methods for measuring their intelligence. A common approach is to propose tasks for which a human excels, but one which machines find difficult. However, an ideal task should also be easy to evaluate and not be easily gameable. We begin with a case study exploring the recently popular task of image captioning and its limitations as a task for measuring machine intelligence. An alternative and more promising task is Visual Question Answering that tests a machine's ability to reason about language and vision. We describe a dataset unprecedented in size created for the task that contains over 760,000 human generated questions about images. Using around 10 million human generated answers, machines may be easily evaluated.

...read moreread less

3 citations

Showing papers by "Margaret Mitchell published in 2016"