Showing papers by "Margaret Mitchell published in 2015"

PDF

Open Access

Proceedings Article•DOI•

[...]

Stanislaw Antol¹, Aishwarya Agrawal¹, Jiasen Lu¹, Margaret Mitchell², Dhruv Batra³, C. Lawrence Zitnick⁴, Devi Parikh³ - Show less +3 more•Institutions (4)

Virginia Tech¹, Microsoft², Georgia Institute of Technology³, Facebook⁴

07 Dec 2015

TL;DR: The task of free-form and open-ended Visual Question Answering (VQA) is proposed, given an image and a natural language question about the image, the task is to provide an accurate natural language answer.

...read moreread less

Abstract: We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language answer. Mirroring real-world scenarios, such as helping the visually impaired, both the questions and answers are open-ended. Visual questions selectively target different areas of an image, including background details and underlying context. As a result, a system that succeeds at VQA typically needs a more detailed understanding of the image and complex reasoning than a system producing generic image captions. Moreover, VQA is amenable to automatic evaluation, since many open-ended answers contain only a few words or a closed set of answers that can be provided in a multiple-choice format. We provide a dataset containing ~0.25M images, ~0.76M questions, and ~10M answers (www.visualqa.org), and discuss the information it provides. Numerous baselines for VQA are provided and compared with human performance.

...read moreread less

3,513 citations

Posted Content•

VQA: Visual Question Answering

[...]

Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Dhruv Batra, Devi Parikh - Show less +3 more

03 May 2015-arXiv: Computation and Language

...read moreread less

Abstract: We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language answer. Mirroring real-world scenarios, such as helping the visually impaired, both the questions and answers are open-ended. Visual questions selectively target different areas of an image, including background details and underlying context. As a result, a system that succeeds at VQA typically needs a more detailed understanding of the image and complex reasoning than a system producing generic image captions. Moreover, VQA is amenable to automatic evaluation, since many open-ended answers contain only a few words or a closed set of answers that can be provided in a multiple-choice format. We provide a dataset containing ~0.25M images, ~0.76M questions, and ~10M answers (this http URL), and discuss the information it provides. Numerous baselines and methods for VQA are provided and compared with human performance. Our VQA demo is available on CloudCV (this http URL).

...read moreread less

2,365 citations

Proceedings Article•DOI•

From captions to visual concepts and back

[...]

Hao Fang¹, Saurabh Gupta¹, Forrest Iandola¹, Rupesh Kumar Srivastava¹, Li Deng¹, Piotr Dollár¹, Jianfeng Gao¹, Xiaodong He¹, Margaret Mitchell¹, John Platt¹, C. Lawrence Zitnick¹, Geoffrey Zweig¹ - Show less +8 more•Institutions (1)

Microsoft¹

07 Jun 2015

TL;DR: This paper used multiple instance learning to train visual detectors for words that commonly occur in captions, including many different parts of speech such as nouns, verbs, and adjectives, which serve as conditional inputs to a maximum-entropy language model.

...read moreread less

Abstract: This paper presents a novel approach for automatically generating image descriptions: visual detectors, language models, and multimodal similarity models learnt directly from a dataset of image captions. We use multiple instance learning to train visual detectors for words that commonly occur in captions, including many different parts of speech such as nouns, verbs, and adjectives. The word detector outputs serve as conditional inputs to a maximum-entropy language model. The language model learns from a set of over 400,000 image descriptions to capture the statistics of word usage. We capture global semantics by re-ranking caption candidates using sentence-level features and a deep multimodal similarity model. Our system is state-of-the-art on the official Microsoft COCO benchmark, producing a BLEU-4 score of 29.1%. When human judges compare the system captions to ones written by other people on our held-out test set, the system captions have equal or better quality 34% of the time.

...read moreread less

1,357 citations

Proceedings Article•DOI•

A Neural Network Approach to Context-Sensitive Generation of Conversational Responses

[...]

Alessandro Sordoni¹, Michel Galley¹, Michael Auli², Chris Brockett¹, Yangfeng Ji³, Margaret Mitchell¹, Jian-Yun Nie¹, Jianfeng Gao¹, Bill Dolan¹ - Show less +5 more•Institutions (3)

Microsoft¹, Facebook², Georgia Institute of Technology³

22 Jun 2015

TL;DR: A neural network architecture is used to address sparsity issues that arise when integrating contextual information into classic statistical models, allowing the system to take into account previous dialog utterances.

...read moreread less

Abstract: We present a novel response generation system that can be trained end to end on large quantities of unstructured Twitter conversations. A neural network architecture is used to address sparsity issues that arise when integrating contextual information into classic statistical models, allowing the system to take into account previous dialog utterances. Our dynamic-context generative models show consistent gains over both context-sensitive and non-context-sensitive Machine Translation and Information Retrieval baselines.

...read moreread less

941 citations

Proceedings Article•DOI•

CLPsych 2015 Shared Task: Depression and PTSD on Twitter

[...]

Glen Coppersmith¹, Mark Dredze¹, Craig Harman¹, Kristy Hollingshead², Margaret Mitchell³ - Show less +1 more•Institutions (3)

Johns Hopkins University¹, Florida Institute for Human and Machine Cognition², Microsoft³

05 Jun 2015

TL;DR: This paper presents a summary of the Computational Linguistics and Clinical Psychology (CLPsych) 2015 shared and unshared tasks, aimed to provide apples-to-apples comparisons of various approaches to modeling language relevant to mental health from social media.

...read moreread less

Abstract: This paper presents a summary of the Computational Linguistics and Clinical Psychology (CLPsych) 2015 shared and unshared tasks. These tasks aimed to provide apples-to-apples comparisons of various approaches to modeling language relevant to mental health from social media. The data used for these tasks is from Twitter users who state a diagnosis of depression or post traumatic stress disorder (PTSD) and demographically-matched community controls. The unshared task was a hackathon held at Johns Hopkins University in November 2014 to explore the data, and the shared task was conducted remotely, with each participating team submitted scores for a held-back test set of users. The shared task consisted of three binary classification experiments: (1) depression versus control, (2) PTSD versus control, and (3) depression versus PTSD. Classifiers were compared primarily via their average precision, though a number of other metrics are used along with this to allow a more nuanced interpretation of the performance measures.

...read moreread less

261 citations

Proceedings Article•DOI•

Language Models for Image Captioning: The Quirks and What Works

[...]

Jacob Devlin¹, Hao Cheng², Hao Fang³, Saurabh Gupta⁴, Li Deng³, Xiaodong He³, Geoffrey Zweig³, Margaret Mitchell³ - Show less +4 more•Institutions (4)

BBN Technologies¹, University of Washington², Microsoft³, University of California, Berkeley⁴

07 May 2015

TL;DR: In this paper, the authors compare the merits of different language modeling approaches for the first time by using the same state-of-the-art CNN as input, and examine issues in different approaches, including linguistic irregularities, caption repetition, and data set overlap.

...read moreread less

Abstract: Two recent approaches have achieved state-of-the-art results in image captioning. The first uses a pipelined process where a set of candidate words is generated by a convolutional neural network (CNN) trained on images, and then a maximum entropy (ME) language model is used to arrange these words into a coherent sentence. The second uses the penultimate activation layer of the CNN as input to a recurrent neural network (RNN) that then generates the caption sequence. In this paper, we compare the merits of these different language modeling approaches for the first time by using the same state-ofthe-art CNN as input. We examine issues in the different approaches, including linguistic irregularities, caption repetition, and data set overlap. By combining key aspects of the ME and RNN methods, we achieve a new record performance over previously published results on the benchmark COCO dataset. However, the gains we see in BLEU do not translate to human judgments.

...read moreread less

242 citations

Posted Content•

Exploring Nearest Neighbor Approaches for Image Captioning

[...]

Jacob Devlin, Saurabh Gupta, Ross Girshick, Margaret Mitchell, C. Lawrence Zitnick - Show less +1 more

17 May 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: A variety of nearest neighbor baseline approaches for image captioning find a set of nearest neighbour images in the training set from which a caption may be borrowed for the query image by finding the caption that best represents the "consensus" of the set of candidate captions gathered from the nearest neighbor images.

...read moreread less

Abstract: We explore a variety of nearest neighbor baseline approaches for image captioning. These approaches find a set of nearest neighbor images in the training set from which a caption may be borrowed for the query image. We select a caption for the query image by finding the caption that best represents the "consensus" of the set of candidate captions gathered from the nearest neighbor images. When measured by automatic evaluation metrics on the MS COCO caption evaluation server, these approaches perform as well as many recent approaches that generate novel captions. However, human studies show that a method that generates novel captions is still preferred over the nearest neighbor approach.

...read moreread less

207 citations

Posted Content•

Language Models for Image Captioning: The Quirks and What Works

[...]

Jacob Devlin¹, Hao Cheng², Hao Fang³, Saurabh Gupta⁴, Li Deng³, Xiaodong He³, Geoffrey Zweig³, Margaret Mitchell³ - Show less +4 more•Institutions (4)

BBN Technologies¹, University of Washington², Microsoft³, University of California, Berkeley⁴

07 May 2015-arXiv: Computation and Language

TL;DR: By combining key aspects of the ME and RNN methods, this paper achieves a new record performance over previously published results on the benchmark COCO dataset, however, the gains the authors see in BLEU do not translate to human judgments.

...read moreread less

Abstract: Two recent approaches have achieved state-of-the-art results in image captioning. The first uses a pipelined process where a set of candidate words is generated by a convolutional neural network (CNN) trained on images, and then a maximum entropy (ME) language model is used to arrange these words into a coherent sentence. The second uses the penultimate activation layer of the CNN as input to a recurrent neural network (RNN) that then generates the caption sequence. In this paper, we compare the merits of these different language modeling approaches for the first time by using the same state-of-the-art CNN as input. We examine issues in the different approaches, including linguistic irregularities, caption repetition, and data set overlap. By combining key aspects of the ME and RNN methods, we achieve a new record performance over previously published results on the benchmark COCO dataset. However, the gains we see in BLEU do not translate to human judgments.

...read moreread less

143 citations

Proceedings Article•DOI•

Quantifying the Language of Schizophrenia in Social Media

[...]

Margaret Mitchell¹, Kristy Hollingshead², Glen Coppersmith³•Institutions (3)

Microsoft¹, Florida Institute for Human and Machine Cognition², Johns Hopkins University³

05 Jun 2015

TL;DR: Potential linguistic markers of schizophrenia using the tweets 1 of self-identified schizophrenia sufferers are explored, several natural language processing methods are described to analyze the language of schizophrenia, and preliminary evidence of additional linguistic signals are provided.

...read moreread less

Abstract: Analyzing symptoms of schizophrenia has traditionally been challenging given the low prevalence of the condition, affecting around 1% of the U.S. population. We explore potential linguistic markers of schizophrenia using the tweets 1 of self-identified schizophrenia sufferers, and describe several natural language processing (NLP) methods to analyze the language of schizophrenia. We examine how these signals compare with the widelyused LIWC categories for understanding mental health (Pennebaker et al., 2007), and provide preliminary evidence of additional linguistic signals that may aid in identifying and getting help to people suffering from schizophrenia.

...read moreread less

139 citations

Proceedings Article•DOI•

deltaBLEU: A Discriminative Metric for Generation Tasks with Intrinsically Diverse Targets

[...]

Michel Galley¹, Chris Brockett¹, Alessandro Sordoni¹, Yangfeng Ji², Michael Auli³, Chris Quirk¹, Margaret Mitchell¹, Jianfeng Gao¹, Bill Dolan¹ - Show less +5 more•Institutions (3)

Microsoft¹, Georgia Institute of Technology², Facebook³

23 Jun 2015

TL;DR: In tasks involving generation of conversational responses, ∆BLEU correlates reasonably with human judgments and outperforms sentence-level and IBM BLEU in terms of both Spearman's ρ and Kendall’s τ.

...read moreread less

Abstract: We introduce Discriminative BLEU (∆BLEU), a novel metric for intrinsic evaluation of generated text in tasks that admit a diverse range of possible outputs. Reference strings are scored for quality by human raters on a scale of [−1, +1] to weight multi-reference BLEU. In tasks involving generation of conversational responses, ∆BLEU correlates reasonably with human judgments and outperforms sentence-level and IBM BLEU in terms of both Spearman’s ρ and Kendall’s τ .

...read moreread less

135 citations

Posted Content•

A Neural Network Approach to Context-Sensitive Generation of Conversational Responses

[...]

Alessandro Sordoni¹, Michel Galley¹, Michael Auli², Chris Brockett¹, Yangfeng Ji³, Margaret Mitchell¹, Jian-Yun Nie¹, Jianfeng Gao¹, Bill Dolan¹ - Show less +5 more•Institutions (3)

Microsoft¹, Facebook², Georgia Institute of Technology³

22 Jun 2015-arXiv: Computation and Language

TL;DR: This paper used a neural network architecture to address sparsity issues that arise when integrating contextual information into classic statistical models, allowing the system to take into account previous dialog utterances and showed consistent gains over both context-sensitive and non-context-sensitive Machine Translation and Information Retrieval baselines.

...read moreread less

Posted Content•

deltaBLEU: A Discriminative Metric for Generation Tasks with Intrinsically Diverse Targets

[...]

Michel Galley¹, Chris Brockett¹, Alessandro Sordoni¹, Yangfeng Ji², Michael Auli³, Chris Quirk¹, Margaret Mitchell¹, Jianfeng Gao¹, Bill Dolan¹ - Show less +5 more•Institutions (3)

Microsoft¹, Georgia Institute of Technology², Facebook³

23 Jun 2015-arXiv: Computation and Language

TL;DR: This article introduced Discriminative BLEU (deltaBLEU), a novel metric for intrinsic evaluation of generated text in tasks that admit a diverse range of possible outputs, and showed that deltaBLEU correlates reasonably with human judgments and outperforms sentence-level and IBM bLEU in terms of Spearman's rho and Kendall's tau.

...read moreread less

Abstract: We introduce Discriminative BLEU (deltaBLEU), a novel metric for intrinsic evaluation of generated text in tasks that admit a diverse range of possible outputs. Reference strings are scored for quality by human raters on a scale of [-1, +1] to weight multi-reference BLEU. In tasks involving generation of conversational responses, deltaBLEU correlates reasonably with human judgments and outperforms sentence-level and IBM BLEU in terms of both Spearman's rho and Kendall's tau.

...read moreread less

Patent•

Discovery of semantic similarities between images and text

[...]

Jianfeng Gao¹, Xiaodong He¹, Saurabh Gupta¹, Geoffrey Zweig¹, Forrest Iandola¹, Li Deng¹, Hao Fang¹, Margaret Mitchell¹, John Platt¹, Rupesh Kumar Srivastava¹ - Show less +6 more•Institutions (1)

Microsoft¹

28 Aug 2015

TL;DR: In this paper, a deep multimodal similarity model was proposed to determine the relevance of the sentences based on similarity of text vectors generated for one or more sentences to an image vector generated for an image.

...read moreread less

Abstract: Disclosed herein are technologies directed to discovering semantic similarities between images and text, which can include performing image search using a textual query, performing text search using an image as a query, and/or generating captions for images using a caption generator. A semantic similarity framework can include a caption generator and can be based on a deep multimodal similar model. The deep multimodal similarity model can receive sentences and determine the relevancy of the sentences based on similarity of text vectors generated for one or more sentences to an image vector generated for an image. The text vectors and the image vector can be mapped in a semantic space, and their relevance can be determined based at least in part on the mapping. The sentence associated with the text vector determined to be the most relevant can be output as a caption for the image.

...read moreread less

Posted Content•

A Survey of Current Datasets for Vision and Language Research

[...]

Francis Ferraro, Nasrin Mostafazadeh, Ting-Hao, Huang, Lucy Vanderwende, Jacob Devlin, Michel Galley, Margaret Mitchell - Show less +4 more

23 Jun 2015-arXiv: Computation and Language

TL;DR: In this paper, the authors propose a set of quality metrics for evaluating and analyzing the vision and language datasets and categorize them accordingly. And they show that most recent datasets have been using more complex language and more abstract concepts, however, there are different strengths and weaknesses in each.

...read moreread less

Abstract: Integrating vision and language has long been a dream in work on artificial intelligence (AI). In the past two years, we have witnessed an explosion of work that brings together vision and language from images to videos and beyond. The available corpora have played a crucial role in advancing this area of research. In this paper, we propose a set of quality metrics for evaluating and analyzing the vision & language datasets and categorize them accordingly. Our analyses show that the most recent datasets have been using more complex language and more abstract concepts, however, there are different strengths and weaknesses in each.

...read moreread less

Patent•

Context-sensitive generation of conversational responses

[...]

Michel Galley¹, Alessandro Sordoni¹, Chris Brockett¹, Jianfeng Gao¹, William B. Dolan¹, Yangfeng Ji¹, Michael Auli¹, Margaret Mitchell¹, Jian-Yun Nie¹ - Show less +5 more•Institutions (1)

Microsoft¹

31 May 2015

TL;DR: In this article, context-message-response n-tuples are extracted from at least one source of conversational data to generate a set of training context message response n -tuples.

...read moreread less

Abstract: Examples are generally directed towards context-sensitive generation of conversational responses. Context-message-response n-tuples are extracted from at least one source of conversational data to generate a set of training context-message-response n-tuples. A response generation engine is trained on the set of training context-message-response n-tuples. The trained response generation engine automatically generates a context-sensitive response based on a user generated input message and conversational context data. A digital assistant utilizes the trained response generation engine to generate context-sensitive, natural language responses that are pertinent to user queries.

...read moreread less

Patent•

Crafting a response based on sentiment identification

[...]

Melissa N. Lim¹, Margaret Mitchell¹, Piali Choudhury¹•Institutions (1)

Microsoft¹

21 May 2015

TL;DR: In this article, a digital assistant receives unstructured data input and identifies a segment of the input that includes a facet item, and a sentiment associated with the facet item in the segment is identified and classified to identify a targeted sentiment directed towards the facet items.

...read moreread less

Abstract: Examples described herein provide a digital assistant crafting a response based on target sentiment identification from user input. The digital assistant receives unstructured data input and identifies a segment of the input that includes a facet item. A sentiment associated with the facet item in the segment is identified and classified to identify a targeted sentiment directed towards the facet item. A response is generated based on the targeted sentiment and the facet item.

...read moreread less

Patent•

Crafting feedback dialogue with a digital assistant

[...]

Melissa N. Lim¹, Margaret Mitchell¹, Chris Quirk¹•Institutions (1)

Microsoft¹

20 May 2015

TL;DR: In this paper, the authors dynamically personalize a digital assistant for a specific user, creating a personal connection between the digital assistant and the user by accessing user activity and generating queries based on user activity.

...read moreread less

Abstract: Examples described herein dynamically personalize a digital assistant for a specific user, creating a personal connection between the digital assistant and the user. The digital assistant accesses user activity and generates queries based on the user activity. The digital assistant facilitates natural language conversations as machine learning sessions between the digital assistant and the user using the one or more queries to learn the user's preferences and receives user input from the user during the learning session in response to the queries. The digital assistant dynamically updates a personalized profile for the user based on the user input during the natural language conversations.

...read moreread less

Patent•

Metric for automatic assessment of conversational responses

[...]

Michel Galley¹, Alessandro Sordoni¹, Chris Brockett¹, Jianfeng Gao¹, William B. Dolan¹, Yangfeng Ji¹, Michael Auli¹, Margaret Mitchell¹, Chris Quirk¹ - Show less +5 more•Institutions (1)

Microsoft¹

31 May 2015

TL;DR: In this article, a response assessment engine generates a metric score for a machine-generated response based on an assessment metric and the set of multi-reference responses, and the metric score indicates a quality of the machine generated conversational response relative to a user-generated message and a context of the user generated message.

...read moreread less

Abstract: Examples are generally directed towards automatic assessment of machine generated conversational responses. Context-message-response n-tuples are extracted from at least one source of conversational data to generate a set of multi-reference responses. A response in the set of multi-reference responses includes it context-message data pair and rating. The rating indicates a quality of the response relative to the context-message data pair. A response assessment engine generates a metric score for a machine-generated response based on an assessment metric and the set of multi-reference responses. The metric score indicates a quality of the machine-generated conversational response relative to a user-generated message and a context of the user-generated message. A response generation system of a computing device, such as a digital assistant, is optimized and adjusted based on the metric score to improve the accuracy, quality, and relevance of responses output to the user.

...read moreread less

Posted Content•

Learning Visual Classifiers using Human-centric Annotations.

[...]

Ishan Misra, C. Lawrence Zitnick, Margaret Mitchell, Ross Girshick

22 Dec 2015

TL;DR: This paper proposes an algorithm to decouple the human reporting bias from the correct visually grounded labels for learning image classifiers, and provides results that are highly interpretable for reporting “what’s in the image” versus “ what”s worth saying.

...read moreread less

Abstract: When human annotators are given a choice about what to label in an image, they apply their own subjective judgments on what to ignore and what to mention. We refer to these noisy "human-centric" annotations as exhibiting human reporting bias. Examples of such annotations include image tags and keywords found on photo sharing sites, or in datasets containing image captions. In this paper, we use these noisy annotations for learning visually correct image classifiers. Such annotations do not use consistent vocabulary, and miss a significant amount of the information present in an image; however, we demonstrate that the noise in these annotations exhibits structure and can be modeled. We propose an algorithm to decouple the human reporting bias from the correct visually grounded labels. Our results are highly interpretable for reporting "what's in the image" versus "what's worth saying." We demonstrate the algorithm's efficacy along a variety of metrics and datasets, including MS COCO and Yahoo Flickr 100M. We show significant improvements over traditional algorithms for both image classification and image captioning, doubling the performance of existing methods in some cases.

...read moreread less

Posted Content•

Seeing through the Human Reporting Bias: Visual Classifiers from Noisy Human-Centric Labels

[...]

Ishan Misra¹, C. Lawrence Zitnick, Margaret Mitchell², Ross Girshick•Institutions (2)

Carnegie Mellon University¹, Microsoft²

22 Dec 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, the authors use human reporting bias to learn visually correct image classifiers and demonstrate that the noise in these annotations exhibits structure and can be modeled, which is highly interpretable for reporting "what's in the image" versus "What's worth saying".

...read moreread less