Top 12 papers published by Alexander C. Berg from University of North Carolina at Chapel Hill in 2016

Book Chapter•DOI•

[...]

Wei Liu¹, Dragomir Anguelov, Dumitru Erhan², Christian Szegedy², Scott Reed³, Cheng-Yang Fu¹, Alexander C. Berg¹ - Show less +3 more•Institutions (3)

University of North Carolina at Chapel Hill¹, Google², University of Michigan³

08 Oct 2016

TL;DR: The approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location, which makes SSD easy to train and straightforward to integrate into systems that require a detection component.

...read moreread less

Abstract: We present a method for detecting objects in images using a single deep neural network. Our approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. SSD is simple relative to methods that require object proposals because it completely eliminates proposal generation and subsequent pixel or feature resampling stages and encapsulates all computation in a single network. This makes SSD easy to train and straightforward to integrate into systems that require a detection component. Experimental results on the PASCAL VOC, COCO, and ILSVRC datasets confirm that SSD has competitive accuracy to methods that utilize an additional object proposal step and is much faster, while providing a unified framework for both training and inference. For \(300 \times 300\) input, SSD achieves 74.3 % mAP on VOC2007 test at 59 FPS on a Nvidia Titan X and for \(512 \times 512\) input, SSD achieves 76.9 % mAP, outperforming a comparable state of the art Faster R-CNN model. Compared to other single stage methods, SSD has much better accuracy even with a smaller input image size. Code is available at https://github.com/weiliu89/caffe/tree/ssd.

...read moreread less

19,543 citations

Posted Content•

Modeling Context in Referring Expressions

[...]

Licheng Yu¹, Patrick Poirson¹, Shan Yang¹, Alexander C. Berg¹, Tamara L. Berg¹ - Show less +1 more•Institutions (1)

University of North Carolina at Chapel Hill¹

31 Jul 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work focuses on incorporating better measures of visual context into referring expression models and finds that visual comparison to other objects within an image helps improve performance significantly.

...read moreread less

Abstract: Humans refer to objects in their environments all the time, especially in dialogue with other people. We explore generating and comprehending natural language referring expressions for objects in images. In particular, we focus on incorporating better measures of visual context into referring expression models and find that visual comparison to other objects within an image helps improve performance significantly. We also develop methods to tie the language generation process together, so that we generate expressions for all objects of a particular category jointly. Evaluation on three recent datasets - RefCOCO, RefCOCO+, and RefCOCOg, shows the advantages of our methods for both referring expression generation and comprehension.

...read moreread less

500 citations

Book Chapter•DOI•

Modeling Context in Referring Expressions

[...]

Licheng Yu¹, Patrick Poirson¹, Shan Yang¹, Alexander C. Berg¹, Tamara L. Berg¹ - Show less +1 more•Institutions (1)

University of North Carolina at Chapel Hill¹

08 Oct 2016

TL;DR: This article explored generating and comprehending natural language referring expressions for objects in images and found that visual comparison to other objects within an image helps improve performance significantly, and developed methods to tie the language generation process together, so that they generate expressions for all objects of a particular category jointly.

...read moreread less

Abstract: Humans refer to objects in their environments all the time, especially in dialogue with other people. We explore generating and comprehending natural language referring expressions for objects in images. In particular, we focus on incorporating better measures of visual context into referring expression models and find that visual comparison to other objects within an image helps improve performance significantly. We also develop methods to tie the language generation process together, so that we generate expressions for all objects of a particular category jointly. Evaluation on three recent datasets - RefCOCO, RefCOCO+, and RefCOCOg (Datasets and toolbox can be downloaded from https://github.com/lichengunc/refer), shows the advantages of our methods for both referring expression generation and comprehension.

...read moreread less

390 citations

Proceedings Article•DOI•

Combining multiple sources of knowledge in deep CNNs for action recognition

[...]

Eunbyung Park¹, Xufeng Han¹, Tamara L. Berg¹, Alexander C. Berg¹•Institutions (1)

University of North Carolina at Chapel Hill¹

07 Mar 2016

TL;DR: This paper presents a spatially varying multiplicative fusion method for combining multiple CNNs trained on different sources that results in robust prediction by amplifying or suppressing the feature activations based on their agreement.

...read moreread less

Abstract: Although deep convolutional neural networks (CNNs) have shown remarkable results for feature learning and prediction tasks, many recent studies have demonstrated improved performance by incorporating additional handcrafted features or by fusing predictions from multiple CNNs. Usually, these combinations are implemented via feature concatenation or by averaging output prediction scores from several CNNs. In this paper, we present new approaches for combining different sources of knowledge in deep learning. First, we propose feature amplification, where we use an auxiliary, hand-crafted, feature (e.g. optical flow) to perform spatially varying soft-gating on intermediate CNN feature maps. Second, we present a spatially varying multiplicative fusion method for combining multiple CNNs trained on different sources that results in robust prediction by amplifying or suppressing the feature activations based on their agreement. We test these methods in the context of action recognition where information from spatial and temporal cues is useful, obtaining results that are comparable with state-of-the-art methods and outperform methods using only CNNs and optical flow features.

...read moreread less

193 citations

Proceedings Article•DOI•

Fast Single Shot Detection and Pose Estimation

[...]

Patrick Poirson¹, Phil Ammirato¹, Cheng-Yang Fu¹, Wei Liu¹, Jana Kosecka², Alexander C. Berg¹ - Show less +2 more•Institutions (2)

University of North Carolina at Chapel Hill¹, George Mason University²

19 Sep 2016

TL;DR: In this paper, the authors combine detection and pose estimation at the same level using a deep learning approach, where scores for the presence of an object category, the offset for its location, and the approximate pose are all estimated on a regular grid of locations in the image.

...read moreread less

Abstract: For applications in navigation and robotics, estimating the 3D pose of objects is as important as detection. Many approaches to pose estimation rely on detecting or tracking parts or keypoints [11, 21]. In this paper we build on a recent state-of-the-art convolutional network for slidingwindow detection [10] to provide detection and rough pose estimation in a single shot, without intermediate stages of detecting parts or initial bounding boxes. While not the first system to treat pose estimation as a categorization problem, this is the first attempt to combine detection and pose estimation at the same level using a deep learning approach. The key to the architecture is a deep convolutional network where scores for the presence of an object category, the offset for its location, and the approximate pose are all estimated on a regular grid of locations in the image. The resulting system is as accurate as recent work on pose estimation (42.4% 8 View mAVP on Pascal 3D+ [21] ) and significantly faster (46 frames per second (FPS) on a TITAN X GPU). This approach to detection and rough pose estimation is fast and accurate enough to be widely applied as a pre-processing step for tasks including high-accuracy pose estimation, object tracking and localization, and vSLAM.

...read moreread less

105 citations

Journal Article•DOI•

Large Scale Retrieval and Generation of Image Descriptions

[...]

Vicente Ordonez¹, Xufeng Han¹, Polina Kuznetsova², Girish Kulkarni², Margaret Mitchell³, Kota Yamaguchi⁴, Karl Stratos⁵, Amit Goyal⁶, Jesse Dodge⁷, Alyssa Mensch⁸, Hal Daumé⁹, Alexander C. Berg¹, Yejin Choi¹⁰, Tamara L. Berg¹ - Show less +10 more•Institutions (10)

University of North Carolina at Chapel Hill¹, Stony Brook University², Microsoft³, Tohoku University⁴, Columbia University⁵, Yahoo!⁶, Carnegie Mellon University⁷, University of Pennsylvania⁸, University of Maryland, College Park⁹, University of Washington¹⁰

01 Aug 2016-International Journal of Computer Vision

TL;DR: The end result is two simple, but effective, methods for harnessing the power of big data to produce image captions that are altogether more general, relevant, and human-like than previous attempts.

...read moreread less

Abstract: What is the story of an image? What is the relationship between pictures, language, and information we can extract using state of the art computational recognition systems? In an attempt to address both of these questions, we explore methods for retrieving and generating natural language descriptions for images. Ideally, we would like our generated textual descriptions (captions) to both sound like a person wrote them, and also remain true to the image content. To do this we develop data-driven approaches for image description generation, using retrieval-based techniques to gather either: (a) whole captions associated with a visually similar image, or (b) relevant bits of text (phrases) from a large collection of image + description pairs. In the case of (b), we develop optimization algorithms to merge the retrieved phrases into valid natural language sentences. The end result is two simple, but effective, methods for harnessing the power of big data to produce image captions that are altogether more general, relevant, and human-like than previous attempts.

...read moreread less

72 citations

Posted Content•

Fast Single Shot Detection and Pose Estimation

[...]

Patrick Poirson¹, Phil Ammirato¹, Cheng-Yang Fu¹, Wei Liu¹, Jana Kosecka², Alexander C. Berg¹ - Show less +2 more•Institutions (2)

University of North Carolina at Chapel Hill¹, George Mason University²

19 Sep 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: This is the first attempt to combine detection and pose estimation at the same level using a deep learning approach and is fast and accurate enough to be widely applied as a pre-processing step for tasks including high-accuracy pose estimation, object tracking and localization, and vSLAM.

...read moreread less

Abstract: For applications in navigation and robotics, estimating the 3D pose of objects is as important as detection. Many approaches to pose estimation rely on detecting or tracking parts or keypoints [11, 21]. In this paper we build on a recent state-of-the-art convolutional network for slidingwindow detection [10] to provide detection and rough pose estimation in a single shot, without intermediate stages of detecting parts or initial bounding boxes. While not the first system to treat pose estimation as a categorization problem, this is the first attempt to combine detection and pose estimation at the same level using a deep learning approach. The key to the architecture is a deep convolutional network where scores for the presence of an object category, the offset for its location, and the approximate pose are all estimated on a regular grid of locations in the image. The resulting system is as accurate as recent work on pose estimation (42.4% 8 View mAVP on Pascal 3D+ [21] ) and significantly faster (46 frames per second (FPS) on a TITAN X GPU). This approach to detection and rough pose estimation is fast and accurate enough to be widely applied as a pre-processing step for tasks including high-accuracy pose estimation, object tracking and localization, and vSLAM.

...read moreread less

38 citations

Proceedings Article•DOI•

Solving Visual Madlibs with Multiple Cues

[...]

Tatiana Tommasi¹, Arun Mallya², Bryan A. Plummer², Svetlana Lazebnik², Alexander C. Berg¹, Tamara L. Berg¹ - Show less +2 more•Institutions (2)

University of North Carolina at Chapel Hill¹, University of Illinois at Urbana–Champaign²

01 Jan 2016

TL;DR: This paper focuses on answering fill-in-the-blank style multiple choice questions from the Visual Madlibs dataset and presents a method for selecting sub-regions of an image that are relevant for evaluating the appropriateness of a putative answer.

...read moreread less

Abstract: This paper focuses on answering fill-in-the-blank style multiple choice questions from the Visual Madlibs dataset. Previous approaches to Visual Question Answering (VQA) have mainly used generic image features from networks trained on the ImageNet dataset, despite the wide scope of questions. In contrast, our approach employs features derived from networks trained for specialized tasks of scene classification, person activity prediction, and person and object attribute prediction. We also present a method for selecting sub-regions of an image that are relevant for evaluating the appropriateness of a putative answer. Visual features are computed both from the whole image and from local regions, while sentences are mapped to a common space using a simple normalized canonical correlation analysis (CCA) model. Our results show a significant improvement over the previous state of the art, and indicate that answering different question types benefits from examining a variety of image cues and carefully choosing informative image sub-regions.

...read moreread less

25 citations

Posted Content•

Solving Visual Madlibs with Multiple Cues

[...]

Tatiana Tommasi¹, Arun Mallya², Bryan A. Plummer², Svetlana Lazebnik², Alexander C. Berg¹, Tamara L. Berg¹ - Show less +2 more•Institutions (2)

University of North Carolina at Chapel Hill¹, University of Illinois at Urbana–Champaign²

11 Aug 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, the authors focus on answering fill-in-the-blank style multiple choice questions from the Visual Madlibs dataset and employ features derived from networks trained for specialized tasks of scene classification, person activity prediction, and person and object attribute prediction.

...read moreread less

Abstract: This paper focuses on answering fill-in-the-blank style multiple choice questions from the Visual Madlibs dataset. Previous approaches to Visual Question Answering (VQA) have mainly used generic image features from networks trained on the ImageNet dataset, despite the wide scope of questions. In contrast, our approach employs features derived from networks trained for specialized tasks of scene classification, person activity prediction, and person and object attribute prediction. We also present a method for selecting sub-regions of an image that are relevant for evaluating the appropriateness of a putative answer. Visual features are computed both from the whole image and from local regions, while sentences are mapped to a common space using a simple normalized canonical correlation analysis (CCA) model. Our results show a significant improvement over the previous state of the art, and indicate that answering different question types benefits from examining a variety of image cues and carefully choosing informative image sub-regions.

...read moreread less

10 citations

Posted Content•

Combining Multiple Cues for Visual Madlibs Question Answering

[...]

Tatiana Tommasi¹, Arun Mallya², Bryan A. Plummer², Svetlana Lazebnik², Alexander C. Berg³, Tamara L. Berg³ - Show less +2 more•Institutions (3)

Istituto Italiano di Tecnologia¹, University of Illinois at Urbana–Champaign², University of North Carolina at Chapel Hill³

01 Nov 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: This paper presents an approach for answering fill-in-the-blank multiple choice questions from the Visual Madlibs dataset and employs a combination of networks trained for specialized tasks such as scene recognition, person activity classification, and attribute prediction.

...read moreread less

Abstract: This paper presents an approach for answering fill-in-the-blank multiple choice questions from the Visual Madlibs dataset. Instead of generic and commonly used representations trained on the ImageNet classification task, our approach employs a combination of networks trained for specialized tasks such as scene recognition, person activity classification, and attribute prediction. We also present a method for localizing phrases from candidate answers in order to provide spatial support for feature extraction. We map each of these features, together with candidate answers, to a joint embedding space through normalized canonical correlation analysis (nCCA). Finally, we solve an optimization problem to learn to combine scores from nCCA models trained on multiple cues to select the best answer. Extensive experimental results show a significant improvement over the previous state of the art and confirm that answering questions from a wide range of types benefits from examining a variety of image cues and carefully choosing the spatial support for feature extraction.

...read moreread less

10 citations

Journal Article•DOI•

Learning to name objects

[...]

Vicente Ordonez¹, Wei Liu², Jia Deng³, Yejin Choi⁴, Alexander C. Berg², Tamara L. Berg² - Show less +2 more•Institutions (4)

Allen Institute for Artificial Intelligence¹, University of North Carolina at Chapel Hill², University of Michigan³, University of Washington⁴

25 Feb 2016-Communications of The ACM

TL;DR: This paper looks at the problem of predicting category labels that mimic how human observers would name objects, related to the concept of entry-level categories first introduced by psychologists in the 1970s and 1980s.

...read moreread less

Abstract: We have seen remarkable recent progress in computational visual recognition, producing systems that can classify objects into thousands of different categories with increasing accuracy. However, one question that has received relatively less attention is "what labels should recognition systems output?" This paper looks at the problem of predicting category labels that mimic how human observers would name objects. This goal is related to the concept of entry-level categories first introduced by psychologists in the 1970s and 1980s. We extend these seminal ideas to study human naming at large scale and to learn computational models for predicting entry-level categories. Practical applications of this work include improving human-focused computer vision applications such as automatically generating a natural language description for an image or text-based image search.

...read moreread less

Posted Content•

When was that made

[...]

Sirion Vittayakorn¹, Alexander C. Berg¹, Tamara L. Berg¹•Institutions (1)

University of North Carolina at Chapel Hill¹

12 Aug 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: The results demonstrate that the deep learning approach outperforms both a color-based baseline and visual data mining approach which is the previous state of the art method for the temporal estimation.

...read moreread less

Abstract: In this paper, we explore deep learning methods for estimating when objects were made. Automatic methods for this task could potentially be useful for historians, collectors, or any individual interested in estimating when their artifact was created. Direct applications include large-scale data organization or retrieval. Toward this goal, we utilize features from existing deep networks and also fine-tune new networks for temporal estimation. In addition, we create two new datasets of 67,771 dated clothing items from Flickr and museum collections. Our method outperforms both a color-based baseline and previous state of the art methods for temporal estimation. We also provide several analyses of what our networks have learned, and demonstrate applications to identifying temporal inspiration in fashion collections.

...read moreread less

Showing papers by "Alexander C. Berg published in 2016"