scispace - formally typeset
Search or ask a question

Showing papers by "Alexander C. Berg published in 2012"


Proceedings Article
23 Apr 2012
TL;DR: A novel generation system that composes humanlike descriptions of images from computer vision detections by leveraging syntactically informed word co-occurrence statistics and automatically generating some of the most natural image descriptions to date.
Abstract: This paper introduces a novel generation system that composes humanlike descriptions of images from computer vision detections. By leveraging syntactically informed word co-occurrence statistics, the generator filters and constrains the noisy detections output from a vision system to generate syntactic trees that detail what the computer vision system sees. Results show that the generation system outperforms state-of-the-art systems, automatically generating some of the most natural image descriptions to date.

450 citations


Proceedings Article
08 Jul 2012
TL;DR: A holistic data-driven approach to image description generation, exploiting the vast amount of (noisy) parallel image data and associated natural language descriptions available on the web to generate novel descriptions for query images.
Abstract: We present a holistic data-driven approach to image description generation, exploiting the vast amount of (noisy) parallel image data and associated natural language descriptions available on the web. More specifically, given a query image, we retrieve existing human-composed phrases used to describe visually similar images, then selectively combine those phrases to generate a novel description for the query image. We cast the generation process as constraint optimization problems, collectively incorporating multiple interconnected aspects of language composition for content planning, surface realization and discourse structure. Evaluation by human annotators indicates that our final system generates more semantically correct and linguistically appealing descriptions than two nontrivial baselines.

353 citations


Proceedings ArticleDOI
16 Jun 2012
TL;DR: This work proposes the Dual Accuracy Reward Trade-off Search (DARTS) algorithm and proves that, under practical conditions, it converges to an optimal solution.
Abstract: As visual recognition scales up to ever larger numbers of categories, maintaining high accuracy is increasingly difficult. In this work, we study the problem of optimizing accuracy-specificity trade-offs in large scale recognition, motivated by the observation that object categories form a semantic hierarchy consisting of many levels of abstraction. A classifier can select the appropriate level, trading off specificity for accuracy in case of uncertainty. By optimizing this trade-off, we obtain classifiers that try to be as specific as possible while guaranteeing an arbitrarily high accuracy. We formulate the problem as maximizing information gain while ensuring a fixed, arbitrarily small error rate with a semantic hierarchy. We propose the Dual Accuracy Reward Trade-off Search (DARTS) algorithm and prove that, under practical conditions, it converges to an optimal solution. Experiments demonstrate the effectiveness of our algorithm on datasets ranging from 65 to over 10,000 categories.

201 citations


Proceedings ArticleDOI
16 Jun 2012
TL;DR: This paper explores how a number of factors relate to human perception of importance using what people describe as a proxy for importance, and builds models to predict what will be described about an image given either known image content, or image content estimated automatically by recognition systems.
Abstract: What do people care about in an image? To drive computational visual recognition toward more human-centric outputs, we need a better understanding of how people perceive and judge the importance of content in images. In this paper, we explore how a number of factors relate to human perception of importance. Proposed factors fall into 3 broad types: 1) factors related to composition, e.g. size, location, 2) factors related to semantics, e.g. category of object or scene, and 3) contextual factors related to the likelihood of attribute-object, or object-scene pairs. We explore these factors using what people describe as a proxy for importance. Finally, we build models to predict what will be described about an image given either known image content, or image content estimated automatically by recognition systems.

179 citations


Proceedings Article
03 Jun 2012
TL;DR: This work concretely defines what it means to be visual, annotate visual text and develops algorithms to automatically classify noun phrases as visual or non-visual, and finds that using text alone, it is able to achieve high accuracies at this task, and that incorporating features derived from computer vision algorithms improves performance.
Abstract: When people describe a scene, they often include information that is not visually apparent; sometimes based on background knowledge, sometimes to tell a story. We aim to separate visual text---descriptions of what is being seen---from non-visual text in natural images and their descriptions. To do so, we first concretely define what it means to be visual, annotate visual text and then develop algorithms to automatically classify noun phrases as visual or non-visual. We find that using text alone, we are able to achieve high accuracies at this task, and that incorporating features derived from computer vision algorithms improves performance. Finally, we show that we can reliably mine visual nouns and adjectives from large corpora and that we can use these effectively in the classification task.

53 citations


Journal ArticleDOI
TL;DR: It is concluded that guidance and recognition in the context of search are not separate processes mediated by different features, and that what the literature knows as guidance is really recognition performed on blurred objects viewed in the visual periphery.
Abstract: Search is commonly described as a repeating cycle of guidance to target-like objects, followed by the recognition of these objects as targets or distractors. Are these indeed separate processes using different visual features? We addressed this question by comparing observer behavior to that of support vector machine (SVM) models trained on guidance and recognition tasks. Observers searched for a categorically defined teddy bear target in four-object arrays. Target-absent trials consisted of random category distractors rated in their visual similarity to teddy bears. Guidance, quantified as first-fixated objects during search, was strongest for targets, followed by target-similar, medium-similarity, and target-dissimilar distractors. False positive errors to first-fixated distractors also decreased with increasing dissimilarity to the target category. To model guidance, nine teddy bear detectors, using features ranging in biological plausibility, were trained on unblurred bears then tested on blurred versions of the same objects appearing in each search display. Guidance estimates were based on target probabilities obtained from these detectors. To model recognition, nine bear/nonbear classifiers, trained and tested on unblurred objects, were used to classify the object that would be fixated first (based on the detector estimates) as a teddy bear or a distractor. Patterns of categorical guidance and recognition accuracy were modeled almost perfectly by an HMAX model in combination with a color histogram feature. We conclude that guidance and recognition in the context of search are not separate processes mediated by different features, and that what the literature knows as guidance is really recognition performed on blurred objects viewed in the visual periphery.

37 citations


Proceedings ArticleDOI
16 Jun 2012
TL;DR: This work builds on a framework borrowed from parallel convex optimization - the alternating direction method of multipliers (ADMM) - to develop a new consensus based algorithm for distributed training of single-machine approaches which allows distributed parallel training with small communication requirements.
Abstract: We present an algorithm and implementation for distributed parallel training of single-machine multiclass SVMs. While there is ongoing and healthy debate about the best strategy for multiclass classification, there are some features of the single-machine approach that are not available when training alternatives such as one-vs-all, and that are quite complex for tree based methods. One obstacle to exploring single-machine approaches on large datasets is that they are usually limited to running on a single machine! We build on a framework borrowed from parallel convex optimization — the alternating direction method of multipliers (ADMM) — to develop a new consensus based algorithm for distributed training of single-machine approaches. This is demonstrated with an implementation of our novel sequential dual algorithm (DCMSVM) which allows distributed parallel training with small communication requirements. Benchmark results show significant reduction in wall clock time compared to current state of the art multiclass SVM implementation (Liblinear) on a single node. Experiments are performed on large scale image classification including results with modern high-dimensional features.

7 citations