Showing papers by "Andrew Rabinovich published in 2015"

PDF

Open Access

Proceedings Article•DOI•

[...]

Christian Szegedy¹, Wei Liu², Yangqing Jia¹, Pierre Sermanet¹, Scott Reed³, Dragomir Anguelov¹, Dumitru Erhan¹, Vincent Vanhoucke¹, Andrew Rabinovich - Show less +5 more•Institutions (3)

Google¹, University of North Carolina at Chapel Hill², University of Michigan³

07 Jun 2015

TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).

...read moreread less

Abstract: We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. By a carefully crafted design, we increased the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.

...read moreread less

40,257 citations

Posted Content•

ParseNet: Looking Wider to See Better

[...]

Wei Liu¹, Andrew Rabinovich, Alexander C. Berg¹•Institutions (1)

University of North Carolina at Chapel Hill¹

15 Jun 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work presents a technique for adding global context to deep convolutional networks for semantic segmentation, and achieves state-of-the-art performance on SiftFlow and PASCAL-Context with small additional computational cost over baselines.

...read moreread less

Abstract: We present a technique for adding global context to deep convolutional networks for semantic segmentation. The approach is simple, using the average feature for a layer to augment the features at each location. In addition, we study several idiosyncrasies of training, significantly increasing the performance of baseline networks (e.g. from FCN). When we add our proposed global feature, and a technique for learning normalization parameters, accuracy increases consistently even over our improved versions of the baselines. Our proposed approach, ParseNet, achieves state-of-the-art performance on SiftFlow and PASCAL-Context with small additional computational cost over baselines, and near current state-of-the-art performance on PASCAL VOC 2012 semantic segmentation with a simple approach. Code is available at this https URL .

...read moreread less

1,166 citations

Proceedings Article•

Training deep neural networks on noisy labels with bootstrapping

[...]

Scott Reed¹, Honglak Lee¹, Dragomir Anguelov², Christian Szegedy², Dumitru Erhan³, Andrew Rabinovich² - Show less +2 more•Institutions (3)

University of Michigan¹, Google², Microsoft³

01 Jan 2015

TL;DR: This article proposed a generic way to handle noisy and incomplete labeling by augmenting the prediction objective with a notion of consistency, where the notion of similarity is between deep network features computed from the input data.

...read moreread less

Abstract: Current state-of-the-art deep learning systems for visual object recognition and detection use purely supervised training with regularization such as dropout to avoid overfitting. The performance depends critically on the amount of labeled examples, and in current practice the labels are assumed to be unambiguous and accurate. However, this assumption often does not hold; e.g. in recognition, class labels may be missing; in detection, objects in the image may not be localized; and in general, the labeling may be subjective. In this work we propose a generic way to handle noisy and incomplete labeling by augmenting the prediction objective with a notion of consistency. We consider a prediction consistent if the same prediction is made given similar percepts, where the notion of similarity is between deep network features computed from the input data. In experiments we demonstrate that our approach yields substantial robustness to label noise on several datasets. On MNIST handwritten digits, we show that our model is robust to label corruption. On the Toronto Face Database, we show that our model handles well the case of subjective labels in emotion recognition, achieving state-of-theart results, and can also benefit from unlabeled face images with no modification to our method. On the ILSVRC2014 detection challenge data, we show that our approach extends to very deep networks, high resolution images and structured outputs, and results in improved scalable detection.

...read moreread less

377 citations

Proceedings Article•DOI•

What's Cookin'? Interpreting Cooking Videos using Text, Speech and Vision

[...]

Jonathan Malmaud¹, Jonathan Huang², Vivek Rathod³, Nick Johnston⁴, Andrew Rabinovich⁴, Kevin Murphy⁵ - Show less +2 more•Institutions (5)

Massachusetts Institute of Technology¹, Stanford University², Columbia University³, Google⁴, Imperial College London⁵

01 Jan 2015

TL;DR: A novel method for aligning a sequence of instructions to a video of someone carrying out a task, based on a deep convolutional neural network, that outperforms simpler techniques based on keyword spotting.

...read moreread less

Abstract: We present a novel method for aligning a sequence of instructions to a video of someone carrying out a task. In particular, we focus on the cooking domain, where the instructions correspond to the recipe. Our technique relies on an HMM to align the recipe steps to the (automatically generated) speech transcript. We then refine this alignment using a state-of-the-art visual food detector, based on a deep convolutional neural network. We show that our technique outperforms simpler techniques based on keyword spotting. It also enables interesting applications, such as automatically illustrating recipes with keyframes, and searching within a video for events of interest.

...read moreread less

139 citations

Posted Content•

What's Cookin'? Interpreting Cooking Videos using Text, Speech and Vision

[...]

Jonathan Malmaud¹, Jonathan Huang², Vivek Rathod³, Nick Johnston⁴, Andrew Rabinovich⁴, Kevin Murphy⁵ - Show less +2 more•Institutions (5)

Massachusetts Institute of Technology¹, Stanford University², Columbia University³, Google⁴, Imperial College London⁵

05 Mar 2015-arXiv: Computation and Language

TL;DR: The authors aligns a sequence of instructions to a video of someone carrying out a task using an HMM to align the recipe steps to the (automatically generated) speech transcript, and then refine this alignment using a state-of-the-art visual food detector, based on a deep convolutional neural network.

...read moreread less

19 citations

Proceedings Article•

Self-informed neural network structure learning

[...]

David Warde-Farley¹, Andrew Rabinovich², Dragomir Anguelov²•Institutions (2)

Université de Montréal¹, Google²

01 Jan 2015

TL;DR: A method for augmenting a trained neural network classifier with auxiliary capacity in a manner designed to significantly improve upon an already well-performing model, while minimally impacting its computational footprint is proposed.

...read moreread less

Abstract: We study the problem of large scale, multi-label visual recognition with a large number of possible classes. We propose a method for augmenting a trained neural network classifier with auxiliary capacity in a manner designed to significantly improve upon an already well-performing model, while minimally impacting its computational footprint. Using the predictions of the network itself as a descriptor for assessing visual similarity, we define a partitioning of the label space into groups of visually similar entities. We then augment the network with auxilliary hidden layer pathways with connectivity only to these groups of label units. We report a significant improvement in mean average precision on a large-scale object recognition task with the augmented model, while increasing the number of multiply-adds by less than 3%.

...read moreread less

4 citations