scispace - formally typeset
Search or ask a question
Author

Diane Larlus

Bio: Diane Larlus is an academic researcher from Xerox. The author has contributed to research in topics: Object (computer science) & Computer science. The author has an hindex of 27, co-authored 69 publications receiving 4722 citations. Previous affiliations of Diane Larlus include Technische Universität Darmstadt & Naver Corporation.


Papers
More filters
Journal ArticleDOI
TL;DR: Experimental results show that colorNames learned from real-world images significantly outperform color names learned from labeled color chips for both image retrieval and image annotation.
Abstract: Color names are required in real-world applications such as image retrieval and image annotation. Traditionally, they are learned from a collection of labeled color chips. These color chips are labeled with color names within a well-defined experimental setup by human test subjects. However, naming colors in real-world images differs significantly from this experimental setting. In this paper, we investigate how color names learned from color chips compare to color names learned from real-world images. To avoid hand labeling real-world images with color names, we use Google image to collect a data set. Due to the limitations of Google image, this data set contains a substantial quantity of wrongly labeled data. We propose several variants of the PLSA model to learn color names from this noisy data. Experimental results show that color names learned from real-world images significantly outperform color names learned from labeled color chips for both image retrieval and image annotation.

752 citations

Book ChapterDOI
08 Oct 2016
TL;DR: This work proposes a novel approach for instance-level image retrieval that produces a global and compact fixed-length representation for each image by aggregating many region-wise descriptors by leveraging a ranking framework and projection weights to build the region features.
Abstract: We propose a novel approach for instance-level image retrieval. It produces a global and compact fixed-length representation for each image by aggregating many region-wise descriptors. In contrast to previous works employing pre-trained deep networks as a black box to produce features, our method leverages a deep architecture trained for the specific task of image retrieval. Our contribution is twofold: (i) we leverage a ranking framework to learn convolution and projection weights that are used to build the region features; and (ii) we employ a region proposal network to learn which regions should be pooled to form the final global descriptor. We show that using clean training data is key to the success of our approach. To that aim, we use a large scale but noisy landmark dataset and develop an automatic cleaning approach. The proposed architecture produces a global image representation in a single forward pass. Our approach significantly outperforms previous approaches based on global descriptors on standard datasets. It even surpasses most prior works based on costly local descriptor indexing and spatial verification. Additional material is available at www.xrce.xerox.com/Deep-Image-Retrieval.

727 citations

Journal ArticleDOI
TL;DR: In this article, the authors leverage a large-scale but noisy landmark dataset and develop an automatic cleaning method that produces a suitable training set for deep retrieval, and train this network with a siamese architecture that combines three streams with a triplet loss.
Abstract: While deep learning has become a key ingredient in the top performing methods for many computer vision tasks, it has failed so far to bring similar improvements to instance-level image retrieval In this article, we argue that reasons for the underwhelming results of deep methods on image retrieval are threefold: (1) noisy training data, (2) inappropriate deep architecture, and (3) suboptimal training procedure We address all three issues First, we leverage a large-scale but noisy landmark dataset and develop an automatic cleaning method that produces a suitable training set for deep retrieval Second, we build on the recent R-MAC descriptor, show that it can be interpreted as a deep and differentiable architecture, and present improvements to enhance it Last, we train this network with a siamese architecture that combines three streams with a triplet loss At the end of the training process, the proposed architecture produces a global image representation in a single forward pass that is well suited for image retrieval Extensive experiments show that our approach significantly outperforms previous retrieval approaches, including state-of-the-art methods based on costly local descriptor indexing and spatial verification On Oxford 5k, Paris 6k and Holidays, we respectively report 947, 966, and 948 mean average precision Our representations can also be heavily compressed using product quantization with little loss in accuracy

481 citations

Proceedings ArticleDOI
01 Jan 2013
TL;DR: This work argues that a per-image score instead of one computed over the entire dataset brings a lot more insight, and proposes new ways to evaluate semantic segmentation.
Abstract: In this work, we consider the evaluation of the semantic segmentation task. We discuss the strengths and limitations of the few existing measures, and propose new ways to evaluate semantic segmentation. First, we argue that a per-image score instead of one computed over the entire dataset brings a lot more insight. Second, we propose to take contours more carefully into account. Based on the conducted experiments, we suggest best practices for the evaluation. Finally, we present a user study we conducted to better understand how the quality of image segmentations is perceived by humans.

439 citations

Proceedings ArticleDOI
06 Nov 2011
TL;DR: It is experimentally shown that the descriptors used, which aggregate statistics computed from low-level local features, implicitly encode the aesthetic properties explicitly used by state-of-the-art methods and outperform them by a significant margin.
Abstract: In this paper, we automatically assess the aesthetic properties of images. In the past, this problem has been addressed by hand-crafting features which would correlate with best photographic practices (e.g. “Does this image respect the rule of thirds?”) or with photographic techniques (e.g. “Is this image a macro?”). We depart from this line of research and propose to use generic image descriptors to assess aesthetic quality. We experimentally show that the descriptors we use, which aggregate statistics computed from low-level local features, implicitly encode the aesthetic properties explicitly used by state-of-the-art methods and outperform them by a significant margin.

410 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: The state-of-the-art in evaluated methods for both classification and detection are reviewed, whether the methods are statistically different, what they are learning from the images, and what the methods find easy or confuse.
Abstract: The Pascal Visual Object Classes (VOC) challenge is a benchmark in visual object category recognition and detection, providing the vision and machine learning communities with a standard dataset of images and annotation, and standard evaluation procedures. Organised annually from 2005 to present, the challenge and its associated dataset has become accepted as the benchmark for object detection. This paper describes the dataset and evaluation procedure. We review the state-of-the-art in evaluated methods for both classification and detection, analyse whether the methods are statistically different, what they are learning from the images (e.g. the object or its context), and what the methods find easy or confuse. The paper concludes with lessons learnt in the three year history of the challenge, and proposes directions for future improvement and extension.

15,935 citations

Journal ArticleDOI
TL;DR: Quantitative assessments show that SegNet provides good performance with competitive inference time and most efficient inference memory-wise as compared to other architectures, including FCN and DeconvNet.
Abstract: We present a novel and practical deep fully convolutional neural network architecture for semantic pixel-wise segmentation termed SegNet. This core trainable segmentation engine consists of an encoder network, a corresponding decoder network followed by a pixel-wise classification layer. The architecture of the encoder network is topologically identical to the 13 convolutional layers in the VGG16 network [1] . The role of the decoder network is to map the low resolution encoder feature maps to full input resolution feature maps for pixel-wise classification. The novelty of SegNet lies is in the manner in which the decoder upsamples its lower resolution input feature map(s). Specifically, the decoder uses pooling indices computed in the max-pooling step of the corresponding encoder to perform non-linear upsampling. This eliminates the need for learning to upsample. The upsampled maps are sparse and are then convolved with trainable filters to produce dense feature maps. We compare our proposed architecture with the widely adopted FCN [2] and also with the well known DeepLab-LargeFOV [3] , DeconvNet [4] architectures. This comparison reveals the memory versus accuracy trade-off involved in achieving good segmentation performance. SegNet was primarily motivated by scene understanding applications. Hence, it is designed to be efficient both in terms of memory and computational time during inference. It is also significantly smaller in the number of trainable parameters than other competing architectures and can be trained end-to-end using stochastic gradient descent. We also performed a controlled benchmark of SegNet and other architectures on both road scenes and SUN RGB-D indoor scene segmentation tasks. These quantitative assessments show that SegNet provides good performance with competitive inference time and most efficient inference memory-wise as compared to other architectures. We also provide a Caffe implementation of SegNet and a web demo at http://mi.eng.cam.ac.uk/projects/segnet/ .

13,468 citations

Book
30 Sep 2010
TL;DR: Computer Vision: Algorithms and Applications explores the variety of techniques commonly used to analyze and interpret images and takes a scientific approach to basic vision problems, formulating physical models of the imaging process before inverting them to produce descriptions of a scene.
Abstract: Humans perceive the three-dimensional structure of the world with apparent ease. However, despite all of the recent advances in computer vision research, the dream of having a computer interpret an image at the same level as a two-year old remains elusive. Why is computer vision such a challenging problem and what is the current state of the art? Computer Vision: Algorithms and Applications explores the variety of techniques commonly used to analyze and interpret images. It also describes challenging real-world applications where vision is being successfully used, both for specialized applications such as medical imaging, and for fun, consumer-level tasks such as image editing and stitching, which students can apply to their own personal photos and videos. More than just a source of recipes, this exceptionally authoritative and comprehensive textbook/reference also takes a scientific approach to basic vision problems, formulating physical models of the imaging process before inverting them to produce descriptions of a scene. These problems are also analyzed using statistical models and solved using rigorous engineering techniques Topics and features: structured to support active curricula and project-oriented courses, with tips in the Introduction for using the book in a variety of customized courses; presents exercises at the end of each chapter with a heavy emphasis on testing algorithms and containing numerous suggestions for small mid-term projects; provides additional material and more detailed mathematical topics in the Appendices, which cover linear algebra, numerical techniques, and Bayesian estimation theory; suggests additional reading at the end of each chapter, including the latest research in each sub-field, in addition to a full Bibliography at the end of the book; supplies supplementary course material for students at the associated website, http://szeliski.org/Book/. Suitable for an upper-level undergraduate or graduate-level course in computer science or engineering, this textbook focuses on basic techniques that work under real-world conditions and encourages students to push their creative boundaries. Its design and exposition also make it eminently suitable as a unique reference to the fundamental techniques and current research literature in computer vision.

4,146 citations

01 Jan 2004
TL;DR: Comprehensive and up-to-date, this book includes essential topics that either reflect practical significance or are of theoretical importance and describes numerous important application areas such as image based rendering and digital libraries.
Abstract: From the Publisher: The accessible presentation of this book gives both a general view of the entire computer vision enterprise and also offers sufficient detail to be able to build useful applications. Users learn techniques that have proven to be useful by first-hand experience and a wide range of mathematical methods. A CD-ROM with every copy of the text contains source code for programming practice, color images, and illustrative movies. Comprehensive and up-to-date, this book includes essential topics that either reflect practical significance or are of theoretical importance. Topics are discussed in substantial and increasing depth. Application surveys describe numerous important application areas such as image based rendering and digital libraries. Many important algorithms broken down and illustrated in pseudo code. Appropriate for use by engineers as a comprehensive reference to the computer vision enterprise.

3,627 citations

Proceedings ArticleDOI
07 Dec 2015
TL;DR: A minor contribution, inspired by recent advances in large-scale image search, an unsupervised Bag-of-Words descriptor is proposed that yields competitive accuracy on VIPeR, CUHK03, and Market-1501 datasets, and is scalable on the large- scale 500k dataset.
Abstract: This paper contributes a new high quality dataset for person re-identification, named "Market-1501". Generally, current datasets: 1) are limited in scale, 2) consist of hand-drawn bboxes, which are unavailable under realistic settings, 3) have only one ground truth and one query image for each identity (close environment). To tackle these problems, the proposed Market-1501 dataset is featured in three aspects. First, it contains over 32,000 annotated bboxes, plus a distractor set of over 500K images, making it the largest person re-id dataset to date. Second, images in Market-1501 dataset are produced using the Deformable Part Model (DPM) as pedestrian detector. Third, our dataset is collected in an open system, where each identity has multiple images under each camera. As a minor contribution, inspired by recent advances in large-scale image search, this paper proposes an unsupervised Bag-of-Words descriptor. We view person re-identification as a special task of image search. In experiment, we show that the proposed descriptor yields competitive accuracy on VIPeR, CUHK03, and Market-1501 datasets, and is scalable on the large-scale 500k dataset.

3,564 citations