scispace - formally typeset
Search or ask a question

Showing papers by "Luc Van Gool published in 2015"


Journal ArticleDOI
TL;DR: A review of the Pascal Visual Object Classes challenge from 2008-2012 and an appraisal of the aspects of the challenge that worked well, and those that could be improved in future challenges.
Abstract: The Pascal Visual Object Classes (VOC) challenge consists of two components: (i) a publicly available dataset of images together with ground truth annotation and standardised evaluation software; and (ii) an annual competition and workshop. There are five challenges: classification, detection, segmentation, action classification, and person layout. In this paper we provide a review of the challenge from 2008---2012. The paper is intended for two audiences: algorithm designers, researchers who want to see what the state of the art is, as measured by performance on the VOC datasets, along with the limitations and weak points of the current generation of algorithms; and, challenge designers, who want to see what we as organisers have learnt from the process and our recommendations for the organisation of future challenges. To analyse the performance of submitted algorithms on the VOC datasets we introduce a number of novel evaluation methods: a bootstrapping method for determining whether differences in the performance of two algorithms are significant or not; a normalised average precision so that performance can be compared across classes with different proportions of positive instances; a clustering method for visualising the performance across multiple algorithms so that the hard and easy images can be identified; and the use of a joint classifier over the submitted algorithms in order to measure their complementarity and combined performance. We also analyse the community's progress through time using the methods of Hoiem et al. (Proceedings of European Conference on Computer Vision, 2012) to identify the types of occurring errors. We conclude the paper with an appraisal of the aspects of the challenge that worked well, and those that could be improved in future challenges.

6,061 citations


Proceedings ArticleDOI
07 Dec 2015
TL;DR: The proposed method, Deep EXpectation (DEX) of apparent age, first detects the face in the test image and then extracts the CNN predictions from an ensemble of 20 networks on the cropped face, significantly outperforming the human reference.
Abstract: In this paper we tackle the estimation of apparent age in still face images with deep learning. Our convolutional neural networks (CNNs) use the VGG-16 architecture [13] and are pretrained on ImageNet for image classification. In addition, due to the limited number of apparent age annotated images, we explore the benefit of finetuning over crawled Internet face images with available age. We crawled 0.5 million images of celebrities from IMDB and Wikipedia that we make public. This is the largest public dataset for age prediction to date. We pose the age regression problem as a deep classification problem followed by a softmax expected value refinement and show improvements over direct regression training of CNNs. Our proposed method, Deep EXpectation (DEX) of apparent age, first detects the face in the test image and then extracts the CNN predictions from an ensemble of 20 networks on the cropped face. The CNNs of DEX were finetuned on the crawled images and then on the provided images with apparent age annotations. DEX does not use explicit facial landmarks. Our DEX is the winner (1st place) of the ChaLearn LAP 2015 challenge on apparent age estimation with 115 registered teams, significantly outperforming the human reference.

603 citations


Proceedings ArticleDOI
07 Jun 2015
TL;DR: A new method is introduced that uses a supervised approach in order to learn the importance of global characteristics of a summary and jointly optimizes for multiple objectives and thus creates summaries that posses multiple properties of a good summary.
Abstract: We present a novel method for summarizing raw, casually captured videos. The objective is to create a short summary that still conveys the story. It should thus be both, interesting and representative for the input video. Previous methods often used simplified assumptions and only optimized for one of these goals. Alternatively, they used handdefined objectives that were optimized sequentially by making consecutive hard decisions. This limits their use to a particular setting. Instead, we introduce a new method that (i) uses a supervised approach in order to learn the importance of global characteristics of a summary and (ii) jointly optimizes for multiple objectives and thus creates summaries that posses multiple properties of a good summary. Experiments on two challenging and very diverse datasets demonstrate the effectiveness of our method, where we outperform or match current state-of-the-art.

452 citations


Posted Content
TL;DR: The Improved A+ (IA) method sets new stateof-the-art results outperforming A+ by up to 0.9dB on average PSNR whilst maintaining a low time complexity.
Abstract: In this paper we present seven techniques that everybody should know to improve example-based single image super resolution (SR): 1) augmentation of data, 2) use of large dictionaries with efficient search structures, 3) cascading, 4) image self-similarities, 5) back projection refinement, 6) enhanced prediction by consistency check, and 7) context reasoning. We validate our seven techniques on standard SR benchmarks (i.e. Set5, Set14, B100) and methods (i.e. A+, SRCNN, ANR, Zeyde, Yang) and achieve substantial improvements.The techniques are widely applicable and require no changes or only minor adjustments of the SR methods. Moreover, our Improved A+ (IA) method sets new state-of-the-art results outperforming A+ by up to 0.9dB on average PSNR whilst maintaining a low time complexity.

290 citations


Journal ArticleDOI
TL;DR: A robust and fast to evaluate energy function is defined, based on enforcing color similarity between the boundaries and the superpixel color histogram, which is able to achieve a performance comparable to the state-of-the-art, but in real-time on a single Intel i7 CPU at 2.8 GHz.
Abstract: Superpixel algorithms aim to over-segment the image by grouping pixels that belong to the same object. Many state-of-the-art superpixel algorithms rely on minimizing objective functions to enforce color homogeneity. The optimization is accomplished by sophisticated methods that progressively build the superpixels, typically by adding cuts or growing superpixels. As a result, they are computationally too expensive for real-time applications. We introduce a new approach based on a simple hill-climbing optimization. Starting from an initial superpixel partitioning, it continuously refines the superpixels by modifying the boundaries. We define a robust and fast to evaluate energy function, based on enforcing color similarity between the boundaries and the superpixel color histogram. In a series of experiments, we show that we achieve an excellent compromise between accuracy and efficiency. We are able to achieve a performance comparable to the state-of-the-art, but in real-time on a single Intel i7 CPU at 2.8 GHz.

168 citations


Proceedings ArticleDOI
07 Jun 2015
TL;DR: It is shown that a properly trained pure-3D approach produces high quality labelings, with significant speed benefits allowing us to analyze entire streets in a matter of minutes, and a novel facade separation based on semantic nuances between facades is proposed.
Abstract: We propose a new approach for semantic segmentation of 3D city models. Starting from an SfM reconstruction of a street-side scene, we perform classification and facade splitting purely in 3D, obviating the need for slow image-based semantic segmentation methods. We show that a properly trained pure-3D approach produces high quality labelings, with significant speed benefits (20x faster) allowing us to analyze entire streets in a matter of minutes. Additionally, if speed is not of the essence, the 3D labeling can be combined with the results of a state-of-the-art 2D classifier, further boosting the performance. Further, we propose a novel facade separation based on semantic nuances between facades. Finally, inspired by the use of architectural principles for 2D facade labeling, we propose new 3D-specific principles and an efficient optimization scheme based on an integer quadratic programming formulation.

148 citations


Proceedings ArticleDOI
07 Dec 2015
TL;DR: In this article, an inverse coarse-to-fine cascade is proposed to select the most promising object locations and refine their boxes in a coarse to-fine manner, which combines the best of both worlds.
Abstract: In this paper we evaluate the quality of the activation layers of a convolutional neural network (CNN) for the generation of object proposals. We generate hypotheses in a sliding-window fashion over different activation layers and show that the final convolutional layers can find the object of interest with high recall but poor localization due to the coarseness of the feature maps. Instead, the first layers of the network can better localize the object of interest but with a reduced recall. Based on this observation we design a method for proposing object locations that is based on CNN features and that combines the best of both worlds. We build an inverse cascade that, going from the final to the initial convolutional layers of the CNN, selects the most promising object locations and refines their boxes in a coarse-to-fine manner. The method is efficient, because i) it uses the same features extracted for detection, ii) it aggregates features using integral images, and iii) it avoids a dense evaluation of the proposals due to the inverse coarse-to-fine cascade. The method is also accurate, it outperforms most of the previously proposed object proposals approaches and when plugged into a CNN-based detector produces state-of-the-art detection performance.

111 citations


Posted Content
TL;DR: An inverse cascade is built that, going from the final to the initial convolutional layers of the CNN, selects the most promising object locations and refines their boxes in a coarse-to-fine manner and is efficient.
Abstract: In this paper we evaluate the quality of the activation layers of a convolutional neural network (CNN) for the gen- eration of object proposals. We generate hypotheses in a sliding-window fashion over different activation layers and show that the final convolutional layers can find the object of interest with high recall but poor localization due to the coarseness of the feature maps. Instead, the first layers of the network can better localize the object of interest but with a reduced recall. Based on this observation we design a method for proposing object locations that is based on CNN features and that combines the best of both worlds. We build an inverse cascade that, going from the final to the initial convolutional layers of the CNN, selects the most promising object locations and refines their boxes in a coarse-to-fine manner. The method is efficient, because i) it uses the same features extracted for detection, ii) it aggregates features using integral images, and iii) it avoids a dense evaluation of the proposals due to the inverse coarse-to-fine cascade. The method is also accurate; it outperforms most of the previously proposed object proposals approaches and when plugged into a CNN-based detector produces state-of-the- art detection performance.

110 citations


Posted Content
TL;DR: Zhang et al. as discussed by the authors presented the first comprehensive study and analysis of the usefulness of super-resolution for other vision applications, including edge detection, semantic image segmentation, digit recognition, and scene recognition.
Abstract: Despite the great advances made in the field of image super-resolution (ISR) during the last years, the performance has merely been evaluated perceptually. Thus, it is still unclear whether ISR is helpful for other vision tasks. In this paper, we present the first comprehensive study and analysis of the usefulness of ISR for other vision applications. In particular, six ISR methods are evaluated on four popular vision tasks, namely edge detection, semantic image segmentation, digit recognition, and scene recognition. We show that applying ISR to input images of other vision systems does improve their performance when the input images are of low-resolution. We also study the correlation between four standard perceptual evaluation criteria (namely PSNR, SSIM, IFC, and NQM) and the usefulness of ISR to the vision tasks. Experiments show that they correlate well with each other in general, but perceptual criteria are still not accurate enough to be used as full proxies for the usefulness. We hope this work will inspire the community to evaluate ISR methods also in real vision applications, and to adopt ISR as a pre-processing step of other vision tasks if the resolution of their input images is low.

74 citations


Proceedings ArticleDOI
05 Jan 2015
TL;DR: This work first extracts sparse pixel correspondences by means of a matching procedure and then applies a variational approach to obtain a refined optical flow, coined 'Sparse Flow', which is competitive on standard optical flow benchmarks with large displacements, while showing excellent performance for small and medium displacements.
Abstract: Despite recent advances, the extraction of optical flow with large displacements is still challenging for state-of the-art methods. The approaches that are the most successful at handling large displacements blend sparse correspondences from a matching algorithm with an optimization that refines the optical flow. We follow the scheme of Deep-Flow [33]. We first extract sparse pixel correspondences by means of a matching procedure and then apply a variational approach to obtain a refined optical flow. In our approach, coined 'Sparse Flow', the novelty lies in the matching. This uses an efficient sparse decomposition of a pixel's surrounding patch as a linear sum of those found around candidate corresponding pixels. As matching pixel the one dominating the decomposition is chosen. The pixel pairs matching in both directions, i.e. in a forward-backward fashion, are used as guiding points in the variational approach. Sparse-Flow is competitive on standard optical flow benchmarks with large displacements, while showing excellent performance for small and medium displacements. Moreover, it is fast in comparison to methods with a similar performance.

73 citations


Proceedings ArticleDOI
07 Jun 2015
TL;DR: This work proposes a novel surface reconstruction method based on image edges, superpixels and second-order smoothness constraints, producing meshes comparable to classic MVS surfaces in quality but orders of magnitudes faster.
Abstract: Multi-View-Stereo (MVS) methods aim for the highest detail possible, however, such detail is often not required. In this work, we propose a novel surface reconstruction method based on image edges, superpixels and second-order smoothness constraints, producing meshes comparable to classic MVS surfaces in quality but orders of magnitudes faster. Our method performs per-view dense depth optimization directly over sparse 3D Ground Control Points (GCPs), hence, removing the need for view pairing, image rectification, and stereo depth estimation, and allowing for full per-image parallelization. We use Structure-from-Motion (SfM) points as GCPs, but the method is not specific to these, e.g. LiDAR or RGB-D can also be used. The resulting meshes are compact and inherently edge-aligned with image gradients, enabling good-quality lightweight per-face flat renderings. Our experiments demonstrate on a variety of 3D datasets the superiority in speed and competitive surface quality.

Posted Content
TL;DR: This work improves the state of-the-art, but also predicts - based on someone's known preferences - how much that particular person is attracted to a novel face, and validates the collaborative filtering solution on the standard MovieLens rating dataset.
Abstract: For people first impressions of someone are of determining importance. They are hard to alter through further information. This begs the question if a computer can reach the same judgement. Earlier research has already pointed out that age, gender, and average attractiveness can be estimated with reasonable precision. We improve the state-of-the-art, but also predict - based on someone's known preferences - how much that particular person is attracted to a novel face. Our computational pipeline comprises a face detector, convolutional neural networks for the extraction of deep features, standard support vector regression for gender, age and facial beauty, and - as the main novelties - visual regularized collaborative filtering to infer inter-person preferences as well as a novel regression technique for handling visual queries without rating history. We validate the method using a very large dataset from a dating site as well as images from celebrities. Our experiments yield convincing results, i.e. we predict 76% of the ratings correctly solely based on an image, and reveal some sociologically relevant conclusions. We also validate our collaborative filtering solution on the standard MovieLens rating dataset, augmented with movie posters, to predict an individual's movie rating. We demonstrate our algorithms on this http URL which went viral around the Internet with more than 50 million pictures evaluated in the first month.

Proceedings ArticleDOI
07 Dec 2015
TL;DR: This paper studies the transition from the Pascal Visual Object Challenge dataset to the updated, bigger, and more challenging Microsoft Common Objects in Context, and proposes various lines of research to take advantage of the new benchmark and improve the techniques.
Abstract: Computer vision in general, and object proposals in particular, are nowadays strongly influenced by the databases on which researchers evaluate the performance of their algorithms. This paper studies the transition from the Pascal Visual Object Challenge dataset, which has been the benchmark of reference for the last years, to the updated, bigger, and more challenging Microsoft Common Objects in Context. We first review and deeply analyze the new challenges, and opportunities, that this database presents. We then survey the current state of the art in object proposals and evaluate it focusing on how it generalizes to the new dataset. In sight of these results, we propose various lines of research to take advantage of the new benchmark and improve the techniques. We explore one of these lines, which leads to an improvement over the state of the art of +5.2%.

Proceedings ArticleDOI
07 Jun 2015
TL;DR: This work investigates how coarse category labels can be used to improve the classification of subcategories and adopts the framework of Random Forests and proposes a regularized objective function that takes into account relations between categories and subc categories.
Abstract: The number of digital images is growing extremely rapidly, and so is the need for their classification. But, as more images of pre-defined categories become available, they also become more diverse and cover finer semantic differences. Ultimately, the categories themselves need to be divided into subcategories to account for that semantic refinement. Image classification in general has improved significantly over the last few years, but it still requires a massive amount of manually annotated data. Subdividing categories into subcategories multiples the number of labels, aggravating the annotation problem. Hence, we can expect the annotations to be refined only for a subset of the already labeled data, and exploit coarser labeled data to improve classification. In this work, we investigate how coarse category labels can be used to improve the classification of subcategories. To this end, we adopt the framework of Random Forests and propose a regularized objective function that takes into account relations between categories and subcategories. Compared to approaches that disregard the extra coarse labeled data, we achieve a relative improvement in subcategory classification accuracy of up to 22% in our large-scale image classification experiments.

Proceedings ArticleDOI
07 Jun 2015
TL;DR: Experiments show that MI is able to provide good metrics while avoiding expensive data labeling efforts and that it achieves state-of-the-art performance for image super-resolution.
Abstract: Metric learning has proved very successful. However, human annotations are necessary. In this paper, we propose an unsupervised method, dubbed Metric Imitation (MI), where metrics over cheap features (target features, TFs) are learned by imitating the standard metrics over more sophisticated, off-the-shelf features (source features, SFs) by transferring view-independent property manifold structures. In particular, MI consists of: 1) quantifying the properties of source metrics as manifold geometry, 2) transferring the manifold from source domain to target domain, and 3) learning a mapping of TFs so that the manifold is approximated as well as possible in the mapped feature domain. MI is useful in at least two scenarios where: 1) TFs are more efficient computationally and in terms of memory than SFs; and 2) SFs contain privileged information, but are not available during testing. For the former, MI is evaluated on image clustering, category-based image retrieval, and instance-based object retrieval, with three SFs and three TFs. For the latter, MI is tested on the task of example-based image super-resolution, where high-resolution patches are taken as SFs and low-resolution patches as TFs. Experiments show that MI is able to provide good metrics while avoiding expensive data labeling efforts and that it achieves state-of-the-art performance for image super-resolution. In addition, manifold transfer is an interesting direction of transfer learning.

Proceedings ArticleDOI
07 Jun 2015
TL;DR: A novel vanishing point (VP) detection and tracking algorithm for calibrated monocular image sequences by combining VP extraction on a Gaussian sphere with recent advances in multi-target tracking on probabilistic occupancy fields is presented.
Abstract: We present a novel vanishing point (VP) detection and tracking algorithm for calibrated monocular image sequences. Previous VP detection and tracking methods usually assume known camera poses for all frames or detect and track separately. We advance the state-of-the-art by combining VP extraction on a Gaussian sphere with recent advances in multi-target tracking on probabilistic occupancy fields. The solution is obtained by solving a Linear Program (LP). This enables the joint detection and tracking of multiple VPs over sequences. Unlike existing works we do not need known camera poses, and at the same time avoid detecting and tracking in separate steps. We also propose an extension to enforce VP orthogonality. We augment an existing video dataset consisting of 48 monocular videos with multiple annotated VPs in 14448 frames for evaluation. Although the method is designed for unknown camera poses, it is also helpful in scenarios with known poses, since a multi-frame approach in VP detection helps to regularize in frames with weak VP line support.

Journal ArticleDOI
TL;DR: A particle matching technique is used to reduce the dependency on prior knowledge in a semi-supervised motion segmentation algorithm by automatically matching particles in frames over which fast motion or occlusion occur.

Journal ArticleDOI
Radu Timofte1, Luc Van Gool1
TL;DR: A novel sparse representation, the Iterative Nearest Neighbors (INN), that combines the power of SR and LLE with the computational simplicity of k NN is proposed and proves on par or better performance with MP and OMP on sparse signal recovery task.

Proceedings ArticleDOI
07 Jun 2015
TL;DR: The algorithm that is proposed - coined `Make My Day' or MMD for short - is akin to the previously published BM3D denoising algorithm and outperforms other state-of-art Denoising methods in terms of PSNR, texture quality, and color fidelity.
Abstract: We address the task of restoring RGB images taken under low illumination (e.g. night time), when an aligned near infrared (NIR or simply N) image taken under stronger NIR illumination is available. Such restoration holds the promise that algorithms designed to work under daylight conditions could be used around the clock. Increasingly, RGBN cameras are becoming available, as car cameras tend to include a Near-Infrared (N) band, next to R, G, and B bands, and NIR artificial lighting is applied. Under low lighting conditions, the NIR band is less noisy than the others and this is all the more the case if stronger illumination is only available in the NIR band. We address the task of restoring the R, G, and B bands on the basis of the NIR band in such cases. Even if the NIR band is less strongly correlated with the R, G, and B bands than these bands are mutually, there is sufficient such correlation to pick up important textural and gradient information in the NIR band and inject it into the others. The algorithm that we propose - coined ‘Make My Day’ or MMD for short - is akin to the previously published BM3D denoising algorithm. MMD denoises the three (visible - NIR) differential images to then add back the original NIR image. It not only effectively reduces the noise but also includes the texture and edge information in the high spatial frequency range. MMD outperforms other state-of-art denoising methods in terms of PSNR, texture quality, and color fidelity. We publish our codes and images.

Journal ArticleDOI
TL;DR: Experiments show that the deformation field can better approximate real object deformations and therefore, for certain classes, produces even better detection accuracy than state-of-the-art DPM.
Abstract: Deformable Parts Models (DPM) are the current state-of-the-art for object detection. Nevertheless they seem sub-optimal in the representation of deformations. Object deformations are often continuous and not confined to big parts. Therefore we propose to replace the DPM star model based on big parts by a deformation field. This consists of a grid of small parts connected with pairwise constraints which can better handle continuous deformations. The naive application of this model for object detection would consist of a bounded sliding window approach: for each possible location of the image the best part configuration within a limited bound around this location is found. This is computationally very expensive.Instead, we propose a different inference procedure, where an iterative image-level search finds the best object hypothesis. We show that this approach is faster than bounded sliding windows yet produces comparable accuracy. Experiments further show that the deformation field can better approximate real object deformations and therefore, for certain classes, produces even better detection accuracy than state-of-the-art DPM. Finally, the same approach is adapted to model-free tracking, showing improved accuracy also in this case.

Proceedings ArticleDOI
26 May 2015
TL;DR: An efficient method to detect lens flares within aerial images based on the position of the sun with respect to the observer is presented and this approach is able to compensate for errors in the parameters influencing the calculation of the lens flare direction.
Abstract: The goal of integrating drones into the civil airspace requires a technical system which robustly detects, tracks and finally avoids aerial objects. Electro-optical cameras have proven to be an adequate sensor to detect traffic, especially for smaller aircraft, gliders or paragliders. However the very challenging environmental conditions and image artifacts such as lens flares often result in a high number of false detections. Depending on the solar radiation lens flares are very common in aerial images and hard to distinguish from aerial objects on a collision course due to their similar size, shape, brightness and trajectories. In this paper we present an efficient method to detect lens flares within aerial images based on the position of the sun with respect to the observer. Using the date, time, position and attitude of the observer we predict the lens flare direction within the image. Once the direction is known the position, size and shape of the lens flares are extracted. Experiments show that our approach is able to compensate for errors in the parameters influencing the calculation of the lens flare direction. We further integrate the lens flare detection into an aerial object tracking framework. A detailed evaluation of the framework with and without lens flare filter shows that false tracks due to lens flares are successfully suppressed without degrading the overall tracking system performance.

Proceedings ArticleDOI
05 Jan 2015
TL;DR: This paper proposes a learned collaborative representation based classifier (LCRC) based on the fixed point theorem and uses a weights formulation similar to WCRC as the starting point and shows that the learning procedure is stable and convergent, and that LCRC is able to improve in performance over CRC and WCRC, while keeping the same computational efficiency at test.
Abstract: The collaborative representation-based classifier (CRC) is proposed as an alternative to the sparse representation based classifier (SRC) for image face recognition. CRC solves an l2-regularized least squares formulation, with algebraic solution, while SRC optimizes over an I1-regularized least squares problem. As an extension of CRC, the weighted collaborative representation-based classifier (WCRC) is further proposed. The weights in WCRC are picked intuitively, it remains unclear why such choice of weights works and how we optimize those weights. In this paper, we propose a learned collaborative representation based classifier (LCRC) and attempt to answer the above questions. Our learning technique is based on the fixed point theorem and we use a weights formulation similar to WCRC as the starting point. Through extensive experiments on face datasets we show that the learning procedure is stable and convergent, and that LCRC is able to improve in performance over CRC and WCRC, while keeping the same computational efficiency at test.

Proceedings ArticleDOI
07 Dec 2015
TL;DR: This paper proposes a novel method to infer the higher dimensional properties of the material's BRDF, based on the statistical distribution of known material characteristics observed in real-life samples, and evaluates the method based on a large set of experiments generated from real-world BRDFs and newly measured materials.
Abstract: The problem of estimating a full BRDF from partial observations has already been studied using either parametric or non-parametric approaches. The goal in each case is to best match this sparse set of input measurements. In this paper we address the problem of inferring higher order reflectance information starting from the minimal input of a single BRDF slice. We begin from the prototypical case of a homogeneous sphere, lit by a head-on light source, which only holds information about less than 0.001% of the whole BRDF domain. We propose a novel method to infer the higher dimensional properties of the material's BRDF, based on the statistical distribution of known material characteristics observed in real-life samples. We evaluated our method based on a large set of experiments generated from real-world BRDFs and newly measured materials. Although inferring higher dimensional BRDFs from such modest training is not a trivial problem, our method performs better than state-of-the-art parametric, semi-parametric and non-parametric approaches. Finally, we discuss interesting applications on material re-lighting, and flash-based photography.

Proceedings ArticleDOI
07 Dec 2015
TL;DR: This paper uses convolutional neural networks with VGG-16 architecture, pretrained on ImageNet or the Places205 dataset for image classification, and fine-tuned on cultural events data to solve the classification of cultural events from a single image with a deep learning based method.
Abstract: In this paper we tackle the classification of cultural events from a single image with a deep learning based method. We use convolutional neural networks (CNNs) with VGG-16 architecture [17], pretrained on ImageNet or the Places205 dataset for image classification, and fine-tuned on cultural events data. CNN features are robustly extracted at 4 different layers in each image. At each layer Linear Discriminant Analysis (LDA) is employed for discriminative dimensionality reduction. An image is represented by the concatenated LDA-projected features from all layers or by the concatenation of CNN pooled features at each layer. The classification is then performed through the Iterative Nearest Neighbors-based Classifier (INNC) [20]. Classification scores are obtained for different image representation setups at train and test. The average of the scores is the output of our deep linear discriminative retrieval (DLDR) system. With 0.80 mean average precision (mAP) DLDR is a top entry for the ChaLearn LAP 2015 cultural event recognition challenge.

Book ChapterDOI
13 Jan 2015
TL;DR: This paper unify the proposed subRW and the other popular random walk algorithms, and design a subRW algorithm with label prior to solve the segmentation problem of objects with thin and elongated parts.
Abstract: In this paper, we propose a subMarkov random walk (subRW) with the label prior with added auxiliary nodes for seeded image segmentation. We unify the proposed subRW and the other popular random walk algorithms. This unifying view can transfer the intrinsic findings between different random walk algorithms, and offer the new ideas for designing the novel random walk algorithms by changing the auxiliary nodes. According to the second benefit, we design a subRW algorithm with label prior to solve the segmentation problem of objects with thin and elongated parts. The experimental results on natural images with twigs demonstrate that our algorithm achieves better performance than the previous random walk algorithms.

Proceedings Article
01 Jan 2015
TL;DR: A novel automatic recognition framework for hand-written mensural music which takes a scanned manuscript as input and yields as output modern music scores, and works as a complete pipeline which integrates both recognition and transcription.
Abstract: This paper presents a novel automatic recognition framework for hand-written mensural music. It takes a scanned manuscript as input and yields as output modern music scores. Compared to the previous mensural Optical Music Recognition (OMR) systems, ours shows not only promising performance in music recognition, but also works as a complete pipeline which integrates both recognition and transcription. There are three main parts in this pipeline: i) region-ofinterest detection, ii) music symbol detection and classification, and iii) transcription to modern music. In addition to the output in modern notation, our system can generate a MIDI file as well. It provides an easy platform for the musicologists to analyze old manuscripts. Moreover, it renders these valuable cultural heritage resources available to non-specialists as well, as they can now access such ancient music in a better understandable form.

Proceedings ArticleDOI
10 Dec 2015
TL;DR: This paper proposes an efficient novel post-processing algorithm based on the adjusted anchored neighborhood regression (A+) method from image super-resolution literature that greatly improves the results of the demosaicing methods, and achieves image quality as competitive as SAPCA but orders of magnitude faster.
Abstract: Color demosaicing is a process of reconstructing lost pixels in an incomplete color image. By extracting spatial-spectral correlations of RGB channels various interpolation methods have been proposed with low computational complexity. Meanwhile, optimization strategies such as sparsity and adaptive PCA based algorithm (SAPCA) were developed. SAPCA outperforms many interpolation techniques by impressive margins at the cost of dramatically increasing the computational time. In this paper we propose an efficient novel post-processing algorithm based on the adjusted anchored neighborhood regression (A+) method from image super-resolution literature. We greatly improve the results of the demosaicing methods, and achieve image quality as competitive as SAPCA but orders of magnitude faster.

Proceedings ArticleDOI
06 Jan 2015
TL;DR: This paper proposes a novel method for detecting and tracking groups of mutually orthogonal vanishing points (MOVP), also known as Manhattan frames, jointly from monocular videos, and shows that the method outperforms greedy MOVP tracking method considerably.
Abstract: While vanishing point (VP) estimation has received extensive attention, most approaches focus on static images or perform detection and tracking separately. In this paper, we focus on man-made environments and propose a novel method for detecting and tracking groups of mutually orthogonal vanishing points (MOVP), also known as Manhattan frames, jointly from monocular videos. The method is unique in that it is designed to enforce orthogonality in groups of VPs, temporal consistency of each individual MOVP, and orientation consistency of all putative MOVP. To this end, the method consists of three steps: 1) proposal of MOVP candidates by directly incorporating mutual orthogonality; 2) extracting consistent tracks of MOVPs by minimizing the flow cost over a network where nodes are putative MOVPs and edges are putative links across time; and 3) refinement of all MOVPs by enforcing consistency between lines, their identified vanishing directions and consistency of global camera orientation. The method is evaluated on six newly collected and annotated videos of urban scenes. Extensive experiments show that the method outperforms greedy MOVP tracking method considerably. In addition, we also test the method for camera orientation estimation and show that it obtains very promising results on a challenging street-view dataset.

Proceedings ArticleDOI
18 May 2015
TL;DR: This work proposes to learn discriminative features based on a small set of annotated images to organize product databases by image classification, which allows for fast feature extraction and training, is easy to implement and does not require powerful dedicated hardware.
Abstract: Fashion is a major segment in e-commerce with growing importance and a steadily increasing number of products. Since manual annotation of apparel items is very tedious, the product databases need to be organized automatically, e.g. by image classification. Common image classification approaches are based on features engineered for general purposes which perform poorly on specific images of apparel. We therefore propose to learn discriminative features based on a small set of annotated images. We experimentally evaluate our method on a dataset with 30,000 images containing apparel items, and compare it to other engineered and learned sets of features. The classification accuracy of our features is significantly superior to designed HOG and SIFT features (43.7% and 16.1% relative improvement, respectively). Our method allows for fast feature extraction and training, is easy to implement and, unlike deep convolutional networks, does not require powerful dedicated hardware.

25 Sep 2015
TL;DR: A novel GPU-accelerated implementation that calculates the shape normals, as well as the albedo and ambient lighting through the Photometric Stereo technique, providing to users the ability for real-time feedback on the recording process, thereby altering the way in which dome-shaped devices can be used.
Abstract: Dome-shaped devices consisting of a single digital camera and multiple light sources have been used in the past for the 3D scanning of objects. They leverage Photometric Stereo techniques in order to build detailed 3D models of these objects. Their advantage is that they can pick up even subtle details of the shape. Yet, these systems typically suffer from high recording and processing times. This paper introduces a novel GPU-accelerated implementation that calculates the shape normals, as well as the albedo and ambient lighting through the Photometric Stereo technique, providing to users the ability for real-time feedback on the recording process. An originally serial algorithm was mapped to the architecture of an NVIDIA GPU and the CUDA programming platform. To maximize performance, various optimizations were applied, like reducing the total amount of memory accesses, coalescing the memory accesses into the minimal number of transactions, reducing register usage to avoid spilling, hiding latency and maximizing thread occupancy. Our method reduces the processing time, accelerating the original implementation by a factor of 950, thereby altering the way in which such devices can be used.