scispace - formally typeset
Search or ask a question

Showing papers by "Luc Van Gool published in 2014"


Book ChapterDOI
01 Nov 2014
TL;DR: This work proposes A+, an improved variant of Anchored Neighborhood Regression, which combines the best qualities of ANR and SF and builds on the features and anchored regressors from ANR but instead of learning the regressors on the dictionary it uses the full training material, similar to SF.
Abstract: We address the problem of image upscaling in the form of single image super-resolution based on a dictionary of low- and high-resolution exemplars. Two recently proposed methods, Anchored Neighborhood Regression (ANR) and Simple Functions (SF), provide state-of-the-art quality performance. Moreover, ANR is among the fastest known super-resolution methods. ANR learns sparse dictionaries and regressors anchored to the dictionary atoms. SF relies on clusters and corresponding learned functions. We propose A+, an improved variant of ANR, which combines the best qualities of ANR and SF. A+ builds on the features and anchored regressors from ANR but instead of learning the regressors on the dictionary it uses the full training material, similar to SF. We validate our method on standard images and compare with state-of-the-art methods. We obtain improved quality (i.e. 0.2–0.7 dB PSNR better than ANR) and excellent time complexity, rendering A+ the most efficient dictionary-based super-resolution method to date.

1,418 citations


Book ChapterDOI
06 Sep 2014
TL;DR: A novel method to mine discriminative parts using Random Forests (rf), which allows us to mine for parts simultaneously for all classes and to share knowledge among them, and compares nicely to other s-o-a component-based classification methods.
Abstract: In this paper we address the problem of automatically recognizing pictured dishes. To this end, we introduce a novel method to mine discriminative parts using Random Forests (rf), which allows us to mine for parts simultaneously for all classes and to share knowledge among them. To improve efficiency of mining and classification, we only consider patches that are aligned with image superpixels, which we call components. To measure the performance of our rf component mining for food recognition, we introduce a novel and challenging dataset of 101 food categories, with 101’000 images. With an average accuracy of 50.76%, our model outperforms alternative classification methods except for cnn, including svm classification on Improved Fisher Vectors and existing discriminative part-mining algorithms by 11.88% and 8.13%, respectively. On the challenging mit-Indoor dataset, our method compares nicely to other s-o-a component-based classification methods.

1,216 citations


Book ChapterDOI
06 Sep 2014
TL;DR: This paper proposes a novel approach and a new benchmark for video summarization, which focuses on user videos, which are raw videos containing a set of interesting events, and generates high-quality results, comparable to manual, human-created summaries.
Abstract: This paper proposes a novel approach and a new benchmark for video summarization. Thereby we focus on user videos, which are raw videos containing a set of interesting events. Our method starts by segmenting the video by using a novel “superframe” segmentation, tailored to raw videos. Then, we estimate visual interestingness per superframe using a set of low-, mid- and high-level features. Based on this scoring, we select an optimal subset of superframes to create an informative and interesting summary. The introduced benchmark comes with multiple human created summaries, which were acquired in a controlled psychological experiment. This data paves the way to evaluate summarization methods objectively and to get new insights in video summarization. When evaluating our method, we find that it generates high-quality results, comparable to manual, human-created summaries.

592 citations


Book ChapterDOI
06 Sep 2014
TL;DR: It is shown that a properly trained vanilla DPM reaches top performance, improving over commercial and research systems, and a detector based on rigid templates - similar in structure to the Viola&Jones detector - can reach similar top performance on this task.
Abstract: Face detection is a mature problem in computer vision. While diverse high performing face detectors have been proposed in the past, we present two surprising new top performance results. First, we show that a properly trained vanilla DPM reaches top performance, improving over commercial and research systems. Second, we show that a detector based on rigid templates - similar in structure to the Viola&Jones detector - can reach similar top performance on this task. Importantly, we discuss issues with existing evaluation benchmark and propose an improved procedure.

588 citations


Journal ArticleDOI
01 Apr 2014
TL;DR: The paper proposes a pipeline for the efficient detection and recognition of traffic signs from such images, and combines 2D and 3D techniques to improve results beyond the state-of-the-art, which is still very much preoccupied with single view analysis.
Abstract: Several applications require information about street furniture. Part of the task is to survey all traffic signs. This has to be done for millions of km of road, and the exercise needs to be repeated every so often. We used a van with eight roof-mounted cameras to drive through the streets and took images every meter. The paper proposes a pipeline for the efficient detection and recognition of traffic signs from such images. The task is challenging, as illumination conditions change regularly, occlusions are frequent, sign positions and orientations vary substantially, and the actual signs are far less similar among equal types than one might expect. We combine 2D and 3D techniques to improve results beyond the state-of-the-art, which is still very much preoccupied with single view analysis. For the initial detection in single frames, we use a set of colour- and shape-based criteria. They yield a set of candidate sign patterns. The selection of such candidates allows for a significant speed up over a sliding window approach while keeping similar performance. A speedup is also achieved through a proposed efficient bounded evaluation of AdaBoost detectors. The 2D detections in multiple views are subsequently combined to generate 3D hypotheses. A Minimum Description Length formulation yields the set of 3D traffic signs that best explains the 2D detections. The paper comes with a publicly available database, with more than 13,000 traffic signs annotations.

309 citations


Book ChapterDOI
01 Nov 2014
TL;DR: This paper builds on the recent Affinity Propagation Clustering algorithm, which passes messages between data points to identify cluster exemplars and shows that it provides a promising solution to the shortcomings of the greedy NMS.
Abstract: Non-maximum suppression (NMS) is a key post-processing step in many computer vision applications. In the context of object detection, it is used to transform a smooth response map that triggers many imprecise object window hypotheses in, ideally, a single bounding-box for each detected object. The most common approach for NMS for object detection is a greedy, locally optimal strategy with several hand-designed components (e.g., thresholds). Such a strategy inherently suffers from several shortcomings, such as the inability to detect nearby objects. In this paper, we try to alleviate these problems and explore a novel formulation of NMS as a well-defined clustering problem. Our method builds on the recent Affinity Propagation Clustering algorithm, which passes messages between data points to identify cluster exemplars. Contrary to the greedy approach, our method is solved globally and its parameters can be automatically learned from training data. In experiments, we show in two contexts – object class and generic object detection – that it provides a promising solution to the shortcomings of the greedy NMS.

186 citations


Proceedings ArticleDOI
23 Jun 2014
TL;DR: A latent representation model is introduced, in which discrimination of the learned dictionary is exploited via minimizing the within-class scatter of coding coefficients and the latent-value weighted dictionary coherence, and a latent sparse representation based classifier is also presented.
Abstract: Dictionary learning (DL) for sparse coding has shown promising results in classification tasks, while how to adaptively build the relationship between dictionary atoms and class labels is still an important open question. The existing dictionary learning approaches simply fix a dictionary atom to be either class-specific or shared by all classes beforehand, ignoring that the relationship needs to be updated during DL. To address this issue, in this paper we propose a novel latent dictionary learning (LDL) method to learn a discriminative dictionary and build its relationship to class labels adaptively. Each dictionary atom is jointly learned with a latent vector, which associates this atom to the representation of different classes. More specifically, we introduce a latent representation model, in which discrimination of the learned dictionary is exploited via minimizing the within-class scatter of coding coefficients and the latent-value weighted dictionary coherence. The optimal solution is efficiently obtained by the proposed solving algorithm. Correspondingly, a latent sparse representation based classifier is also presented. Experimental results demonstrate that our algorithm outperforms many recently proposed sparse representation and dictionary learning approaches for action, gender and face recognition.

128 citations


Proceedings ArticleDOI
29 Sep 2014
TL;DR: A novel method for creating 3D models of persons freely moving in front of a consumer depth sensor and how they can be used for long-term person re-identification is described and shown.
Abstract: In this work, we describe a novel method for creating 3D models of persons freely moving in front of a consumer depth sensor and we show how they can be used for long-term person re-identification. For overcoming the problem of the different poses a person can assume, we exploit the information provided by skeletal tracking algorithms for warping every point cloud frame to a standard pose in real time. Then, the warped point clouds are merged together to compose the model. Re-identification is performed by matching body shapes in terms of whole point clouds warped to a standard pose with the described method. We compare this technique with a classification method based on a descriptor of skeleton features and with a mixed approach which exploits both skeleton and shape features. We report experiments on two datasets we acquired for RGB-D re-identification which use different skeletal tracking algorithms and which are made publicly available to foster research in this new research branch.

109 citations


Book ChapterDOI
06 Sep 2014
TL;DR: In this paper, the geometry of a 3D mesh model obtained from multi-view reconstruction is exploited to predict the best view before the actual labeling, which leads to a further reduction of computation time and a gain in accuracy.
Abstract: There is an increasing interest in semantically annotated 3D models, e.g. of cities. The typical approaches start with the semantic labelling of all the images used for the 3D model. Such labelling tends to be very time consuming though. The inherent redundancy among the overlapping images calls for more efficient solutions. This paper proposes an alternative approach that exploits the geometry of a 3D mesh model obtained from multi-view reconstruction. Instead of clustering similar views, we predict the best view before the actual labelling. For this we find the single image part that bests supports the correct semantic labelling of each face of the underlying 3D mesh. Moreover, our single-image approach may surprise because it tends to increase the accuracy of the model labelling when compared to approaches that fuse the labels from multiple images. As a matter of fact, we even go a step further, and only explicitly label a subset of faces (e.g. 10%), to subsequently fill in the labels of the remaining faces. This leads to a further reduction of computation time, again combined with a gain in accuracy. Compared to a process that starts from the semantic labelling of the images, our method to semantically label 3D models yields accelerations of about 2 orders of magnitude. We tested our multi-view semantic labelling on a variety of street scenes.

100 citations


Book ChapterDOI
01 Jan 2014
TL;DR: A comparison between two techniques for one-shot person re-identification from soft biometric cues based upon a descriptor composed of features provided by a skeleton estimation algorithm and a novel technique to warp the subject’s point cloud to a standard pose.
Abstract: In this chapter, we propose a comparison between two techniques for one-shot person re-identification from soft biometric cues. One is based upon a descriptor composed of features provided by a skeleton estimation algorithm; the other compares body shapes in terms of whole point clouds. This second approach relies on a novel technique we propose to warp the subject’s point cloud to a standard pose, which allows to disregard the problem of the different poses a person can assume. This technique is also used for composing 3D models which are then used at testing time for matching unseen point clouds. We test the proposed approaches on an existing RGB-D re-identification dataset and on the newly built BIWI RGBD-ID dataset. This dataset provides sequences of RGB, depth, and skeleton data for 50 people in two different scenarios and it has been made publicly available to foster advancement in this new research branch.

98 citations


Proceedings ArticleDOI
23 Jun 2014
TL;DR: This work introduces Nearest Class Mean Forests (NCMF), a variant of Random Forests where the decision nodes are based on nearest class mean (NCM) classification, and demonstrates that NCMFs not only outperform conventional random forests, but are also well suited for integrating new classes.
Abstract: In recent years, large image data sets such as "ImageNet", "TinyImages" or ever-growing social networks like "Flickr" have emerged, posing new challenges to image classification that were not apparent in smaller image sets. In particular, the efficient handling of dynamically growing data sets, where not only the amount of training images, but also the number of classes increases over time, is a relatively unexplored problem. To remedy this, we introduce Nearest Class Mean Forests (NCMF), a variant of Random Forests where the decision nodes are based on nearest class mean (NCM) classification. NCMFs not only outperform conventional random forests, but are also well suited for integrating new classes. To this end, we propose and compare several approaches to incorporate data from new classes, so as to seamlessly extend the previously trained forest instead of re-training them from scratch. In our experiments, we show that NCMFs trained on small data sets with 10 classes can be extended to large data sets with 1000 classes without significant loss of accuracy compared to training from scratch on the full data.

Journal ArticleDOI
TL;DR: This work introduces parts dependent body joint regressors which are random forests that operate over two layers that outperform independent classifiers or regressors of the state-of-the-art in terms of accuracy, while running with a couple of frames per second.
Abstract: In this work, we address the problem of estimating 2d human pose from still images. Articulated body pose estimation is challenging due to the large variation in body poses and appearances of the different body parts. Recent methods that rely on the pictorial structure framework have shown to be very successful in solving this task. They model the body part appearances using discriminatively trained, independent part templates and the spatial relations of the body parts using a tree model. Within such a framework, we address the problem of obtaining better part templates which are able to handle a very high variation in appearance. To this end, we introduce parts dependent body joint regressors which are random forests that operate over two layers. While the first layer acts as an independent body part classifier, the second layer takes the estimated class distributions of the first one into account and is thereby able to predict joint locations by modeling the interdependence and co-occurrence of the parts. This helps to overcome typical ambiguities of tree structures, such as self-similarities of legs and arms. In addition, we introduce a novel data set termed FashionPose that contains over 7,000 images with a challenging variation of body part appearances due to a large variation of dressing styles. In the experiments, we demonstrate that the proposed parts dependent joint regressors outperform independent classifiers or regressors. The method also performs better or similar to the state-of-the-art in terms of accuracy, while running with a couple of frames per second.

Journal ArticleDOI
TL;DR: The Weighted Collaborative Representation Classifier (WCRC) improves the classification performance over that of the original formulation, while keeping the simplicity and the speed of the originally CRC-RLS formulation.

Journal ArticleDOI
TL;DR: Current state‐of‐the‐art visualization technologies are mainly fully virtual, while AR has the potential to enhance those visualizations by observing proposed designs directly within the real environment.
Abstract: Augmented Reality (AR) is a rapidly develop- ing field with numerous potential applications. For ex- ample, building developers, public authorities, and other construction industry stakeholders need to visually as- sess potential new developments with regard to aesthet- ics, health and safety, and other criteria. Current state-of- the-art visualization technologies are mainly fully virtual, while AR has the potential to enhance those visualiza- tions by observing proposed designs directly within the real environment. A novel AR system is presented, that is most appropriate for urban applications. It is based on monocular vision, is markerless, and does not rely on beacon-based local- ization technologies (like GPS) or inertial sensors. Addi- tionally, the system automatically calculates occlusions of the built environment on the augmenting virtual objects. Three datasets from real environments presenting dif- ferent levels of complexity (geometrical complexity, tex- tures, occlusions) are used to demonstrate the perfor- mance of the proposed system. Videos augmented with our system are shown to provide realistic and valuable visualizations of proposed changes of the urban environ- ∗ To whom correspondence should be addressed. E-mail: f.n.bosche@

Proceedings ArticleDOI
23 Jun 2014
TL;DR: This work presents a novel approach for producing dense reconstructions from multiple images and from the underlying sparse Structure-from-Motion (SfM) data in an efficient way and assumes piecewise planarity of man-made scenes and exploits both sparse visibility and a fast over-segmentation of the images.
Abstract: State-of-the-art Multi-View Stereo (MVS) algorithms deliver dense depth maps or complex meshes with very high detail, and redundancy over regular surfaces. In turn, our interest lies in an approximate, but light-weight method that is better to consider for large-scale applications, such as urban scene reconstruction from ground-based images. We present a novel approach for producing dense reconstructions from multiple images and from the underlying sparse Structure-from-Motion (SfM) data in an efficient way. To overcome the problem of SfM sparsity and textureless areas, we assume piecewise planarity of man-made scenes and exploit both sparse visibility and a fast over-segmentation of the images. Reconstruction is formulated as an energy-driven, multi-view plane assignment problem, which we solve jointly over superpixels from all views while avoiding expensive photoconsistency computations. The resulting planar primitives--defined by detailed superpixel boundaries--are computed in about 10 seconds per image.

Proceedings ArticleDOI
24 Mar 2014
TL;DR: Adding scale-invariance to line descriptors increases the accuracy when confronted with big scale changes and increases the number of inliers in the general case, both resulting in smaller calibration errors by means of RANSAC-like techniques and epipolar estimations.
Abstract: In this paper we propose a method to add scale-invariance to line descriptors for wide baseline matching purposes. While finding point correspondences among different views is a well-studied problem, there still remain difficult cases where it performs poorly, such as textureless scenes, ambiguities and extreme transformations. For these cases using line segment correspondences is a valuable addition for finding sufficient matches. Our general method for adding scale-invariance to line segment descriptors consist of 5 basic rules. We apply these rules to enhance both the line descriptor described by Bay et al. [1] and the mean-standard deviation line descriptor (MSLD) proposed by Wang et al. [14]. Moreover, we examine the effect of the line descriptors when combined with the topological filtering method proposed by Bay et al. and the recent proposed graph matching strategy from K-VLD [6]. We validate the method using standard point correspondence benchmarks and more challenging new ones. Adding scale-invariance increases the accuracy when confronted with big scale changes and increases the number of inliers in the general case, both resulting in smaller calibration errors by means of RANSAC-like techniques and epipolar estimations.

Proceedings ArticleDOI
23 Jun 2014
TL;DR: This work is the first attempt to quantify this image property, and it is found that texture synthesizability can be learned and predicted and used to trim images to parts that are more synthesizable.
Abstract: Example-based texture synthesis (ETS) has been widely used to generate high quality textures of desired sizes from a small example However, not all textures are equally well reproducible that way We predict how synthesizable a particular texture is by ETS We introduce a dataset (21, 302 textures) of which all images have been annotated in terms of their synthesizability We design a set of texture features, such as 'textureness', homogeneity, repetitiveness, and irregularity, and train a predictor using these features on the data collection This work is the first attempt to quantify this image property, and we find that texture synthesizability can be learned and predicted We use this insight to trim images to parts that are more synthesizable Also we suggest which texture synthesis method is best suited to synthesise a given texture Our approach can be seen as 'winner-uses-all': picking one method among several alternatives, ending up with an overall superior ETS method Such strategy could also be considered for other vision tasks: rather than building an even stronger method, choose from existing methods based on some simple preprocessing

Proceedings ArticleDOI
23 Jun 2014
TL;DR: This paper addresses the problem of personalization in the context of gesture recognition, and proposes a novel and extremely efficient way of doing personalization that learns a set of classifiers during training, one of which is selected for each test subject based on the personalization data.
Abstract: Human gestures, similar to speech and handwriting, are often unique to the individual. Training a generic classifier applicable to everyone can be very difficult and as such, it has become a standard to use personalized classifiers in speech and handwriting recognition. In this paper, we address the problem of personalization in the context of gesture recognition, and propose a novel and extremely efficient way of doing personalization. Unlike conventional personalization methods which learn a single classifier that later gets adapted, our approach learns a set (portfolio) of classifiers during training, one of which is selected for each test subject based on the personalization data. We formulate classifier personalization as a selection problem and propose several algorithms to compute the set of candidate classifiers. Our experiments show that such an approach is much more efficient than adapting the classifier parameters but can still achieve comparable or better results.

Proceedings ArticleDOI
01 Jan 2014
TL;DR: This research presents a parallel version of the TSP called TSP “TSP2” which was developed at the same time as TSP1, using the same underlying technology, but with different Brayko Riemenschneider code.
Abstract: Massimo Mauro1 m.mauro001@unibs.it Hayko Riemenschneider2 http://www.vision.ee.ethz.ch/~rhayko/ Alberto Signoroni1 http://www.ing.unibs.it/~signoron/ Riccardo Leonardi1 http://www.ing.unibs.it/~leon/ Luc Van Gool2 http://www.vision.ee.ethz.ch/~vangool/ 1 Department of Information Engineering University of Brescia Brescia, Italia 2 Computer Vision Lab Swiss Federal Institute of Technology Zurich, Switzerland

Proceedings ArticleDOI
08 Dec 2014
TL;DR: This work introduces a way of capturing semantic scene context of a key point into a compact description and proposes to learn correct match ability of descriptors from these semantic contexts.
Abstract: Image-to-image feature matching is the single most restrictive time bottleneck in any matching pipeline We propose two methods for improving the speed and quality by employing semantic scene segmentation First, we introduce a way of capturing semantic scene context of a key point into a compact description Second, we propose to learn correct match ability of descriptors from these semantic contexts Finally, we further reduce the complexity of matching to only a pre-computed set of semantically close key points All methods can be used independently and in the evaluation we show combinations for maximum speed benefits Overall, our proposed methods outperform all baselines and provide significant improvements in accuracy and an order of magnitude faster key point matching

Journal ArticleDOI
TL;DR: A novel scale-invariant image feature detection algorithm (D-SIFER) using a newly proposed scale-space optimal 10th-order Gaussian derivative (GDO-10) filter, which reaches the jointly optimal Heisenberg's uncertainty of its impulse response in scale and space simultaneously.
Abstract: We present a novel scale-invariant image feature detection algorithm (D-SIFER) using a newly proposed scale-space optimal 10th-order Gaussian derivative (GDO-10) filter, which reaches the jointly optimal Heisenberg's uncertainty of its impulse response in scale and space simultaneously (i.e., we minimize the maximum of the two moments). The D-SIFER algorithm using this filter leads to an outstanding quality of image feature detection, with a factor of three quality improvement over state-of-the-art scale-invariant feature transform (SIFT) and speeded up robust features (SURF) methods that use the second-order Gaussian derivative filters. To reach low computational complexity, we also present a technique approximating the GDO-10 filters with a fixed-length implementation, which is independent of the scale. The final approximation error remains far below the noise margin, providing constant time, low cost, but nevertheless high-quality feature detection and registration capabilities. D-SIFER is validated on a real-life hyperspectral image registration application, precisely aligning up to hundreds of successive narrowband color images, despite their strong artifacts (blurring, low-light noise) typically occurring in such delicate optical system setups.

Proceedings ArticleDOI
Ralf Dragon1, Luc Van Gool1
23 Jun 2014
TL;DR: The problem of estimating the ground plane orientation and location in monocular video sequences from a moving observer is formulated as a state-continuous Hidden Markov Model (HMM) where the hidden state contains t and n and may be estimated by sampling and decomposing homographies.
Abstract: We focus on the problem of estimating the ground plane orientation and location in monocular video sequences from a moving observer. Our only assumptions are that the 3D ego motion t and the ground plane normal n are orthogonal, and that n and t are smooth over time. We formulate the problem as a state-continuous Hidden Markov Model (HMM) where the hidden state contains t and n and may be estimated by sampling and decomposing homographies. We show that using blocked Gibbs sampling, we can infer the hidden state with high robustness towards outliers, drifting trajectories, rolling shutter and an imprecise intrinsic calibration. Since our approach does not need any initial orientation prior, it works for arbitrary camera orientations in which the ground is visible.

Book ChapterDOI
06 Sep 2014
TL;DR: A novel tracking algorithm that can track a highly non-rigid targets accurately and robustly, and outperforms state-of-the-art methods is proposed using a new bounding box representation called the Double Bounding Box (DBB).
Abstract: A novel tracking algorithm that can track a highly non-rigid target robustly is proposed using a new bounding box representation called the Double Bounding Box (DBB). In the DBB, a target is described by the combination of the Inner Bounding Box (IBB) and the Outer Bounding Box (OBB). Then our objective of visual tracking is changed to find the IBB and OBB instead of a single bounding box, where the IBB and OBB can be easily obtained by the Dempster-Shafer (DS) theory. If the target is highly non-rigid, any single bounding box cannot include all foreground regions while excluding all background regions. Using the DBB, our method does not directly handle the ambiguous regions, which include both the foreground and background regions. Hence, it can solve the inherent ambiguity of the single bounding box representation and thus can track highly non-rigid targets robustly. Our method finally finds the best state of the target using a new Constrained Markov Chain Monte Carlo (CMCMC)-based sampling method with the constraint that the OBB should include the IBB. Experimental results show that our method tracks non-rigid targets accurately and robustly, and outperforms state-of-the-art methods.

Book ChapterDOI
06 Sep 2014
TL;DR: In this paper, the authors incorporate temporal constraints into the image-based registration setting and solve the problem by pose regularization with model fitting and smoothing methods, which leads to accurate, gap-free and smooth poses for all frames.
Abstract: Registering image data to Structure from Motion (SfM) point clouds is widely used to find precise camera location and orientation with respect to a world model. In case of videos one constraint has previously been unexploited: temporal smoothness. Without temporal smoothness the magnitude of the pose error in each frame of a video will often dominate the magnitude of frame-to-frame pose change. This hinders application of methods requiring stable poses estimates (e.g. tracking, augmented reality). We incorporate temporal constraints into the image-based registration setting and solve the problem by pose regularization with model fitting and smoothing methods. This leads to accurate, gap-free and smooth poses for all frames. We evaluate different methods on challenging synthetic and real street-view SfM data for varying scenarios of motion speed, outlier contamination, pose estimation failures and 2D-3D correspondence noise. For all test cases a 2 to 60-fold reduction in root mean squared (RMS) positional error is observed, depending on pose estimation difficulty. For varying scenarios, different methods perform best. We give guidance which methods should be preferred depending on circumstances and requirements.

BookDOI
TL;DR: In this paper, a tracking-by-detection algorithm for multiple targets from multiple dynamic, unlocalized and unconstrained cameras is proposed, which can effectively deal with independently moving cameras and camera registration noise.
Abstract: We propose a new tracking-by-detection algorithm for multiple targets from multiple dynamic, unlocalized and unconstrained cameras. In the past tracking has either been done with multiple static cameras, or single and stereo dynamic cameras. We register several moving cameras using a given 3D model from Structure from Motion (SfM), and initialize the tracking given the registration. The camera uncertainty estimate can be efficiently incorporated into a flow-network formulation for tracking. As this is a novel task in the tracking domain, we evaluate our method on a new challenging dataset for tracking with multiple moving cameras and show that our tracking method can effectively deal with independently moving cameras and camera registration noise.

Book ChapterDOI
02 Sep 2014
TL;DR: This work proposes a solution based on a single video camera, that is not only far less intrusive, but also a lot cheaper, and outperforms current motion segmentation and tracking approaches for Cerebral Palsy detection.
Abstract: Motions of organs or extremities are important features for clinical diagnosis. However, tracking and segmentation of complex, quickly changing motion patterns is challenging, certainly in the presence of occlusions. Neither state-of-the-art tracking nor motion segmentation approaches are able to deal with such cases. Thus far, motion capture systems or the like were needed which are complicated to handle and which impact on the movements. We propose a solution based on a single video camera, that is not only far less intrusive, but also a lot cheaper. The limitation of tracking and motion segmentation are overcome by a new approach to integrate prior knowledge in the form of weak labeling into motion segmentation. Using the example of Cerebral Palsy detection, we segment motion patterns of infants into the different body parts by analyzing body movements. Our experimental results show that our approach outperforms current motion segmentation and tracking approaches.

Proceedings ArticleDOI
23 Jun 2014
TL;DR: This work extends the mixtures from trees to more general loopy graphs and can localize facial points with an accuracy similar to fully supervised approaches without any facial point annotation at the level of individual training images.
Abstract: Face detection and facial points localization are interconnected tasks. Recently it has been shown that solving these two tasks jointly with a mixture of trees of parts (MTP) leads to state-of-the-art results. However, MTP, as most other methods for facial point localization proposed so far, requires a complete annotation of the training data at facial point level. This is used to predefine the structure of the trees and to place the parts correctly. In this work we extend the mixtures from trees to more general loopy graphs. In this way we can learn in a weakly supervised manner (using only the face location and orientation) a powerful deformable detector that implicitly aligns its parts to the detected face in the image. By attaching some reference points to the correct parts of our detector we can then localize the facial points. In terms of detection our method clearly outperforms the state-of-the-art, even if competing with methods that use facial point annotations during training. Additionally, without any facial point annotation at the level of individual training images, our method can localize facial points with an accuracy similar to fully supervised approaches.

Proceedings ArticleDOI
08 Dec 2014
TL;DR: A novel formulation for view selection is proposed, where cameras are modeled with binary variables, while the linear constraints enforce the completeness of the 3D reconstruction, and the solution of the ILP leads to an optimal subset of selected cameras.
Abstract: Multi-View Stereo (MVS) algorithms scale poorly on large image sets, and quickly become unfeasible to run on a single machine with limited memory. Typical solutions to lower the complexity include reducing the redundancy of the image set (view selection), and dividing the image set in groups to be processed independently (view clustering). A novel formulation for view selection is proposed here. We express the problem with an Integer Linear Programming (ILP) model, where cameras are modeled with binary variables, while the linear constraints enforce the completeness of the 3D reconstruction. The solution of the ILP leads to an optimal subset of selected cameras. As a second contribution, we integrate ILP camera selection with a view clustering approach which exploits Leveraged Affinity Propagation (LAP). LAP clustering can efficiently deal with large camera sets. We adapt the original algorithm so that it provides a set of overlapping clusters where the minimum and maximum sizes and the number of overlapping cameras can be specified. Evaluations on four different dataset show our solution provides significant complexity reductions and guarantees near-perfect coverage, making large reconstructions feasible even on a single machine.

Journal Article
TL;DR: An alternative approach that exploits the geometry of a 3D mesh model obtained from multi-view reconstruction, and predicts the best view before the actual labelling to increase the accuracy of the model labelling when compared to approaches that fuse the labels from multiple images.
Abstract: There is an increasing interest in semantically annotated 3D models, e.g. of cities. The typical approaches start with the semantic labelling of all the images used for the 3D model. Such labelling tends to be very time consuming though. The inherent redundancy among the overlapping images calls for more efficient solutions. This paper proposes an alternative approach that exploits the geometry of a 3D mesh model obtained from multi-view reconstruction. Instead of clustering similar views, we predict the best view before the actual labelling. For this we find the single image part that bests supports the correct semantic labelling of each face of the underlying 3D mesh. Moreover, our single-image approach may surprise because it tends to increase the accuracy of the model labelling when compared to approaches that fuse the labels from multiple images. As a matter of fact, we even go a step further, and only explicitly label a subset of faces (e.g. 10%), to subsequently fill in the labels of the remaining faces. This leads to a further reduction of computation time, again combined with a gain in accuracy. Compared to a process that starts from the semantic labelling of the images, our method to semantically label 3D models yields accelerations of about 2 orders of magnitude. We tested our multi-view semantic labelling on a variety of street scenes.

Proceedings Article
08 Dec 2014
TL;DR: A simple and flexible family of non-linear kernels which are arbitrary kernels in the index space of a data quantizer, i.e., piecewise constant similarities in the original feature space that grant access to Euclidean geometry for uncompressed features are introduced.
Abstract: Matching local visual features is a crucial problem in computer vision and its accuracy greatly depends on the choice of similarity measure. As it is generally very difficult to design by hand a similarity or a kernel perfectly adapted to the data of interest, learning it automatically with as few assumptions as possible is preferable. However, available techniques for kernel learning suffer from several limitations, such as restrictive parametrization or scalability. In this paper, we introduce a simple and flexible family of non-linear kernels which we refer to as Quantized Kernels (QK). QKs are arbitrary kernels in the index space of a data quantizer, i.e., piecewise constant similarities in the original feature space. Quantization allows to compress features and keep the learning tractable. As a result, we obtain state-of-the-art matching performance on a standard benchmark dataset with just a few bits to represent each feature dimension. QKs also have explicit non-linear, low-dimensional feature mappings that grant access to Euclidean geometry for uncompressed features.