scispace - formally typeset
Search or ask a question

Showing papers presented at "German Conference on Pattern Recognition in 2014"


Book ChapterDOI
02 Sep 2014
TL;DR: A structured lighting system for creating high-resolution stereo datasets of static indoor scenes with highly accurate ground-truth disparities using novel techniques for efficient 2D subpixel correspondence search and self-calibration of cameras and projectors with modeling of lens distortion is presented.
Abstract: We present a structured lighting system for creating high-resolution stereo datasets of static indoor scenes with highly accurate ground-truth disparities. The system includes novel techniques for efficient 2D subpixel correspondence search and self-calibration of cameras and projectors with modeling of lens distortion. Combining disparity estimates from multiple projector positions we are able to achieve a disparity accuracy of 0.2 pixels on most observed surfaces, including in half-occluded regions. We contribute 33 new 6-megapixel datasets obtained with our system and demonstrate that they present new challenges for the next generation of stereo algorithms.

1,071 citations


Book ChapterDOI
02 Sep 2014
TL;DR: This paper follows a two-step approach where it first learns to predict a semantic representation from video and then generates natural language descriptions from it, and model across-sentence consistency at the level of the SR by enforcing a consistent topic.
Abstract: Humans can easily describe what they see in a coherent way and at varying level of detail. However, existing approaches for automatic video description focus on generating only single sentences and are not able to vary the descriptions’ level of detail. In this paper, we address both of these limitations: for a variable level of detail we produce coherent multi-sentence descriptions of complex videos. To understand the difference between detailed and short descriptions, we collect and analyze a video description corpus of three levels of detail. We follow a two-step approach where we first learn to predict a semantic representation (SR) from video and then generate natural language descriptions from it. For our multi-sentence descriptions we model across-sentence consistency at the level of the SR by enforcing a consistent topic. Human judges rate our descriptions as more readable, correct, and relevant than related work.

244 citations


Book ChapterDOI
02 Sep 2014
TL;DR: This work directly learns a mapping from image patches, corrupted by missing pixels, onto complete image patches that is represented as a deep neural network that is automatically trained on a large image data set to exploit the shape information of the missing regions.
Abstract: Most inpainting approaches require a good image model to infer the unknown pixels. In this work, we directly learn a mapping from image patches, corrupted by missing pixels, onto complete image patches. This mapping is represented as a deep neural network that is automatically trained on a large image data set. In particular, we are interested in the question whether it is helpful to exploit the shape information of the missing regions, i.e. the masks, which is something commonly ignored by other approaches. In comprehensive experiments on various images, we demonstrate that our learning-based approach is able to use this extra information and can achieve state-of-the-art inpainting results. Furthermore, we show that training with such extra information is useful for blind inpainting, where the exact shape of the missing region might be uncertain, for instance due to aliasing effects.

143 citations


Book ChapterDOI
02 Sep 2014
TL;DR: This work proposes a method to label roads in aerial images and extract a topologically correct road network, which outperforms several baselines on two challenging data sets, both in terms of precision/recall and w.r.t. topological correctness.
Abstract: We propose a method to label roads in aerial images and extract a topologically correct road network. Three factors make road extraction difficult: (i) high intra-class variability due to clutter like cars, markings, shadows on the roads; (ii) low inter-class variability, because some non-road structures are made of similar materials; and (iii) most importantly, a complex structural prior: roads form a connected network of thin segments, with slowly changing width and curvature, often bordered by buildings, etc. We model this rich, but complicated contextual information at two levels. Locally, the context and layout of roads is learned implicitly, by including multi-scale appearance information from a large neighborhood in the per-pixel classifier. Globally, the network structure is enforced explicitly: we first detect promising stretches of road via shortest-path search on the per-pixel evidence, and then select pixels on an optimal subset of these paths by energy minimization in a CRF, where each putative path forms a higher-order clique. The model outperforms several baselines on two challenging data sets, both in terms of precision/recall and w.r.t. topological correctness.

67 citations


Book ChapterDOI
02 Sep 2014
TL;DR: This paper builds on the recent dataset [2] leveraging the existing taxonomy of human activities and reveals that holistic and pose-based methods are highly complementary, and their performance varies significantly depending on the activity.
Abstract: Holistic methods based on dense trajectories [29, 30] are currently the de facto standard for recognition of human activities in video Whether holistic representations will sustain or will be superseded by higher level video encoding in terms of body pose and motion is the subject of an ongoing debate [12] In this paper we aim to clarify the underlying factors responsible for good performance of holistic and pose-based representations To that end we build on our recent dataset [2] leveraging the existing taxonomy of human activities This dataset includes \(24,920\) video snippets covering \(410\) human activities in total Our analysis reveals that holistic and pose-based methods are highly complementary, and their performance varies significantly depending on the activity We find that holistic methods are mostly affected by the number and speed of trajectories, whereas pose-based methods are mostly influenced by viewpoint of the person We observe striking performance differences across activities: for certain activities results with pose-based features are more than twice as accurate compared to holistic features, and vice versa The best performing approach in our comparison is based on the combination of holistic and pose-based approaches, which again underlines their complementarity

58 citations


Book ChapterDOI
02 Sep 2014
TL;DR: The first principled explanation of this empirically successful semi-global matching algorithm is offered, and its exact relation to belief propagation and tree-reweighted message passing is clarified.
Abstract: Semi-global matching, originally introduced in the context of dense stereo, is a very successful heuristic to minimize the energy of a pairwise multi-label Markov Random Field defined on a grid. We offer the first principled explanation of this empirically successful algorithm, and clarify its exact relation to belief propagation and tree-reweighted message passing. One outcome of this new connection is an uncertainty measure for the MAP label of a variable in a Markov Random Field.

57 citations


Book ChapterDOI
02 Sep 2014
TL;DR: This paper is the first to transfer and adapt submapping to RGB-D sensors and to provide a detailed analysis of the resulting gain, and finds that it outperform several state-of-the-art approaches both in terms of speed and accuracy.
Abstract: The key contribution of this paper is a novel submapping technique for RGB-D-based bundle adjustment Our approach significantly speeds up 3D object reconstruction with respect to full bundle adjustment while generating visually compelling 3D models of high metric accuracy While submapping has been explored previously for mono and stereo cameras, we are the first to transfer and adapt this concept to RGB-D sensors and to provide a detailed analysis of the resulting gain In our approach, we partition the input data uniformly into submaps to optimize them individually by minimizing the 3D alignment error Subsequently, we fix the interior variables and optimize only over the separator variables between the submaps As we demonstrate in this paper, our method reduces the runtime of full bundle adjustment by 32 % on average while still being able to deal with real-world noise of cheap commodity sensors We evaluated our method on a large number of benchmark datasets, and found that we outperform several state-of-the-art approaches both in terms of speed and accuracy Furthermore, we present highly accurate 3D reconstructions of various objects to demonstrate the validity of our approach

44 citations


Book ChapterDOI
02 Sep 2014
TL;DR: Formalizing a connection between Random Forests and ANN allows exploiting the former to initialize the latter, and parameter optimization within the ANN framework yields models that are intermediate betweenRF and ANN, and achieve performance better than RF and ANN on the majority of the UCI datasets used for benchmarking.
Abstract: While Artificial Neural Networks (ANNs) are highly expressive models, they are hard to train from limited data. Formalizing a connection between Random Forests (RFs) and ANNs allows exploiting the former to initialize the latter. Further parameter optimization within the ANN framework yields models that are intermediate between RF and ANN, and achieve performance better than RF and ANN on the majority of the UCI datasets used for benchmarking.

43 citations


Book ChapterDOI
02 Sep 2014
TL;DR: A novel model that combines Deep Convolutional Neural Networks with a global inference model, derived from a convex variational relaxation of the minimum s-t cut problem on graphs, which is frequently used for the task of image segmentation.
Abstract: In this paper we introduce a novel model that combines Deep Convolutional Neural Networks with a global inference model. Our model is derived from a convex variational relaxation of the minimum s-t cut problem on graphs, which is frequently used for the task of image segmentation. We treat the outputs of Convolutional Neural Networks as the unary and pairwise potentials of a graph and derive a smooth approximation to the minimum s-t cut problem. During training, this approximation facilitates the adaptation of the Convolutional Neural Network to the smoothing that is induced by the global model. The training algorithm can be understood as a modified backpropagation algorithm, that explicitly takes the global inference layer into account.

39 citations


Book ChapterDOI
02 Sep 2014
TL;DR: A flow-based propagation of user scribbles from the first to subsequent video frames which drastically reduces the user input is proposed and is compared to state-of-the-art video completion methods.
Abstract: We propose a framework for temporally consistent video completion. To this end we generalize the exemplar-based inpainting method of Criminisi et al. [7] to video inpainting. Specifically we address two important issues: Firstly, we propose a color and optical flow inpainting to ensure temporal consistency of inpainting even for complex motion of foreground and background. Secondly, rather than requiring the user to hand-label the inpainting region in every single image, we propose a flow-based propagation of user scribbles from the first to subsequent video frames which drastically reduces the user input. Experimental comparisons to state-of-the-art video completion methods demonstrate the benefits of the proposed approach.

34 citations


Book ChapterDOI
02 Sep 2014
TL;DR: It is shown that the integration of learned must-link constraints not only improves the segmentation result but also significantly reduces the required runtime, making the use of costly spectral methods possible for today’s high quality video.
Abstract: In recent years it has been shown that clustering and segmentation methods can greatly benefit from the integration of prior information in terms of must-link constraints. Very recently the use of such constraints has been integrated in a rigorous manner also in graph-based methods such as normalized cut. On the other hand spectral clustering as relaxation of the normalized cut has been shown to be among the best methods for video segmentation. In this paper we merge these two developments and propose to learn must-link constraints for video segmentation with spectral clustering. We show that the integration of learned must-link constraints not only improves the segmentation result but also significantly reduces the required runtime, making the use of costly spectral methods possible for today’s high quality video.

Book ChapterDOI
02 Sep 2014
TL;DR: This work proposes a framework for hand tracking that can capture the motion of two interacting hands using only a single, inexpensive RGB-D camera, and combines a generative model with collision detection and discriminatively learned salient points.
Abstract: Hand motion capture has been an active research topic, following the success of full-body pose tracking Despite similarities, hand tracking proves to be more challenging, characterized by a higher dimensionality, severe occlusions and self-similarity between fingers For this reason, most approaches rely on strong assumptions, like hands in isolation or expensive multi-camera systems, that limit practical use In this work, we propose a framework for hand tracking that can capture the motion of two interacting hands using only a single, inexpensive RGB-D camera Our approach combines a generative model with collision detection and discriminatively learned salient points We quantitatively evaluate our approach on 14 new sequences with challenging interactions

Book ChapterDOI
02 Sep 2014
TL;DR: This paper shows that these methods can be significantly improved by introducing a new iterative classification, statistical modeling, and segmentation procedure, using a detect-and-merge algorithm.
Abstract: There have recently been advances in the area of fully automatic detection of clustered objects in color images. State of the art methods combine detection with segmentation. In this paper we show that these methods can be significantly improved by introducing a new iterative classification, statistical modeling, and segmentation procedure. The proposed method used a detect-and-merge algorithm, which iteratively finds and validates new objects and subsequently updates the statistical model, while converging in very few iterations.

Book ChapterDOI
02 Sep 2014
TL;DR: This work presents a new global optimization approach for multiple people tracking based on a hierarchical tracklet framework that casts the optimization problem as a minimum cost arborescence problem in an acyclic directed graph, where a tracking solution can be obtained in linear time.
Abstract: We present a new global optimization approach for multiple people tracking based on a hierarchical tracklet framework. A new type of tracklets is introduced, which we call tree tracklets. They contain bifurcations to naturally deal with ambiguous tracking situations. Difficult decisions are postponed to a later iteration of the hierarchical framework, when more information is available. We cast the optimization problem as a minimum cost arborescence problem in an acyclic directed graph, where a tracking solution can be obtained in linear time. Experiments on six publicly available datasets show that the method performs well when compared to state-of-the art tracking algorithms.

Book ChapterDOI
02 Sep 2014
TL;DR: This paper presents a principled way to additionally integrate top-down prior information about object location and shape that arises from independent system modules, ranging from geometric cues up to highly confident object detections, in a consistent scene representation for traffic scenarios.
Abstract: This paper presents a stereo vision-based scene model for traffic scenarios. Our approach effectively couples bottom-up image segmentation with object-level knowledge in a sound probabilistic fashion. The relevant scene structure, i.e. obstacles and freespace, is encoded using individual Stixels as building blocks that are computed bottom-up from dense disparity images. We present a principled way to additionally integrate top-down prior information about object location and shape that arises from independent system modules, ranging from geometric cues up to highly confident object detections. This results in an efficient exploration of orthogonal image-based cues, such as disparity and gray-level intensity data, combined in a consistent scene representation. The overall segmentation problem is modeled as a Markov Random Field and solved efficiently through Dynamic Programming.

Book ChapterDOI
02 Sep 2014
TL;DR: This work proposes a lens-based depth estimation scheme based on a novel adaptive lens selection strategy and shows that this strategy achieves similar error rates as selection strategies with a fixed number of lenses, while being computationally less time consuming.
Abstract: Multi-focus portable plenoptic camera devices provide a reasonable tradeoff between spatial and angular resolution while enlarging the depth of field of a standard camera. Many applications using the data captured by these camera devices require or benefit from correspondences established between the single microlens images. In this work we propose a lens-based depth estimation scheme based on a novel adaptive lens selection strategy. Coarse depth estimates serve as indicators for suitable target lenses. The selection criterion accounts for lens overlap and the amount of defocus blur between the reference and possible target lenses. The depth maps are regularized using a semi-global strategy. For insufficiently textured scenes, we further incorporate a semi-global coarse regularization with respect to the lens-grid. In contrast to algorithms operating on the complete lightfield, our algorithm has a low memory footprint. The resulting per-lens dense depth maps are well suited for volumetric surface reconstruction techniques. We show that our selection strategy achieves similar error rates as selection strategies with a fixed number of lenses, while being computationally less time consuming. Results are presented for synthetic as well as real-world datasets.

Book ChapterDOI
02 Sep 2014
TL;DR: A simple and effective framework for multi-view image sequence interpolation in space and time is proposed and two novel filtering approaches for outlier elimination and a robust approach for match extrapolations at the image boundaries are introduced.
Abstract: We propose a simple and effective framework for multi-view image sequence interpolation in space and time. For spatial view point interpolation we present a robust feature-based matching algorithm that allows for wide-baseline camera configurations. To this end, we introduce two novel filtering approaches for outlier elimination and a robust approach for match extrapolations at the image boundaries. For small-baseline and temporal interpolations we rely on an established optical flow based approach. We perform a quantitative and qualitative evaluation of our framework and present applications and results. Our method has a low runtime and results can compete with state-of-the-art methods.

Book ChapterDOI
02 Sep 2014
TL;DR: In this article, the Stiefel manifold is directly minimized over the reconstruction error to avoid deflation as often used by projection pursuit methods, which has no free parameter and is computationally very efficient.
Abstract: It is well known that Principal Component Analysis (PCA) is strongly affected by outliers and a lot of effort has been put into robustification of PCA. In this paper we present a new algorithm for robust PCA minimizing the trimmed reconstruction error. By directly minimizing over the Stiefel manifold, we avoid deflation as often used by projection pursuit methods. In distinction to other methods for robust PCA, our method has no free parameter and is computationally very efficient. We illustrate the performance on various datasets including an application to background modeling and subtraction. Our method performs better or similar to current state-of-the-art methods while being faster.

Book ChapterDOI
02 Sep 2014
TL;DR: This work proposes, for the first time, a general purpose segmentation algorithm to extract the most informative and interpretable features as convolution kernels while simultaneously building a multivariate decision tree.
Abstract: Most computer vision and especially segmentation tasks require to extract features that represent local appearance of patches. Relevant features can be further processed by learning algorithms to infer posterior probabilities that pixels belong to an object of interest. Deep Convolutional Neural Networks (CNN) define a particularly successful class of learning algorithms for semantic segmentation, although they proved to be very slow to train even when employing special purpose hardware. We propose, for the first time, a general purpose segmentation algorithm to extract the most informative and interpretable features as convolution kernels while simultaneously building a multivariate decision tree. The algorithm trains several orders of magnitude faster than regular CNNs and achieves state of the art results in processing quality on benchmark datasets.

Book ChapterDOI
02 Sep 2014
TL;DR: This work proposes a solution based on a single video camera, that is not only far less intrusive, but also a lot cheaper, and outperforms current motion segmentation and tracking approaches for Cerebral Palsy detection.
Abstract: Motions of organs or extremities are important features for clinical diagnosis. However, tracking and segmentation of complex, quickly changing motion patterns is challenging, certainly in the presence of occlusions. Neither state-of-the-art tracking nor motion segmentation approaches are able to deal with such cases. Thus far, motion capture systems or the like were needed which are complicated to handle and which impact on the movements. We propose a solution based on a single video camera, that is not only far less intrusive, but also a lot cheaper. The limitation of tracking and motion segmentation are overcome by a new approach to integrate prior knowledge in the form of weak labeling into motion segmentation. Using the example of Cerebral Palsy detection, we segment motion patterns of infants into the different body parts by analyzing body movements. Our experimental results show that our approach outperforms current motion segmentation and tracking approaches.

Book ChapterDOI
02 Sep 2014
TL;DR: This paper focuses on image segmentation, where some label classes exhibit strong internal boundaries, such as the background class which is the pool of objects, and should be modeled as a single region, even if some internal boundaries are visible.
Abstract: For image segmentation, recent advances in optimization make it possible to combine noisy region appearance terms with pairwise terms which can not only discourage, but also encourage label transitions, depending on boundary evidence. These models have the potential to overcome problems such as the shrinking bias. However, with the ability to encourage label transitions comes a different problem: strong boundary evidence can overrule weak region appearance terms to create new regions out of nowhere. While some label classes exhibit strong internal boundaries, such as the background class which is the pool of objects. Other label classes, meanwhile, should be modeled as a single region, even if some internal boundaries are visible.

Book ChapterDOI
02 Sep 2014
TL;DR: It is shown that to obtain a statistically sound result, intuitively appealing deterministic reduction strategies are problematic and that a simple reduction strategy based on random deletion was evaluated best.
Abstract: This paper deals with efficient means for camera pose estimation for difficult scenes. Particularly, we speed up the combination of image triplets to image sets by hierarchical merging and a reduction of the number of merged points. By image sets we denote a generalization of image sequences where images can be linked in multiple directions, i.e., they can form a graph. To obtain reliable results for triplets, we use large numbers of corresponding points. For a high-quality and yet efficient merging of the triplets we propose strategies for the reduction of the number of points. The strategies are evaluated based on statistical measures employing the full covariance information for the camera poses from bundle adjustment. We show that to obtain a statistically sound result, intuitively appealing deterministic reduction strategies are problematic and that a simple reduction strategy based on random deletion was evaluated best. We also discuss the benefits of the evaluation measures for finding conceptual and implementation weaknesses. The paper is illustrated with a number of experiments giving standard deviations for all values.

Book ChapterDOI
02 Sep 2014
TL;DR: This work presents a novel approach to integrate the spatial information to BoVWs model in a rotation-invariant way by encoding the triangular relationship among the positions of identical visual words in the \(2D\) image space and validate the proposed method for rotation-Invariance on datasets of ancient coins and butterflies.
Abstract: Incorporating the spatial information of visual words enhances the performance of the well-known bag-of-visual words (BoVWs) model for problems like object category recognition. However, object images can undergo various in-plane rotations due to which the spatial information must be added to the BoVWs model in rotation-invariant manner. We present a novel approach to integrate the spatial information to BoVWs model in a rotation-invariant way by encoding the triangular relationship among the positions of identical visual words in the \(2D\) image space. Our proposed BoVWs model is based on densely sampled local features for which the dominant orientations are calculated. Thus we achieve rotation-invariance both globally and locally. We validate our proposed method for rotation-invariance on datasets of ancient coins and butterflies and achieve better performance than the conventional BoVWs model.

Book ChapterDOI
02 Sep 2014
TL;DR: This paper proposes a descriptor that comprises the direction and magnitude of curvature and naturally expands classical orientation histograms like SIFT and HOG, demonstrating the general benefit of the expansion exemplarily for image classification, object detection, and descriptor matching.
Abstract: Descriptors based on orientation histograms are widely used in computer vision. The spatial pooling involved in these representations provides important invariance properties, yet it is also responsible for the loss of important details. In this paper, we suggest a way to preserve the details described by the local curvature. We propose a descriptor that comprises the direction and magnitude of curvature and naturally expands classical orientation histograms like SIFT and HOG. We demonstrate the general benefit of the expansion exemplarily for image classification, object detection, and descriptor matching.

Book ChapterDOI
02 Sep 2014
TL;DR: A novel variational model to jointly estimate geometry and motion from a sequence of light fields captured with a plenoptic camera is presented, which enforces multi-view geometry consistency, and piecewise smoothness assumptions on the scene flow variables.
Abstract: In this paper we present a novel variational model to jointly estimate geometry and motion from a sequence of light fields captured with a plenoptic camera. The proposed model uses the so-called sub-aperture representation of the light field. Sub-aperture images represent images with slightly different viewpoints, which can be extracted from the light field. The sub-aperture representation allows us to formulate a convex global energy functional, which enforces multi-view geometry consistency, and piecewise smoothness assumptions on the scene flow variables. We optimize the proposed scene flow model by using an efficient preconditioned primal-dual algorithm. Finally, we also present synthetic and real world experiments.

Book ChapterDOI
02 Sep 2014
TL;DR: A way to boost the performance of 2D pose estimation based on the output of the 3D pose reconstruction process, thus closing the loop in the pose estimation pipeline is explored.
Abstract: In this paper we consider the task of articulated 3D human pose estimation in challenging scenes with dynamic background and multiple people. Initial progress on this task has been achieved building on discriminatively trained part-based models that deliver a set of 2D body pose candidates that are then subsequently refined by reasoning in 3D [1, 4, 5]. The performance of such methods is limited by the performance of the underlying 2D pose estimation approaches. In this paper we explore a way to boost the performance of 2D pose estimation based on the output of the 3D pose reconstruction process, thus closing the loop in the pose estimation pipeline. We build our approach around a component that is able to identify true positive pose estimation hypotheses with high confidence. We then either retrain 2D pose estimation models using such highly confident hypotheses as additional training examples, or we use similarity to these hypotheses as a cue for 2D pose estimation. We consider a number of features that can be used for assessing the confidence of the pose estimation results. The strongest feature in our comparison corresponds to the ensemble agreement on the 3D pose output. We evaluate our approach on two publicly available datasets improving over state of the art in each case.

Book ChapterDOI
02 Sep 2014
TL;DR: This work provides the first pose-invariant approach to estimate gaze from unconstrained still images and provides results for pose- Invariant gaze estimation on still images on the UUlm Head Pose and Gaze Database and attribute description on the Multi-PIE database.
Abstract: Our goal is to obtain an eye gaze estimation and a face description based on attributes (e.g. glasses, beard or thick lips) from still images. An attribute-based face description reflects human vocabulary and is therefore adequate as face description. Head pose and eye gaze play an important role in human interaction and are a key element to extract interaction information from still images. Pose variation is a major challenge when analyzing them. Most current approaches for facial image analysis are not explicitly pose-invariant. To obtain a pose-invariant representation, we have to account the three dimensional nature of a face. A 3D Morphable Model (3DMM) of faces is used to obtain a dense 3D reconstruction of the face in the image. This Analysis-by-Synthesis approach provides model parameters which contain an explicit face description and a dense model to image correspondence. However, the fit is restricted to the model space and cannot explain all variations. Our model only contains straight gaze directions and lacks high detail textural features. To overcome this limitations, we use the obtained correspondence in a discriminative approach. The dense correspondence is used to extract a pose-normalized version of the input image. The warped image contains all information from the original image and preserves gaze and detailed textural information. On the pose-normalized representation we train a regression function to obtain gaze estimation and attribute description. We provide results for pose-invariant gaze estimation on still images on the UUlm Head Pose and Gaze Database and attribute description on the Multi-PIE database. To the best of our knowledge, this is the first pose-invariant approach to estimate gaze from unconstrained still images.

Book ChapterDOI
02 Sep 2014
TL;DR: A method to improve the classification result by combining multiple deep convolutional neural networks in a committee is presented and can achieve results that are better than the state of the art.
Abstract: Deep convolutional neural networks are known to give good results on image classification tasks. In this paper we present a method to improve the classification result by combining multiple such networks in a committee. We adopt the STL-10 dataset which has very few training examples and show that our method can achieve results that are better than the state of the art. The networks are trained layer-wise and no backpropagation is used. We also explore the effects of dataset augmentation by mirroring, rotation, and scaling.

Book ChapterDOI
02 Sep 2014
TL;DR: A new tracking-by-detection algorithm for multiple targets from multiple dynamic, unlocalized and unconstrained cameras is proposed and it is shown that the tracking method can effectively deal with independently moving cameras and camera registration noise.
Abstract: We propose a new tracking-by-detection algorithm for multiple targets from multiple dynamic, unlocalized and unconstrained cameras In the past tracking has either been done with multiple static cameras, or single and stereo dynamic cameras We register several moving cameras using a given 3D model from Structure from Motion (SfM), and initialize the tracking given the registration The camera uncertainty estimate can be efficiently incorporated into a flow-network formulation for tracking As this is a novel task in the tracking domain, we evaluate our method on a new challenging dataset for tracking with multiple moving cameras and show that our tracking method can effectively deal with independently moving cameras and camera registration noise

Book ChapterDOI
02 Sep 2014
TL;DR: An orthogonal approach that learns patch representations specifically tailored to every single test exemplar for fine-grained recognition or subordinate categorization, tasks where an algorithm needs to reliably differentiate between visually similar categories, e.g., different bird species.
Abstract: In this paper, we present a new approach for fine-grained recognition or subordinate categorization, tasks where an algorithm needs to reliably differentiate between visually similar categories, e.g., different bird species. While previous approaches aim at learning a single generic representation and models with increasing complexity, we propose an orthogonal approach that learns patch representations specifically tailored to every single test exemplar. Since we query a constant number of images similar to a given test image, we obtain very compact features and avoid large-scale training with all classes and examples. Our learned mid-level features are built on shape and color detectors estimated from discovered patches reflecting small highly discriminative structures in the queried images. We evaluate our approach for fine-grained recognition on the CUB-2011 birds dataset and show that high recognition rates can be obtained by model combination.