scispace - formally typeset
Search or ask a question

Showing papers presented at "British Machine Vision Conference in 2012"


Proceedings ArticleDOI
01 Sep 2012
TL;DR: The neighbor embedding SR algorithm so designed is shown to give good visual results, comparable to other state-of-the-art methods, while presenting an appreciable reduction of the computational time.
Abstract: This paper describes a single-image super-resolution (SR) algorithm based on nonnegative neighbor embedding. It belongs to the family of single-image example-based SR algorithms, since it uses a dictionary of low resolution (LR) and high resolution (HR) trained patch pairs to infer the unknown HR details. Each LR feature vector in the input image is expressed as the weighted combination of its K nearest neighbors in the dictionary; the corresponding HR feature vector is reconstructed under the assumption that the local LR embedding is preserved. Three key aspects are introduced in order to build a low-complexity competitive algorithm: (i) a compact but efficient representation of the patches (feature representation) (ii) an accurate estimation of the patches by their nearest neighbors (weight computation) (iii) a compact and already built (therefore external) dictionary, which allows a one-step upscaling. The neighbor embedding SR algorithm so designed is shown to give good visual results, comparable to other state-of-the-art methods, while presenting an appreciable reduction of the computational time.

2,059 citations


Proceedings ArticleDOI
01 Jan 2012
TL;DR: This paper presents a single regression model based approach that is able to estimate people count in spatially localised regions and is more scalable without the need for training a large number of regressors proportional to the number of local regions.
Abstract: This paper presents a multi-output regression model for crowd counting in public scenes. Existing counting by regression methods either learn a single model for global counting, or train a large number of separate regressors for localised density estimation. In contrast, our single regression model based approach is able to estimate people count in spatially localised regions and is more scalable without the need for training a large number of regressors proportional to the number of local regions. In particular, the proposed model automatically learns the functional mapping between interdependent low-level features and multi-dimensional structured outputs. The model is able to discover the inherent importance of different features for people counting at different spatial locations. Extensive evaluations on an existing crowd analysis benchmark dataset and a new more challenging dataset demonstrate the effectiveness of our approach.

661 citations


Proceedings ArticleDOI
01 Jan 2012
TL;DR: This work proposes a novel method for re-identification that learns a selection and weighting of mid-level semantic attributes to describe people, an attribute-centric, parts-based feature representation that differs from and complements existing low-level features that rely purely on bottom-up statistics for feature selection.
Abstract: Visually identifying a target individual reliably in a crowded environment observed by a distributed camera network is critical to a variety of tasks in managing business information, border control, and crime prevention. Automatic re-identification of a human candidate from public space CCTV video is challenging due to spatiotemporal visual feature variations and strong visual similarity between different people, compounded by low-resolution and poor quality video data. In this work, we propose a novel method for re-identification that learns a selection and weighting of mid-level semantic attributes to describe people. Specifically, the model learns an attribute-centric, parts-based feature representation. This differs from and complements existing low-level features for re-identification that rely purely on bottom-up statistics for feature selection, which are limited in discriminating and identifying reliably visual appearances of target people appearing in different camera views under certain degrees of occlusion due to crowdedness. Our experiments demonstrate the effectiveness of our approach compared to existing feature representations when applied to benchmarking datasets.

346 citations


Proceedings ArticleDOI
01 Jan 2012
TL;DR: It is shown that retrieval methods using a selective voting scheme are able to outperform state-of-the-art direct matching methods and how both selective voting and correspondence computation can be accelerated by using a Hamming embedding of feature descriptors.
Abstract: To reliably determine the camera pose of an image relative to a 3D point cloud of a scene, correspondences between 2D features and 3D points are needed. Recent work has demonstrated that directly matching the features against the points outperforms methods that take an intermediate image retrieval step in terms of the number of images that can be localized successfully. Yet, direct matching is inherently less scalable than retrievalbased approaches. In this paper, we therefore analyze the algorithmic factors that cause the performance gap and identify false positive votes as the main source of the gap. Based on a detailed experimental evaluation, we show that retrieval methods using a selective voting scheme are able to outperform state-of-the-art direct matching methods. We explore how both selective voting and correspondence computation can be accelerated by using a Hamming embedding of feature descriptors. Furthermore, we introduce a new dataset with challenging query images for the evaluation of image-based localization.

302 citations


Proceedings ArticleDOI
01 Sep 2012
TL;DR: A novel image representation which can properly handle both background and illumination variations and is adapted to the person/face reidentification tasks, avoiding the use of any additional pre-processing steps such as foreground-background separation or face and body part segmentation is proposed.
Abstract: This paper proposes a novel image representation which can properly handle both background and illumination variations. It is therefore adapted to the person/face reidentification tasks, avoiding the use of any additional pre-processing steps such as foreground-background separation or face and body part segmentation. This novel representation relies on the combination of Biologically Inspired Features (BIF) and covariance descriptors used to compute the similarity of the BIF features at neighboring scales. Hence, we will refer to it as the BiCov representation. To show the effectiveness of BiCov, this paper conducts experiments on two person re-identification tasks (VIPeR and ETHZ) and one face verification task (LFW), on which it improves the current state-of-the-art performance.

291 citations


Proceedings ArticleDOI
01 Sep 2012
TL;DR: This paper proposes a novel face representation based on Local Quantized Patterns that gives state-of-the-art performance without requiring neither a metric learning stage nor a costly labelled training dataset.
Abstract: This paper proposes a novel face representation based on Local Quantized Patterns (LQP). LQP is a generalization of local pattern features that makes use of vector quantization and lookup table to let local pattern features have many more pixels and/or quantization levels without sacrificing simplicity and computational efficiency. Our new LQP face representation not only outperforms any other representation on challenging face datasets but performs equally well in the intensity space and orientation space (obtained by applying gradient or Gabor Filters) and hence is intrinsically robust to illumination variations. Extensive experiments on several challenging face recognition datasets (such as FERET and LFW) show that this representation gives state-of-the-art performance (improving the earlier state-of-the-art by around 3%) without requiring neither a metric learning stage nor a costly labelled training dataset, having the comparison of two faces being made by simply computing the Cosine similarity between their LQP representations in a projected space.

200 citations


Proceedings ArticleDOI
01 Jan 2012
TL;DR: Improvements of the LO-RANSAC procedure are proposed: a use of truncated quadratic cost function, an introduction of a limit on the number of inliers used for the least squares computation and several implementation issues are addressed.
Abstract: The paper revisits the problem of local optimization for RANSAC. Improvements of the LO-RANSAC procedure are proposed: a use of truncated quadratic cost function, an introduction of a limit on the number of inliers used for the least squares computation and several implementation issues are addressed. The implementation is made publicly available. Extensive experiments demonstrate that the novel algorithm called LO + -RANSAC is (1) very stable (almost non-random in nature), (2) very precise in a broad range of con- ditions, (3) less sensitive to the choice of inlier-outlier threshold and (4) it offers a sig- nificantly better starting point for bundle adjustment than the Gold Standard method advocated in the Hartley-Zisserman book.

197 citations


Proceedings ArticleDOI
01 Jan 2012
TL;DR: This work reports moving volume KinectFusion with additional algorithms that allow the camera to roam freely and allows the algorithm to handle a volume that moves arbitrarily on-line.
Abstract: Newcombe and Izadi et al’s KinectFusion [5] is an impressive new algorithm for real-time dense 3D mapping using the Kinect. It is geared towards games and augmented reality, but could also be of great use for robot perception. However, the algorithm is currently limited to a relatively small volume fixed in the world at start up (typically a ∼ 3m cube). This limits applications for perception. Here we report moving volume KinectFusion with additional algorithms that allow the camera to roam freely. We are interested in perception in rough terrain, but the system would also be useful in other applications including free-roaming games and awareness aids for hazardous environments or the visually impaired. Our approach allows the algorithm to handle a volume that moves arbitrarily on-line (Figure 1).

197 citations


Proceedings ArticleDOI
01 Jan 2012
TL;DR: Surprisingly, the performance of problem domain-agnostic mixture models appears to saturate quickly, and there is still room to improve performance with linear classifiers and the existing feature space by improved representations and learning algorithms.
Abstract: Datasets for training object recognition systems are steadily growing in size. This paper investigates the question of whether existing detectors will continue to improve as data grows, or if models are close to saturating due to limited model complexity and the Bayes risk associated with the feature spaces in which they operate. We focus on the popular paradigm of scanning-window templates defined on oriented gradient features, trained with discriminative classifiers. We investigate the performance of mixtures of templates as a function of the number of templates (complexity) and the amount of training data. We find that additional data does help, but only with correct regularization and treatment of noisy examples or “outliers” in the training data. Surprisingly, the performance of problem domain-agnostic mixture models appears to saturate quickly (∼10 templates and ∼100 positive training examples per template). However, compositional mixtures (implemented via composed parts) give much better performance because they share parameters among templates, and can synthesize new templates not encountered during training. This suggests there is still room to improve performance with linear classifiers and the existing feature space by improved representations and learning algorithms.

197 citations


Proceedings ArticleDOI
01 Jan 2012
TL;DR: A method of face verification that takes advantage of a reference set of faces, disjoint by identity from the test faces, labeled with identity and face part locations, to perform an “identity-preserving” alignment.
Abstract: We propose a method of face verification that takes advantage of a reference set of faces, disjoint by identity from the test faces, labeled with identity and face part locations. The reference set is used in two ways. First, we use it to perform an “identity-preserving” alignment, warping the faces in a way that reduces differences due to pose and expression but preserves differences that indicate identity. Second, using the aligned faces, we learn a large set of identity classifiers, each trained on images of just two people. We call these “Tom-vs-Pete” classifiers to stress their binary nature. We assemble a collection of these classifiers able to discriminate among a wide variety of subjects and use their outputs as features in a same-or-different classifier on face pairs. We evaluate our method on the Labeled Faces in the Wild benchmark, achieving an accuracy of 93.10%, significantly improving on the published state of the art.

162 citations


Proceedings ArticleDOI
01 Jan 2012
TL;DR: This work shows that issuing multiple queries significantly improves recall and enables the system to find quite challenging occu rrences of the queried object, and is evaluated quantitatively on the standard Oxford Buildings benchmark dataset where it achieves very high retrieval performance.
Abstract: The aim of large scale specific-object image retrieval systems is to instanta neously find images that contain the query object in the image database. Current s ystems, for example Google Goggles, concentrate on querying using a single view of an object, e.g. a photo a user takes with his mobile phone, in order to answer the question “what is this?”. Here we consider the somewhat converse problem of finding all images of an object given that the user knows what he is looking for; so the input modality is text, not a n image. This problem is useful in a number of settings, for example media production teams are interested in searching internal databases for images or video footage to accompany news reports and newspaper articles. Given a textual query (e.g. “coca cola bottle”), our approach is to firs t obtain multiple images of the queried object using textual Google image search. These images are then used to visually query the target database to discover images containing the object of interest. We compare a number of different methods for combining the multiple query images, including discriminative learning. We show that issuing multiple queries significantly improves recall and enables the system to find quite challenging occu rrences of the queried object. The system is evaluated quantitatively on the standard Oxford Buildings benchmark dataset where it achieves very high retrieval performance, and also qualitatively on the TrecVid 2011 known-item search dataset.

Proceedings ArticleDOI
01 Jan 2012
TL;DR: In this article, the authors proposed a max-product particle BP (MP-PBP) algorithm for continuous labeling problems, which is more accurate than patchmatch and orders of magnitude faster than MP-pBP.
Abstract: PatchMatch (PM) is a simple, yet very powerful and successful method for optimizing continuous labelling problems. The algorithm has two main ingredients: the update of the solution space by sampling and the use of the spatial neighbourhood to propagate samples. We show how these ingredients are related to steps in a specific form of belief propagation (BP) in the continuous space, called max-product particle BP (MP-PBP). However, MP-PBP has thus far been too slow to allow complex state spaces. In the case where all nodes share a common state space and the smoothness prior favours equal values, we show that unifying the two approaches yields a new algorithm, PMBP, which is more accurate than PM and orders of magnitude faster than MP-PBP. To illustrate the benefits of our PMBP method we have built a new stereo matching algorithm with unary terms which are borrowed from the recent PM Stereo work and novel realistic pairwise terms that provide smoothness. We have experimentally verified that our method is an improvement over state-of-the-art techniques at sub-pixel accuracy level.

Proceedings ArticleDOI
01 Jan 2012
TL;DR: In this article, the authors propose a people detector tailored to various occlusion levels and leverage the fact that person/person occlusions result in very characteristic appearance patterns that can help to improve detection results.
Abstract: We consider the problem of detection and tracking of multiple people in crowded street scenes. State-of-the-art methods perform well in scenes with relatively few people, but are severely challenged by scenes with many subjects that partially occlude each other. This limitation is due to the fact that current people detectors fail when persons are strongly occluded. We observe that typical occlusions are due to overlaps between people and propose a people detector tailored to various occlusion levels. Instead of treating partial occlusions as distractions, we leverage the fact that person/person occlusions result in very characteristic appearance patterns that can help to improve detection results. We demonstrate the performance of our occlusion-aware person detector on a new dataset of people with controlled but severe levels of occlusion and on two challenging publicly available benchmarks outperforming single person detectors in each case.

Proceedings ArticleDOI
12 Sep 2012
TL;DR: This work presents a method for viewinvariant action recognition based on sparse representations using a transferable dictionary pair, and extends the approach to transferring an action model learned from multiple source views to one target view.
Abstract: Discriminative appearance features are effective for recognizing actions in a fixed view, but generalize poorly to changes in viewpoint. We present a method for viewinvariant action recognition based on sparse representations using a transferable dictionary pair. A transferable dictionary pair consists of two dictionaries that correspond to the source and target views respectively. The two dictionaries are learned simultaneously from pairs of videos taken at different views and aim to encourage each video in the pair to have the same sparse representation. Thus, the transferable dictionary pair links features between the two views that are useful for action recognition. Both unsupervised and supervised algorithms are presented for learning transferable dictionary pairs. Using the sparse representation as features, a classifier built in the source view can be directly transferred to the target view. We extend our approach to transferring an action model learned from multiple source views to one target view. We demonstrate the effectiveness of our approach on the multi-view IXMAS data set. Our results compare favorably to the the state of the art.

Proceedings ArticleDOI
01 Jan 2012
TL;DR: A framework for auto-calibrating a camera, rendering 3D models from the viewpoint an image was taken, and computing a similarity measure between each 3D model and an input image is developed and shown.
Abstract: In this paper, we propose a data-driven approach to leverage repositories of 3D models for scene understanding. Our ability to relate what we see in an image to a large collection of 3D models allows us to transfer information from these models, creating a rich understanding of the scene. We develop a framework for auto-calibrating a camera, rendering 3D models from the viewpoint an image was taken, and computing a similarity measure between each 3D model and an input image. We demonstrate this data-driven approach in the context of geometry estimation and show the ability to find the identities and poses of object in a scene. Additionally, we present a new dataset with annotated scene geometry. This data allows us to measure the performance of our algorithm in 3D, rather than in the image plane. Recently, large online repositories of 3D data such as Google 3D Warehouse have emerged. These resources, as well as the advent of low-cost depth cameras, have sparked interest in geometric data-driven algorithms. At the same time, researchers have (re-)started investigating the feasibility of recovering geometric information, e.g., the layout of a scene. The success of data-driven techniques for tasks based on appearance features, e.g., interpreting an input image by retrieving similar scenes, suggests that similar techniques based on geometric data could be equally effective for 3D scene interpretation tasks. In fact, the motivation for data-driven techniques is the same for 3D models as for images: realworld environments are not random; the sizes, shapes, orientations, locations and co-location of objects are constrained in complicated ways that can be represented given enough data. In principle, estimating 3D scene structure from data would help constrain bottom-up vision processes. For example, in Figure 1, one nightstand is fully visible; however, the second nightstand is almost fully occluded. Although a bottom-up detector would likely fail to identify the second nightstand since only a few pixels are visible, our method of finding the best matching 3D model is able to detect these types of occluded objects. This is not a trivial extension of the image-based techniques. Generalizing data-driven ideas raises new fundamental technical questions never addressed before in this context: What features should be used to compare input images and 3D models? Given these features, what mechanism should be used to rank the most similar 3D models to the input scene? Even assuming that this ranking is correct, how can we transfer information from the 3D models to the input image? To address these questions, we develop a set of features that can be used to compare an input image with a 3D model and design a mechanism for finding the best matching 3D scene using support vector ranking. We show the feasibility of these techniques for transferring the geometry of objects in indoor scenes from 3D models to an input image. Naturally, we cannot compare 3D models directly to a 2D image. Thus, we first estimate the intrinsic and extrinsic parameters of the camera and use this information to render each of the 3D models from the same view as the image was taken from. We then compute similarity features between the models and the input image. Lastly, each of the 3D models is ranked based on how similar its rendering is to the input image using a learned feature weighting. See Figure 2 for an overview of this process. Please read our full paper for a detailed explaination of our data-driven geometry estimation algorithm and results.

Proceedings ArticleDOI
01 Jan 2012
TL;DR: This paper proposes two novel methods for fine-grained classification, both based on part information, as well as a new fine-Grained category data set of car types, and demonstrates superior performance of these methods to state-of-the-art classifiers.
Abstract: Fine-grained categorization of object classes is receiving increased attention, since it promises to automate classification tasks that are difficult even for humans, such as the distinction between different animal species. In this paper, we consider fine-grained categorization for a different reason: following the intuition that fine-grained categories encode metric information, we aim to generate metric constraints from fine-grained category predictions, for the benefit of 3D scene-understanding. To that end, we propose two novel methods for fine-grained classification, both based on part information, as well as a new fine-grained category data set of car types. We demonstrate superior performance of our methods to state-of-the-art classifiers, and show first promising results for estimating the depth of objects from fine-grained category predictions from a monocular camera.

Proceedings ArticleDOI
03 Sep 2012
TL;DR: A novel learning-based approach for video sequence classification that automatically learns a sparse shift-invariant representation of the local 2D+t salient information, without any use of prior knowledge is presented.
Abstract: We present in this paper a novel learning-based approach for video sequence classification Contrary to the dominant methodology, which relies on hand-crafted features that are manually engineered to be optimal for a specific task, our neural model automatically learns a sparse shift-invariant representation of the local 2D+t salient information, without any use of prior knowledge To that aim, a spatio-temporal convolutional sparse auto-encoder is trained to project a given input in a feature space, and to reconstruct it from its projection coordinates Learning is performed in an unsupervised manner by minimizing a global parametrized objective function The sparsity is ensured by adding a sparsifying logistic between the encoder and the decoder, while the shift-invariance is handled by including an additional hidden variable to the objective function The temporal evolution of the obtained sparse features is learned by a long short-term memory recurrent neural network trained to classify each sequence We show that, since the feature learning process is problem-independent, the model achieves outstanding performances when applied to two different problems, namely human action and facial expression recognition Obtained results are superior to the state of the art on the GEMEP-FERA dataset and among the very best on the KTH dataset

Proceedings ArticleDOI
03 Sep 2012
TL;DR: This work introduces a divisive clustering algorithm that can efficiently extract a hierarchy over a large number of local trajectories and provides an efficient positive definite kernel that computes the structural and visual similarity of two tree decompositions by relying on models of their edges.
Abstract: We address the problem of recognizing complex activities, such as pole vaulting, which are characterized by the composition of a large and variable number of different spatio-temporal parts. We represent a video as a hierarchy of mid-level motion components. This hierarchy is a data-driven decomposition specific to each video. We introduce a divisive clustering algorithm that can efficiently extract a hierarchy over a large number of local trajectories. We use this structure to represent a video as an unordered binary tree. This tree is modeled by nested histograms of local motion features. We provide an efficient positive definite kernel that computes the structural and visual similarity of two tree decompositions by relying on models of their edges. Contrary to most approaches based on action decompositions, we propose to use the full hierarchical action structure instead of selecting a small fixed number of parts. We present experimental results on two recent challenging benchmarks that focus on complex activities and show that our kernel on per-video hierarchies allows to efficiently discriminate between complex activities sharing common action parts. Our approach improves over the state of the art, including unstructured activity models, baselines using other motion decomposition algorithms, graph matching, and latent models explicitly selecting a fixed number of parts.

Proceedings ArticleDOI
01 Jan 2012
TL;DR: This work deals with the recovery of dense depth information from thermal (far infrared spectrum) and optical (visible spectrum) image pairs where large differences in the characteristics of image pairs make this task significantly more challenging than the common stereo case.
Abstract: Here we address the problem of scene depth recovery within cross-spectral stereo imagery (each image sensed over a differing spectral range). We compare several robust matching techniques which are able to capture local similarities between the structure of cross-spectral images and a range of stereo optimisation techniques for the computation of valid dense depth estimates for this case. As the performance of standard optical camera systems can be severely affected by environmental conditions the use of combined sensing systems operating in differing parts of the electromagnetic spectrum is increasingly common [5]. As a result, an attractive solution is the combination of both optical and thermal images in many sensing and surveillance scenarios as the complementary nature of both modalities can be exploited and the individual drawbacks largely compensated. Despite the inherent stereo setup of this common two sensor deployment, in practical scenarios it is rarely exploited. Here, we specifically deal with the recovery of dense depth information from thermal (far infrared spectrum) and optical (visible spectrum) image pairs where large differences in the characteristics of image pairs make this task significantly more challenging than the common stereo case (Figure 1A).

Proceedings ArticleDOI
01 Jan 2012
TL;DR: The Exemplar SVM framework is used to produce a better representation of the query in an unsupervised way, and the document descriptors are precomputed and compressed with Product Quantization.
Abstract: In this paper we propose an unsupervised segmentation-free method for word spotting in document images. Documents are represented with a grid of HOG descriptors, and a sliding window approach is used to locate the document regions that are most similar to the query. We use the Exemplar SVM framework to produce a better representation of the query in an unsupervised way. Finally, the document descriptors are precomputed and compressed with Product Quantization. This offers two advantages: first, a large number of documents can be kept in RAM memory at the same time. Second, the sliding window becomes significantly faster since distances between quantized HOG descriptors can be precomputed. Our results significantly outperform other segmentation-free methods in the literature, both in accuracy and in speed and memory usage.

ProceedingsDOI
01 Sep 2012
TL;DR: The method is geared for fast and scalable learning and detection by combining tractable extraction of edgelet constellations with library lookup based on rotation and scale-invariant descriptors, and is generative enabling more objects to be learnt without the need for re-training.
Abstract: We present a method for the learning and detection of multiple rigid texture-less 3D objects intended to operate at frame rate speeds for video input. The method is geared for fast and scalable learning and detection by combining tractable extraction of edgelet constellations with library lookup based on rotationand scale-invariant descriptors. The approach learns object views in real-time, and is generative enabling more objects to be learnt without the need for re-training. During testing, a random sample of edgelet constellations is tested for the presence of known objects. We perform testing of single and multi-object detection on a 30 objects dataset showing detections of any of them within milliseconds from the object’s visibility. The results show the scalability of the approach and its framerate performance.

Proceedings ArticleDOI
01 Jan 2012
TL;DR: An online SfM method that integrates wide-baseline still-images in an online fashion into a consistent reconstruction and is suited for large-scale reconstructions as obtained by flying micro aerial vehicles as well as on small indoor environments is proposed.
Abstract: The quality and completeness of 3D models obtained by Structure-fromMotion (SfM) heavily depend on the image acquisition process. If the user gets feedback about the reconstruction quality already during the acquisition, he can optimize this process. The goal of this paper is to support a user during image acquisition by giving online feedback of the current reconstruction quality. We propose an online SfM method that integrates wide-baseline still-images in an online fashion into a consistent reconstruction and we derive a surface model given the SfM point cloud. To guide the user to scene parts that are captured not very well, we colour the mesh according to redundancy and resolution information. In the experiments, we show that our approach makes the final SfM result predictable already during image acquisition. The method is suited for large-scale reconstructions as obtained by flying micro aerial vehicles as well as on small indoor environments. We propose a method that supports a user in the acquisition process in two ways: (a) sparse online SfM with accuracy close to offline methods and (b) surface extraction and quality visualization. The workflow of our method is shown in Figure 1.

Proceedings ArticleDOI
01 Jan 2012
TL;DR: In this article, a transfer learning based on learning to rank is proposed, which effectively transfers a model for automatic annotation of object location from an auxiliary dataset to a target dataset with completely unrelated object categories.
Abstract: Most existing approaches to training object detectors rely on fully supervised learning, which requires the tedious manual annotation of object location in a training set. Recently there has been an increasing interest in developing weakly supervised approach to detector training where the object location is not manually annotated but automatically determined based on binary (weak) labels indicating if a training image contains the object. This is a challenging problem because each image can contain many candidate object locations which partially overlaps the object of interest. Existing approaches focus on how to best utilise the binary labels for object location annotation. In this paper we propose to solve this problem from a very different perspective by casting it as a transfer learning problem. Specifically, we formulate a novel transfer learning based on learning to rank, which effectively transfers a model for automatic annotation of object location from an auxiliary dataset to a target dataset with completely unrelated object categories. We show that our approach outperforms existing state-of-the-art weakly supervised approach to annotating objects in the challenging VOC dataset.

Proceedings ArticleDOI
01 Jan 2012
TL;DR: A fast single-image deblurring algorithm to remove non-uniform blur is proposed and an iterative method to refine the camera motion estimation and introduce perturbation at each iteration to obtain robust solutions is developed.
Abstract: where Kθ is the matrix that warps latent image L to the transformed copy at a sampled pose θ and S denotes the set of sampled camera poses. While these algorithms show promising results, they entail high computational cost as the high-dimensional camera motion space and the latent image have to be computed during the iterative optimization procedures. In this paper, we propose a fast single-image deblurring algorithm to remove non-uniform blur. We first introduce an initialization method that facilitates convergence and avoid local minimums of the formulated optimization problem. We then propose a new camera motion estimation method which optimizes on a small set of pose weights of a constrained camera pose subspace at a time rather than using the entire space. We develop an iterative method to refine the camera motion estimation and introduce perturbation at each iteration to obtain robust solutions. Fig. 1 summarizes the main steps of our method.

Proceedings ArticleDOI
01 Jan 2012
TL;DR: A novel bandlet-based edge detector is introduced which is quite effective at obtaining text edges in images while dismissing noisy and foliage edges and the experimental results indicate a high performance for the proposed method and the effectiveness of the proposed edge detector for text localization purposes.
Abstract: In this paper, we propose a text detection method based on a feature vector generated from connected components produced via the stroke width transform Several properties, such as variant directionality of gradient of text edges, high contrast with background, and geometric properties of text components jointly with the properties found by the stroke width transform are considered in the formation of feature vectors Then, k-means clustering is performed by employing the feature vectors in a bid to distinguish text and non-text components Finally, the obtained text components are grouped and the remaining components are discarded Since the stroke width transform relies on a precise edge detection scheme, we introduce a novel bandlet-based edge detector which is quite effective at obtaining text edges in images while dismissing noisy and foliage edges Our experimental results indicate a high performance for the proposed method and the effectiveness of our proposed edge detector for text localization purposes

Proceedings ArticleDOI
01 Sep 2012
TL;DR: Presented at the Ninth Conference on 23rd British Machine Vision Conference (BMVC 2012), 3-7 September 2012, Guildford, Surrey, UK.
Abstract: Presented at the Ninth Conference on 23rd British Machine Vision Conference (BMVC 2012), 3-7 September 2012, Guildford, Surrey, UK.

Proceedings ArticleDOI
01 Jan 2012
TL;DR: A method for relocalisation of a freely moving RGBD camera in small workspaces that uses a general regression over a set of synthetic views distributed throughout an informed estimate of possible camera viewpoints to estimate the full 6D camera pose.
Abstract: With the advent of real-time dense scene reconstruction from handheld cameras, one key aspect to enable robust operation is the ability to relocalise in a previously mapped environment or after loss of measurement. Tasks such as operating on a workspace, where moving objects and occlusions are likely, require a recovery competence in order to be useful. For RGBD cameras, this must also include the ability to relocalise in areas with reduced visual texture. This paper describes a method for relocalisation of a freely moving RGBD camera in small workspaces. The approach combines both 2D image and 3D depth information to estimate the full 6D camera pose. The method uses a general regression over a set of synthetic views distributed throughout an informed estimate of possible camera viewpoints. The resulting relocalisation is accurate and works faster than framerate and the system’s performance is demonstrated through a comparison against visual and geometric feature matching relocalisation techniques on sequences with moving objects and minimal texture.

Proceedings ArticleDOI
01 Jan 2012
TL;DR: This paper introduces a representation which is based on image gradient directions near robust edges which correspond to characteristic facial features which is by far the best performance achieved in this setup reported in the literature.
Abstract: Our aim in this paper is to robustly match frontal faces in the presence of extreme illumination changes, using only a single training image per person and a single probe image. In the illumination conditions we consider, which include those with the dominant light source placed behind and to the side of the user, directly above and pointing downwards or indeed below and pointing upwards, this is a most challenging problem. The presence of sharp cast shadows, large poorly illuminated regions of the face, quantum and quantization noise and other nuisance effects, makes it difficult to extract a sufficiently discriminative yet robust representation. We introduce a representation which is based on image gradient directions near robust edges which correspond to characteristic facial features. Robust edges are extracted using a cascade of processing steps, each of which seeks to harness further discriminative information or normalize for a particular source of extra-personal appearance variability. The proposed representation was evaluated on the extremely difficult YaleB data set. Unlike most of the previous work we include all available illuminations, perform training using a single image per person and match these also to a single probe image. In this challenging evaluation setup, the proposed gradient edge map achieved 0.8% error rate, demonstrating a nearly perfect receiver-operator characteristic curve behaviour. This is by far the best performance achieved in this setup reported in the literature, the best performing methods previously proposed attaining error rates of approximately 6–7%.

Proceedings ArticleDOI
01 Jan 2012
TL;DR: A novel template-matching split function using DOT for Random Forest, which divides a feature space in a non-linear manner, but has a very low complexity up to binary bit-wise operations.
Abstract: In this paper, we present a new pedestrian detection method combining Random Forest and Dominant Orientation Templates(DOT) to achieve state-of-the-art accuracy and, more importantly, to accelerate run-time speed. DOT can be considered as a binary version of Histogram of Oriented Gradients(HOG) and therefore provides time-efficient properties. However, since discarding magnitude information, it degrades the detection rate, when it is directly incorporated. We propose a novel template-matching split function using DOT for Random Forest. It divides a feature space in a non-linear manner, but has a very low complexity up to binary bit-wise operations. Experiments demonstrate that our method provides much superior speed with comparable accuracy to state-ofthe-art pedestrian detectors. By combining a holistic and a patch-based detectors in a cascade manner, we accelerate the detection speed of Hough Forest, a prior-art using Random Forest and HOG, by about 20 times. The obtained speed is 5 frames per second for 640×480 images with 24 scales.

Proceedings ArticleDOI
01 Sep 2012
TL;DR: The idea of a training-free texture classification scheme, which requires no training, is advocated not only for traditional texture benchmarks, but also for the identification of materials and of the writers of musical scores.
Abstract: We advocate the idea of a training-free texture classification scheme. This we demonstrate not only for traditional texture benchmarks, but also for the identification of materials and of the writers of musical scores. State-of-the-art methods operate using local descriptors, their intermediate representation over trained dictionaries, and classifiers. For the first two steps, we work with pooled local Gaussian derivative filters and a small dictionary not obtained through training, resp. Moreover, we build a multi-level representation similar to a spatial pyramid which captures region-level information. An extra step robustifies the final representation by means of comparative reasoning. As to the classification step, we achieve robust results using nearest neighbor classification, and state-of-the-art results with a collaborative strategy. Also these classifiers need no training. To the best of our knowledge, the proposed system yields top results on five standard benchmarks: 99.4% for CUReT, 97.3% for Brodatz, 99.5% for UMD, 99.4% for KTHTIPS, and 99% for UIUC. We significantly improve the state-of-the-art for three other benchmarks: KTHTIPS2b 66.3% (from 58.1%), CVC-MUSCIMA 99.8% (from 77.0%), and FMD 55.8% (from 54%).