scispace - formally typeset
Search or ask a question

Showing papers presented at "British Machine Vision Conference in 2013"


Proceedings ArticleDOI
01 Jan 2013
TL;DR: A novel and fast multiscale feature detection and description approach that exploits the benefits of nonlinear scale spaces and introduces a Modified-Local Difference Binary (M-LDB) descriptor that is highly efficient, exploits gradient information from the non linear scale space, is scale and rotation invariant and has low storage requirements.
Abstract: We propose a novel and fast multiscale feature detection and description approach that exploits the benefits of nonlinear scale spaces. Previous attempts to detect and describe features in nonlinear scale spaces such as KAZE [1] and BFSIFT [6] are highly time consuming due to the computational burden of creating the nonlinear scale space. In this paper we propose to use recent numerical schemes called Fast Explicit Diffusion (FED) [3, 4] embedded in a pyramidal framework to dramatically speed-up feature detection in nonlinear scale spaces. In addition, we introduce a Modified-Local Difference Binary (M-LDB) descriptor that is highly efficient, exploits gradient information from the nonlinear scale space, is scale and rotation invariant and has low storage requirements. Our features are called Accelerated-KAZE (A-KAZE) due to the dramatic speed-up introduced by FED schemes embedded in a pyramidal framework.

917 citations


Proceedings ArticleDOI
01 Jan 2013
TL;DR: This paper shows that Fisher vectors on densely sampled SIFT features are capable of achieving state-of-the-art face verification performance on the challenging “Labeled Faces in the Wild” benchmark, and shows that a compact descriptor can be learnt from them using discriminative metric learning.
Abstract: Several recent papers on automatic face verification have significantly raised the performance bar by developing novel, specialised representations that outperform standard features such as SIFT for this problem. This paper makes two contributions: first, and somewhat surprisingly, we show that Fisher vectors on densely sampled SIFT features, i.e. an off-the-shelf object recognition representation, are capable of achieving state-of-the-art face verification performance on the challenging “Labeled Faces in the Wild” benchmark; second, since Fisher vectors are very high dimensional, we show that a compact descriptor can be learnt from them using discriminative metric learning. This compact descriptor has a better recognition accuracy and is very well suited to large scale identification tasks.

488 citations


Proceedings ArticleDOI
01 Jan 2013
TL;DR: This work argues that a per-image score instead of one computed over the entire dataset brings a lot more insight, and proposes new ways to evaluate semantic segmentation.
Abstract: In this work, we consider the evaluation of the semantic segmentation task. We discuss the strengths and limitations of the few existing measures, and propose new ways to evaluate semantic segmentation. First, we argue that a per-image score instead of one computed over the entire dataset brings a lot more insight. Second, we propose to take contours more carefully into account. Based on the conducted experiments, we suggest best practices for the evaluation. Finally, we present a user study we conducted to better understand how the quality of image segmentations is perceived by humans.

439 citations


Proceedings ArticleDOI
01 Jan 2013
TL;DR: This paper proposes a multi-view pictorial structures model that builds on recent advances in 2D pose estimation and incorporates evidence across multiple viewpoints to allow for robust 3D poses estimation.
Abstract: Pictorial structure models are the de facto standard for 2D human pose estimation. Numerous refinements and improvements have been proposed such as discriminatively trained body part detectors, flexible body models, and local and global mixtures. While these techniques allow to achieve state-of-the-art performance for 2D pose estimation, they have not yet been extended to enable pose estimation in 3D. This paper thus proposes a multi-view pictorial structures model that builds on recent advances in 2D pose estimation and incorporates evidence across multiple viewpoints to allow for robust 3D pose estimation. We evaluate our multi-view pictorial structures approach on the HumanEva-I and MPII Cooking dataset. In comparison to related work for 3D pose estimation our approach achieves similar or better results while operating on single-frames only and not relying on activity specific motion models or tracking. Notably, our approach outperforms state-of-the-art for activities with more complex motions.

169 citations


Proceedings ArticleDOI
01 Jan 2013
TL;DR: A 3D-saliency formulation that takes into account structural features of objects in an indoor setting to identify regions at salient depth levels is proposed that integrates depth and geometric features of object surfaces in indoor scenes.
Abstract: Depth information has been shown to affect identification of visually salient regions in images. In this paper, we investigate the role of depth in saliency detection in the presence of (i) competing saliencies due to appearance, (ii) depth-induced blur and (iii) centre-bias. Having established through experiments that depth continues to be a significant contributor to saliency in the presence of these cues, we propose a 3D-saliency formulation that takes into account structural features of objects in an indoor setting to identify regions at salient depth levels. Computed 3D-saliency is used in conjunction with 2D-saliency models through non-linear regression using SVM to improve saliency maps. Experiments on benchmark datasets containing depth information show that the proposed fusion of 3D-saliency with 2D-saliency models results in an average improvement in ROC scores of about 9% over state-of-the-art 2D saliency models. The main contributions of this paper are: (i) The development of a 3D-saliency model that integrates depth and geometric features of object surfaces in indoor scenes. (ii) Fusion of appearance (RGB) saliency with depth saliency through non-linear regression using SVM. (iii) Experiments to support the hypothesis that depth improves saliency detection in the presence of blur and centre-bias. The effectiveness of the 3D-saliency model and its fusion with RGB-saliency is illustrated through experiments on two benchmark datasets that contain depth information. Current stateof-the-art saliency detection algorithms perform poorly on these datasets that depict indoor scenes due to the presence of competing saliencies in the form of color contrast. For example in Fig. 1, saliency maps of [1] is shown for different scenes, along with its human eye fixations and our proposed saliency map after fusion. It is seen from the first scene of Fig. 1, that illumination plays spoiler role in RGB-saliency map. In second scene of Fig. 1, the RGB-saliency is focused on the cap though multiple salient objects are present in the scene. Last scene at the bottom of Fig. 1, shows the limitation of the RGB-saliency when the object is similar in appearance with the background. Effect of depth on Saliency: In [4], it is shown that depth is an important cue for saliency. In this paper we go further and verify if the depth alone influences the saliency. Different scenes were captured for experimentation using Kinect sensor. Observations resulted out of these experiments are (i) Humans fixate on the objects at closer depth, in the presence of visually competing salient objects in the background, (ii) Early attention happens on the objects at closer depth, (iii) Effective fixations are high at the low contrast foreground compared to the high contrast objects in the background which are blurred, (iv) Low contrast object placed at the center of the field of view, gets more attention compared to other locations. As a result of all these observations, we develop a 3D-saliency that captures the depth information of the regions in the scene. 3D-Saliency: We adapt the region based contrast method from Cheng et al. [1] in computing contrast strengths for the segmented 3D surfaces or regions. Each segmented region is assigned a contrast score using surface normals as the feature. Structure of the surface can be described based on the distribution of normals in the region. We compute a histogram of angular distances formed by every pair of normals in the region. Every region Rk is associated with a histogram Hk. Contrast score Ck of a region Rk is computed as the sum of the dot products of its histogram with histograms of other regions in the scene. Since the depth of the region is influencing the visual attention, the contrast score is scaled by a value Zk, which is the depth of the region Rk from the sensor. In order to define the saliency, sizes of the regions i.e. the number of the points in the region, have to be considered. We find the ratio of the region dimension to the half of the scene dimension. Considering nk as the number of 3D points in the region Rk, the constrast score becomes Figure 1: Four different scenes and their saliency maps; For each scene from top left (i) Original Image, (ii) RGB-Saliency map using RC [1], (iii) Human fixations from eye-tracker and (iv) Fused RGBD-saliency map

163 citations


Proceedings ArticleDOI
01 Jan 2013
TL;DR: A general continuous-time framework for visual-inertial simultaneous localization and mapping and calibration is described and how to use a spline parameterization that closely matches the torque-minimal motion of the sensor is shown.
Abstract: This paper describes a general continuous-time framework for visual-inertial simultaneous localization and mapping and calibration. We show how to use a spline parameterization that closely matches the torque-minimal motion of the sensor. Compared to traditional discrete-time solutions, the continuous-time formulation is particularly useful for solving problems with high-frame rate sensors and multiple unsynchronized devices. We demonstrate the applicability of the method for multi-sensor visual-inertial SLAM and calibration by accurately establishing the relative pose and internal parameters of multiple unsynchronized devices. We also show the advantages of the approach through evaluation and uniform treatment of both global and rolling shutter cameras within visual and visual-inertial SLAM systems.

120 citations


Proceedings ArticleDOI
01 Sep 2013
TL;DR: Presented at the 24th British Machine Vision Conference (BMVC 2013), 9-13 September 2013, Bristol, UK.
Abstract: Presented at the 24th British Machine Vision Conference (BMVC 2013), 9-13 September 2013, Bristol, UK.

118 citations


Proceedings ArticleDOI
09 Sep 2013
TL;DR: Extensive evaluations on the widely used MSRA-1000 dataset and also on the new PASCAL-1500 dataset demonstrate that the proposed saliency model outperforms the state-of-the-art models.
Abstract: Low-rank matrix recovery (LRMR) model, aiming at decomposing a matrix into a low-rank matrix and a sparse one, has shown the potential to address the problem of saliency detection, where the decomposed low-rank matrix naturally corresponds to the background, and the sparse one captures salient objects. This is under the assumption that the background is consistent and objects are obviously distinctive. Unfortunately, in real images, the background may be cluttered and may have low contrast with objects. Thus directly applying the LRMR model to the saliency detection has limited robustness. This paper proposes a novel approach that exploits bottom-up segmentation as a guidance cue of the matrix recovery. This method is fully unsupervised, yet obtains higher performance than the supervised LRMR model. A new challenging dataset PASCAL-1500 is also introduced to validate the saliency detection performance. Extensive evaluations on the widely used MSRA-1000 dataset and also on the new PASCAL-1500 dataset demonstrate that the proposed saliency model outperforms the state-of-the-art models.

105 citations


PatentDOI
01 Feb 2013
TL;DR: In this paper, a text image is embedded into a vectorial space by extracting a set of features from the text image and then a character string representation based on the extracted features is generated.
Abstract: A system and method for comparing a text image and a character string are provided. The method includes embedding a character string into a vectorial space by extracting a set of features from the character string and generating a character string representation based on the extracted features, such as a spatial pyramid bag of characters (SPBOC) representation. A text image is embedded into a vectorial space by extracting a set of features from the text image and generating a text image representation based on the text image extracted features. A compatibility between the text image representation and the character string representation is computed, which includes computing a function of the text image representation and character string representation.

102 citations


Proceedings ArticleDOI
01 Jan 2013
TL;DR: This paper proposes a simple and robust local descriptor, called the robust local binary pattern (RLBP), which impressively outperforms the other widely used descriptors and other variants of LBP, and shows a promising performance on the Face Recognition Grand Challenge (FRGC) face dataset.
Abstract: In this paper, we propose a simple and robust local descriptor, called the robust local binary pattern (RLBP). The local binary pattern (LBP) works very successfully in many domains, such as texture classification, human detection and face recognition. However, an issue of LBP is that it is not so robust to the noise present in the image. We improve the robustness of LBP by changing the coding bit of LBP. Experimental results on the Brodatz and UIUC texture databases show that RLBP impressively outperforms the other widely used descriptors (e.g., SIFT, Gabor, MR8 and LBP) and other variants of LBP (e.g., completed LBP), especially when we add noise in the images. In addition, experimental results on human face recognition also show a promising performance comparable to the best known results on the Face Recognition Grand Challenge (FRGC) face dataset.

96 citations


Proceedings ArticleDOI
01 Jan 2013
TL;DR: This thesis presents practical approaches for tackling the corre-spondence estimation problem with an emphasis on deformable objects and a hybrid generative/discriminative approach is used to perform accurate correspondence estimation in real-time.
Abstract: Many computer vision tasks such as object detection, pose estimation,and alignment are directly related to the estimation of correspondences overinstances of an object class. Other tasks such as image classification andverification if not completely solved can largely benefit from correspondenceestimation. This thesis presents practical approaches for tackling the corre-spondence estimation problem with an emphasis on deformable objects.Different methods presented in this thesis greatly vary in details but theyall use a combination of generative and discriminative modeling to estimatethe correspondences from input images in an efficient manner. While themethods described in this work are generic and can be applied to any object,two classes of objects of high importance namely human body and faces arethe subjects of our experimentations.When dealing with human body, we are mostly interested in estimating asparse set of landmarks – specifically we are interested in locating the bodyjoints. We use pictorial structures to model the articulation of the body partsgeneratively and learn efficient discriminative models to localize the parts inthe image. This is a common approach explored by many previous works. Wefurther extend this hybrid approach by introducing higher order terms to dealwith the double-counting problem and provide an algorithm for solving theresulting non-convex problem efficiently. In another work we explore the areaof multi-view pose estimation where we have multiple calibrated cameras andwe are interested in determining the pose of a person in 3D by aggregating2D information. This is done efficiently by discretizing the 3D search spaceand use the 3D pictorial structures model to perform the inference.In contrast to the human body, faces have a much more rigid structureand it is relatively easy to detect the major parts of the face such as eyes,nose and mouth, but performing dense correspondence estimation on facesunder various poses and lighting conditions is still challenging. In a first workwe deal with this variation by partitioning the face into multiple parts andlearning separate regressors for each part. In another work we take a fullydiscriminative approach and learn a global regressor from image to landmarksbut to deal with insufficiency of training data we augment it by a large numberof synthetic images. While we have shown great performance on the standardface datasets for performing correspondence estimation, in many scenariosthe RGB signal gets distorted as a result of poor lighting conditions andbecomes almost unusable. This problem is addressed in another work wherewe explore use of depth signal for dense correspondence estimation. Hereagain a hybrid generative/discriminative approach is used to perform accuratecorrespondence estimation in real-time.

Proceedings ArticleDOI
01 Jan 2013
TL;DR: The three dimensional Discrete Cosine Transform (3D-DCT) is proposed for feature extraction and it is shown that compared to other transforms, such as the Fourier transform, the transformed coefficients are real and thus require less data to process.
Abstract: Hyperspectral imaging offers new opportunities for inter-person facial discrimination. However, due to the high dimensionality of hyperspectral data, discriminative feature extraction for face recognition is more challenging than 2D images. For dimensionality reduction and feature extraction most of the previous approaches just sub sampled the hyperspectral data [5, 6, 9] or used simple PCA [3]. In contrast, we propose the three dimensional Discrete Cosine Transform (3D-DCT) for feature extraction (Fig. 1). Exploiting the fact that hyperspectral data is usually highly correlated in the spatial and spectral dimensions, a transform such as DCT is expected to perform information compaction in a few coefficients by providing maximal decorrelation. DCT transform being an approximation of the KL-Transformation optimally compacts the signal information in a given number of transform coefficients. Moreover, compared to other transforms, such as the Fourier transform, the transformed coefficients are real and thus require less data to process. The Discrete Cosine Transform (DCT) [1] expresses a discrete signal, such as a 2D image or a hyperspectral cube, as a linear combination of mutually uncorrelated cosine basis functions [4]. DCT generates a compact energy spectrum of the signal where the low-frequency coefficients encode most of the signal information. A compact signal representation can be obtained by selecting only the low-frequency coefficient as features. The 2D-DCT of a 2D image h(x,y)N1×N2 , and the 3D-DCT of a hyperspectral cube H(x,y,λ )N1×N2×N3 are given by

Proceedings ArticleDOI
01 Jan 2013
TL;DR: This work proposes a fully automatic recognition system utilizing facial expression, head pose information and their dynamics, and analyzes the relevance of head pose Information for pain recognition and compares person-specific and general classification models.
Abstract: Pain is what the patient says it is. But what about these who cannot utter? Automatic pain monitoring opens up prospects for better treatment, but accurate assessment of pain is challenging due to the subjective nature of pain. To facilitate advances, we contribute a new dataset, the BioVid Heat Pain Database which contains videos and physiological data of 90 persons subjected to well-defined pain stimuli of 4 intensities. We propose a fully automatic recognition system utilizing facial expression, head pose information and their dynamics. The approach is evaluated with the task of pain detection on the new dataset, also outlining open challenges for pain monitoring in general. Additionally, we analyze the relevance of head pose information for pain recognition and compare person-specific and general classification models.

Proceedings ArticleDOI
01 Jan 2013
TL;DR: This work presents a method for the representation and matching of sketches by exploiting not only local features but also global structures of sketches, through a star graph based ensemble matching strategy, and shows that by encapsulating holistic structure matching and learned bag-of-features models into a single framework, notable recognition performance improvement can be observed.
Abstract: Sketch recognition aims to automatically classify human hand sketches of objects into known categories. This has become increasingly a desirable capability due to recent advances in human computer interaction on portable devices. The problem is nontrivial because of the sparse and abstract nature of hand drawings as compared to photographic images of objects, compounded by a highly variable degree of details in human sketches. To this end, we present a method for the representation and matching of sketches by exploiting not only local features but also global structures of sketches, through a star graph based ensemble matching strategy. Different local feature representations were evaluated using the star graph model to demonstrate the effectiveness of the ensemble matching of structured features. We further show that by encapsulating holistic structure matching and learned bag-of-features models into a single framework, notable recognition performance improvement over the state-of-the-art can be observed. Extensive comparative experiments were carried out using the currently largest sketch dataset released by Eitz et al. [15], with over 20,000 sketches of 250 object categories generated by AMT (Amazon Mechanical Turk) crowd-sourcing.

Proceedings ArticleDOI
01 Jan 2013
TL;DR: Performance comparison among different methods for kernelization using chi-squared kernel is shown.
Abstract: MBRM[1] 0.24/0.25/0.245/122 0.18/0.19/0.185/209 0.24/0.23/0.235/233 JEC[3] 0.27/0.32/0.293/139 0.22/0.25/0.234/224 0.28/0.29/0.285/250 TagProp-ML[2] 0.31/0.37/0.337/146 0.49/0.20/0.284/213 0.48/0.25/0.329/227 TagProp-s ML[2] 0.33/0.42/0.370/160 0.39/0.27/0.319/239 0.46/0.35/0.398/266 KSVM 0.29/0.43/0.346/174 0.30/0.28/0.290/256 0.43/0.27/0.332/266 KSVM-VT (Ours) 0.32/0.42/0.363/179 0.33/0.32/0.325/259 0.47/0.29/0.359/268 Table 1: Performance comparison among different methods. The prefix ‘K’ corresponds to kernelization using chi-squared kernel.

Proceedings ArticleDOI
09 Sep 2013
TL;DR: A pooling strategy for local descriptors to produce a vector representation that is orientation-invariant yet implicitly incorporates the relative angles between features measured by their dominant orientation that is especially effective when combined with dense oriented features.
Abstract: This paper proposes a pooling strategy for local descriptors to produce a vector representation that is orientation-invariant yet implicitly incorporates the relative angles between features measured by their dominant orientation. This pooling is associated with a similarity metric that ensures that all the features have undergone a comparable rotation. This approach is especially effective when combined with dense oriented features, in contrast to existing methods that either rely on oriented features extracted on key points or on non-oriented dense features. The interest of our approach in a retrieval scenario is demonstrated on popular benchmarks comprising up to 1 million database images.

Proceedings ArticleDOI
01 Jan 2013
TL;DR: A boosting approach that automatically selects a small set of useful spatio-temporal pyramid histograms among a randomized pool of candidate partitions and an “object-centric” cutting scheme that prefers sampling bin boundaries near those objects prominently involved in the egocentric activities are proposed.
Abstract: Activities in egocentric video are largely defined by the objects with which the camera wearer interacts, making representations that summarize the objects in view quite informative. Beyond simply recording how frequently each object occurs in a single histogram, spatio-temporal binning approaches can capture the objects’ relative layout and ordering. However, existing methods use hand-crafted binning schemes (e.g., a uniformly spaced pyramid of partitions), which may fail to capture the relationships that best distinguish certain activities. We propose to learn the spatio-temporal partitions that are discriminative for a set of egocentric activity classes. We devise a boosting approach that automatically selects a small set of useful spatio-temporal pyramid histograms among a randomized pool of candidate partitions. In order to efficiently focus the candidate partitions, we further propose an “object-centric” cutting scheme that prefers sampling bin boundaries near those objects prominently involved in the egocentric activities. In this way, we specialize the randomized pool of partitions to the egocentric setting and improve the training efficiency for boosting. Our approach yields state-of-the-art accuracy for recognition of challenging activities of daily living.

Proceedings Article
01 Jan 2013
TL;DR: It is shown that, somewhat counter-intuitively, mouth patterns are highly informative for isolating words in a language for the Deaf, and their co-occurrence with signing can be used to significantly reduce the correspondence search space.
Abstract: The goal of this work is to automatically learn a large number of signs from sign language-interpreted TV broadcasts. We achieve this by exploiting supervisory information available in the subtitles of the broadcasts. However, this information is both weak and noisy and this leads to a challenging correspondence problem when trying to identify the temporal window of the sign. We make the following contributions: (i) we show that, somewhat counter-intuitively, mouth patterns are highly informative for isolating words in a language for the Deaf, and their co-occurrence with signing can be used to significantly reduce the correspondence search space; and (ii) we develop a multiple instance learning method using an efficient discriminative search, which determines a candidate list for the sign with both high recall and precision. We demonstrate the method on videos from BBC TV broadcasts, and achieve higher accuracy and recall than previous methods, despite using much simpler features.

Proceedings ArticleDOI
01 Jan 2013
TL;DR: A new method to incrementally extract a surface from a consecutively growing Structure-from-Motion (SfM) point cloud in real-time based on a Delaunay triangulation on the 3D points, which achieves the same accuracy as state-of-the-art methods but reduces the computational effort significantly.
Abstract: In this paper we propose a new method to incrementally extract a surface from a consecutively growing Structure-from-Motion (SfM) point cloud in real-time. Our method is based on a Delaunay triangulation (DT) on the 3D points. The core idea is to robustly label all tetrahedra into freeand occupied space using a random field formulation and to extract the surface as the interface between differently labeled tetrahedra. For this reason, we propose a new energy function that achieves the same accuracy as state-of-the-art methods but reduces the computational effort significantly. Furthermore, our new formulation allows us to extract the surface in an incremental manner, i. e. whenever the point cloud is updated we adapt our energy function. Instead of minimizing the updated energy with a standard graph cut, we employ the dynamic graph cut of Kohli et al. [1] which enables efficient minimization of a series of similar random fields by re-using the previous solution. In such a way we are able to extract the surface from an increasingly growing point cloud nearly independent of the overall scene size. Energy Function for Surface Extraction Our method formulates surface extraction as a binary labeling problem, with the goal of assigning each tetrahedron either a free or occupied label. For this reason, we model the probabilities that a tetrahedron is free- or occupied space analyzing the set of rays that connect all 3D points to image features. Following the idea of the truncated signed distance function (TSDF), which is known from voxel-based surface reconstructions, a tetrahedron in front of a 3D point X has a high probability to be free space, whereas a tetrahedron behind X is presumably occupied space. We further assume that it is very unlikely that neighboring tetrahedra obtain different labels, except for pairs of tetrahedra that have a ray through the face connecting both. Such a labeling problem can be elegantly formulated as a pairwise random field and since our priors are submodular, we can efficiently find a global optimal labeling solution e. g. using graph cuts. In contrast to existing methods like [2], our energy depends only on the visibility information that is directly connected to the four 3D points that span the tetrahedraVi. Hence a modification of the tetrahedral structure by inserting new points has only limited effect on the energy function. This property enables us to easily adopt the energy function to a modified tetrahedral structure. Incremental Surface Extraction To enable efficient incremental surface reconstruction, our method has to consecutively integrate new scene information (3D points as well as visibility information) in the energy function and to minimize the modified energy efficiently. Integrating new visibility information, i. e. adding rays for newly available 3D points, affects only those terms of the energy function that relate

Proceedings ArticleDOI
01 Jan 2013
TL;DR: It is demonstrated that the transfer learning and person specific trackers significantly improve pose estimation performance, and a method for adapting existing training data to generate new training data by synthesis for signers with different appearances.
Abstract: The objective of this work is to estimate upper body pose for signers in TV broadcasts. Given suitable training data, the pose is estimated using a random forest body joint detector. However, obtaining such training data can be costly. The novelty of this paper is a method of transfer learning which is able to harness existing training data and use it for new domains. Our contributions are: (i) a method for adapting existing training data to generate new training data by synthesis for signers with different appearances, and (ii) a method for personalising training data. As a case study we show how the appearance of the arms for different clothing, specifically short and long sleeved clothes, can be modelled to obtain person-specific trackers. We demonstrate that the transfer learning and person specific trackers significantly improve pose estimation performance.


Proceedings ArticleDOI
01 Jan 2013
TL;DR: This paper presents a comprehensive evaluation of image classification and object detection in X-ray images using standard local features in a BoW framework with (structural) SVMs, and proposes a multi-view branch-and-bound algorithm for multi-View object detection.
Abstract: Object recognition in X-ray images is an interesting application of machine vision that can help reduce the workload of human operators of X-ray scanners at security checkpoints. In this paper, we first present a comprehensive evaluation of image classification and object detection in X-ray images using standard local features in a BoW framework with (structural) SVMs. Then, we extend the features to utilize the extra information available in dual energy X-ray images. Finally, we propose a multi-view branch-and-bound algorithm for multi-view object detection. Through extensive experiments on three object categories, we show that the classification and detection performance substantially improves with the extended features and multiple views.

Proceedings ArticleDOI
01 Jan 2013
TL;DR: A motion boundary based dense sampling strategy is introduced, which greatly reduces the number of valid trajectories while preserves the discriminative power, and a set of new descriptors which describe the spatial-temporal context of motion trajectories are developed.
Abstract: Feature representation is important for human action recognition. Recently, Wang et al. [25] proposed dense trajectory (DT) based features for action video representation and achieved state-of-the-art performance on several action datasets. In this paper, we improve the DT method in two folds. Firstly, we introduce a motion boundary based dense sampling strategy, which greatly reduces the number of valid trajectories while preserves the discriminative power. Secondly, we develop a set of new descriptors which describe the spatial-temporal context of motion trajectories. To evaluate the performance of the proposed methods, we conduct extensive experiments on three benchmarks including KTH, YouTube and HMDB51. The results show that our sampling strategy significantly reduces the computational cost of point tracking without degrading performance. Meanwhile, we achieve superior performance than the state-of-the-art methods by utilizing our spatial-temporal context descriptors.

Proceedings ArticleDOI
01 Sep 2013
TL;DR: This paper illustrates that Fisher Vector, VLAD and BOF can be uniformly derived in two steps: i Encoding – separately map each local descriptor into a code, and ii Pooling – aggregate all codes from one image into a single vector.
Abstract: The bag-of-features(BOF) image representation [7] is popular in largescale image retrieval. With BOF, the memory to store the inverted index file and the search complexity are both approximately linearly increased with the number of images. To address the retrieval efficiency and the memory constraint problem, besides some improvement work based on BOF, there come alternative approaches which aggregate local descriptors in one image into a single vector using Fisher Vector [6] or Vector of Local Aggregated Descriptor (VLAD) [1]. It has been shown in [1] that with as few as 16 bytes to represent an image, the retrieval performance is still comparable to that of the BOF representation. In this paper, we illustrate that Fisher Vector, VLAD and BOF can be uniformly derived in two steps: i Encoding – separately map each local descriptor into a code, and ii Pooling – aggregate all codes from one image into a single vector. Motivated by the success of these two-step approaches, we propose to use sparse coding(SC) framework to aggregate local feature for image retrieval. SC framework is firstly introduced by [10] for the task of image classification. It is a classical two-step approach: Step 1: Encoding. Each local descriptor x from an image is encoded into an N-dimensional vector u = [u1,u2, ...,uN ] by fitting a linear model with sparsity (L1) constraint:

Proceedings Article
01 Jan 2013
TL;DR: In this paper, the authors propose to discover and learn the visual appearance of attributes automatically, using the recently introduced AVA database which contains more than 250,000 images together with their user ratings and textual comments.
Abstract: Current approaches to aesthetic image analysis either provide accurate or interpretable results. To get both accuracy and interpretability, we advocate the use of learned visual attributes as mid-level features. For this purpose, we propose to discover and learn the visual appearance of attributes automatically, using the recently introduced AVA database which contains more than 250,000 images together with their user ratings and textual comments. These learned attributes have many applications including aesthetic quality prediction, image classification and retrieval.

Proceedings ArticleDOI
01 Jan 2013
TL;DR: Extensive experiments on five benchmark datasets for face recognition and person re-identification demonstrate that CRNP is not only more effective but also significantly faster than other state-of-the-art methods, including RNP and CSA.
Abstract: Set based recognition has been attracting more and more attention in recent years, benefitting from two facts: the difficulty of collecting sets of images for recognition fades quickly, and set based recognition models generally outperform the ones for single instance based recognition. In this paper, we propose a novel model called collaboratively regularized nearest points (CRNP) for solving this problem. The proposal inherits the merits of simplicity, robustness, and high-efficiency from the very recently introduced regularized nearest points (RNP) method on finding the set-to-set distance using the l2-norm regularized affine hulls. Meanwhile, CRNP makes use of the powerful discriminative ability induced by collaborative representation, following the same idea as that in sparse recognition for classification (SRC) for image-based recognition and collaborative sparse approximation (CSA) for set-based recognition. However, CRNP uses l2-norm instead of the expensive l1-norm for coefficients regularization, which makes it much more efficient. Extensive experiments on five benchmark datasets for face recognition and person re-identification demonstrate that CRNP is not only more effective but also significantly faster than other state-of-the-art methods, including RNP and CSA.

Proceedings ArticleDOI
01 Jan 2013
TL;DR: A novel approach for estimating the relative motion between successive RGB-D frames that uses plane-primitives instead of point features that is as accurate as state-of-the-art point-based approaches when the camera displacement is small, and significantly outperforms them in case of wide-baseline and/or dynamic foreground.
Abstract: Odometry consists in using data from a moving sensor to estimate change in position over time. It is a crucial step for several applications in robotics and computer vision. This paper presents a novel approach for estimating the relative motion between successive RGB-D frames that uses plane-primitives instead of point features. The planes in the scene are extracted and the motion estimation is cast as a plane-to-plane registration problem with a closed-form solution. Point features are only extracted in the cases where the plane surface configuration is insufficient to determine motion with no ambiguity. The initial estimate is refined in a photo-geometric optimization step that takes full advantage of the plane detection and simultaneous availability of depth and visual appearance cues. Extensive experiments show that our plane-based approach is as accurate as state-of-the-art point-based approaches when the camera displacement is small, and significantly outperforms them in case of wide-baseline and/or dynamic foreground.

Proceedings ArticleDOI
01 Jan 2013
TL;DR: A local ZM-based representation which involves a non-linear encoding layer (quantisation) and pools encoded features over local histograms is introduced in terms of this threelayered framework.
Abstract: Local representations became popular for facial affect recognition as they efficiently capture the image discontinuities, which play an important role for interpreting facial actions. We propose to use Local Zernike Moments (ZMs) [4] due to their useful and compact description of the image discontinuities and texture. Their main advantage in comparison to well-established alternatives such as Local Binary Patterns (LBPs) [5], is their flexibility in terms of the size and level of detail of the local description. We introduce a local ZM-based representation which involves a non-linear encoding layer (quantisation). The functionality of this layer is mapping similar facial configurations together and increasing compactness. We demonstrate the use of the local ZM-based representation for posed and naturalistic affect recognition on standard datasets, and show its superiority to alternative approaches for both tasks. Contemporary representations are often designed as frameworks consisting of three layers [2]: (Local) feature extraction, non-linear encoding and pooling. Non-linear encoding aims at enhancing the relevance of local features by increasing their robustness against image noise. Pooling describes small spatial neighbourhoods as single entities, ignoring the precise location of the encoded features, and increasing the tolerance against small geometric inconsistencies. In what follows, we describe the proposed local ZM-based representation scheme in terms of this threelayered framework. Feature Extraction – Local Zernike Moments: The computation of (complex) ZMs can be considered equivalent to representing an image in an alternative space. As shown in Figure 1-a, an image is decomposed onto a set of basis matrices (ZM bases), which are useful for describing the variation at different directions and scales. ZM bases are orthogonal, therefore there is no overlap in the information conveyed by each feature (ZM coefficient). ZMs are usually computed for the entire image, however in this case, ZMs cannot capture the local variation due to ZM bases lacking localisation [3]. In contrary, when computed around local neighbourhoods across the image, they become an efficient tool for describing the image discontinuities which are essential to interpreting facial activity. Non-linear Encoding – Quantisation: We perform quantisation via converting local features into binary values. Such coarse quantisation increases compactness and allows us to code each local block only with a single integer. Figure 1-b illustrates the process of obtaining the Quantised Local ZM (QLZM) image. Firstly, local ZM coefficients are computed across the input image (LZM layer) — each image in the LZM layer (LZM image) contains the features that are extracted through a particular ZM basis. Next, each LZM image is converted into a binary image by quantising each pixel via the signum(·) function. Finally, the QLZM image is obtained by combining all of the binary images. Specifically, each pixel in a particular location of the QLZM image is an integer (QLZM integer), computed by concatenating all of the binary values in the corresponding location of all binary images. The QLZM image is similar to an LBP-transformed image, in the sense that it contains integers of a limited range. Yet, the physical meaning of the information encoded by each integer is quite different. LBP integers describe a circular block by considering only the values along the border, neglecting the pixels that remain inside the block. Therefore, the efficient operation scale of LBPs is usually limited to 3-5 pixels [1, 5]. QLZM integers, on the other hand, describe blocks as a whole, and provide flexibility in terms of operation scale without major loss of information. Pooling – Histograms: Our representation scheme pools encoded features over local histograms. Figure 1-c illustrates the overall pipeline of the proposed representation scheme. Firstly, the QLZM image is computed through the process that is illustrated in detail in Figure 1-b. Next, . . . . . . ... = ZM Coefficients (local features) ZM Bases

Proceedings ArticleDOI
01 Jan 2013
TL;DR: This paper presents a probabilistic analysis of the response of the immune system to the presence of Tournaisian infectious disease, which has the potential to infect other immune systems with similar defences.
Abstract: Iljung S. Kwak1 iskwak@cs.ucsd.edu Ana C. Murillo2 acm@unizar.es Peter N. Belhumeur3 belhumeur@cs.columbia.edu David Kriegman1 kriegman@cs.ucsd.edu Serge Belongie1 sjb@cs.ucsd.edu 1 Dept. of Computer Science and Engineering University of California, San Diego, USA. 2 Dpt. Informatica e Ing. Sistemas Inst. Investigacion en Ingenieria de Aragon. University of Zaragoza, Spain. 3 Department of Computer Science Columbia University, USA.

Proceedings ArticleDOI
01 Jan 2013
TL;DR: This paper presents spacetime forests defined over complementary spatial and temporal features for recognition of naturally occurring dynamic scenes, and improves on the previous state-of-the-art in both classification and execution rates with increased robustness to camera motion.
Abstract: This paper presents spacetime forests defined over complementary spatial and temporal features for recognition of naturally occurring dynamic scenes. The approach improves on the previous state-of-the-art in both classification and execution rates. A particular improvement is with increased robustness to camera motion, where previous approaches have experienced difficulty. There are three key novelties in the approach. First, a novel spacetime descriptor is employed that exploits the complementary nature of spatial and temporal information, as inspired by previous research on the role of orientation features in scene classification. Second, a forest-based classifier is used to learn a multi-class representation of the feature distributions. Third, the video is processed in temporal slices with scale matched preferentially to scene dynamics over camera motion. Slicing allows for temporal alignment to be handled as latent information in the classifier and for efficient, incremental processing. The integrated approach is evaluated empirically on two publically available datasets to document its outstanding performance.