scispace - formally typeset
Search or ask a question

Showing papers on "Feature extraction published in 2009"


Journal ArticleDOI
TL;DR: This work considers the problem of automatically recognizing human faces from frontal views with varying expression and illumination, as well as occlusion and disguise, and proposes a general classification algorithm for (image-based) object recognition based on a sparse representation computed by C1-minimization.
Abstract: We consider the problem of automatically recognizing human faces from frontal views with varying expression and illumination, as well as occlusion and disguise. We cast the recognition problem as one of classifying among multiple linear regression models and argue that new theory from sparse signal representation offers the key to addressing this problem. Based on a sparse representation computed by C1-minimization, we propose a general classification algorithm for (image-based) object recognition. This new framework provides new insights into two crucial issues in face recognition: feature extraction and robustness to occlusion. For feature extraction, we show that if sparsity in the recognition problem is properly harnessed, the choice of features is no longer critical. What is critical, however, is whether the number of features is sufficiently large and whether the sparse representation is correctly computed. Unconventional features such as downsampled images and random projections perform just as well as conventional features such as eigenfaces and Laplacianfaces, as long as the dimension of the feature space surpasses certain threshold, predicted by the theory of sparse representation. This framework can handle errors due to occlusion and corruption uniformly by exploiting the fact that these errors are often sparse with respect to the standard (pixel) basis. The theory of sparse representation helps predict how much occlusion the recognition algorithm can handle and how to choose the training images to maximize robustness to occlusion. We conduct extensive experiments on publicly available databases to verify the efficacy of the proposed algorithm and corroborate the above claims.

9,658 citations


Proceedings ArticleDOI
12 May 2009
TL;DR: This paper modifications their mathematical expressions and performs a rigorous analysis on their robustness and complexity for the problem of 3D registration for overlapping point cloud views, and proposes an algorithm for the online computation of FPFH features for realtime applications.
Abstract: In our recent work [1], [2], we proposed Point Feature Histograms (PFH) as robust multi-dimensional features which describe the local geometry around a point p for 3D point cloud datasets. In this paper, we modify their mathematical expressions and perform a rigorous analysis on their robustness and complexity for the problem of 3D registration for overlapping point cloud views. More concretely, we present several optimizations that reduce their computation times drastically by either caching previously computed values or by revising their theoretical formulations. The latter results in a new type of local features, called Fast Point Feature Histograms (FPFH), which retain most of the discriminative power of the PFH. Moreover, we propose an algorithm for the online computation of FPFH features for realtime applications. To validate our results we demonstrate their efficiency for 3D registration and propose a new sample consensus based method for bringing two datasets into the convergence basin of a local non-linear optimizer: SAC-IA (SAmple Consensus Initial Alignment).

3,138 citations


Proceedings ArticleDOI
01 Sep 2009
TL;DR: It is shown that using non-linearities that include rectification and local contrast normalization is the single most important ingredient for good accuracy on object recognition benchmarks and that two stages of feature extraction yield better accuracy than one.
Abstract: In many recent object recognition systems, feature extraction stages are generally composed of a filter bank, a non-linear transformation, and some sort of feature pooling layer Most systems use only one stage of feature extraction in which the filters are hard-wired, or two stages where the filters in one or both stages are learned in supervised or unsupervised mode This paper addresses three questions: 1 How does the non-linearities that follow the filter banks influence the recognition accuracy? 2 does learning the filter banks in an unsupervised or supervised manner improve the performance over random filters or hardwired filters? 3 Is there any advantage to using an architecture with two stages of feature extraction, rather than one? We show that using non-linearities that include rectification and local contrast normalization is the single most important ingredient for good accuracy on object recognition benchmarks We show that two stages of feature extraction yield better accuracy than one Most surprisingly, we show that a two-stage system with random filters can yield almost 63% recognition rate on Caltech-101, provided that the proper non-linearities and pooling layers are used Finally, we show that with supervised refinement, the system achieves state-of-the-art performance on NORB dataset (56%) and unsupervised pre-training followed by supervised refinement produces good accuracy on Caltech-101 (≫ 65%), and the lowest known error rate on the undistorted, unprocessed MNIST dataset (053%)

2,317 citations


Proceedings ArticleDOI
01 Sep 2009
TL;DR: By combining Histograms of Oriented Gradients (HOG) and Local Binary Pattern (LBP) as the feature set, this work proposes a novel human detection approach capable of handling partial occlusion and achieves the best human detection performance on the INRIA dataset.
Abstract: By combining Histograms of Oriented Gradients (HOG) and Local Binary Pattern (LBP) as the feature set, we propose a novel human detection approach capable of handling partial occlusion. Two kinds of detectors, i.e., global detector for whole scanning windows and part detectors for local regions, are learned from the training data using linear SVM. For each ambiguous scanning window, we construct an occlusion likelihood map by using the response of each block of the HOG feature to the global detector. The occlusion likelihood map is then segmented by Mean-shift approach. The segmented portion of the window with a majority of negative response is inferred as an occluded region. If partial occlusion is indicated with high likelihood in a certain scanning window, part detectors are applied on the unoccluded regions to achieve the final classification on the current scanning window. With the help of the augmented HOG-LBP feature and the global-part occlusion handling method, we achieve a detection rate of 91.3% with FPPW= 10−6, 94.7% with FPPW= 10−5, and 97.9% with FPPW= 10−4 on the INRIA dataset, which, to our best knowledge, is the best human detection performance on the INRIA dataset. The global-part occlusion handling method is further validated using synthesized occlusion data constructed from the INRIA and Pascal dataset.

1,838 citations


Journal ArticleDOI
TL;DR: A new texture feature called center-symmetric local binary pattern (CS-LBP) is introduced that is a modified version of the well-known localbinary pattern (LBP), and is computationally simpler than the SIFT.

1,172 citations


Journal ArticleDOI
TL;DR: A filter method of feature selection based on mutual information, called normalized mutual information feature selection (NMIFS), is presented and is combined with a genetic algorithm to form a hybrid filter/wrapper method called GAMIFS.
Abstract: A filter method of feature selection based on mutual information, called normalized mutual information feature selection (NMIFS), is presented. NMIFS is an enhancement over Battiti's MIFS, MIFS-U, and mRMR methods. The average normalized mutual information is proposed as a measure of redundancy among features. NMIFS outperformed MIFS, MIFS-U, and mRMR on several artificial and benchmark data sets without requiring a user-defined parameter. In addition, NMIFS is combined with a genetic algorithm to form a hybrid filter/wrapper method called GAMIFS. This includes an initialization procedure and a mutation operator based on NMIFS to speed up the convergence of the genetic algorithm. GAMIFS overcomes the limitations of incremental search algorithms that are unable to find dependencies between groups of features.

989 citations


Proceedings ArticleDOI
20 Jun 2009
TL;DR: This paper presents a systematic framework for recognizing realistic actions from videos “in the wild”, and uses motion statistics to acquire stable motion features and clean static features, and PageRank is used to mine the most informative static features.
Abstract: In this paper, we present a systematic framework for recognizing realistic actions from videos ldquoin the wildrdquo. Such unconstrained videos are abundant in personal collections as well as on the Web. Recognizing action from such videos has not been addressed extensively, primarily due to the tremendous variations that result from camera motion, background clutter, changes in object appearance, and scale, etc. The main challenge is how to extract reliable and informative features from the unconstrained videos. We extract both motion and static features from the videos. Since the raw features of both types are dense yet noisy, we propose strategies to prune these features. We use motion statistics to acquire stable motion features and clean static features. Furthermore, PageRank is used to mine the most informative static features. In order to further construct compact yet discriminative visual vocabularies, a divisive information-theoretic algorithm is employed to group semantically related features. Finally, AdaBoost is chosen to integrate all the heterogeneous yet complementary features for recognition. We have tested the framework on the KTH dataset and our own dataset consisting of 11 categories of actions collected from YouTube and personal videos, and have obtained impressive results for action recognition and action localization.

917 citations


Journal ArticleDOI
TL;DR: A novel technique for unsupervised change detection in multitemporal satellite images using principal component analysis (PCA) and k-means clustering and Experimental results confirm the effectiveness of the proposed approach.
Abstract: In this letter, we propose a novel technique for unsupervised change detection in multitemporal satellite images using principal component analysis (PCA) and k-means clustering. The difference image is partitioned into h times h nonoverlapping blocks. S, S les h2, orthonormal eigenvectors are extracted through PCA of h times h nonoverlapping block set to create an eigenvector space. Each pixel in the difference image is represented with an S-dimensional feature vector which is the projection of h times h difference image data onto the generated eigenvector space. The change detection is achieved by partitioning the feature vector space into two clusters using k-means clustering with k = 2 and then assigning each pixel to the one of the two clusters by using the minimum Euclidean distance between the pixel's feature vector and mean feature vector of clusters. Experimental results confirm the effectiveness of the proposed approach.

817 citations


Journal ArticleDOI
TL;DR: The proposed features are robust to image rotation, less sensitive to histogram equalization and noise, and achieves the highest classification accuracy in various texture databases and image conditions.
Abstract: This paper proposes a novel approach to extract image features for texture classification. The proposed features are robust to image rotation, less sensitive to histogram equalization and noise. It comprises of two sets of features: dominant local binary patterns (DLBP) in a texture image and the supplementary features extracted by using the circularly symmetric Gabor filter responses. The dominant local binary pattern method makes use of the most frequently occurred patterns to capture descriptive textural information, while the Gabor-based features aim at supplying additional global textural information to the DLBP features. Through experiments, the proposed approach has been intensively evaluated by applying a large number of classification tests to histogram-equalized, randomly rotated and noise corrupted images in Outex, Brodatz, Meastex, and CUReT texture image databases. Our method has also been compared with six published texture features in the experiments. It is experimentally demonstrated that the proposed method achieves the highest classification accuracy in various texture databases and image conditions.

786 citations


Journal ArticleDOI
TL;DR: Time-frequency domain signal processing using energy concentration as a feature is a very powerful tool and has been utilized in numerous applications and the expectation is that further research and applications of these algorithms will flourish in the near future.

646 citations


Journal ArticleDOI
TL;DR: An empirical feature analysis for audio environment characterization is performed and a matching pursuit algorithm is proposed to use to obtain effective time-frequency features to yield higher recognition accuracy for environmental sounds.
Abstract: The paper considers the task of recognizing environmental sounds for the understanding of a scene or context surrounding an audio sensor. A variety of features have been proposed for audio recognition, including the popular Mel-frequency cepstral coefficients (MFCCs) which describe the audio spectral shape. Environmental sounds, such as chirpings of insects and sounds of rain which are typically noise-like with a broad flat spectrum, may include strong temporal domain signatures. However, only few temporal-domain features have been developed to characterize such diverse audio signals previously. Here, we perform an empirical feature analysis for audio environment characterization and propose to use the matching pursuit (MP) algorithm to obtain effective time-frequency features. The MP-based method utilizes a dictionary of atoms for feature selection, resulting in a flexible, intuitive and physically interpretable set of features. The MP-based feature is adopted to supplement the MFCC features to yield higher recognition accuracy for environmental sounds. Extensive experiments are conducted to demonstrate the effectiveness of these joint features for unstructured environmental sound classification, including listening tests to study human recognition capabilities. Our recognition system has shown to produce comparable performance as human listeners.

Proceedings ArticleDOI
01 Sep 2009
TL;DR: A novel matching, spatio-temporal relationship match, which is designed to measure structural similarity between sets of features extracted from two videos, thereby enabling detection and localization of complex non-periodic activities.
Abstract: Human activity recognition is a challenging task, especially when its background is unknown or changing, and when scale or illumination differs in each video. Approaches utilizing spatio-temporal local features have proved that they are able to cope with such difficulties, but they mainly focused on classifying short videos of simple periodic actions. In this paper, we present a new activity recognition methodology that overcomes the limitations of the previous approaches using local features. We introduce a novel matching, spatio-temporal relationship match, which is designed to measure structural similarity between sets of features extracted from two videos. Our match hierarchically considers spatio-temporal relationships among feature points, thereby enabling detection and localization of complex non-periodic activities. In contrast to previous approaches to ‘classify’ videos, our approach is designed to ‘detect and localize’ all occurring activities from continuous videos where multiple actors and pedestrians are present. We implement and test our methodology on a newly-introduced dataset containing videos of multiple interacting persons and individual pedestrians. The results confirm that our system is able to recognize complex non-periodic activities (e.g. ‘push’ and ‘hug’) from sets of spatio-temporal features even when multiple activities are present in the scene

Proceedings ArticleDOI
20 Jun 2009
TL;DR: This paper proposes to represent each frame of a video using a histogram of oriented optical flow (HOOF) and to recognize human actions by classifying HOOF time-series, and proposes a generalization of the Binet-Cauchy kernels to nonlinear dynamical systems (NLDS) whose output lives in a non-Euclidean space.
Abstract: System theoretic approaches to action recognition model the dynamics of a scene with linear dynamical systems (LDSs) and perform classification using metrics on the space of LDSs, e.g. Binet-Cauchy kernels. However, such approaches are only applicable to time series data living in a Euclidean space, e.g. joint trajectories extracted from motion capture data or feature point trajectories extracted from video. Much of the success of recent object recognition techniques relies on the use of more complex feature descriptors, such as SIFT descriptors or HOG descriptors, which are essentially histograms. Since histograms live in a non-Euclidean space, we can no longer model their temporal evolution with LDSs, nor can we classify them using a metric for LDSs. In this paper, we propose to represent each frame of a video using a histogram of oriented optical flow (HOOF) and to recognize human actions by classifying HOOF time-series. For this purpose, we propose a generalization of the Binet-Cauchy kernels to nonlinear dynamical systems (NLDS) whose output lives in a non-Euclidean space, e.g. the space of histograms. This can be achieved by using kernels defined on the original non-Euclidean space, leading to a well-defined metric for NLDSs. We use these kernels for the classification of actions in video sequences using (HOOF) as the output of the NLDS. We evaluate our approach to recognition of human actions in several scenarios and achieve encouraging results.

Journal ArticleDOI
TL;DR: Preliminary experimental results show that the third criterion is a potential discriminative subspace selection method, which significantly reduces the class separation problem in comparing with the linear dimensionality reduction step in FLDA and its several representative extensions.
Abstract: Subspace selection approaches are powerful tools in pattern classification and data visualization. One of the most important subspace approaches is the linear dimensionality reduction step in the Fisher's linear discriminant analysis (FLDA), which has been successfully employed in many fields such as biometrics, bioinformatics, and multimedia information management. However, the linear dimensionality reduction step in FLDA has a critical drawback: for a classification task with c classes, if the dimension of the projected subspace is strictly lower than c - 1, the projection to a subspace tends to merge those classes, which are close together in the original feature space. If separate classes are sampled from Gaussian distributions, all with identical covariance matrices, then the linear dimensionality reduction step in FLDA maximizes the mean value of the Kullback-Leibler (KL) divergences between different classes. Based on this viewpoint, the geometric mean for subspace selection is studied in this paper. Three criteria are analyzed: 1) maximization of the geometric mean of the KL divergences, 2) maximization of the geometric mean of the normalized KL divergences, and 3) the combination of 1 and 2. Preliminary experimental results based on synthetic data, UCI Machine Learning Repository, and handwriting digits show that the third criterion is a potential discriminative subspace selection method, which significantly reduces the class separation problem in comparing with the linear dimensionality reduction step in FLDA and its several representative extensions.

Proceedings ArticleDOI
01 Sep 2009
TL;DR: This paper describes a human detection method that augments widely used edge-based features with texture and color information, providing us with a much richer descriptor set, and is shown to outperform state-of-the-art techniques on three varied datasets.
Abstract: Significant research has been devoted to detecting people in images and videos. In this paper we describe a human detection method that augments widely used edge-based features with texture and color information, providing us with a much richer descriptor set. This augmentation results in an extremely high-dimensional feature space (more than 170,000 dimensions). In such high-dimensional spaces, classical machine learning algorithms such as SVMs are nearly intractable with respect to training. Furthermore, the number of training samples is much smaller than the dimensionality of the feature space, by at least an order of magnitude. Finally, the extraction of features from a densely sampled grid structure leads to a high degree of multicollinearity. To circumvent these data characteristics, we employ Partial Least Squares (PLS) analysis, an efficient dimensionality reduction technique, one which preserves significant discriminative information, to project the data onto a much lower dimensional subspace (20 dimensions, reduced from the original 170,000). Our human detection system, employing PLS analysis over the enriched descriptor set, is shown to outperform state-of-the-art techniques on three varied datasets including the popular INRIA pedestrian dataset, the low-resolution gray-scale DaimlerChrysler pedestrian dataset, and the ETHZ pedestrian dataset consisting of full-length videos of crowded scenes.

Journal ArticleDOI
TL;DR: The findings show that, although the wavelet transform approach can be used to characterize nonstationary signals, it does not perform as accurately as frequency-based features when classifying dynamic activities performed by healthy subjects.
Abstract: Driven by the demands on healthcare resulting from the shift toward more sedentary lifestyles, considerable effort has been devoted to the monitoring and classification of human activity. In previous studies, various classification schemes and feature extraction methods have been used to identify different activities from a range of different datasets. In this paper, we present a comparison of 14 methods to extract classification features from accelerometer signals. These are based on the wavelet transform and other well-known time- and frequency-domain signal characteristics. To allow an objective comparison between the different features, we used two datasets of activities collected from 20 subjects. The first set comprised three commonly used activities, namely, level walking, stair ascent, and stair descent, and the second a total of eight activities. Furthermore, we compared the classification accuracy for each feature set across different combinations of three different accelerometer placements. The classification analysis has been performed with robust subject-based cross-validation methods using a nearest-neighbor classifier. The findings show that, although the wavelet transform approach can be used to characterize nonstationary signals, it does not perform as accurately as frequency-based features when classifying dynamic activities performed by healthy subjects. Overall, the best feature sets achieved over 95% intersubject classification accuracy.

Proceedings ArticleDOI
01 Sep 2009
TL;DR: This work presents an activity recognition feature inspired by human psychophysical performance, based on the velocity history of tracked keypoints, and presents a generative mixture model for video sequences using this feature, and shows that it performs comparably to local spatio-temporal features on the KTH activity recognition dataset.
Abstract: We present an activity recognition feature inspired by human psychophysical performance. This feature is based on the velocity history of tracked keypoints. We present a generative mixture model for video sequences using this feature, and show that it performs comparably to local spatio-temporal features on the KTH activity recognition dataset. In addition, we contribute a new activity recognition dataset, focusing on activities of daily living, with high resolution video sequences of complex actions. We demonstrate the superiority of our velocity history feature on high resolution video sequences of complicated activities. Further, we show how the velocity history feature can be extended, both with a more sophisticated latent velocity model, and by combining the velocity history feature with other useful information, like appearance, position, and high level semantic information. Our approach performs comparably to established and state of the art methods on the KTH dataset, and significantly outperforms all other methods on our challenging new dataset.

Proceedings ArticleDOI
19 Apr 2009
TL;DR: The proposed features can detect duplicated region in the images very accurately, even when the copied region was undergone severe image manipulations and use of counting bloom filters offers a considerable improvement in time efficiency at the expense of a slight reduction in the robustness.
Abstract: Copy-move forgery is a specific type of image tampering, where a part of the image is copied and pasted on another part of the same image. In this paper, we propose a new approach for detecting copy-move forgery in digital images, which is considerably more robust to lossy compression, scaling and rotation type of manipulations. Also, to improve the computational complexity in detecting the duplicated image regions, we propose to use the notion of counting bloom filters as an alternative to lexicographic sorting, which is a common component of most of the proposed copy-move forgery detection schemes. Our experimental results show that the proposed features can detect duplicated region in the images very accurately, even when the copied region was undergone severe image manipulations. In addition, it is observed that use of counting bloom filters offers a considerable improvement in time efficiency at the expense of a slight reduction in the robustness.

Proceedings ArticleDOI
20 Jun 2009
TL;DR: A fast location recognition technique based on structure from motion point clouds is presented, and Vocabulary tree-based indexing of features directly returns relevant fragments of 3D models instead of documents from the images database.
Abstract: Efficient view registration with respect to a given 3D reconstruction has many applications like inside-out tracking in indoor and outdoor environments, and geo-locating images from large photo collections. We present a fast location recognition technique based on structure from motion point clouds. Vocabulary tree-based indexing of features directly returns relevant fragments of 3D models instead of documents from the images database. Additionally, we propose a compressed 3D scene representation which improves recognition rates while simultaneously reducing the computation time and the memory consumption. The design of our method is based on algorithms that efficiently utilize modern graphics processing units to deliver real-time performance for view registration. We demonstrate the approach by matching hand-held outdoor videos to known 3D urban models, and by registering images from online photo collections to the corresponding landmarks.

Proceedings ArticleDOI
01 Sep 2009
TL;DR: This paper quantitatively evaluate the design of a system for recognizing objects in 3D point clouds of urban environments and tradeoffs of different alternatives in a truthed part of a scan of Ottawa that contains approximately 100 million points and 1000 objects of interest.
Abstract: This paper investigates the design of a system for recognizing objects in 3D point clouds of urban environments. The system is decomposed into four steps: locating, segmenting, characterizing, and classifying clusters of 3D points. Specifically, we first cluster nearby points to form a set of potential object locations (with hierarchical clustering). Then, we segment points near those locations into foreground and background sets (with a graph-cut algorithm). Next, we build a feature vector for each point cluster (based on both its shape and its context). Finally, we label the feature vectors using a classifier trained on a set of manually labeled objects. The paper presents several alternative methods for each step. We quantitatively evaluate the system and tradeoffs of different alternatives in a truthed part of a scan of Ottawa that contains approximately 100 million points and 1000 objects of interest. Then, we use this truth data as a training set to recognize objects amidst approximately 1 billion points of the remainder of the Ottawa scan.

Journal ArticleDOI
TL;DR: This paper shows that various crucial factors in video annotation, including multiple modalities, multiple distance functions, and temporal consistency, all correspond to different relationships among video units, and hence they can be represented by different graphs, and proposes optimized multigraph-based semi-supervised learning (OMG-SSL), which aims to simultaneously tackle these difficulties in a unified scheme.
Abstract: Learning-based video annotation is a promising approach to facilitating video retrieval and it can avoid the intensive labor costs of pure manual annotation. But it frequently encounters several difficulties, such as insufficiency of training data and the curse of dimensionality. In this paper, we propose a method named optimized multigraph-based semi-supervised learning (OMG-SSL), which aims to simultaneously tackle these difficulties in a unified scheme. We show that various crucial factors in video annotation, including multiple modalities, multiple distance functions, and temporal consistency, all correspond to different relationships among video units, and hence they can be represented by different graphs. Therefore, these factors can be simultaneously dealt with by learning with multiple graphs, namely, the proposed OMG-SSL approach. Different from the existing graph-based semi-supervised learning methods that only utilize one graph, OMG-SSL integrates multiple graphs into a regularization framework in order to sufficiently explore their complementation. We show that this scheme is equivalent to first fusing multiple graphs and then conducting semi-supervised learning on the fused graph. Through an optimization approach, it is able to assign suitable weights to the graphs. Furthermore, we show that the proposed method can be implemented through a computationally efficient iterative process. Extensive experiments on the TREC video retrieval evaluation (TRECVID) benchmark have demonstrated the effectiveness and efficiency of our proposed approach.

Proceedings ArticleDOI
Zhong Wu1, Qifa Ke1, Michael Isard1, Jian Sun1
20 Jun 2009
TL;DR: This paper presents a novel scheme where image features are bundled into local groups and each group of bundled features becomes much more discriminative than a single feature, and within each group simple and robust geometric constraints can be efficiently enforced.
Abstract: In state-of-the-art image retrieval systems, an image is represented by a bag of visual words obtained by quantizing high-dimensional local image descriptors, and scalable schemes inspired by text retrieval are then applied for large scale image indexing and retrieval. Bag-of-words representations, however: 1) reduce the discriminative power of image features due to feature quantization; and 2) ignore geometric relationships among visual words. Exploiting such geometric constraints, by estimating a 2D affine transformation between a query image and each candidate image, has been shown to greatly improve retrieval precision but at high computational cost. In this paper we present a novel scheme where image features are bundled into local groups. Each group of bundled features becomes much more discriminative than a single feature, and within each group simple and robust geometric constraints can be efficiently enforced. Experiments in Web image search, with a database of more than one million images, show that our scheme achieves a 49% improvement in average precision over the baseline bag-of-words approach. Retrieval performance is comparable to existing full geometric verification approaches while being much less computationally expensive. When combined with full geometric verification we achieve a 77% precision improvement over the baseline bag-of-words approach, and a 24% improvement over full geometric verification alone.

Proceedings ArticleDOI
20 Jun 2009
TL;DR: This paper presents a unified framework for object detection, segmentation, and classification using regions using a generalized Hough voting scheme to generate hypotheses of object locations, scales and support, followed by a verification classifier and a constrained segmenter on each hypothesis.
Abstract: This paper presents a unified framework for object detection, segmentation, and classification using regions. Region features are appealing in this context because: (1) they encode shape and scale information of objects naturally; (2) they are only mildly affected by background clutter. Regions have not been popular as features due to their sensitivity to segmentation errors. In this paper, we start by producing a robust bag of overlaid regions for each image using Arbeldez et al., CVPR 2009. Each region is represented by a rich set of image cues (shape, color and texture). We then learn region weights using a max-margin framework. In detection and segmentation, we apply a generalized Hough voting scheme to generate hypotheses of object locations, scales and support, followed by a verification classifier and a constrained segmenter on each hypothesis. The proposed approach significantly outperforms the state of the art on the ETHZ shape database(87.1% average detection rate compared to Ferrari et al. 's 67.2%), and achieves competitive performance on the Caltech 101 database.

Journal ArticleDOI
TL;DR: This paper proposes a method called Mlnb which adapts the traditional naive Bayes classifiers to deal with multi-label instances and achieves comparable performance to other well-established multi- label learning algorithms.

Journal ArticleDOI
TL;DR: This paper presents a generic and patient-specific classification system designed for robust and accurate detection of ECG heartbeat patterns that can adapt to significant interpatient variations in ECG patterns by training the optimal network structure, and achieves higher accuracy over larger datasets.
Abstract: This paper presents a generic and patient-specific classification system designed for robust and accurate detection of ECG heartbeat patterns. The proposed feature extraction process utilizes morphological wavelet transform features, which are projected onto a lower dimensional feature space using principal component analysis, and temporal features from the ECG data. For the pattern recognition unit, feedforward and fully connected artificial neural networks, which are optimally designed for each patient by the proposed multidimensional particle swarm optimization technique, are employed. By using relatively small common and patient-specific training data, the proposed classification system can adapt to significant interpatient variations in ECG patterns by training the optimal network structure, and thus, achieves higher accuracy over larger datasets. The classification experiments over a benchmark database demonstrate that the proposed system achieves such average accuracies and sensitivities better than most of the current state-of-the-art algorithms for detection of ventricular ectopic beats (VEBs) and supra-VEBs (SVEBs). Over the entire database, the average accuracy-sensitivity performances of the proposed system for VEB and SVEB detections are 98.3%-84.6% and 97.4%-63.5%, respectively. Finally, due to its parameter-invariant nature, the proposed system is highly generic, and thus, applicable to any ECG dataset.

Proceedings ArticleDOI
20 Jun 2009
TL;DR: A 3D feature detector and feature descriptor for uniformly triangulated meshes, invariant to changes in rotation, translation, and scale are proposed and defined generically for any scalar function, e.g., local curvature.
Abstract: In this paper we revisit local feature detectors/descriptors developed for 2D images and extend them to the more general framework of scalar fields defined on 2D manifolds. We provide methods and tools to detect and describe features on surfaces equiped with scalar functions, such as photometric information. This is motivated by the growing need for matching and tracking photometric surfaces over temporal sequences, due to recent advancements in multiple camera 3D reconstruction. We propose a 3D feature detector (MeshDOG) and a 3D feature descriptor (MeshHOG) for uniformly triangulated meshes, invariant to changes in rotation, translation, and scale. The descriptor is able to capture the local geometric and/or photometric properties in a succinct fashion. Moreover, the method is defined generically for any scalar function, e.g., local curvature. Results with matching rigid and non-rigid meshes demonstrate the interest of the proposed framework.

Proceedings ArticleDOI
08 Dec 2009
TL;DR: A novel open-source affect and emotion recognition engine, which integrates all necessary components in one highly efficient software package, and which can be used for batch processing of databases.
Abstract: Various open-source toolkits exist for speech recognition and speech processing. These toolkits have brought a great benefit to the research community, i.e. speeding up research. Yet, no such freely available toolkit exists for automatic affect recognition from speech. We herein introduce a novel open-source affect and emotion recognition engine, which integrates all necessary components in one highly efficient software package. The components include audio recording and audio file reading, state-of-the-art paralinguistic feature extraction and plugable classification modules. In this paper we introduce the engine and extensive baseline results. Pre-trained models for four affect recognition tasks are included in the openEAR distribution. The engine is tailored for multi-threaded, incremental on-line processing of live input in real-time, however it can also be used for batch processing of databases.

Journal ArticleDOI
TL;DR: Experimental results on three challenging iris image databases demonstrate that the proposed algorithm outperforms state-of-the-art methods in both accuracy and speed.
Abstract: Iris segmentation is an essential module in iris recognition because it defines the effective image region used for subsequent processing such as feature extraction. Traditional iris segmentation methods often involve an exhaustive search of a large parameter space, which is time consuming and sensitive to noise. To address these problems, this paper presents a novel algorithm for accurate and fast iris segmentation. After efficient reflection removal, an Adaboost-cascade iris detector is first built to extract a rough position of the iris center. Edge points of iris boundaries are then detected, and an elastic model named pulling and pushing is established. Under this model, the center and radius of the circular iris boundaries are iteratively refined in a way driven by the restoring forces of Hooke's law. Furthermore, a smoothing spline-based edge fitting scheme is presented to deal with noncircular iris boundaries. After that, eyelids are localized via edge detection followed by curve fitting. The novelty here is the adoption of a rank filter for noise elimination and a histogram filter for tackling the shape irregularity of eyelids. Finally, eyelashes and shadows are detected via a learned prediction model. This model provides an adaptive threshold for eyelash and shadow detection by analyzing the intensity distributions of different iris regions. Experimental results on three challenging iris image databases demonstrate that the proposed algorithm outperforms state-of-the-art methods in both accuracy and speed.

Proceedings ArticleDOI
11 Oct 2009
TL;DR: The experimental results demonstrate that the use of an enriched feature set analyzed by PLS reduces the ambiguity among different appearances and provides higher recognition rates when compared to other machine learning techniques.
Abstract: Appearance information is essential for applications such as tracking and people recognition. One of the main problems of using appearance-based discriminative models is the ambiguities among classes when the number of persons being considered increases. To reduce the amount of ambiguity, we propose the use of a rich set of feature descriptors based on color, textures and edges. Another issue regarding appearance modeling is the limited number of training samples available for each appearance. The discriminative models are created using a powerful statistical tool called Partial Least Squares (PLS), responsible for weighting the features according to their discriminative power for each different appearance. The experimental results, based on appearance-based person recognition, demonstrate that the use of an enriched feature set analyzed by PLS reduces the ambiguity among different appearances and provides higher recognition rates when compared to other machine learning techniques.

Journal ArticleDOI
TL;DR: An automatic method for reconstruction of building facade models from terrestrial laser scanning data, using knowledge about the features’ sizes, positions, orientations, and topology to recognize these features in a segmented laser point cloud.
Abstract: This paper presents an automatic method for reconstruction of building facade models from terrestrial laser scanning data. Important facade elements such as walls and roofs are distinguished as features. Knowledge about the features’ sizes, positions, orientations, and topology is then introduced to recognize these features in a segmented laser point cloud. An outline polygon of each feature is generated by least squares fitting, convex hull fitting or concave polygon fitting, according to the size of the feature. Knowledge is used again to hypothesise the occluded parts from the directly extracted feature polygons. Finally, a polyhedron building model is combined from extracted feature polygons and hypothesised parts. The reconstruction method is tested with two data sets containing various building shapes.