scispace - formally typeset
Search or ask a question

Showing papers in "arXiv: Computer Vision and Pattern Recognition in 2012"


Posted Content
TL;DR: This work introduces UCF101 which is currently the largest dataset of human actions and provides baseline action recognition results on this new dataset using standard bag of words approach with overall performance of 44.5%.
Abstract: We introduce UCF101 which is currently the largest dataset of human actions. It consists of 101 action classes, over 13k clips and 27 hours of video data. The database consists of realistic user uploaded videos containing camera motion and cluttered background. Additionally, we provide baseline action recognition results on this new dataset using standard bag of words approach with overall performance of 44.5%. To the best of our knowledge, UCF101 is currently the most challenging dataset of actions due to its large number of classes, large number of clips and also unconstrained nature of such clips.

4,784 citations


Posted Content
TL;DR: This paper proposes and studies an algorithm, called sparse subspace clustering, to cluster data points that lie in a union of low-dimensional subspaces, and demonstrates the effectiveness of the proposed algorithm through experiments on synthetic data as well as the two real-world problems of motion segmentation and face clustering.
Abstract: In many real-world problems, we are dealing with collections of high-dimensional data, such as images, videos, text and web documents, DNA microarray data, and more. Often, high-dimensional data lie close to low-dimensional structures corresponding to several classes or categories the data belongs to. In this paper, we propose and study an algorithm, called Sparse Subspace Clustering (SSC), to cluster data points that lie in a union of low-dimensional subspaces. The key idea is that, among infinitely many possible representations of a data point in terms of other points, a sparse representation corresponds to selecting a few points from the same subspace. This motivates solving a sparse optimization program whose solution is used in a spectral clustering framework to infer the clustering of data into subspaces. Since solving the sparse optimization program is in general NP-hard, we consider a convex relaxation and show that, under appropriate conditions on the arrangement of subspaces and the distribution of data, the proposed minimization program succeeds in recovering the desired sparse representations. The proposed algorithm can be solved efficiently and can handle data points near the intersections of subspaces. Another key advantage of the proposed algorithm with respect to the state of the art is that it can deal with data nuisances, such as noise, sparse outlying entries, and missing entries, directly by incorporating the model of the data into the sparse optimization program. We demonstrate the effectiveness of the proposed algorithm through experiments on synthetic data as well as the two real-world problems of motion segmentation and face clustering.

1,521 citations


Posted Content
TL;DR: This work reports state-of-the-art and competitive results on all major pedestrian datasets with a convolutional network model that uses a few new twists, such as multi-stage features, connections that skip layers to integrate global shape information with local distinctive motif information.
Abstract: Pedestrian detection is a problem of considerable practical interest. Adding to the list of successful applications of deep learning methods to vision, we report state-of-the-art and competitive results on all major pedestrian datasets with a convolutional network model. The model uses a few new twists, such as multi-stage features, connections that skip layers to integrate global shape information with local distinctive motif information, and an unsupervised method based on convolutional sparse coding to pre-train the filters at each stage.

713 citations


Posted Content
TL;DR: A wavelet scattering network as discussed by the authors computes a translation invariant image representation, which is stable to deformations and preserves high frequency information for classification, cascading wavelet transform convolutions with nonlinear modulus and averaging operators.
Abstract: A wavelet scattering network computes a translation invariant image representation, which is stable to deformations and preserves high frequency information for classification. It cascades wavelet transform convolutions with non-linear modulus and averaging operators. The first network layer outputs SIFT-type descriptors whereas the next layers provide complementary invariant information which improves classification. The mathematical analysis of wavelet scattering networks explains important properties of deep convolution networks for classification. A scattering representation of stationary processes incorporates higher order moments and can thus discriminate textures having the same Fourier power spectrum. State of the art classification results are obtained for handwritten digits and texture discrimination, using a Gaussian kernel SVM and a generative PCA classifier.

630 citations


Posted Content
TL;DR: In this paper, a set of discriminative patches which can serve as a fully unsupervised mid-level visual representation is discovered. But these patches are not restricted to be any one of the parts, objects, visual phrases, etc.
Abstract: The goal of this paper is to discover a set of discriminative patches which can serve as a fully unsupervised mid-level visual representation. The desired patches need to satisfy two requirements: 1) to be representative, they need to occur frequently enough in the visual world; 2) to be discriminative, they need to be different enough from the rest of the visual world. The patches could correspond to parts, objects, "visual phrases", etc. but are not restricted to be any one of them. We pose this as an unsupervised discriminative clustering problem on a huge dataset of image patches. We use an iterative procedure which alternates between clustering and training discriminative classifiers, while applying careful cross-validation at each step to prevent overfitting. The paper experimentally demonstrates the effectiveness of discriminative patches as an unsupervised mid-level visual representation, suggesting that it could be used in place of visual words for many tasks. Furthermore, discriminative patches can also be used in a supervised regime, such as scene classification, where they demonstrate state-of-the-art performance on the MIT Indoor-67 dataset.

539 citations


Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper examined the 15 most prominent feature sets and analyzed the detection performance on a per-image basis and on per-pixel basis, and found that the keypoint-based features SIFT and SURF, as well as the block-based DCT, DWT, KPCA, PCA and Zernike features perform very well.
Abstract: A copy-move forgery is created by copying and pasting content within the same image, and potentially post-processing it. In recent years, the detection of copy-move forgeries has become one of the most actively researched topics in blind image forensics. A considerable number of different algorithms have been proposed focusing on different types of postprocessed copies. In this paper, we aim to answer which copy-move forgery detection algorithms and processing steps (e.g., matching, filtering, outlier detection, affine transformation estimation) perform best in various postprocessing scenarios. The focus of our analysis is to evaluate the performance of previously proposed feature sets. We achieve this by casting existing algorithms in a common pipeline. In this paper, we examined the 15 most prominent feature sets. We analyzed the detection performance on a per-image basis and on a per-pixel basis. We created a challenging real-world copy-move dataset, and a software framework for systematic image manipulation. Experiments show, that the keypoint-based features SIFT and SURF, as well as the block-based DCT, DWT, KPCA, PCA and Zernike features perform very well. These feature sets exhibit the best robustness against various noise sources and downsampling, while reliably identifying the copied regions.

429 citations


Posted Content
TL;DR: A time-line view of the advances made in this field, the applications of automatic face expression recognizers, the characteristics of an ideal system, the databases that have been used and the advancesmade in terms of their standardization and a detailed summary of the state of the art are presented.
Abstract: The automatic recognition of facial expressions has been an active research topic since the early nineties There have been several advances in the past few years in terms of face detection and tracking, feature extraction mechanisms and the techniques used for expression classification This paper surveys some of the published work since 2001 till date The paper presents a time-line view of the advances made in this field, the applications of automatic face expression recognizers, the characteristics of an ideal system, the databases that have been used and the advances made in terms of their standardization and a detailed summary of the state of the art The paper also discusses facial parameterization using FACS Action Units (AUs) and MPEG-4 Facial Animation Parameters (FAPs) and the recent advances in face detection, tracking and feature extraction methods Notes have also been presented on emotions, expressions and facial features, discussion on the six prototypic expressions and the recent studies on expression classifiers The paper ends with a note on the challenges and the future work This paper has been written in a tutorial style with the intention of helping students and researchers who are new to this field

304 citations


Posted Content
TL;DR: In this article, the pairwise edge potentials are defined by a linear combination of Gaussian kernels, which improves segmentation and labeling accuracy significantly and has been shown to improve segmentation performance.
Abstract: Most state-of-the-art techniques for multi-class image segmentation and labeling use conditional random fields defined over pixels or image regions. While region-level models often feature dense pairwise connectivity, pixel-level models are considerably larger and have only permitted sparse graph structures. In this paper, we consider fully connected CRF models defined on the complete set of pixels in an image. The resulting graphs have billions of edges, making traditional inference algorithms impractical. Our main contribution is a highly efficient approximate inference algorithm for fully connected CRF models in which the pairwise edge potentials are defined by a linear combination of Gaussian kernels. Our experiments demonstrate that dense connectivity at the pixel level substantially improves segmentation and labeling accuracy.

239 citations


Posted Content
TL;DR: This paper discusses how SRC works, and shows that the collaborative representation mechanism used in SRC is much more crucial to its success of face classification.
Abstract: By coding a query sample as a sparse linear combination of all training samples and then classifying it by evaluating which class leads to the minimal coding residual, sparse representation based classification (SRC) leads to interesting results for robust face recognition. It is widely believed that the l1- norm sparsity constraint on coding coefficients plays a key role in the success of SRC, while its use of all training samples to collaboratively represent the query sample is rather ignored. In this paper we discuss how SRC works, and show that the collaborative representation mechanism used in SRC is much more crucial to its success of face classification. The SRC is a special case of collaborative representation based classification (CRC), which has various instantiations by applying different norms to the coding residual and coding coefficient. More specifically, the l1 or l2 norm characterization of coding residual is related to the robustness of CRC to outlier facial pixels, while the l1 or l2 norm characterization of coding coefficient is related to the degree of discrimination of facial features. Extensive experiments were conducted to verify the face recognition accuracy and efficiency of CRC with different instantiations.

222 citations


Posted Content
TL;DR: This paper describes a locally adaptive thresholding technique that removes background by using local mean and mean deviation and uses integral sum image as a prior processing to calculate local mean.
Abstract: Image binarization is the process of separation of pixel values into two groups, white as background and black as foreground Thresholding plays a major in binarization of images Thresholding can be categorized into global thresholding and local thresholding In images with uniform contrast distribution of background and foreground like document images, global thresholding is more appropriate In degraded document images, where considerable background noise or variation in contrast and illumination exists, there exists many pixels that cannot be easily classified as foreground or background In such cases, binarization with local thresholding is more appropriate This paper describes a locally adaptive thresholding technique that removes background by using local mean and mean deviation Normally the local mean computational time depends on the window size Our technique uses integral sum image as a prior processing to calculate local mean It does not involve calculations of standard deviations as in other local adaptive techniques This along with the fact that calculations of mean is independent of window size speed up the process as compared to other local thresholding techniques

202 citations


Posted Content
TL;DR: In this paper, a new state-of-the-art of 94.85% accuracy on the SVHN dataset (45.2% error improvement) was achieved by using Lp pooling.
Abstract: We classify digits of real-world house numbers using convolutional neural networks (ConvNets). ConvNets are hierarchical feature learning neural networks whose structure is biologically inspired. Unlike many popular vision approaches that are hand-designed, ConvNets can automatically learn a unique set of features optimized for a given task. We augmented the traditional ConvNet architecture by learning multi-stage features and by using Lp pooling and establish a new state-of-the-art of 94.85% accuracy on the SVHN dataset (45.2% error improvement). Furthermore, we analyze the benefits of different pooling methods and multi-stage features in ConvNets. The source code and a tutorial are available at this http URL.

Posted Content
TL;DR: In this paper, the authors propose the construction of common approximate eigenbases for multiple shapes using approximate joint diagonalization algorithms, which can be used for shape editing, pose transfer, correspondence, and similarity.
Abstract: The use of Laplacian eigenbases has been shown to be fruitful in many computer graphics applications. Today, state-of-the-art approaches to shape analysis, synthesis, and correspondence rely on these natural harmonic bases that allow using classical tools from harmonic analysis on manifolds. However, many applications involving multiple shapes are obstacled by the fact that Laplacian eigenbases computed independently on different shapes are often incompatible with each other. In this paper, we propose the construction of common approximate eigenbases for multiple shapes using approximate joint diagonalization algorithms. We illustrate the benefits of the proposed approach on tasks from shape editing, pose transfer, correspondence, and similarity.

Posted Content
TL;DR: In this article, the authors propose a model that explicitly accounts for the interdependencies between images sharing common properties, such as the groups to which each image belongs, the comment thread associated with the image, who uploaded it, their location, and their network of friends.
Abstract: Large-scale image retrieval benchmarks invariably consist of images from the Web. Many of these benchmarks are derived from online photo sharing networks, like Flickr, which in addition to hosting images also provide a highly interactive social community. Such communities generate rich metadata that can naturally be harnessed for image classification and retrieval. Here we study four popular benchmark datasets, extending them with social-network metadata, such as the groups to which each image belongs, the comment thread associated with the image, who uploaded it, their location, and their network of friends. Since these types of data are inherently relational, we propose a model that explicitly accounts for the interdependencies between images sharing common properties. We model the task as a binary labeling problem on a network, and use structured learning techniques to learn model parameters. We find that social-network metadata are useful in a variety of classification tasks, in many cases outperforming methods based on image content.

Posted Content
TL;DR: The SVD properties for images are experimentally presented to be utilized in developing new SVD-based image processing applications and a survey on the developed SVD based image applications is offered.
Abstract: Singular Value Decomposition (SVD) has recently emerged as a new paradigm for processing different types of images. SVD is an attractive algebraic transform for image processing applications. The paper proposes an experimental survey for the SVD as an efficient transform in image processing applications. Despite the well-known fact that SVD offers attractive properties in imaging, the exploring of using its properties in various image applications is currently at its infancy. Since the SVD has many attractive properties have not been utilized, this paper contributes in using these generous properties in newly image applications and gives a highly recommendation for more research challenges. In this paper, the SVD properties for images are experimentally presented to be utilized in developing new SVD-based image processing applications. The paper offers survey on the developed SVD based image applications. The paper also proposes some new contributions that were originated from SVD properties analysis in different image processing. The aim of this paper is to provide a better understanding of the SVD in image processing and identify important various applications and open research directions in this increasingly important area; SVD based image processing in the future research.

Posted Content
TL;DR: In this article, a more refined notion of coherence is proposed, called local coherence, which measures for each sensing vector separately how correlated it is to the sparsity basis.
Abstract: In many signal processing applications, one wishes to acquire images that are sparse in transform domains such as spatial finite differences or wavelets using frequency domain samples. For such applications, overwhelming empirical evidence suggests that superior image reconstruction can be obtained through variable density sampling strategies that concentrate on lower frequencies. The wavelet and Fourier transform domains are not incoherent because low-order wavelets and low-order frequencies are correlated, so compressive sensing theory does not immediately imply sampling strategies and reconstruction guarantees. In this paper we turn to a more refined notion of coherence -- the so-called local coherence -- measuring for each sensing vector separately how correlated it is to the sparsity basis. For Fourier measurements and Haar wavelet sparsity, the local coherence can be controlled and bounded explicitly, so for matrices comprised of frequencies sampled from a suitable inverse square power-law density, we can prove the restricted isometry property with near-optimal embedding dimensions. Consequently, the variable-density sampling strategy we provide allows for image reconstructions that are stable to sparsity defects and robust to measurement noise. Our results cover both reconstruction by $\ell_1$-minimization and by total variation minimization. The local coherence framework developed in this paper should be of independent interest in sparse recovery problems more generally, as it implies that for optimal sparse recovery results, it suffices to have bounded \emph{average} coherence from sensing basis to sparsity basis -- as opposed to bounded maximal coherence -- as long as the sampling strategy is adapted accordingly.

Posted Content
TL;DR: In this article, the basic ideas of principal component analysis (PCA) and kernel PCA are reviewed, and the reconstruction of pre-images for Kernel PCA is discussed.
Abstract: Principal component analysis (PCA) is a popular tool for linear dimensionality reduction and feature extraction. Kernel PCA is the nonlinear form of PCA, which better exploits the complicated spatial structure of high-dimensional features. In this paper, we first review the basic ideas of PCA and kernel PCA. Then we focus on the reconstruction of pre-images for kernel PCA. We also give an introduction on how PCA is used in active shape models (ASMs), and discuss how kernel PCA can be applied to improve traditional ASMs. Then we show some experimental results to compare the performance of kernel PCA and standard PCA for classification problems. We also implement the kernel PCA-based ASMs, and use it to construct human face models.

Posted Content
TL;DR: In this paper, the authors show that by increasing the number of components, and switching the initialization step from their aspect-ratio, left-right flipping heuristics to appearance-based clustering, considerable improvement in performance is obtained.
Abstract: The main stated contribution of the Deformable Parts Model (DPM) detector of Felzenszwalb et al. (over the Histogram-of-Oriented-Gradients approach of Dalal and Triggs) is the use of deformable parts. A secondary contribution is the latent discriminative learning. Tertiary is the use of multiple components. A common belief in the vision community (including ours, before this study) is that their ordering of contributions reflects the performance of detector in practice. However, what we have experimentally found is that the ordering of importance might actually be the reverse. First, we show that by increasing the number of components, and switching the initialization step from their aspect-ratio, left-right flipping heuristics to appearance-based clustering, considerable improvement in performance is obtained. But more intriguingly, we show that with these new components, the part deformations can now be completely switched off, yet obtaining results that are almost on par with the original DPM detector. Finally, we also show initial results for using multiple components on a different problem -- scene classification, suggesting that this idea might have wider applications in addition to object detection.

Posted Content
TL;DR: In this paper, a fast segmentation method based on a new variant of spectral graph theory named diffusion maps was proposed for OCT images depicting macular and optic nerve head appearance, which does not require edge-based image information and relies on regional image texture.
Abstract: Optical coherence tomography (OCT) is a powerful and noninvasive method for retinal imaging. In this paper, we introduce a fast segmentation method based on a new variant of spectral graph theory named diffusion maps. The research is performed on spectral domain (SD) OCT images depicting macular and optic nerve head appearance. The presented approach does not require edge-based image information and relies on regional image texture. Consequently, the proposed method demonstrates robustness in situations of low image contrast or poor layer-to-layer image gradients. Diffusion mapping is applied to 2D and 3D OCT datasets composed of two steps, one for partitioning the data into important and less important sections, and another one for localization of internal this http URL the first step, the pixels/voxels are grouped in rectangular/cubic sets to form a graph node.The weights of a graph are calculated based on geometric distances between pixels/voxels and differences of their mean intensity.The first diffusion map clusters the data into three parts, the second of which is the area of interest. The other two sections are eliminated from the remaining calculations. In the second step, the remaining area is subjected to another diffusion map assessment and the internal layers are localized based on their textural similarities.The proposed method was tested on 23 datasets from two patient groups (glaucoma and normals). The mean unsigned border positioning errors(mean - SD) was 8.52 - 3.13 and 7.56 - 2.95 micrometer for the 2D and 3D methods, respectively.

Journal ArticleDOI
TL;DR: Chaudhury et al. as discussed by the authors used trigonometric functions for the range kernel of the bilateral filter, and exploited their so-called shiftability property to accelerate the implementation in general.
Abstract: A direct implementation of the bilateral filter [1] requires O(\sigma_s^2) operations per pixel, where \sigma_s is the (effective) width of the spatial kernel. A fast implementation of the bilateral filter was recently proposed in [2] that required O(1) operations per pixel with respect to \sigma_s. This was done by using trigonometric functions for the range kernel of the bilateral filter, and by exploiting their so-called shiftability property. In particular, a fast implementation of the Gaussian bilateral filter was realized by approximating the Gaussian range kernel using raised cosines. Later, it was demonstrated in [3] that this idea could be extended to a larger class of filters, including the popular non-local means filter [4]. As already observed in [2], a flip side of this approach was that the run time depended on the width \sigma_r of the range kernel. For an image with (local) intensity variations in the range [0,T], the run time scaled as O(T^2/\sigma^2_r) with \sigma_r. This made it difficult to implement narrow range kernels, particularly for images with large dynamic range. We discuss this problem in this note, and propose some simple steps to accelerate the implementation in general, and for small \sigma_r in particular. [1] C. Tomasi and R. Manduchi, "Bilateral filtering for gray and color images", Proc. IEEE International Conference on Computer Vision, 1998. [2] K.N. Chaudhury, Daniel Sage, and M. Unser, "Fast O(1) bilateral filtering using trigonometric range kernels", IEEE Transactions on Image Processing, 2011. [3] K.N. Chaudhury, "Constant-time filtering using shiftable kernels", IEEE Signal Processing Letters, 2011. [4] A. Buades, B. Coll, and J.M. Morel, "A review of image denoising algorithms, with a new one", Multiscale Modeling and Simulation, 2005.

Posted Content
TL;DR: In this article, an unsupervised learning of complex articulated object models from 3D range data is proposed, which automatically recovers a decomposition of the object into approximately rigid parts, the location of the parts in the different object instances and the articulated object skeleton linking the parts.
Abstract: We address the problem of unsupervised learning of complex articulated object models from 3D range data. We describe an algorithm whose input is a set of meshes corresponding to different configurations of an articulated object. The algorithm automatically recovers a decomposition of the object into approximately rigid parts, the location of the parts in the different object instances, and the articulated object skeleton linking the parts. Our algorithm first registers allthe meshes using an unsupervised non-rigid technique described in a companion paper. It then segments the meshes using a graphical model that captures the spatial contiguity of parts. The segmentation is done using the EM algorithm, iterating between finding a decomposition of the object into rigid parts, and finding the location of the parts in the object instances. Although the graphical model is densely connected, the object decomposition step can be performed optimally and efficiently, allowing us to identify a large number of object parts while avoiding local maxima. We demonstrate the algorithm on real world datasets, recovering a 15-part articulated model of a human puppet from just 7 different puppet configurations, as well as a 4 part model of a fiexing arm where significant non-rigid deformation was present.

Posted Content
TL;DR: Based on the results, good performance have been achieved for datasets captured under controlled environments, but there is still much work that can be done to improve the robustness of gender recognition under real-life environments.
Abstract: Gender is an important demographic attribute of people. This paper provides a survey of human gender recognition in computer vision. A review of approaches exploiting information from face and whole body (either from a still image or gait sequence) is presented. We highlight the challenges faced and survey the representative methods of these approaches. Based on the results, good performance have been achieved for datasets captured under controlled environments, but there is still much work that can be done to improve the robustness of gender recognition under real-life environments.

Posted Content
TL;DR: In this article, a multiscale convolutional network is used to extract dense feature vectors from a graph of pixel dissimilarities, which encodes regions of multiple sizes centered on each pixel, and the feature vectors associated with the segments covered by each node in the tree are aggregated and fed to a classifier.
Abstract: Scene parsing, or semantic segmentation, consists in labeling each pixel in an image with the category of the object it belongs to. It is a challenging task that involves the simultaneous detection, segmentation and recognition of all the objects in the image. The scene parsing method proposed here starts by computing a tree of segments from a graph of pixel dissimilarities. Simultaneously, a set of dense feature vectors is computed which encodes regions of multiple sizes centered on each pixel. The feature extractor is a multiscale convolutional network trained from raw pixels. The feature vectors associated with the segments covered by each node in the tree are aggregated and fed to a classifier which produces an estimate of the distribution of object categories contained in the segment. A subset of tree nodes that cover the image are then selected so as to maximize the average "purity" of the class distributions, hence maximizing the overall likelihood that each segment will contain a single object. The convolutional network feature extractor is trained end-to-end from raw pixels, alleviating the need for engineered features. After training, the system is parameter free. The system yields record accuracies on the Stanford Background Dataset (8 classes), the Sift Flow Dataset (33 classes) and the Barcelona Dataset (170 classes) while being an order of magnitude faster than competing approaches, producing a 320 \times 240 image labeling in less than 1 second.

Posted Content
TL;DR: Wang et al. as mentioned in this paper proposed an incremental 3D-DCT algorithm, which decomposes the 3D DCT into successive operations of the 2D discrete cosine transform (2DDCT) and the 1D-dCT (1D-DCT) on the input video data.
Abstract: Visual tracking usually requires an object appearance model that is robust to changing illumination, pose and other factors encountered in video. In this paper, we construct an appearance model using the 3D discrete cosine transform (3D-DCT). The 3D-DCT is based on a set of cosine basis functions, which are determined by the dimensions of the 3D signal and thus independent of the input video data. In addition, the 3D-DCT can generate a compact energy spectrum whose high-frequency coefficients are sparse if the appearance samples are similar. By discarding these high-frequency coefficients, we simultaneously obtain a compact 3D-DCT based object representation and a signal reconstruction-based similarity measure (reflecting the information loss from signal reconstruction). To efficiently update the object representation, we propose an incremental 3D-DCT algorithm, which decomposes the 3D-DCT into successive operations of the 2D discrete cosine transform (2D-DCT) and 1D discrete cosine transform (1D-DCT) on the input video data.

Posted Content
TL;DR: It is found that topological features are useful in the classification of hepatic lesions and two-dimensional persistent homology outperforms one-dimensional persistence homology in this application.
Abstract: In this paper we present a methodology of classifying hepatic (liver) lesions using multidimensional persistent homology, the matching metric (also called the bottleneck distance), and a support vector machine. We present our classification results on a dataset of 132 lesions that have been outlined and annotated by radiologists. We find that topological features are useful in the classification of hepatic lesions. We also find that two-dimensional persistent homology outperforms one-dimensional persistent homology in this application.

Posted Content
TL;DR: An ensemble metric learning approach that consists of sparse block diagonal metric ensembling and join- t metric learning as two consecutive steps that outperform existing state-of-the-art methods in accuracy while retaining high efficiency in face verification and retrieval.
Abstract: Learning Mahanalobis distance metrics in a high- dimensional feature space is very difficult especially when structural sparsity and low rank are enforced to improve com- putational efficiency in testing phase. This paper addresses both aspects by an ensemble metric learning approach that consists of sparse block diagonal metric ensembling and join- t metric learning as two consecutive steps. The former step pursues a highly sparse block diagonal metric by selecting effective feature groups while the latter one further exploits correlations between selected feature groups to obtain an accurate and low rank metric. Our algorithm considers all pairwise or triplet constraints generated from training samples with explicit class labels, and possesses good scala- bility with respect to increasing feature dimensionality and growing data volumes. Its applications to face verification and retrieval outperform existing state-of-the-art methods in accuracy while retaining high efficiency.

Posted Content
TL;DR: It is proved that under some suitable conditions, even with insufficient observations, FRR can still reveal the true subspace memberships, and to achieve robustness to outliers and noise, a sparse regularizer is introduced into the FRR framework.
Abstract: Subspace clustering and feature extraction are two of the most commonly used unsupervised learning techniques in computer vision and pattern recognition. State-of-the-art techniques for subspace clustering make use of recent advances in sparsity and rank minimization. However, existing techniques are computationally expensive and may result in degenerate solutions that degrade clustering performance in the case of insufficient data sampling. To partially solve these problems, and inspired by existing work on matrix factorization, this paper proposes fixed-rank representation (FRR) as a unified framework for unsupervised visual learning. FRR is able to reveal the structure of multiple subspaces in closed-form when the data is noiseless. Furthermore, we prove that under some suitable conditions, even with insufficient observations, FRR can still reveal the true subspace memberships. To achieve robustness to outliers and noise, a sparse regularizer is introduced into the FRR framework. Beyond subspace clustering, FRR can be used for unsupervised feature extraction. As a non-trivial byproduct, a fast numerical solver is developed for FRR. Experimental results on both synthetic data and real applications validate our theoretical analysis and demonstrate the benefits of FRR for unsupervised visual learning.

Posted Content
Xi Li1, Chunhua Shen1, Qinfeng Shi1, Anthony Dick1, Anton van den Hengel1 
TL;DR: Zhang et al. as discussed by the authors proposed a visual tracker based on non-sparse linear representations, which admit an efficient closed-form solution without sacrificing accuracy. And they showed that online metric learning using proximity comparison significantly improves the robustness of the tracking, especially on those sequences exhibiting drastic appearance changes.
Abstract: Most sparse linear representation-based trackers need to solve a computationally expensive L1-regularized optimization problem. To address this problem, we propose a visual tracker based on non-sparse linear representations, which admit an efficient closed-form solution without sacrificing accuracy. Moreover, in order to capture the correlation information between different feature dimensions, we learn a Mahalanobis distance metric in an online fashion and incorporate the learned metric into the optimization problem for obtaining the linear representation. We show that online metric learning using proximity comparison significantly improves the robustness of the tracking, especially on those sequences exhibiting drastic appearance changes. Furthermore, in order to prevent the unbounded growth in the number of training samples for the metric learning, we design a time-weighted reservoir sampling method to maintain and update limited-sized foreground and background sample buffers for balancing sample diversity and adaptability. Experimental results on challenging videos demonstrate the effectiveness and robustness of the proposed tracker.

Posted Content
TL;DR: The work presented here involves the design of a Multi Layer Perceptron (MLP) based classifier for recognition of handwritten Bangla alphabet using a 76 element feature set that has useful application in the development of a complete OCR system for handwritten Bangle text.
Abstract: The work presented here involves the design of a Multi Layer Perceptron (MLP) based classifier for recognition of handwritten Bangla alphabet using a 76 element feature set Bangla is the second most popular script and language in the Indian subcontinent and the fifth most popular language in the world. The feature set developed for representing handwritten characters of Bangla alphabet includes 24 shadow features, 16 centroid features and 36 longest-run features. Recognition performances of the MLP designed to work with this feature set are experimentally observed as 86.46% and 75.05% on the samples of the training and the test sets respectively. The work has useful application in the development of a complete OCR system for handwritten Bangla text.

Journal ArticleDOI
TL;DR: In this article, a frame work was proposed to detect the background in images characterized by poor contrast and image enhancement has been carried out by the two methods based on the Weber's law notion, which employed information from image background analysis by blocks, while the second transformation method utilizes the opening operation, closing operation, which is employed to define the multi-background gray scale images.
Abstract: This paper deals with enhancement of images with poor contrast and detection of background. Proposes a frame work which is used to detect the background in images characterized by poor contrast. Image enhancement has been carried out by the two methods based on the Weber's law notion. The first method employs information from image background analysis by blocks, while the second transformation method utilizes the opening operation, closing operation, which is employed to define the multi-background gray scale images. The complete image processing is done using MATLAB simulation model. Finally, this paper is organized as follows as Morphological transformation and Weber's law. Image background approximation to the background by means of block analysis in conjunction with transformations that enhance images with poor lighting. The multibackground notion is introduced by means of the opening by reconstruction shows a comparison among several techniques to improve contrast in images. Finally, conclusions are presented.

Posted Content
TL;DR: On the very competitive MNIST handwriting benchmark, this method is the first to achieve near-human performance and improves the state-of-the-art on a plethora of common image classification benchmarks.
Abstract: Traditional methods of computer vision and machine learning cannot match human performance on tasks such as the recognition of handwritten digits or traffic signs. Our biologically plausible deep artificial neural network architectures can. Small (often minimal) receptive fields of convolutional winner-take-all neurons yield large network depth, resulting in roughly as many sparsely connected neural layers as found in mammals between retina and visual cortex. Only winner neurons are trained. Several deep neural columns become experts on inputs preprocessed in different ways; their predictions are averaged. Graphics cards allow for fast training. On the very competitive MNIST handwriting benchmark, our method is the first to achieve near-human performance. On a traffic sign recognition benchmark it outperforms humans by a factor of two. We also improve the state-of-the-art on a plethora of common image classification benchmarks.