scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Image Processing in 2016"


Journal ArticleDOI
TL;DR: DehazeNet as discussed by the authors adopts convolutional neural network-based deep architecture, whose layers are specially designed to embody the established assumptions/priors in image dehazing.
Abstract: Single image haze removal is a challenging ill-posed problem. Existing methods use various constraints/priors to get plausible dehazing solutions. The key to achieve haze removal is to estimate a medium transmission map for an input hazy image. In this paper, we propose a trainable end-to-end system called DehazeNet, for medium transmission estimation. DehazeNet takes a hazy image as input, and outputs its medium transmission map that is subsequently used to recover a haze-free image via atmospheric scattering model. DehazeNet adopts convolutional neural network-based deep architecture, whose layers are specially designed to embody the established assumptions/priors in image dehazing. Specifically, the layers of Maxout units are used for feature extraction, which can generate almost all haze-relevant features. We also propose a novel nonlinear activation function in DehazeNet, called bilateral rectified linear unit, which is able to improve the quality of recovered haze-free image. We establish connections between the components of the proposed DehazeNet and those used in existing methods. Experiments on benchmark images show that DehazeNet achieves superior performance over existing methods, yet keeps efficient and easy to use.

1,880 citations


Journal ArticleDOI
TL;DR: This paper proposes a multi-task deep saliency model based on a fully convolutional neural network with global input (whole raw images) and global output (Whole saliency maps) and presents a graph Laplacian regularized nonlinear regression model for saliency refinement.
Abstract: A key problem in salient object detection is how to effectively model the semantic properties of salient objects in a data-driven manner. In this paper, we propose a multi-task deep saliency model based on a fully convolutional neural network with global input (whole raw images) and global output (whole saliency maps). In principle, the proposed saliency model takes a data-driven strategy for encoding the underlying saliency prior information, and then sets up a multi-task learning scheme for exploring the intrinsic correlations between saliency detection and semantic image segmentation. Through collaborative feature learning from such two correlated tasks, the shared fully convolutional layers produce effective features for object perception. Moreover, it is capable of capturing the semantic information on salient objects across different levels using the fully convolutional layers, which investigate the feature-sharing properties of salient object detection with a great reduction of feature redundancy. Finally, we present a graph Laplacian regularized nonlinear regression model for saliency refinement. Experimental results demonstrate the effectiveness of our approach in comparison with the state-of-the-art approaches.

497 citations


Journal ArticleDOI
Chongyi Li1, Jichang Guo1, Runmin Cong1, Yanwei Pang1, Bo Wang1 
TL;DR: Extensive experiments demonstrate that the proposed method achieves better visual quality, more valuable information, and more accurate color restoration than several state-of-the-art methods, even for underwater images taken under several challenging scenes.
Abstract: Images captured under water are usually degraded due to the effects of absorption and scattering. Degraded underwater images show some limitations when they are used for display and analysis. For example, underwater images with low contrast and color cast decrease the accuracy rate of underwater object detection and marine biology recognition. To overcome those limitations, a systematic underwater image enhancement method, which includes an underwater image dehazing algorithm and a contrast enhancement algorithm, is proposed. Built on a minimum information loss principle, an effective underwater image dehazing algorithm is proposed to restore the visibility, color, and natural appearance of underwater images. A simple yet effective contrast enhancement algorithm is proposed based on a kind of histogram distribution prior, which increases the contrast and brightness of underwater images. The proposed method can yield two versions of enhanced output. One version with relatively genuine color and natural appearance is suitable for display. The other version with high contrast and brightness can be used for extracting more valuable information and unveiling more details. Simulation experiment, qualitative and quantitative comparisons, as well as color accuracy and application tests are conducted to evaluate the performance of the proposed method. Extensive experiments demonstrate that the proposed method achieves better visual quality, more valuable information, and more accurate color restoration than several state-of-the-art methods, even for underwater images taken under several challenging scenes.

459 citations


Journal ArticleDOI
TL;DR: The LIVE In the Wild Image Quality Challenge Database is designed and created, which contains widely diverse authentic image distortions on a large number of images captured using a representative variety of modern mobile devices, and a new online crowdsourcing system is implemented, which is used to conduct a very large-scale, multi-month image quality assessment (IQA) subjective study.
Abstract: Most publicly available image quality databases have been created under highly controlled conditions by introducing graded simulated distortions onto high-quality photographs. However, images captured using typical real-world mobile camera devices are usually afflicted by complex mixtures of multiple distortions, which are not necessarily well-modeled by the synthetic distortions found in existing databases. The originators of existing legacy databases usually conducted human psychometric studies to obtain statistically meaningful sets of human opinion scores on images in a stringently controlled visual environment, resulting in small data collections relative to other kinds of image analysis databases. Toward overcoming these limitations, we designed and created a new database that we call the LIVE In the Wild Image Quality Challenge Database, which contains widely diverse authentic image distortions on a large number of images captured using a representative variety of modern mobile devices. We also designed and implemented a new online crowdsourcing system, which we have used to conduct a very large-scale, multi-month image quality assessment (IQA) subjective study. Our database consists of over 350 000 opinion scores on 1162 images evaluated by over 8100 unique human observers. Despite the lack of control over the experimental environments of the numerous study participants, we demonstrate excellent internal consistency of the subjective data set. We also evaluate several top-performing blind IQA algorithms on it and present insights on how the mixtures of distortions challenge both end users as well as automatic perceptual quality prediction models. The new database is available for public use at http://live.ece.utexas.edu/research/ChallengeDB/index.html .

458 citations


Journal ArticleDOI
TL;DR: A new hyperspectral image super-resolution method from a low-resolution (LR) image and a HR reference image of the same scene to improve the accuracy of non-negative sparse coding and to exploit the spatial correlation among the learned sparse codes.
Abstract: Hyperspectral imaging has many applications from agriculture and astronomy to surveillance and mineralogy. However, it is often challenging to obtain high-resolution (HR) hyperspectral images using existing hyperspectral imaging techniques due to various hardware limitations. In this paper, we propose a new hyperspectral image super-resolution method from a low-resolution (LR) image and a HR reference image of the same scene. The estimation of the HR hyperspectral image is formulated as a joint estimation of the hyperspectral dictionary and the sparse codes based on the prior knowledge of the spatial-spectral sparsity of the hyperspectral image. The hyperspectral dictionary representing prototype reflectance spectra vectors of the scene is first learned from the input LR image. Specifically, an efficient non-negative dictionary learning algorithm using the block-coordinate descent optimization technique is proposed. Then, the sparse codes of the desired HR hyperspectral image with respect to learned hyperspectral basis are estimated from the pair of LR and HR reference images. To improve the accuracy of non-negative sparse coding, a clustering-based structured sparse coding method is proposed to exploit the spatial correlation among the learned sparse codes. The experimental results on both public datasets and real LR hypspectral images suggest that the proposed method substantially outperforms several existing HR hyperspectral image recovery techniques in the literature in terms of both objective quality metrics and computational efficiency.

404 citations


Journal ArticleDOI
TL;DR: A novel general purpose BIQA method based on high order statistics aggregation (HOSA), requiring only a small codebook, which has been extensively evaluated on ten image databases with both simulated and realistic image distortions, and shows highly competitive performance to the state-of-the-art BIZA methods.
Abstract: Blind image quality assessment (BIQA) research aims to develop a perceptual model to evaluate the quality of distorted images automatically and accurately without access to the non-distorted reference images. The state-of-the-art general purpose BIQA methods can be classified into two categories according to the types of features used. The first includes handcrafted features which rely on the statistical regularities of natural images. These, however, are not suitable for images containing text and artificial graphics. The second includes learning-based features which invariably require large codebook or supervised codebook updating procedures to obtain satisfactory performance. These are time-consuming and not applicable in practice. In this paper, we propose a novel general purpose BIQA method based on high order statistics aggregation (HOSA), requiring only a small codebook. HOSA consists of three steps. First, local normalized image patches are extracted as local features through a regular grid, and a codebook containing 100 codewords is constructed by K-means clustering. In addition to the mean of each cluster, the diagonal covariance and coskewness (i.e., dimension-wise variance and skewness) of clusters are also calculated. Second, each local feature is softly assigned to several nearest clusters and the differences of high order statistics (mean, variance and skewness) between local features and corresponding clusters are softly aggregated to build the global quality aware image representation. Finally, support vector regression is adopted to learn the mapping between perceptual features and subjective opinion scores. The proposed method has been extensively evaluated on ten image databases with both simulated and realistic image distortions, and shows highly competitive performance to the state-of-the-art BIQA methods.

371 citations


Journal ArticleDOI
TL;DR: It is presented that, even without offline training with a large amount of auxiliary data, simple two-layer convolutional networks can be powerful enough to learn robust representations for visual tracking.
Abstract: Deep networks have been successfully applied to visual tracking by learning a generic representation offline from numerous training images. However, the offline training is time-consuming and the learned generic representation may be less discriminative for tracking specific objects. In this paper, we present that, even without offline training with a large amount of auxiliary data, simple two-layer convolutional networks can be powerful enough to learn robust representations for visual tracking. In the first frame, we extract a set of normalized patches from the target region as fixed filters, which integrate a series of adaptive contextual filters surrounding the target to define a set of feature maps in the subsequent frames. These maps measure similarities between each filter and useful local intensity patterns across the target, thereby encoding its local structural information. Furthermore, all the maps together form a global representation, via which the inner geometric layout of the target is also preserved. A simple soft shrinkage method that suppresses noisy values below an adaptive threshold is employed to de-noise the global representation. Our convolutional networks have a lightweight structure and perform favorably against several state-of-the-art methods on the recent tracking benchmark data set with 50 challenging videos.

353 citations


Journal ArticleDOI
TL;DR: New, efficient algorithms that substantially improve on the performance of other recent methods of sparse representation are presented, contributing to the development of this type of representation as a practical tool for a wider range of problems.
Abstract: When applying sparse representation techniques to images, the standard approach is to independently compute the representations for a set of overlapping image patches. This method performs very well in a variety of applications, but results in a representation that is multi-valued and not optimized with respect to the entire image. An alternative representation structure is provided by a convolutional sparse representation, in which a sparse representation of an entire image is computed by replacing the linear combination of a set of dictionary vectors by the sum of a set of convolutions with dictionary filters. The resulting representation is both single-valued and jointly optimized over the entire image. While this form of a sparse representation has been applied to a variety of problems in signal and image processing and computer vision, the computational expense of the corresponding optimization problems has restricted application to relatively small signals and images. This paper presents new, efficient algorithms that substantially improve on the performance of other recent methods, contributing to the development of this type of representation as a practical tool for a wider range of problems.

331 citations


Journal ArticleDOI
TL;DR: This paper discovers that a high-quality visual saliency model can be learned from multiscale features extracted using deep convolutional neural networks (CNNs), which have had many successes in visual recognition tasks.
Abstract: Visual saliency is a fundamental problem in both cognitive and computational sciences, including computer vision. In this paper, we discover that a high-quality visual saliency model can be learned from multiscale features extracted using deep convolutional neural networks (CNNs), which have had many successes in visual recognition tasks. For learning such saliency models, we introduce a neural network architecture, which has fully connected layers on top of CNNs responsible for feature extraction at three different scales. The penultimate layer of our neural network has been confirmed to be a discriminative high-level feature vector for saliency detection, which we call deep contrast feature. To generate a more robust feature, we integrate handcrafted low-level features with our deep contrast feature. To promote further research and evaluation of visual saliency models, we also construct a new large database of 4447 challenging images and their pixelwise saliency annotations. Experimental results demonstrate that our proposed method is capable of achieving the state-of-the-art performance on all public benchmarks, improving the F-measure by 6.12% and 10%, respectively, on the DUT-OMRON data set and our new data set (HKU-IS), and lowering the mean absolute error by 9% and 35.3%, respectively, on these two data sets.

325 citations


Journal ArticleDOI
TL;DR: The robustness of this approach under various types of distortions, such as deformation, noise, outliers, rotation, and occlusion, greatly outperforms the state-of-the-art methods, especially when the data is badly degraded.
Abstract: In previous work on point registration, the input point sets are often represented using Gaussian mixture models and the registration is then addressed through a probabilistic approach, which aims to exploit global relationships on the point sets. For non-rigid shapes, however, the local structures among neighboring points are also strong and stable and thus helpful in recovering the point correspondence. In this paper, we formulate point registration as the estimation of a mixture of densities, where local features, such as shape context, are used to assign the membership probabilities of the mixture model. This enables us to preserve both global and local structures during matching. The transformation between the two point sets is specified in a reproducing kernel Hilbert space and a sparse approximation is adopted to achieve a fast implementation. Extensive experiments on both synthesized and real data show the robustness of our approach under various types of distortions, such as deformation, noise, outliers, rotation, and occlusion. It greatly outperforms the state-of-the-art methods, especially when the data is badly degraded.

311 citations


Journal ArticleDOI
TL;DR: A new system for scene text detection by proposing a novel text-attentional convolutional neural network (Text-CNN) that particularly focuses on extracting text-related regions and features from the image components and a powerful low-level detector called contrast-enhancement maximally stable extremal regions (MSERs) is developed.
Abstract: Recent deep learning models have demonstrated strong capabilities for classifying text and non-text components in natural images. They extract a high-level feature globally computed from a whole image component (patch), where the cluttered background information may dominate true text features in the deep representation. This leads to less discriminative power and poorer robustness. In this paper, we present a new system for scene text detection by proposing a novel text-attentional convolutional neural network (Text-CNN) that particularly focuses on extracting text-related regions and features from the image components. We develop a new learning mechanism to train the Text-CNN with multi-level and rich supervised information, including text region mask, character label, and binary text/non-text information. The rich supervision information enables the Text-CNN with a strong capability for discriminating ambiguous texts, and also increases its robustness against complicated background components. The training process is formulated as a multi-task learning problem, where low-level supervised information greatly facilitates the main task of text/non-text classification. In addition, a powerful low-level detector called contrast-enhancement maximally stable extremal regions (MSERs) is developed, which extends the widely used MSERs by enhancing intensity contrast between text patterns and background. This allows it to detect highly challenging text patterns, resulting in a higher recall. Our approach achieved promising results on the ICDAR 2013 data set, with an F-measure of 0.82, substantially improving the state-of-the-art results.

Journal ArticleDOI
TL;DR: This work develops a new VQA model called the video intrinsic integrity and distortion evaluation oracle (VIIDEO), which is able to predict the quality of distorted videos without any external knowledge about the pristine source, anticipated distortions, or human judgments of video quality.
Abstract: Considerable progress has been made toward developing still picture perceptual quality analyzers that do not require any reference picture and that are not trained on human opinion scores of distorted images. However, there do not yet exist any such completely blind video quality assessment (VQA) models. Here, we attempt to bridge this gap by developing a new VQA model called the video intrinsic integrity and distortion evaluation oracle (VIIDEO). The new model does not require the use of any additional information other than the video being quality evaluated. VIIDEO embodies models of intrinsic statistical regularities that are observed in natural vidoes, which are used to quantify disturbances introduced due to distortions. An algorithm derived from the VIIDEO model is thereby able to predict the quality of distorted videos without any external knowledge about the pristine source, anticipated distortions, or human judgments of video quality. Even with such a paucity of information, we are able to show that the VIIDEO algorithm performs much better than the legacy full reference quality measure MSE on the LIVE VQA database and delivers performance comparable with a leading human judgment trained blind VQA model. We believe that the VIIDEO algorithm is a significant step toward making real-time monitoring of completely blind video quality possible. The software release of VIIDEO can be obtained online ( http://live.ece.utexas.edu/research/quality/VIIDEO_release.zip ).

Journal ArticleDOI
TL;DR: A comprehensive evaluation on benchmark data sets reveals MRELBP’s high performance—robust to gray scale variations, rotation changes and noise—but at a low computational cost.
Abstract: Local binary patterns (LBP) are considered among the most computationally efficient high-performance texture features. However, the LBP method is very sensitive to image noise and is unable to capture macrostructure information. To best address these disadvantages, in this paper, we introduce a novel descriptor for texture classification, the median robust extended LBP (MRELBP). Different from the traditional LBP and many LBP variants, MRELBP compares regional image medians rather than raw image intensities. A multiscale LBP type descriptor is computed by efficiently comparing image medians over a novel sampling scheme, which can capture both microstructure and macrostructure texture information. A comprehensive evaluation on benchmark data sets reveals MRELBP’s high performance—robust to gray scale variations, rotation changes and noise—but at a low computational cost. MRELBP produces the best classification scores of 99.82%, 99.38%, and 99.77% on three popular Outex test suites. More importantly, MRELBP is shown to be highly robust to image noise, including Gaussian noise, Gaussian blur, salt-and-pepper noise, and random pixel corruption.

Journal ArticleDOI
TL;DR: It is demonstrated that a sparse coding model particularly designed for SR can be incarnated as a neural network with the merit of end-to-end optimization over training data and significantly outperforms the existing state-of-the-art methods for various scaling factors both quantitatively and perceptually.
Abstract: Single image super-resolution (SR) is an ill-posed problem, which tries to recover a high-resolution image from its low-resolution observation. To regularize the solution of the problem, previous methods have focused on designing good priors for natural images, such as sparse representation, or directly learning the priors from a large data set with models, such as deep neural networks. In this paper, we argue that domain expertise from the conventional sparse coding model can be combined with the key ingredients of deep learning to achieve further improved results. We demonstrate that a sparse coding model particularly designed for SR can be incarnated as a neural network with the merit of end-to-end optimization over training data. The network has a cascaded structure, which boosts the SR performance for both fixed and incremental scaling factors. The proposed training and testing schemes can be extended for robust handling of images with additional degradation, such as noise and blurring. A subjective assessment is conducted and analyzed in order to thoroughly evaluate various SR techniques. Our proposed model is tested on a wide range of images, and it significantly outperforms the existing state-of-the-art methods for various scaling factors both quantitatively and perceptually.

Journal ArticleDOI
TL;DR: The experimental results demonstrate that the real-time superpixel algorithm by the DBSCAN clustering outperforms the state-of-the-art superpixel segmentation methods in terms of both accuracy and efficiency.
Abstract: In this paper, we propose a real-time image superpixel segmentation method with 50 frames/s by using the density-based spatial clustering of applications with noise (DBSCAN) algorithm. In order to decrease the computational costs of superpixel algorithms, we adopt a fast two-step framework. In the first clustering stage, the DBSCAN algorithm with color-similarity and geometric restrictions is used to rapidly cluster the pixels, and then, small clusters are merged into superpixels by their neighborhood through a distance measurement defined by color and spatial features in the second merging stage. A robust and simple distance function is defined for obtaining better superpixels in these two steps. The experimental results demonstrate that our real-time superpixel algorithm (50 frames/s) by the DBSCAN clustering outperforms the state-of-the-art superpixel segmentation methods in terms of both accuracy and efficiency.

Journal ArticleDOI
TL;DR: This paper addresses the problem of unsupervised domain transfer learning in which no labels are available in the target domain by the inexact augmented Lagrange multiplier method and can avoid a potentially negative transfer by using a sparse matrix to model the noise and, thus, is more robust to different types of noise.
Abstract: In this paper, we address the problem of unsupervised domain transfer learning in which no labels are available in the target domain. We use a transformation matrix to transfer both the source and target data to a common subspace, where each target sample can be represented by a combination of source samples such that the samples from different domains can be well interlaced. In this way, the discrepancy of the source and target domains is reduced. By imposing joint low-rank and sparse constraints on the reconstruction coefficient matrix, the global and local structures of data can be preserved. To enlarge the margins between different classes as much as possible and provide more freedom to diminish the discrepancy, a flexible linear classifier (projection) is obtained by learning a non-negative label relaxation matrix that allows the strict binary label matrix to relax into a slack variable matrix. Our method can avoid a potentially negative transfer by using a sparse matrix to model the noise and, thus, is more robust to different types of noise. We formulate our problem as a constrained low-rankness and sparsity minimization problem and solve it by the inexact augmented Lagrange multiplier method. Extensive experiments on various visual domain adaptation tasks show the superiority of the proposed method over the state-of-the art methods. The MATLAB code of our method will be publicly available at http://www.yongxu.org/lunwen.html .

Journal ArticleDOI
TL;DR: In this paper, a hierarchical sub-band transform was proposed for intra-frame color encoding of point clouds for real-time 3D video. But the results show that the proposed solution performs comparably with the current state-of-the-art, in many occasions outperforming it, while being much more computationally efficient.
Abstract: In free-viewpoint video, there is a recent trend to represent scene objects as solids rather than using multiple depth maps. Point clouds have been used in computer graphics for a long time, and with the recent possibility of real-time capturing and rendering, point clouds have been favored over meshes in order to save computation. Each point in the cloud is associated with its 3D position and its color. We devise a method to compress the colors in point clouds, which is based on a hierarchical transform and arithmetic coding. The transform is a hierarchical sub-band transform that resembles an adaptive variation of a Haar wavelet. The arithmetic encoding of the coefficients assumes Laplace distributions, one per sub-band. The Laplace parameter for each distribution is transmitted to the decoder using a custom method. The geometry of the point cloud is encoded using the well-established octtree scanning. Results show that the proposed solution performs comparably with the current state-of-the-art, in many occasions outperforming it, while being much more computationally efficient. We believe this paper represents the state of the art in intra-frame compression of point clouds for real-time 3D video.

Journal ArticleDOI
TL;DR: A well-organized propagation process leveraging multiple teachers and one learner enables the multi-modal curriculum learning (MMCL) strategy to outperform five state-of-the-art methods on eight popular image data sets.
Abstract: Semi-supervised image classification aims to classify a large quantity of unlabeled images by typically harnessing scarce labeled images. Existing semi-supervised methods often suffer from inadequate classification accuracy when encountering difficult yet critical images, such as outliers, because they treat all unlabeled images equally and conduct classifications in an imperfectly ordered sequence. In this paper, we employ the curriculum learning methodology by investigating the difficulty of classifying every unlabeled image. The reliability and the discriminability of these unlabeled images are particularly investigated for evaluating their difficulty. As a result, an optimized image sequence is generated during the iterative propagations, and the unlabeled images are logically classified from simple to difficult. Furthermore, since images are usually characterized by multiple visual feature descriptors, we associate each kind of features with a teacher, and design a multi-modal curriculum learning (MMCL) strategy to integrate the information from different feature modalities. In each propagation, each teacher analyzes the difficulties of the currently unlabeled images from its own modality viewpoint. A consensus is subsequently reached among all the teachers, determining the currently simplest images (i.e., a curriculum), which are to be reliably classified by the multi-modal learner. This well-organized propagation process leveraging multiple teachers and one learner enables our MMCL to outperform five state-of-the-art methods on eight popular image data sets.

Journal ArticleDOI
TL;DR: This paper proposes to use a family of nonconvex surrogates of L0-norm on the singular values of a matrix to approximate the rank function, and proves that the IRNN decreases the objective function value monotonically, and any limit point is a stationary point.
Abstract: The nuclear norm is widely used as a convex surrogate of the rank function in compressive sensing for low rank matrix recovery with its applications in image recovery and signal processing. However, solving the nuclear norm-based relaxed convex problem usually leads to a suboptimal solution of the original rank minimization problem. In this paper, we propose to use a family of nonconvex surrogates of $L_{0}$ -norm on the singular values of a matrix to approximate the rank function. This leads to a nonconvex nonsmooth minimization problem. Then, we propose to solve the problem by an iteratively reweighted nuclear norm (IRNN) algorithm. IRNN iteratively solves a weighted singular value thresholding problem, which has a closed form solution due to the special properties of the nonconvex surrogate functions. We also extend IRNN to solve the nonconvex problem with two or more blocks of variables. In theory, we prove that the IRNN decreases the objective function value monotonically, and any limit point is a stationary point. Extensive experiments on both synthesized data and real images demonstrate that IRNN enhances the low rank matrix recovery compared with the state-of-the-art convex algorithms.

Journal ArticleDOI
TL;DR: The proposed weighted Schatten p-norm minimization (WSNM) gives better approximation to the original low-rank assumption, but also considers the importance of different rank components, and can more effectively remove noise and model the complex and dynamic scenes compared with state-of-the-art methods.
Abstract: Low rank matrix approximation (LRMA), which aims to recover the underlying low rank matrix from its degraded observation, has a wide range of applications in computer vision. The latest LRMA methods resort to using the nuclear norm minimization (NNM) as a convex relaxation of the nonconvex rank minimization. However, NNM tends to over-shrink the rank components and treats the different rank components equally, limiting its flexibility in practical applications. We propose a more flexible model, namely, the weighted Schatten $p$ -norm minimization (WSNM), to generalize the NNM to the Schatten $p$ -norm minimization with weights assigned to different singular values. The proposed WSNM not only gives better approximation to the original low-rank assumption, but also considers the importance of different rank components. We analyze the solution of WSNM and prove that, under certain weights permutation, WSNM can be equivalently transformed into independent non-convex $l_{p}$ -norm subproblems, whose global optimum can be efficiently solved by generalized iterated shrinkage algorithm. We apply WSNM to typical low-level vision problems, e.g., image denoising and background subtraction. Extensive experimental results show, both qualitatively and quantitatively, that the proposed WSNM can more effectively remove noise, and model the complex and dynamic scenes compared with state-of-the-art methods.

Journal ArticleDOI
TL;DR: This scheme comes to alleviate another shortcoming existing in patch-based restoration algorithms-the fact that a local (patch-based) prior is serving as a model for a global stochastic phenomenon.
Abstract: Many image restoration algorithms in recent years are based on patch processing. The core idea is to decompose the target image into fully overlapping patches, restore each of them separately, and then merge the results by a plain averaging. This concept has been demonstrated to be highly effective, leading often times to the state-of-the-art results in denoising, inpainting, deblurring, segmentation, and other applications. While the above is indeed effective, this approach has one major flaw: the prior is imposed on intermediate (patch) results, rather than on the final outcome, and this is typically manifested by visual artifacts. The expected patch log likelihood (EPLL) method by Zoran and Weiss was conceived for addressing this very problem. Their algorithm imposes the prior on the patches of the final image , which in turn leads to an iterative restoration of diminishing effect. In this paper, we propose to further extend and improve the EPLL by considering a multi-scale prior. Our algorithm imposes the very same prior on different scale patches extracted from the target image. While all the treated patches are of the same size, their footprint in the destination image varies due to subsampling. Our scheme comes to alleviate another shortcoming existing in patch-based restoration algorithms—the fact that a local (patch-based) prior is serving as a model for a global stochastic phenomenon. We motivate the use of the multi-scale EPLL by restricting ourselves to the simple Gaussian case, comparing the aforementioned algorithms and showing a clear advantage to the proposed method. We then demonstrate our algorithm in the context of image denoising, deblurring, and super-resolution, showing an improvement in performance both visually and quantitatively.

Journal ArticleDOI
TL;DR: This paper presents an efficient and very robust tracking algorithm using a single convolutional neural network for learning effective feature representations of the target object in a purely online manner and introduces a novel truncated structural loss function that maintains as many training samples as possible and reduces the risk of tracking error accumulation.
Abstract: Deep neural networks, albeit their great success on feature learning in various computer vision tasks, are usually considered as impractical for online visual tracking, because they require very long training time and a large number of training samples. In this paper, we present an efficient and very robust tracking algorithm using a single convolutional neural network (CNN) for learning effective feature representations of the target object in a purely online manner. Our contributions are multifold. First, we introduce a novel truncated structural loss function that maintains as many training samples as possible and reduces the risk of tracking error accumulation. Second, we enhance the ordinary stochastic gradient descent approach in CNN training with a robust sample selection mechanism. The sampling mechanism randomly generates positive and negative samples from different temporal distributions, which are generated by taking the temporal relations and label noise into account. Finally, a lazy yet effective updating scheme is designed for CNN training. Equipped with this novel updating algorithm, the CNN model is robust to some long-existing difficulties in visual tracking, such as occlusion or incorrect detections, without loss of the effective adaption for significant appearance changes. In the experiment, our CNN tracker outperforms all compared state-of-the-art methods on two recently proposed benchmarks, which in total involve over 60 video sequences. The remarkable performance improvement over the existing trackers illustrates the superiority of the feature representations, which are learned purely online via the proposed deep learning framework.

Journal ArticleDOI
TL;DR: An adaptive fusion scheme is proposed based on collaborative sparse representation in Bayesian filtering framework and jointly optimize sparse codes and the reliable weights of different modalities in an online way to perform robust object tracking in challenging scenarios.
Abstract: Integrating multiple different yet complementary feature representations has been proved to be an effective way for boosting tracking performance. This paper investigates how to perform robust object tracking in challenging scenarios by adaptively incorporating information from grayscale and thermal videos, and proposes a novel collaborative algorithm for online tracking. In particular, an adaptive fusion scheme is proposed based on collaborative sparse representation in Bayesian filtering framework. We jointly optimize sparse codes and the reliable weights of different modalities in an online way. In addition, this paper contributes a comprehensive video benchmark, which includes 50 grayscale-thermal sequences and their ground truth annotations for tracking purpose. The videos are with high diversity and the annotations were finished by one single person to guarantee consistency. Extensive experiments against other state-of-the-art trackers with both grayscale and grayscale-thermal inputs demonstrate the effectiveness of the proposed tracking approach. Through analyzing quantitative results, we also provide basic insights and potential future research directions in grayscale-thermal tracking.

Journal ArticleDOI
TL;DR: A novel approach to person re-identification, a fundamental task in distributed multi-camera surveillance systems, is proposed that significantly outperforms all the state-of-the-art approaches, including both traditional and CNN-based methods on the challenging VIPeR, CUHK-01, and CAVIAR4REID datasets.
Abstract: This paper proposes a novel approach to person re-identification, a fundamental task in distributed multi-camera surveillance systems. Although a variety of powerful algorithms have been presented in the past few years, most of them usually focus on designing hand-crafted features and learning metrics either individually or sequentially. Different from previous works, we formulate a unified deep ranking framework that jointly tackles both of these key components to maximize their strengths. We start from the principle that the correct match of the probe image should be positioned in the top rank within the whole gallery set. An effective learning-to-rank algorithm is proposed to minimize the cost corresponding to the ranking disorders of the gallery. The ranking model is solved with a deep convolutional neural network (CNN) that builds the relation between input image pairs and their similarity scores through joint representation learning directly from raw image pixels. The proposed framework allows us to get rid of feature engineering and does not rely on any assumption. An extensive comparative evaluation is given, demonstrating that our approach significantly outperforms all the state-of-the-art approaches, including both traditional and CNN-based methods on the challenging VIPeR, CUHK-01, and CAVIAR4REID datasets. In addition, our approach has better ability to generalize across datasets without fine-tuning.

Journal ArticleDOI
TL;DR: The proposed LSDT methods outperform other state-of-the-art representation-based domain adaptation methods and generalize the proposed LSDT model into a kernel-based linear/nonlinear basis transformation learning framework for tackling nonlinear subspace shifts in reproduced kernel Hilbert space.
Abstract: We propose a novel reconstruction-based transfer learning method called latent sparse domain transfer (LSDT) for domain adaptation and visual categorization of heterogeneous data. For handling cross-domain distribution mismatch, we advocate reconstructing the target domain data with the combined source and target domain data points based on $\ell _{1}$ -norm sparse coding. Furthermore, we propose a joint learning model for simultaneous optimization of the sparse coding and the optimal subspace representation. In addition, we generalize the proposed LSDT model into a kernel-based linear/nonlinear basis transformation learning framework for tackling nonlinear subspace shifts in reproduced kernel Hilbert space. The proposed methods have three advantages: 1) the latent space and the reconstruction are jointly learned for pursuit of an optimal subspace transfer; 2) with the theory of sparse subspace clustering, a few valuable source and target data points are formulated to reconstruct the target data with noise (outliers) from source domain removed during domain adaptation, such that the robustness is guaranteed; and 3) a nonlinear projection of some latent space with kernel is easily generalized for dealing with highly nonlinear domain shift (e.g., face poses). Extensive experiments on several benchmark vision data sets demonstrate that the proposed approaches outperform other state-of-the-art representation-based domain adaptation methods.

Journal ArticleDOI
TL;DR: This paper introduces a dimension reduction framework which to some extend represents data as parts, has fast learning speed, and learns the between-class scatter subspace, and experimental results show the efficacy of linear and non-linear ELM-AE and SELM- AE in terms of discriminative capability, sparsity, training time, and normalized mean square error.
Abstract: Data may often contain noise or irrelevant information, which negatively affect the generalization capability of machine learning algorithms The objective of dimension reduction algorithms, such as principal component analysis (PCA), non-negative matrix factorization (NMF), random projection (RP), and auto-encoder (AE), is to reduce the noise or irrelevant information of the data The features of PCA (eigenvectors) and linear AE are not able to represent data as parts (eg nose in a face image) On the other hand, NMF and non-linear AE are maimed by slow learning speed and RP only represents a subspace of original data This paper introduces a dimension reduction framework which to some extend represents data as parts, has fast learning speed, and learns the between-class scatter subspace To this end, this paper investigates a linear and non-linear dimension reduction framework referred to as extreme learning machine AE (ELM-AE) and sparse ELM-AE (SELM-AE) In contrast to tied weight AE, the hidden neurons in ELM-AE and SELM-AE need not be tuned, and their parameters (eg, input weights in additive neurons) are initialized using orthogonal and sparse random weights, respectively Experimental results on USPS handwritten digit recognition data set, CIFAR-10 object recognition, and NORB object recognition data set show the efficacy of linear and non-linear ELM-AE and SELM-AE in terms of discriminative capability, sparsity, training time, and normalized mean square error

Journal ArticleDOI
TL;DR: In this article, the authors revisited the popular performance measures and tracker performance visualizations and analyzed them theoretically and experimentally, and showed that several measures are equivalent from the point of information they provide for tracker comparison and, crucially, that some are more brittle than the others.
Abstract: The problem of visual tracking evaluation is sporting a large variety of performance measures, and largely suffers from lack of consensus about which measures should be used in experiments. This makes the cross-paper tracker comparison difficult. Furthermore, as some measures may be less effective than others, the tracking results may be skewed or biased toward particular tracking aspects. In this paper, we revisit the popular performance measures and tracker performance visualizations and analyze them theoretically and experimentally. We show that several measures are equivalent from the point of information they provide for tracker comparison and, crucially, that some are more brittle than the others. Based on our analysis, we narrow down the set of potential measures to only two complementary ones, describing accuracy and robustness, thus pushing toward homogenization of the tracker evaluation methodology. These two measures can be intuitively interpreted and visualized and have been employed by the recent visual object tracking challenges as the foundation for the evaluation methodology.

Journal ArticleDOI
TL;DR: The experimental results demonstrate that the proposed subRW method outperforms previous RW algorithms for seeded image segmentation, and designs a new subRW algorithm with label prior to solve the segmentation problem of objects with thin and elongated parts.
Abstract: A novel sub-Markov random walk (subRW) algorithm with label prior is proposed for seeded image segmentation, which can be interpreted as a traditional random walker on a graph with added auxiliary nodes. Under this explanation, we unify the proposed subRW and other popular random walk (RW) algorithms. This unifying view will make it possible for transferring intrinsic findings between different RW algorithms, and offer new ideas for designing novel RW algorithms by adding or changing auxiliary nodes. To verify the second benefit, we design a new subRW algorithm with label prior to solve the segmentation problem of objects with thin and elongated parts. The experimental results on both synthetic and natural images with twigs demonstrate that the proposed subRW method outperforms previous RW algorithms for seeded image segmentation.

Journal ArticleDOI
TL;DR: This paper proposes a new deep transfer metric learning (DTML) method to learn a set of hierarchical nonlinear transformations for cross-domain visual recognition by transferring discriminative knowledge from the labeled source domain to the unlabeled target domain.
Abstract: Conventional metric learning methods usually assume that the training and test samples are captured in similar scenarios so that their distributions are assumed to be the same. This assumption does not hold in many real visual recognition applications, especially when samples are captured across different data sets. In this paper, we propose a new deep transfer metric learning (DTML) method to learn a set of hierarchical nonlinear transformations for cross-domain visual recognition by transferring discriminative knowledge from the labeled source domain to the unlabeled target domain. Specifically, our DTML learns a deep metric network by maximizing the inter-class variations and minimizing the intra-class variations, and minimizing the distribution divergence between the source domain and the target domain at the top layer of the network. To better exploit the discriminative information from the source domain, we further develop a deeply supervised transfer metric learning (DSTML) method by including an additional objective on DTML, where the output of both the hidden layers and the top layer are optimized jointly. To preserve the local manifold of input data points in the metric space, we present two new methods, DTML with autoencoder regularization and DSTML with autoencoder regularization. Experimental results on face verification, person re-identification, and handwritten digit recognition validate the effectiveness of the proposed methods.

Journal ArticleDOI
TL;DR: An enhanced prior for image deblurring is introduced by combining the low rank prior of similar patches from both the blurry image and its gradient map, and a weighted nuclear norm minimization method is employed to further enhance the effectiveness of low-rank prior.
Abstract: Low-rank matrix approximation has been successfully applied to numerous vision problems in recent years. In this paper, we propose a novel low-rank prior for blind image deblurring. Our key observation is that directly applying a simple low-rank model to a blurry input image significantly reduces the blur even without using any kernel information, while preserving important edge information. The same model can be used to reduce blur in the gradient map of a blurry input. Based on these properties, we introduce an enhanced prior for image deblurring by combining the low rank prior of similar patches from both the blurry image and its gradient map. We employ a weighted nuclear norm minimization method to further enhance the effectiveness of low-rank prior for image deblurring, by retaining the dominant edges and eliminating fine texture and slight edges in intermediate images, allowing for better kernel estimation. In addition, we evaluate the proposed enhanced low-rank prior for both the uniform and the non-uniform deblurring. Quantitative and qualitative experimental evaluations demonstrate that the proposed algorithm performs favorably against the state-of-the-art deblurring methods.