scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Image Processing in 2017"


Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper proposed a feed-forward denoising convolutional neural networks (DnCNNs) to handle Gaussian denobling with unknown noise level.
Abstract: The discriminative model learning for image denoising has been recently attracting considerable attentions due to its favorable denoising performance. In this paper, we take one step forward by investigating the construction of feed-forward denoising convolutional neural networks (DnCNNs) to embrace the progress in very deep architecture, learning algorithm, and regularization method into image denoising. Specifically, residual learning and batch normalization are utilized to speed up the training process as well as boost the denoising performance. Different from the existing discriminative denoising models which usually train a specific model for additive white Gaussian noise at a certain noise level, our DnCNN model is able to handle Gaussian denoising with unknown noise level (i.e., blind Gaussian denoising). With the residual learning strategy, DnCNN implicitly removes the latent clean image in the hidden layers. This property motivates us to train a single DnCNN model to tackle with several general image denoising tasks, such as Gaussian denoising, single image super-resolution, and JPEG image deblocking. Our extensive experiments demonstrate that our DnCNN model can not only exhibit high effectiveness in several general image denoising tasks, but also be efficiently implemented by benefiting from GPU computing.

5,902 citations


Journal ArticleDOI
TL;DR: In this paper, the authors proposed a deep convolutional neural network (CNN)-based algorithm for solving ill-posed inverse problems, which combines multiresolution decomposition and residual learning in order to learn to remove these artifacts while preserving image structure.
Abstract: In this paper, we propose a novel deep convolutional neural network (CNN)-based algorithm for solving ill-posed inverse problems. Regularized iterative algorithms have emerged as the standard approach to ill-posed inverse problems in the past few decades. These methods produce excellent results, but can be challenging to deploy in practice due to factors including the high computational cost of the forward and adjoint operators and the difficulty of hyperparameter selection. The starting point of this paper is the observation that unrolled iterative methods have the form of a CNN (filtering followed by pointwise non-linearity) when the normal operator ( $H^{*}H$ , where $H^{*}$ is the adjoint of the forward imaging operator, $H$ ) of the forward model is a convolution. Based on this observation, we propose using direct inversion followed by a CNN to solve normal-convolutional inverse problems. The direct inversion encapsulates the physical model of the system, but leads to artifacts when the problem is ill posed; the CNN combines multiresolution decomposition and residual learning in order to learn to remove these artifacts while preserving image structure. We demonstrate the performance of the proposed network in sparse-view reconstruction (down to 50 views) on parallel beam X-ray computed tomography in synthetic phantoms as well as in real experimental sinograms. The proposed network outperforms total variation-regularized iterative reconstruction for the more realistic phantoms and requires less than a second to reconstruct a $512\times 512$ image on the GPU.

1,757 citations


Journal ArticleDOI
TL;DR: Experiments on a number of challenging low-light images are present to reveal the efficacy of the proposed LIME and show its superiority over several state-of-the-arts in terms of enhancement quality and efficiency.
Abstract: When one captures images in low-light conditions, the images often suffer from low visibility. Besides degrading the visual aesthetics of images, this poor quality may also significantly degenerate the performance of many computer vision and multimedia algorithms that are primarily designed for high-quality inputs. In this paper, we propose a simple yet effective low-light image enhancement (LIME) method. More concretely, the illumination of each pixel is first estimated individually by finding the maximum value in R, G, and B channels. Furthermore, we refine the initial illumination map by imposing a structure prior on it, as the final illumination map. Having the well-constructed illumination map, the enhancement can be achieved accordingly. Experiments on a number of challenging low-light images are present to reveal the efficacy of our LIME and show its superiority over several state-of-the-arts in terms of enhancement quality and efficiency.

1,364 citations


Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper introduced a deep network architecture called DerainNet for removing rain streaks from an image, which directly learned the mapping relationship between rainy and clean image detail layers from data.
Abstract: We introduce a deep network architecture called DerainNet for removing rain streaks from an image. Based on the deep convolutional neural network (CNN), we directly learn the mapping relationship between rainy and clean image detail layers from data. Because we do not possess the ground truth corresponding to real-world rainy images, we synthesize images with rain for training. In contrast to other common strategies that increase depth or breadth of the network, we use image processing domain knowledge to modify the objective function and improve deraining with a modestly sized CNN. Specifically, we train our DerainNet on the detail (high-pass) layer rather than in the image domain. Though DerainNet is trained on synthetic data, we find that the learned network translates very effectively to real-world images for testing. Moreover, we augment the CNN framework with image enhancement to improve the visual results. Compared with the state-of-the-art single image de-raining methods, our method has improved rain removal and much faster computation time after network training.

701 citations


Journal ArticleDOI
TL;DR: This paper proposes a new soft attention-based model, i.e., the end-to-end comparative attention network (CAN), specifically tailored for the task of person re-identification that outperforms well established baselines significantly and offers the new state-of-the-art performance.
Abstract: Person re-identification across disjoint camera views has been widely applied in video surveillance yet it is still a challenging problem. One of the major challenges lies in the lack of spatial and temporal cues, which makes it difficult to deal with large variations of lighting conditions, viewing angles, body poses, and occlusions. Recently, several deep-learning-based person re-identification approaches have been proposed and achieved remarkable performance. However, most of those approaches extract discriminative features from the whole frame at one glimpse without differentiating various parts of the persons to identify. It is essentially important to examine multiple highly discriminative local regions of the person images in details through multiple glimpses for dealing with the large appearance variance. In this paper, we propose a new soft attention-based model, i.e. , the end-to-end comparative attention network (CAN), specifically tailored for the task of person re-identification. The end-to-end CAN learns to selectively focus on parts of pairs of person images after taking a few glimpses of them and adaptively comparing their appearance. The CAN model is able to learn which parts of images are relevant for discerning persons and automatically integrates information from different parts to determine whether a pair of images belongs to the same person. In other words, our proposed CAN model simulates the human perception process to verify whether two images are from the same person. Extensive experiments on four benchmark person re-identification data sets, including CUHK01, CHUHK03, Market-1501, and VIPeR, clearly demonstrate that our proposed end-to-end CAN for person re-identification outperforms well established baselines significantly and offer the new state-of-the-art performance.

610 citations


Journal ArticleDOI
TL;DR: A novel deep convolutional neural network that is deeper and wider than other existing deep networks for hyperspectral image classification, called contextual deep CNN, can optimally explore local contextual interactions by jointly exploiting local spatio-spectral relationships of neighboring individual pixel vectors.
Abstract: In this paper, we describe a novel deep convolutional neural network (CNN) that is deeper and wider than other existing deep networks for hyperspectral image classification. Unlike current state-of-the-art approaches in CNN-based hyperspectral image classification, the proposed network, called contextual deep CNN, can optimally explore local contextual interactions by jointly exploiting local spatio-spectral relationships of neighboring individual pixel vectors. The joint exploitation of the spatio-spectral information is achieved by a multi-scale convolutional filter bank used as an initial component of the proposed CNN pipeline. The initial spatial and spectral feature maps obtained from the multi-scale filter bank are then combined together to form a joint spatio-spectral feature map. The joint feature map representing rich spectral and spatial properties of the hyperspectral image is then fed through a fully convolutional network that eventually predicts the corresponding label of each pixel vector. The proposed approach is tested on three benchmark data sets: the Indian Pines data set, the Salinas data set, and the University of Pavia data set. Performance comparison shows enhanced classification performance of the proposed approach over the current state-of-the-art on the three data sets.

578 citations


Journal ArticleDOI
TL;DR: This work establishes a large-scale database named the Waterloo Exploration Database, which in its current state contains 4744 pristine natural images and 94 880 distorted images created from them, and presents three alternative test criteria to evaluate the performance of IQA models, namely, the pristine/distorted image discriminability test, the listwise ranking consistency test, and the pairwise preference consistency test.
Abstract: The great content diversity of real-world digital images poses a grand challenge to image quality assessment (IQA) models, which are traditionally designed and validated on a handful of commonly used IQA databases with very limited content variation. To test the generalization capability and to facilitate the wide usage of IQA techniques in real-world applications, we establish a large-scale database named the Waterloo Exploration Database, which in its current state contains 4744 pristine natural images and 94 880 distorted images created from them. Instead of collecting the mean opinion score for each image via subjective testing, which is extremely difficult if not impossible, we present three alternative test criteria to evaluate the performance of IQA models, namely, the pristine/distorted image discriminability test, the listwise ranking consistency test, and the pairwise preference consistency test (P-test). We compare 20 well-known IQA models using the proposed criteria, which not only provide a stronger test in a more challenging testing environment for existing models, but also demonstrate the additional benefits of using the proposed database. For example, in the P-test, even for the best performing no-reference IQA model, more than 6 million failure cases against the model are “discovered” automatically out of over 1 billion test pairs. Furthermore, we discuss how the new database may be exploited using innovative approaches in the future, to reveal the weaknesses of existing IQA models, to provide insights on how to improve the models, and to shed light on how the next-generation IQA models may be developed. The database and codes are made publicly available at: https://ece.uwaterloo.ca/~k29ma/exploration/ .

495 citations


Journal ArticleDOI
TL;DR: DeepFix as mentioned in this paper proposes a fully convolutional neural network (FCN) which models the bottom-up mechanism of visual attention via saliency prediction and predicts the saliency map in an end-to-end manner.
Abstract: Understanding and predicting the human visual attention mechanism is an active area of research in the fields of neuroscience and computer vision. In this paper, we propose DeepFix, a fully convolutional neural network, which models the bottom–up mechanism of visual attention via saliency prediction. Unlike classical works, which characterize the saliency map using various hand-crafted features, our model automatically learns features in a hierarchical fashion and predicts the saliency map in an end-to-end manner. DeepFix is designed to capture semantics at multiple scales while taking global context into account, by using network layers with very large receptive fields. Generally, fully convolutional nets are spatially invariant—this prevents them from modeling location-dependent patterns (e.g., centre-bias). Our network handles this by incorporating a novel location-biased convolutional layer. We evaluate our model on multiple challenging saliency data sets and show that it achieves the state-of-the-art results.

443 citations


Journal ArticleDOI
TL;DR: A depth estimation method for underwater scenes based on image blurriness and light absorption is proposed, which can be used in the image formation model (IFM) to restore and enhance underwater images.
Abstract: Underwater images often suffer from color distortion and low contrast, because light is scattered and absorbed when traveling through water. Such images with different color tones can be shot in various lighting conditions, making restoration and enhancement difficult. We propose a depth estimation method for underwater scenes based on image blurriness and light absorption, which can be used in the image formation model (IFM) to restore and enhance underwater images. Previous IFM-based image restoration methods estimate scene depth based on the dark channel prior or the maximum intensity prior. These are frequently invalidated by the lighting conditions in underwater images, leading to poor restoration results. The proposed method estimates underwater scene depth more accurately. Experimental results on restoring real and synthesized underwater images demonstrate that the proposed method outperforms other IFM-based underwater image restoration methods.

433 citations


Journal ArticleDOI
TL;DR: A novel cross- modal hashing method, termed discrete cross-modal hashing (DCH), which directly learns discriminative binary codes while retaining the discrete constraints, and an effective discrete optimization algorithm is developed for DCH to jointly learn the modality-specific hash function and the unified binary codes.
Abstract: Hashing based methods have attracted considerable attention for efficient cross-modal retrieval on large-scale multimedia data. The core problem of cross-modal hashing is how to learn compact binary codes that construct the underlying correlations between heterogeneous features from different modalities. A majority of recent approaches aim at learning hash functions to preserve the pairwise similarities defined by given class labels. However, these methods fail to explicitly explore the discriminative property of class labels during hash function learning. In addition, they usually discard the discrete constraints imposed on the to-be-learned binary codes, and compromise to solve a relaxed problem with quantization to obtain the approximate binary solution. Therefore, the binary codes generated by these methods are suboptimal and less discriminative to different classes. To overcome these drawbacks, we propose a novel cross-modal hashing method, termed discrete cross-modal hashing (DCH), which directly learns discriminative binary codes while retaining the discrete constraints. Specifically, DCH learns modality-specific hash functions for generating unified binary codes, and these binary codes are viewed as representative features for discriminative classification with class labels. An effective discrete optimization algorithm is developed for DCH to jointly learn the modality-specific hash function and the unified binary codes. Extensive experiments on three benchmark data sets highlight the superiority of DCH under various cross-modal scenarios and show its state-of-the-art performance.

358 citations


Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper proposed the selective convolutional descriptor aggregation (SCDA) method, which first localizes the main object in fine-grained images and then aggregates the selected descriptors into a short feature vector using the best practices.
Abstract: Deep convolutional neural network models pre-trained for the ImageNet classification task have been successfully adopted to tasks in other domains, such as texture description and object proposal generation, but these tasks require annotations for images in the new domain. In this paper, we focus on a novel and challenging task in the pure unsupervised setting: fine-grained image retrieval. Even with image labels, fine-grained images are difficult to classify, letting alone the unsupervised retrieval task. We propose the selective convolutional descriptor aggregation (SCDA) method. The SCDA first localizes the main object in fine-grained images, a step that discards the noisy background and keeps useful deep descriptors. The selected descriptors are then aggregated and the dimensionality is reduced into a short feature vector using the best practices we found. The SCDA is unsupervised, using no image label or bounding box annotation. Experiments on six fine-grained data sets confirm the effectiveness of the SCDA for fine-grained image retrieval. Besides, visualization of the SCDA features shows that they correspond to visual attributes (even subtle ones), which might explain SCDA’s high-mean average precision in fine-grained retrieval. Moreover, on general image retrieval data sets, the SCDA achieves comparable retrieval results with the state-of-the-art general image retrieval approaches.

Journal ArticleDOI
TL;DR: This paper proposes a part-based hierarchical bidirectional recurrent neural network (PHRNN) to analyze the facial expression information of temporal sequences and reduces the error rates of the previous best ones on three widely used facial expression databases.
Abstract: One key challenging issue of facial expression recognition is to capture the dynamic variation of facial physical structure from videos. In this paper, we propose a part-based hierarchical bidirectional recurrent neural network (PHRNN) to analyze the facial expression information of temporal sequences. Our PHRNN models facial morphological variations and dynamical evolution of expressions, which is effective to extract “temporal features” based on facial landmarks (geometry information) from consecutive frames. Meanwhile, in order to complement the still appearance information, a multi-signal convolutional neural network (MSCNN) is proposed to extract “spatial features” from still frames. We use both recognition and verification signals as supervision to calculate different loss functions, which are helpful to increase the variations of different expressions and reduce the differences among identical expressions. This deep evolutional spatial-temporal network (composed of PHRNN and MSCNN) extracts the partial-whole, geometry-appearance, and dynamic-still information, effectively boosting the performance of facial expression recognition. Experimental results show that this method largely outperforms the state-of-the-art ones. On three widely used facial expression databases (CK+, Oulu-CASIA, and MMI), our method reduces the error rates of the previous best ones by 45.5%, 25.8%, and 24.4%, respectively.

Journal ArticleDOI
TL;DR: It is shown that the proposed novel technique, characterised by a cascade of two cascaded classifiers, performs comparable to current top-performing detection and localization methods on standard benchmarks, but outperforms those in general with respect to required computation time.
Abstract: This paper proposes a fast and reliable method for anomaly detection and localization in video data showing crowded scenes. Time-efficient anomaly localization is an ongoing challenge and subject of this paper. We propose a cubic-patch-based method, characterised by a cascade of classifiers, which makes use of an advanced feature-learning approach. Our cascade of classifiers has two main stages. First, a light but deep 3D auto-encoder is used for early identification of “many” normal cubic patches. This deep network operates on small cubic patches as being the first stage, before carefully resizing the remaining candidates of interest, and evaluating those at the second stage using a more complex and deeper 3D convolutional neural network (CNN). We divide the deep auto-encoder and the CNN into multiple sub-stages, which operate as cascaded classifiers. Shallow layers of the cascaded deep networks (designed as Gaussian classifiers, acting as weak single-class classifiers) detect “simple” normal patches, such as background patches and more complex normal patches, are detected at deeper layers. It is shown that the proposed novel technique (a cascade of two cascaded classifiers) performs comparable to current top-performing detection and localization methods on standard benchmarks, but outperforms those in general with respect to required computation time.

Journal ArticleDOI
TL;DR: The proposed deep label distribution learning (DLDL) method effectively utilizes the label ambiguity in both feature learning and classifier learning, which help prevent the network from overfitting even when the training set is small.
Abstract: Convolutional neural networks (ConvNets) have achieved excellent recognition performance in various visual recognition tasks. A large labeled training set is one of the most important factors for its success. However, it is difficult to collect sufficient training images with precise labels in some domains, such as apparent age estimation, head pose estimation, multilabel classification, and semantic segmentation. Fortunately, there is ambiguous information among labels, which makes these tasks different from traditional classification. Based on this observation, we convert the label of each image into a discrete label distribution, and learn the label distribution by minimizing a Kullback–Leibler divergence between the predicted and ground-truth label distributions using deep ConvNets. The proposed deep label distribution learning (DLDL) method effectively utilizes the label ambiguity in both feature learning and classifier learning, which help prevent the network from overfitting even when the training set is small. Experimental results show that the proposed approach produces significantly better results than the state-of-the-art methods for age estimation and head pose estimation. At the same time, it also improves recognition performance for multi-label classification and semantic segmentation tasks.

Journal ArticleDOI
TL;DR: A discriminative deep multi-metric learning method to jointly learn multiple neural networks, under which the correlation of different features of each sample is maximized, and the distance of each positive pair is reduced and that of each negative pair is enlarged.
Abstract: This paper presents a new discriminative deep metric learning (DDML) method for face and kinship verification in wild conditions. While metric learning has achieved reasonably good performance in face and kinship verification, most existing metric learning methods aim to learn a single Mahalanobis distance metric to maximize the inter-class variations and minimize the intra-class variations, which cannot capture the nonlinear manifold where face images usually lie on. To address this, we propose a DDML method to train a deep neural network to learn a set of hierarchical nonlinear transformations to project face pairs into the same latent feature space, under which the distance of each positive pair is reduced and that of each negative pair is enlarged. To better use the commonality of multiple feature descriptors to make all the features more robust for face and kinship verification, we develop a discriminative deep multi-metric learning method to jointly learn multiple neural networks, under which the correlation of different features of each sample is maximized, and the distance of each positive pair is reduced and that of each negative pair is enlarged. Extensive experimental results show that our proposed methods achieve the acceptable results in both face and kinship verification.

Journal ArticleDOI
TL;DR: The proposed algorithm not only outperforms previous MEF algorithms on static scenes but also consistently produces high quality fused images with little ghosting artifacts for dynamic scenes and maintains a lower computational cost compared with the state-of-the-art deghosting schemes.
Abstract: We propose a simple yet effective structural patch decomposition approach for multi-exposure image fusion (MEF) that is robust to ghosting effect. We decompose an image patch into three conceptually independent components: signal strength, signal structure, and mean intensity. Upon fusing these three components separately, we reconstruct a desired patch and place it back into the fused image. This novel patch decomposition approach benefits MEF in many aspects. First, as opposed to most pixel-wise MEF methods, the proposed algorithm does not require post-processing steps to improve visual quality or to reduce spatial artifacts. Second, it handles RGB color channels jointly, and thus produces fused images with more vivid color appearance. Third and most importantly, the direction of the signal structure component in the patch vector space provides ideal information for ghost removal. It allows us to reliably and efficiently reject inconsistent object motions with respect to a chosen reference image without performing computationally expensive motion estimation. We compare the proposed algorithm with 12 MEF methods on 21 static scenes and 12 deghosting schemes on 19 dynamic scenes (with camera and object motion). Extensive experimental results demonstrate that the proposed algorithm not only outperforms previous MEF algorithms on static scenes but also consistently produces high quality fused images with little ghosting artifacts for dynamic scenes. Moreover, it maintains a lower computational cost compared with the state-of-the-art deghosting schemes. 1 1 The MATLAB code of the proposed algorithm will be made available online. Preliminary results of Section III-A [1] were presented at the IEEE International Conference on Image Processing, Canada, 2015.

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed a novel tensor completion approach based on the tensor train (TT) rank, which is able to capture hidden information from tensors thanks to its definition from a well-balanced matricization scheme.
Abstract: This paper proposes a novel approach to tensor completion, which recovers missing entries of data represented by tensors. The approach is based on the tensor train (TT) rank, which is able to capture hidden information from tensors thanks to its definition from a well-balanced matricization scheme. Accordingly, new optimization formulations for tensor completion are proposed as well as two new algorithms for their solution. The first one called simple low-rank tensor completion via TT (SiLRTC-TT) is intimately related to minimizing a nuclear norm based on TT rank. The second one is from a multilinear matrix factorization model to approximate the TT rank of a tensor, and is called tensor completion by parallel matrix factorization via TT (TMac-TT). A tensor augmentation scheme of transforming a low-order tensor to higher orders is also proposed to enhance the effectiveness of SiLRTC-TT and TMac-TT. Simulation results for color image and video recovery show the clear advantage of our method over all other methods.

Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper designed a new convolutional neural network (CNN) to automatically learn the interaction mechanism for RGBD salient object detection, which takes advantage of the knowledge obtained in traditional saliency detection by adopting various flexible and interpretable saliency feature vectors as inputs.
Abstract: Numerous efforts have been made to design various low-level saliency cues for RGBD saliency detection, such as color and depth contrast features as well as background and color compactness priors. However, how these low-level saliency cues interact with each other and how they can be effectively incorporated to generate a master saliency map remain challenging problems. In this paper, we design a new convolutional neural network (CNN) to automatically learn the interaction mechanism for RGBD salient object detection. In contrast to existing works, in which raw image pixels are fed directly to the CNN, the proposed method takes advantage of the knowledge obtained in traditional saliency detection by adopting various flexible and interpretable saliency feature vectors as inputs. This guides the CNN to learn a combination of existing features to predict saliency more effectively, which presents a less complex problem than operating on the pixels directly. We then integrate a superpixel-based Laplacian propagation framework with the trained CNN to extract a spatially consistent saliency map by exploiting the intrinsic structure of the input image. Extensive quantitative and qualitative experimental evaluations on three data sets demonstrate that the proposed method consistently outperforms the state-of-the-art methods.

Journal ArticleDOI
TL;DR: A novel depth-aware salient object detection and segmentation framework via multiscale discriminative saliency fusion (MDSF) and bootstrap learning for RGBD images (RGB color images with corresponding Depth maps) and stereoscopic images achieves the better performance on both saliency detection and salient object segmentation.
Abstract: This paper proposes a novel depth-aware salient object detection and segmentation framework via multiscale discriminative saliency fusion (MDSF) and bootstrap learning for RGBD images (RGB color images with corresponding Depth maps) and stereoscopic images By exploiting low-level feature contrasts, mid-level feature weighted factors and high-level location priors, various saliency measures on four classes of features are calculated based on multiscale region segmentation A random forest regressor is learned to perform the discriminative saliency fusion (DSF) and generate the DSF saliency map at each scale, and DSF saliency maps across multiple scales are combined to produce the MDSF saliency map Furthermore, we propose an effective bootstrap learning-based salient object segmentation method, which is bootstrapped with samples based on the MDSF saliency map and learns multiple kernel support vector machines Experimental results on two large datasets show how various categories of features contribute to the saliency detection performance and demonstrate that the proposed framework achieves the better performance on both saliency detection and salient object segmentation

Journal ArticleDOI
TL;DR: A novel blind/no-reference (NR) model for accessing the perceptual quality of screen content pictures with big data learning and delivers computational efficiency and promising performance.
Abstract: Recent years have witnessed a growing number of image and video centric applications on mobile, vehicular, and cloud platforms, involving a wide variety of digital screen content images Unlike natural scene images captured with modern high fidelity cameras, screen content images are typically composed of fewer colors, simpler shapes, and a larger frequency of thin lines In this paper, we develop a novel blind/no-reference (NR) model for accessing the perceptual quality of screen content pictures with big data learning The new model extracts four types of features descriptive of the picture complexity, of screen content statistics, of global brightness quality, and of the sharpness of details Comparative experiments verify the efficacy of the new model as compared with existing relevant blind picture quality assessment algorithms applied on screen content image databases A regression module is trained on a considerable number of training samples labeled with objective visual quality predictions delivered by a high-performance full-reference method designed for screen content image quality assessment (IQA) This results in an opinion-unaware NR blind screen content IQA algorithm Our proposed model delivers computational efficiency and promising performance The source code of the new model will be available at: https://sitesgooglecom/site/guke198701/publications

Journal ArticleDOI
TL;DR: This paper revisits the co- saliency detection task and advances its development into a new phase, where the problem setting is generalized to allow the image group to contain objects in arbitrary number of categories and the algorithms need to simultaneously detect multi-class co-salient objects from such complex data.
Abstract: With the goal of discovering the common and salient objects from the given image group, co-saliency detection has received tremendous research interest in recent years. However, as most of the existing co-saliency detection methods are performed based on the assumption that all the images in the given image group should contain co-salient objects in only one category, they can hardly be applied in practice, particularly for the large-scale image set obtained from the Internet. To address this problem, this paper revisits the co-saliency detection task and advances its development into a new phase, where the problem setting is generalized to allow the image group to contain objects in arbitrary number of categories and the algorithms need to simultaneously detect multi-class co-salient objects from such complex data. To solve this new challenge, we decompose it into two sub-problems, i.e., how to identify subgroups of relevant images and how to discover relevant co-salient objects from each subgroup, and propose a novel co-saliency detection framework to correspondingly address the two sub-problems via two-stage multi-view spectral rotation co-clustering. Comprehensive experiments on two publically available benchmarks demonstrate the effectiveness of the proposed approach. Notably, it can even outperform the state-of-the-art co-saliency detection methods, which are performed based on the image subgroups carefully separated by the human labor.

Journal ArticleDOI
TL;DR: In this paper, a semi-supervised sparse representation-based classification method was proposed to deal with the non-linear nuisance variations between labeled and unlabeled samples, where a gallery dictionary consisting of one or more examples of each person and a variation dictionary representing linear nuisance variables (e.g., different lighting conditions and different glasses).
Abstract: This paper addresses the problem of face recognition when there is only few, or even only a single, labeled examples of the face that we wish to recognize. Moreover, these examples are typically corrupted by nuisance variables, both linear (i.e., additive nuisance variables, such as bad lighting and wearing of glasses) and non-linear (i.e., non-additive pixel-wise nuisance variables, such as expression changes). The small number of labeled examples means that it is hard to remove these nuisance variables between the training and testing faces to obtain good recognition performance. To address the problem, we propose a method called semi-supervised sparse representation-based classification. This is based on recent work on sparsity, where faces are represented in terms of two dictionaries: a gallery dictionary consisting of one or more examples of each person, and a variation dictionary representing linear nuisance variables (e.g., different lighting conditions and different glasses). The main idea is that: 1) we use the variation dictionary to characterize the linear nuisance variables via the sparsity framework and 2) prototype face images are estimated as a gallery dictionary via a Gaussian mixture model, with mixed labeled and unlabeled samples in a semi-supervised manner, to deal with the non-linear nuisance variations between labeled and unlabeled samples. We have done experiments with insufficient labeled samples, even when there is only a single labeled sample per person. Our results on the AR, Multi-PIE, CAS-PEAL, and LFW databases demonstrate that the proposed method is able to deliver significantly improved performance over existing methods.

Journal ArticleDOI
TL;DR: A novelWeakly supervised deep matrix factorization algorithm is proposed, which uncovers the latent image representations and tag representations embedded in the latent subspace by collaboratively exploring the weakly supervised tagging information, the visual structure, and the semantic structure.
Abstract: The number of images associated with weakly supervised user-provided tags has increased dramatically in recent years. User-provided tags are incomplete, subjective and noisy. In this paper, we focus on the problem of social image understanding, i.e., tag refinement, tag assignment, and image retrieval. Different from previous work, we propose a novel weakly supervised deep matrix factorization algorithm, which uncovers the latent image representations and tag representations embedded in the latent subspace by collaboratively exploring the weakly supervised tagging information, the visual structure, and the semantic structure. Due to the well-known semantic gap, the hidden representations of images are learned by a hierarchical model, which are progressively transformed from the visual feature space. It can naturally embed new images into the subspace using the learned deep architecture. The semantic and visual structures are jointly incorporated to learn a semantic subspace without overfitting the noisy, incomplete, or subjective tags. Besides, to remove the noisy or redundant visual features, a sparse model is imposed on the transformation matrix of the first layer in the deep architecture. Finally, a unified optimization problem with a well-defined objective function is developed to formulate the proposed problem and solved by a gradient descent procedure with curvilinear search. Extensive experiments on real-world social image databases are conducted on the tasks of image understanding: image tag refinement, assignment, and retrieval. Encouraging results are achieved with comparison with the state-of-the-art algorithms, which demonstrates the effectiveness of the proposed method.

Journal ArticleDOI
TL;DR: Structured Sparse Subspace Clustering (S3C) as discussed by the authors is a joint optimization framework for learning both the affinity and the segmentation, which is based on expressing each data point as a structured sparse linear combination of all other data points, where the structure is induced by a norm that depends on unknown segmentation.
Abstract: Subspace clustering refers to the problem of segmenting data drawn from a union of subspaces. State-of-the-art approaches for solving this problem follow a two-stage approach. In the first step, an affinity matrix is learned from the data using sparse or low-rank minimization techniques. In the second step, the segmentation is found by applying spectral clustering to this affinity. While this approach has led to the state-of-the-art results in many applications, it is suboptimal, because it does not exploit the fact that the affinity and the segmentation depend on each other. In this paper, we propose a joint optimization framework — Structured Sparse Subspace Clustering (S3C) — for learning both the affinity and the segmentation. The proposed S3C framework is based on expressing each data point as a structured sparse linear combination of all other data points, where the structure is induced by a norm that depends on the unknown segmentation. Moreover, we extend the proposed S3C framework into Constrained S3C (CS3C) in which available partial side-information is incorporated into the stage of learning the affinity. We show that both the structured sparse representation and the segmentation can be found via a combination of an alternating direction method of multipliers with spectral clustering. Experiments on a synthetic data set, the Extended Yale B face data set, the Hopkins 155 motion segmentation database, and three cancer data sets demonstrate the effectiveness of our approach.

Journal ArticleDOI
TL;DR: This paper presents a robust and efficient method for license plate detection with the purpose of accurately localizing vehicle license plates from complex scenes in real time and substantially outperforms state-of-the-art methods in terms of both detection accuracy and run-time efficiency.
Abstract: This paper presents a robust and efficient method for license plate detection with the purpose of accurately localizing vehicle license plates from complex scenes in real time. A simple yet effective image downscaling method is first proposed to substantially accelerate license plate localization without sacrificing detection performance compared with that achieved using the original image. Furthermore, a novel line density filter approach is proposed to extract candidate regions, thereby significantly reducing the area to be analyzed for license plate localization. Moreover, a cascaded license plate classifier based on linear support vector machines using color saliency features is introduced to identify the true license plate from among the candidate regions. For performance evaluation, a data set consisting of 3977 images captured from diverse scenes under different conditions is also presented. Extensive experiments on the widely used Caltech license plate data set and our newly introduced data set demonstrate that the proposed approach substantially outperforms state-of-the-art methods in terms of both detection accuracy and run-time efficiency, increasing the detection ratio from 91.09% to 96.62% while decreasing the run time from 672 to 42 ms for processing an image with a resolution of $1082\times 728$ . The executable code and our collected data set are publicly available.

Journal ArticleDOI
TL;DR: Simulations reveal that this simple motion-compensated coder can efficiently extend the compression range of dynamic voxelized point clouds to rates below what intra-frame coding alone can accommodate, trading rate for geometry accuracy.
Abstract: Dynamic point clouds are a potential new frontier in visual communication systems. A few articles have addressed the compression of point clouds, but very few references exist on exploring temporal redundancies. This paper presents a novel motion-compensated approach to encoding dynamic voxelized point clouds at low bit rates. A simple coder breaks the voxelized point cloud at each frame into blocks of voxels. Each block is either encoded in intra-frame mode or is replaced by a motion-compensated version of a block in the previous frame. The decision is optimized in a rate-distortion sense. In this way, both the geometry and the color are encoded with distortion, allowing for reduced bit-rates. In-loop filtering is employed to minimize compression artifacts caused by distortion in the geometry information. Simulations reveal that this simple motion-compensated coder can efficiently extend the compression range of dynamic voxelized point clouds to rates below what intra-frame coding alone can accommodate, trading rate for geometry accuracy.

Journal ArticleDOI
TL;DR: The effectiveness of the proposed algorithm to remove rain or snow from a single color image shows a superiority over several state-of-the-art works, which is verified through both subjective and objective approaches.
Abstract: In this paper, we propose an efficient algorithm to remove rain or snow from a single color image Our algorithm takes advantage of two popular techniques employed in image processing, namely, image decomposition and dictionary learning At first, a combination of rain/snow detection and a guided filter is used to decompose the input image into a complementary pair: 1) the low-frequency part that is free of rain or snow almost completely and 2) the high-frequency part that contains not only the rain/snow component but also some or even many details of the image Then, we focus on the extraction of image’s details from the high-frequency part To this end, we design a 3-layer hierarchical scheme In the first layer, an overcomplete dictionary is trained and three classifications are carried out to classify the high-frequency part into rain/snow and non-rain/snow components in which some common characteristics of rain/snow have been utilized In the second layer, another combination of rain/snow detection and guided filtering is performed on the rain/snow component obtained in the first layer In the third layer, the sensitivity of variance across color channels is computed to enhance the visual quality of rain/snow-removed image The effectiveness of our algorithm is verified through both subjective (the visual quality) and objective (through rendering rain/snow on some ground-truth images) approaches, which shows a superiority over several state-of-the-art works

Journal ArticleDOI
TL;DR: The results show that the proposed linear algorithm, which is augmented by feature interaction, has its advantage for motion detection in a real-time Kinect entertaining environment.
Abstract: The Kinect sensing devices have been widely used in current Human-Computer Interaction entertainment. A fundamental issue involved is to detect users’ motions accurately and quickly. In this paper, we tackle it by proposing a linear algorithm, which is augmented by feature interaction. The linear property guarantees its speed whereas feature interaction captures the higher order effect from the data to enhance its accuracy. The Schatten-p norm is leveraged to integrate the main linear effect and the higher order nonlinear effect by mining the correlation between them. The resulted classification model is a desirable combination of speed and accuracy. We propose a novel solution to solve our objective function. Experiments are performed on three public Kinect-based entertainment data sets related to fitness and gaming. The results show that our method has its advantage for motion detection in a real-time Kinect entertaining environment.

Journal ArticleDOI
TL;DR: In this article, a graph Laplacian regularizer is proposed for image denoising in the continuous domain, and the convergence of the regularizer to a continuous domain functional is analyzed.
Abstract: Inverse imaging problems are inherently underdetermined, and hence, it is important to employ appropriate image priors for regularization. One recent popular prior—the graph Laplacian regularizer—assumes that the target pixel patch is smooth with respect to an appropriately chosen graph. However, the mechanisms and implications of imposing the graph Laplacian regularizer on the original inverse problem are not well understood. To address this problem, in this paper, we interpret neighborhood graphs of pixel patches as discrete counterparts of Riemannian manifolds and perform analysis in the continuous domain, providing insights into several fundamental aspects of graph Laplacian regularization for image denoising. Specifically, we first show the convergence of the graph Laplacian regularizer to a continuous-domain functional, integrating a norm measured in a locally adaptive metric space. Focusing on image denoising, we derive an optimal metric space assuming non-local self-similarity of pixel patches, leading to an optimal graph Laplacian regularizer for denoising in the discrete domain. We then interpret graph Laplacian regularization as an anisotropic diffusion scheme to explain its behavior during iterations, e.g., its tendency to promote piecewise smooth signals under certain settings. To verify our analysis, an iterative image denoising algorithm is developed. Experimental results show that our algorithm performs competitively with state-of-the-art denoising methods, such as BM3D for natural images, and outperforms them significantly for piecewise smooth images.

Journal ArticleDOI
TL;DR: A multigraph determinantal point process (MDPP) model is proposed to capture the full structure between different bands and efficiently find the optimal band subset in extensive hyperspectral applications.
Abstract: Band selection, as a special case of the feature selection problem, tries to remove redundant bands and select a few important bands to represent the whole image cube. This has attracted much attention, since the selected bands provide discriminative information for further applications and reduce the computational burden. Though hyperspectral band selection has gained rapid development in recent years, it is still a challenging task because of the following requirements: 1) an effective model can capture the underlying relations between different high-dimensional spectral bands; 2) a fast and robust measure function can adapt to general hyperspectral tasks; and 3) an efficient search strategy can find the desired selected bands in reasonable computational time. To satisfy these requirements, a multigraph determinantal point process (MDPP) model is proposed to capture the full structure between different bands and efficiently find the optimal band subset in extensive hyperspectral applications. There are three main contributions: 1) graphical model is naturally transferred to address band selection problem by the proposed MDPP; 2) multiple graphs are designed to capture the intrinsic relationships between hyperspectral bands; and 3) mixture DPP is proposed to model the multiple dependencies in the proposed multiple graphs, and offers an efficient search strategy to select the optimal bands. To verify the superiority of the proposed method, experiments have been conducted on three hyperspectral applications, such as hyperspectral classification, anomaly detection, and target detection. The reliability of the proposed method in generic hyperspectral tasks is experimentally proved on four real-world hyperspectral data sets.