scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Image Processing in 2019"


Journal ArticleDOI
TL;DR: In this article, the authors present a comprehensive study and evaluation of existing single image dehazing algorithms, using a new large-scale benchmark consisting of both synthetic and real-world hazy images, called Realistic Single-Image DEhazing (RESIDE).
Abstract: We present a comprehensive study and evaluation of existing single-image dehazing algorithms, using a new large-scale benchmark consisting of both synthetic and real-world hazy images, called REalistic Single-Image DEhazing (RESIDE). RESIDE highlights diverse data sources and image contents, and is divided into five subsets, each serving different training or evaluation purposes. We further provide a rich variety of criteria for dehazing algorithm evaluation, ranging from full-reference metrics to no-reference metrics and to subjective evaluation, and the novel task-driven evaluation. Experiments on RESIDE shed light on the comparisons and limitations of the state-of-the-art dehazing algorithms, and suggest promising future directions.

922 citations


Journal ArticleDOI
Hui Li1, Xiaojun Wu1
TL;DR: A novel deep learning architecture for infrared and visible images fusion problems is presented, where the encoding network is combined with convolutional layers, a fusion layer, and dense block in which the output of each layer is connected to every other layer.
Abstract: In this paper, we present a novel deep learning architecture for infrared and visible images fusion problems. In contrast to conventional convolutional networks, our encoding network is combined with convolutional layers, a fusion layer, and dense block in which the output of each layer is connected to every other layer. We attempt to use this architecture to get more useful features from source images in the encoding process, and two fusion layers (fusion strategies) are designed to fuse these features. Finally, the fused image is reconstructed by a decoder. Compared with existing fusion methods, the proposed fusion method achieves the state-of-the-art performance in objective and subjective assessment.

703 citations


Journal ArticleDOI
TL;DR: The proposed method is extended for attribute style manipulation in an unsupervised manner and outperforms the state-of-the-art on realistic attribute editing with other facial details well preserved.
Abstract: Facial attribute editing aims to manipulate single or multiple attributes on a given face image, i.e., to generate a new face image with desired attributes while preserving other details. Recently, the generative adversarial net (GAN) and encoder–decoder architecture are usually incorporated to handle this task with promising results. Based on the encoder–decoder architecture, facial attribute editing is achieved by decoding the latent representation of a given face conditioned on the desired attributes. Some existing methods attempt to establish an attribute-independent latent representation for further attribute editing. However, such attribute-independent constraint on the latent representation is excessive because it restricts the capacity of the latent representation and may result in information loss, leading to over-smooth or distorted generation. Instead of imposing constraints on the latent representation, in this work, we propose to apply an attribute classification constraint to the generated image to just guarantee the correct change of desired attributes, i.e., to change what you want. Meanwhile, the reconstruction learning is introduced to preserve attribute-excluding details, in other words, to only change what you want. Besides, the adversarial learning is employed for visually realistic editing. These three components cooperate with each other forming an effective framework for high quality facial attribute editing, referred as AttGAN . Furthermore, the proposed method is extended for attribute style manipulation in an unsupervised manner. Experiments on two wild datasets, CelebA and LFW, show that the proposed method outperforms the state-of-the-art on realistic attribute editing with other facial details well preserved.

633 citations


Journal ArticleDOI
TL;DR: Visualization results demonstrate that, compared with the CNN without Gate Unit, ACNNs are capable of shifting the attention from the occluded patches to other related but unobstructed ones and outperform other state-of-the-art methods on several widely used in thelab facial expression datasets under the cross-dataset evaluation protocol.
Abstract: Facial expression recognition in the wild is challenging due to various unconstrained conditions. Although existing facial expression classifiers have been almost perfect on analyzing constrained frontal faces, they fail to perform well on partially occluded faces that are common in the wild. In this paper, we propose a convolution neutral network (CNN) with attention mechanism (ACNN) that can perceive the occlusion regions of the face and focus on the most discriminative un-occluded regions. ACNN is an end-to-end learning framework. It combines the multiple representations from facial regions of interest (ROIs). Each representation is weighed via a proposed gate unit that computes an adaptive weight from the region itself according to the unobstructedness and importance. Considering different RoIs, we introduce two versions of ACNN: patch-based ACNN (pACNN) and global–local-based ACNN (gACNN). pACNN only pays attention to local facial patches. gACNN integrates local representations at patch-level with global representation at image-level. The proposed ACNNs are evaluated on both real and synthetic occlusions, including a self-collected facial expression dataset with real-world occlusions, the two largest in-the-wild facial expression datasets (RAF-DB and AffectNet) and their modifications with synthesized facial occlusions. Experimental results show that ACNNs improve the recognition accuracy on both the non-occluded faces and occluded faces. Visualization results demonstrate that, compared with the CNN without Gate Unit, ACNNs are capable of shifting the attention from the occluded patches to other related but unobstructed ones. ACNNs also outperform other state-of-the-art methods on several widely used in-the-lab facial expression datasets under the cross-dataset evaluation protocol.

536 citations


Journal ArticleDOI
TL;DR: DeepCrack-an end-to-end trainable deep convolutional neural network for automatic crack detection by learning high-level features for crack representation and outperforms the current state-of-the-art methods.
Abstract: Cracks are typical line structures that are of interest in many computer-vision applications. In practice, many cracks, e.g., pavement cracks, show poor continuity and low contrast, which bring great challenges to image-based crack detection by using low-level features. In this paper, we propose DeepCrack-an end-to-end trainable deep convolutional neural network for automatic crack detection by learning high-level features for crack representation. In this method, multi-scale deep convolutional features learned at hierarchical convolutional stages are fused together to capture the line structures. More detailed representations are made in larger scale feature maps and more holistic representations are made in smaller scale feature maps. We build DeepCrack net on the encoder–decoder architecture of SegNet and pairwisely fuse the convolutional features generated in the encoder network and in the decoder network at the same scale. We train DeepCrack net on one crack dataset and evaluate it on three others. The experimental results demonstrate that DeepCrack achieves $F$ -measure over 0.87 on the three challenging datasets in average and outperforms the current state-of-the-art methods.

449 citations


Journal ArticleDOI
TL;DR: A new deep locality-preserving convolutional neural network (DLP-CNN) method that aims to enhance the discriminative power of deep features by preserving the locality closeness while maximizing the inter-class scatter is proposed.
Abstract: Facial expression is central to human experience, but most previous databases and studies are limited to posed facial behavior under controlled conditions In this paper, we present a novel facial expression database, Real-world Affective Face Database (RAF-DB), which contains approximately 30 000 facial images with uncontrolled poses and illumination from thousands of individuals of diverse ages and races During the crowdsourcing annotation, each image is independently labeled by approximately 40 annotators An expectation–maximization algorithm is developed to reliably estimate the emotion labels, which reveals that real-world faces often express compound or even mixture emotions A cross-database study between RAF-DB and CK+ database further indicates that the action units of real-world emotions are much more diverse than, or even deviate from, those of laboratory-controlled emotions To address the recognition of multi-modal expressions in the wild, we propose a new deep locality-preserving convolutional neural network (DLP-CNN) method that aims to enhance the discriminative power of deep features by preserving the locality closeness while maximizing the inter-class scatter Benchmark experiments on 7-class basic expressions and 11-class compound expressions, as well as additional experiments on CK+, MMI, and SFEW 20 databases, show that the proposed DLP-CNN outperforms the state-of-the-art handcrafted features and deep learning-based methods for expression recognition in the wild To promote further study, we have made the RAF database, benchmarks, and descriptor encodings publicly available to the research community

429 citations


Journal ArticleDOI
TL;DR: This paper introduces pose-invariant embedding (PIE) as a pedestrian descriptor and shows that PoseBox alone yields decent re-ID accuracy and that when integrated in the PBF network, the learned PIE descriptor produces competitive performance compared with state-of-the-art approaches.
Abstract: Pedestrian misalignment, which mainly arises from detector errors and pose variations, is a critical problem for a robust person re-identification (re-ID) system. With poor alignment, the feature learning and matching process might be largely compromised. To address this problem, this paper introduces pose-invariant embedding (PIE) as a pedestrian descriptor. First, in order to align pedestrians to a standard pose, the PoseBox structure is introduced, which is generated through pose estimation followed by affine transformations. Second, to reduce the impact of pose estimation errors and information loss during the PoseBox construction, we design a PoseBox fusion (PBF) CNN architecture that takes the original image, the PoseBox, and the pose estimation confidence as input. The proposed PIE descriptor is thus defined as the fully connected layer of the PBF network for the retrieval task. Experiments are conducted on the Market-1501, CUHK03-NP, and DukeMTMC-reID datasets. We show that PoseBox alone yields decent re-ID accuracy and that when integrated in the PBF network, the learned PIE descriptor produces competitive performance compared with state-of-the-art approaches.

386 citations


Journal ArticleDOI
TL;DR: This work builds up the existing state-of-the-art object detection systems and proposes a simple but effective method to train rotation-invariant and Fisher discriminative CNN models to further boost object detection performance.
Abstract: The performance of object detection has recently been significantly improved due to the powerful features learnt through convolutional neural networks (CNNs). Despite the remarkable success, there are still several major challenges in object detection, including object rotation, within-class diversity, and between-class similarity, which generally degenerate object detection performance. To address these issues, we build up the existing state-of-the-art object detection systems and propose a simple but effective method to train rotation-invariant and Fisher discriminative CNN models to further boost object detection performance. This is achieved by optimizing a new objective function that explicitly imposes a rotation-invariant regularizer and a Fisher discrimination regularizer on the CNN features. Specifically, the first regularizer enforces the CNN feature representations of the training samples before and after rotation to be mapped closely to each other in order to achieve rotation-invariance. The second regularizer constrains the CNN features to have small within-class scatter but large between-class separation. We implement our proposed method under four popular object detection frameworks, including region-CNN (R-CNN), Fast R- CNN, Faster R- CNN, and R- FCN. In the experiments, we comprehensively evaluate the proposed method on the PASCAL VOC 2007 and 2012 data sets and a publicly available aerial image data set. Our proposed methods outperform the existing baseline methods and achieve the state-of-the-art results.

367 citations


Journal ArticleDOI
TL;DR: A multiview consensus clustering method to learn a consensus graph with minimizing disagreement between different views and constraining the rank of the Laplacian matrix is proposed.
Abstract: A graph is usually formed to reveal the relationship between data points and graph structure is encoded by the affinity matrix. Most graph-based multiview clustering methods use predefined affinity matrices and the clustering performance highly depends on the quality of graph. We learn a consensus graph with minimizing disagreement between different views and constraining the rank of the Laplacian matrix. Since diverse views admit the same underlying cluster structure across multiple views, we use a new disagreement cost function for regularizing graphs from different views toward a common consensus. Simultaneously, we impose a rank constraint on the Laplacian matrix to learn the consensus graph with exactly $k$ connected components where $k$ is the number of clusters, which is different from using fixed affinity matrices in most existing graph-based methods. With the learned consensus graph, we can directly obtain the cluster labels without performing any post-processing, such as $k$ -means clustering algorithm in spectral clustering-based methods. A multiview consensus clustering method is proposed to learn such a graph. An efficient iterative updating algorithm is derived to optimize the proposed challenging optimization problem. Experiments on several benchmark datasets have demonstrated the effectiveness of the proposed method in terms of seven metrics.

304 citations


Journal ArticleDOI
TL;DR: A novel spatially variant recurrent neural network (RNN) is proposed as an edge stream to model edge details, with the guidance of another auto-encoder, to enhance the visibility of degraded images.
Abstract: Camera sensors often fail to capture clear images or videos in a poorly lit environment. In this paper, we propose a trainable hybrid network to enhance the visibility of such degraded images. The proposed network consists of two distinct streams to simultaneously learn the global content and the salient structures of the clear image in a unified network. More specifically, the content stream estimates the global content of the low-light input through an encoder–decoder network. However, the encoder in the content stream tends to lose some structure details. To remedy this, we propose a novel spatially variant recurrent neural network (RNN) as an edge stream to model edge details, with the guidance of another auto-encoder. The experimental results show that the proposed network favorably performs against the state-of-the-art low-light image enhancement algorithms.

293 citations


Journal ArticleDOI
TL;DR: In the proposed architecture, a cross-modal distillation stream, accompanying the RGB-specific and depth-specific streams, is introduced to extract new RGB-D features in each level in the bottom–up path, and the channel-wise attention mechanism is innovatively introduced to the cross- modal cross-level fusion problem to adaptively select complementary feature maps from each modality in eachlevel.
Abstract: Previous RGB-D fusion systems based on convolutional neural networks typically employ a two-stream architecture, in which RGB and depth inputs are learned independently. The multi-modal fusion stage is typically performed by concatenating the deep features from each stream in the inference process. The traditional two-stream architecture might experience insufficient multi-modal fusion due to two following limitations: 1) the cross-modal complementarity is rarely studied in the bottom–up path, wherein we believe the cross-modal complements can be combined to learn new discriminative features to enlarge the RGB-D representation community and 2) the cross-modal channels are typically combined by undifferentiated concatenation, which appears ambiguous to selecting cross-modal complementary features. In this paper, we address these two limitations by proposing a novel three-stream attention-aware multi-modal fusion network. In the proposed architecture, a cross-modal distillation stream, accompanying the RGB-specific and depth-specific streams, is introduced to extract new RGB-D features in each level in the bottom–up path. Furthermore, the channel-wise attention mechanism is innovatively introduced to the cross-modal cross-level fusion problem to adaptively select complementary feature maps from each modality in each level. Extensive experiments report the effectiveness of the proposed architecture and the significant improvement over the state-of-the-art RGB-D salient object detection methods.

Journal ArticleDOI
TL;DR: Experimental results on three person ReID datasets, i.e., Market1501, CUHK03, and VIPeR, show that the proposed deep representation learning procedure named part loss network outperforms existing deep representations.
Abstract: Learning discriminative representations for unseen person images is critical for person re-identification (ReID). Most of the current approaches learn deep representations in classification tasks, which essentially minimize the empirical classification risk on the training set. As shown in our experiments, such representations easily get over-fitted on a discriminative human body part on the training set. To gain the discriminative power on unseen person images, we propose a deep representation learning procedure named part loss network, to minimize both the empirical classification risk on training person images and the representation learning risk on unseen person images. The representation learning risk is evaluated by the proposed part loss, which automatically detects human body parts and computes the person classification loss on each part separately. Compared with traditional global classification loss, simultaneously considering part loss enforces the deep network to learn representations for different body parts and gain the discriminative power on unseen persons. Experimental results on three person ReID datasets, i.e., Market1501, CUHK03, and VIPeR, show that our representation outperforms existing deep representations.

Journal ArticleDOI
TL;DR: In this paper, a novel spectral mixture model, called the augmented linear mixing model (ARMLM), is proposed to address spectral variability by applying a data-driven learning strategy in inverse problems of hyperspectral unmixing.
Abstract: Hyperspectral imagery collected from airborne or satellite sources inevitably suffers from spectral variability, making it difficult for spectral unmixing to accurately estimate abundance maps. The classical unmixing model, the linear mixing model (LMM), generally fails to handle this sticky issue effectively. To this end, we propose a novel spectral mixture model, called the augmented LMM, to address spectral variability by applying a data-driven learning strategy in inverse problems of hyperspectral unmixing. The proposed approach models the main spectral variability (i.e., scaling factors) generated by variations in illumination or typography separately by means of the endmember dictionary. It then models other spectral variabilities caused by environmental conditions (e.g., local temperature and humidity and atmospheric effects) and instrumental configurations (e.g., sensor noise), and material nonlinear mixing effects, by introducing a spectral variability dictionary. To effectively run the data-driven learning strategy, we also propose a reasonable prior knowledge for the spectral variability dictionary, whose atoms are assumed to be low-coherent with spectral signatures of endmembers, which leads to a well-known low-coherence dictionary learning problem. Thus, a dictionary learning technique is embedded in the framework of spectral unmixing so that the algorithm can learn the spectral variability dictionary and estimate the abundance maps simultaneously. Extensive experiments on synthetic and real datasets are performed to demonstrate the superiority and effectiveness of the proposed method in comparison with the previous state-of-the-art methods.

Journal ArticleDOI
TL;DR: Experimental results show that CamStyle significantly improves the performance of the baseline in person re-identification (re-ID) and UDA, and achieves state-of-the-art accuracy based on a baseline deep re-ID model on Market-1501 and DukeMTMC-reID.
Abstract: Person re-identification (re-ID) is a cross-camera retrieval task that suffers from image style variations caused by different cameras. The art implicitly addresses this problem by learning a camera-invariant descriptor subspace. In this paper, we explicitly consider this challenge by introducing camera style (CamStyle). CamStyle can serve as a data augmentation approach that reduces the risk of deep network overfitting and that smooths the CamStyle disparities. Specifically, with a style transfer model, labeled training images can be style transferred to each camera, and along with the original training samples, form the augmented training set. This method, while increasing data diversity against overfitting, also incurs a considerable level of noise. In the effort to alleviate the impact of noise, the label smooth regularization (LSR) is adopted. The vanilla version of our method (without LSR) performs reasonably well on few camera systems in which overfitting often occurs. With LSR, we demonstrate consistent improvement in all systems regardless of the extent of overfitting. We also report competitive accuracy compared with the state of the art on Market-1501 and DukeMTMC-re-ID. Importantly, CamStyle can be employed to the challenging problems of one view learning and unsupervised domain adaptation (UDA) in person re-identification (re-ID), both of which have critical research and application significance. The former only has labeled data in one camera view and the latter only has labeled data in the source domain. Experimental results show that CamStyle significantly improves the performance of the baseline in the two problems. Specially, for UDA, CamStyle achieves state-of-the-art accuracy based on a baseline deep re-ID model on Market-1501 and DukeMTMC-reID. Our code is available at: https://github.com/zhunzhong07/CamStyle .

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed a new discriminative correlation filter (DCF) based tracking method, which enables joint spatial-temporal filter learning in a lower dimensional discriminativity manifold, and applied structured spatial sparsity constraints to multi-channel filters.
Abstract: With efficient appearance learning models, discriminative correlation filter (DCF) has been proven to be very successful in recent video object tracking benchmarks and competitions. However, the existing DCF paradigm suffers from two major issues, i.e., spatial boundary effect and temporal filter degradation. To mitigate these challenges, we propose a new DCF-based tracking method. The key innovations of the proposed method include adaptive spatial feature selection and temporal consistent constraints, with which the new tracker enables joint spatial-temporal filter learning in a lower dimensional discriminative manifold. More specifically, we apply structured spatial sparsity constraints to multi-channel filters. Consequently, the process of learning spatial filters can be approximated by the lasso regularization. To encourage temporal consistency, the filter model is restricted to lie around its historical value and updated locally to preserve the global structure in the manifold. Last, a unified optimization framework is proposed to jointly select temporal consistency preserving spatial features and learn discriminative filters with the augmented Lagrangian method. Qualitative and quantitative evaluations have been conducted on a number of well-known benchmarking datasets such as OTB2013, OTB50, OTB100, Temple-Colour, UAV123, and VOT2018. The experimental results demonstrate the superiority of the proposed method over the state-of-the-art approaches.

Journal ArticleDOI
TL;DR: In this article, a novel deep learning-based approach for one-class transfer learning is presented, in which labeled data from an unrelated task is used for feature learning in one class classification.
Abstract: We present a novel deep-learning-based approach for one-class transfer learning in which labeled data from an unrelated task is used for feature learning in one-class classification. The proposed method operates on top of a convolutional neural network (CNN) of choice and produces descriptive features while maintaining a low intra-class variance in the feature space for the given class. For this purpose two loss functions, compactness loss and descriptiveness loss, are proposed along with a parallel CNN architecture. A template matching-based framework is introduced to facilitate the testing process. Extensive experiments on publicly available anomaly detection, novelty detection, and mobile active authentication datasets show that the proposed deep one-class (DOC) classification method achieves significant improvements over the state-of-the-art.

Journal ArticleDOI
TL;DR: TextField as discussed by the authors learns a direction field pointing away from the nearest text boundary to each text point, which is represented by an image of 2D vectors and learned via a fully convolutional neural network.
Abstract: Scene text detection is an important step in the scene text reading system. The main challenges lie in significantly varied sizes and aspect ratios, arbitrary orientations, and shapes. Driven by the recent progress in deep learning, impressive performances have been achieved for multi-oriented text detection. Yet, the performance drops dramatically in detecting the curved texts due to the limited text representation (e.g., horizontal bounding boxes, rotated rectangles, or quadrilaterals). It is of great interest to detect the curved texts, which are actually very common in natural scenes. In this paper, we present a novel text detector named TextField for detecting irregular scene texts. Specifically, we learn a direction field pointing away from the nearest text boundary to each text point. This direction field is represented by an image of 2D vectors and learned via a fully convolutional neural network. It encodes both binary text mask and direction information used to separate adjacent text instances, which is challenging for the classical segmentation-based approaches. Based on the learned direction field, we apply a simple yet effective morphological-based post-processing to achieve the final detection. The experimental results show that the proposed TextField outperforms the state-of-the-art methods by a large margin (28% and 8%) on two curved text datasets: Total-Text and SCUT-CTW1500, respectively; TextField also achieves very competitive performance on multi-oriented datasets: ICDAR 2015 and MSRA-TD500. Furthermore, TextField is robust in generalizing unseen datasets.

Journal ArticleDOI
TL;DR: DeepISP as mentioned in this paper is a full end-to-end deep neural model of the camera image signal processing pipeline that learns a mapping from the raw low-light mosaiced image to the final visually compelling image.
Abstract: We present DeepISP, a full end-to-end deep neural model of the camera image signal processing pipeline. Our model learns a mapping from the raw low-light mosaiced image to the final visually compelling image and encompasses low-level tasks, such as demosaicing and denoising, as well as higher-level tasks, such as color correction and image adjustment. The training and evaluation of the pipeline were performed on a dedicated data set containing pairs of low-light and well-lit images captured by a Samsung S7 smartphone camera in both raw and processed JPEG formats. The proposed solution achieves the state-of-the-art performance in objective evaluation of peak signal-to-noise ratio on the subtask of joint denoising and demosaicing. For the full end-to-end pipeline, it achieves better visual quality compared to the manufacturer ISP, in both a subjective human assessment and when rated by a deep model trained for assessing image quality.

Journal ArticleDOI
TL;DR: A new approach for learning-based video quality assessment is proposed, based on the idea of computing features in two levels so that low complexity features are computed for the full sequence first, and then high complexity Features are extracted from a subset of representative video frames, selected by using the low complexity Features.
Abstract: Smartphones and other consumer devices capable of capturing video content and sharing it on social media in nearly real time are widely available at a reasonable cost. Thus, there is a growing need for no-reference video quality assessment (NR-VQA) of consumer produced video content, typically characterized by capture impairments that are qualitatively different from those observed in professionally produced video content. To date, most of the NR-VQA models in prior art have been developed for assessing coding and transmission distortions, rather than capture impairments. In addition, the most accurate NR-VQA methods known in prior art are often computationally complex, and therefore impractical for many real life applications. In this paper, we propose a new approach for learning-based video quality assessment, based on the idea of computing features in two levels so that low complexity features are computed for the full sequence first, and then high complexity features are extracted from a subset of representative video frames, selected by using the low complexity features. We have compared the proposed method against several relevant benchmark methods using three recently published annotated public video quality databases, and our results show that the proposed method can predict subjective video quality more accurately than the benchmark methods. The best performing prior method achieves nearly similar accuracy, but at substantially higher computational cost.

Journal ArticleDOI
TL;DR: This paper proposes a progressive framework that gradually exploits the unlabeled data for person re-ID with comparable performance to the supervised state-of-the-art method with 100% labeled data.
Abstract: In this paper, we focus on the one-example person re-identification (re-ID) task, where each identity has only one labeled example along with many unlabeled examples. We propose a progressive framework that gradually exploits the unlabeled data for person re-ID. In this framework, we iteratively: 1) update the convolutional neural network (CNN) model and (2) estimate pseudo labels for the unlabeled data. We split the training data into three parts, i.e., labeled data, pseudo-labeled data, and index-labeled data. Initially, the re-ID model is trained using the labeled data. For the subsequent model training, we update the CNN model by the joint training on the three data parts. The proposed joint training method can optimize the model by both the data with labels (or pseudo labels) and the data without any reliable labels. For the label estimation step, instead of using a static sampling strategy, we propose a progressive sampling strategy to increase the number of the selected pseudo-labeled candidates step by step. We select a few candidates with most reliable pseudo labels from unlabeled examples as the pseudo-labeled data, and keep the rest as index-labeled data by assigning them with the data indexes. During iterations, the index-labeled data are dynamically transferred to pseudo-labeled data. Notably, the rank-1 accuracy of our method outperforms the state-of-the-art method by 21.6 points (absolute, i.e., 62.8% versus 41.2%) on MARS, and 16.6 points on DukeMTMC-VideoReID. Extended to the few-example setting, our approach with only 20% labeled data surprisingly achieves comparable performance to the supervised state-of-the-art method with 100% labeled data.

Journal ArticleDOI
TL;DR: This paper proposes a novel deep generative approach to cross-modal retrieval to learn hash functions in the absence of paired training samples through the cycle consistency loss, and employs adversarial training scheme to learn a couple of hash functions enabling translation between modalities while assuming the underlying semantic relationship.
Abstract: In this paper, we propose a novel deep generative approach to cross-modal retrieval to learn hash functions in the absence of paired training samples through the cycle consistency loss. Our proposed approach employs adversarial training scheme to learn a couple of hash functions enabling translation between modalities while assuming the underlying semantic relationship. To induce the hash codes with semantics to the input-output pair, cycle consistency loss is further delved into the adversarial training to strengthen the correlation between the inputs and corresponding outputs. Our approach is generative to learn hash functions, such that the learned hash codes can maximally correlate each input–output correspondence and also regenerate the inputs so as to minimize the information loss. The learning to hash embedding is thus performed to jointly optimize the parameters of the hash functions across modalities as well as the associated generative models. Extensive experiments on a variety of large-scale cross-modal data sets demonstrate that our proposed method outperforms the state of the arts.

Journal ArticleDOI
TL;DR: Li et al. as discussed by the authors proposed a weakly supervised adversarial domain adaptation method to improve the segmentation performance from synthetic data to real scenes, which consists of three deep neural networks: a detection and segmentation (DS) model focuses on detecting objects and predicting segmentation map; a pixel-level domain classifier (PDC) tries to distinguish the image features from which domains; and an object-level Domain Classifier (ODC) discriminates the objects from which domain and predicts object classes.
Abstract: Semantic segmentation, a pixel-level vision task, is rapidly developed by using convolutional neural networks (CNNs). Training CNNs requires a large amount of labeled data, but manually annotating data is difficult. For emancipating manpower, in recent years, some synthetic datasets are released. However, they are still different from real scenes, which causes that training a model on the synthetic data (source domain) cannot achieve a good performance on real urban scenes (target domain). In this paper, we propose a weakly supervised adversarial domain adaptation to improve the segmentation performance from synthetic data to real scenes, which consists of three deep neural networks. A detection and segmentation (DS) model focuses on detecting objects and predicting segmentation map; a pixel-level domain classifier (PDC) tries to distinguish the image features from which domains; and an object-level domain classifier (ODC) discriminates the objects from which domains and predicts object classes. PDC and ODC are treated as the discriminators, and DS is considered as the generator. By the adversarial learning, DS is supposed to learn domain-invariant features. In experiments, our proposed method yields the new record of mIoU metric in the same problem.

Journal ArticleDOI
TL;DR: Shape classification and retrieval results under three large-scale benchmarks verify that SeqViews2SeqLabels learns more discriminative global features by more effectively aggregating sequential views than state-of-the-art methods.
Abstract: Learning 3D global features by aggregating multiple views has been introduced as a successful strategy for 3D shape analysis In recent deep learning models with end-to-end training, pooling is a widely adopted procedure for view aggregation However, pooling merely retains the max or mean value over all views, which disregards the content information of almost all views and also the spatial information among the views To resolve these issues, we propose Sequential Views To Sequential Labels (SeqViews2SeqLabels) as a novel deep learning model with an encoder–decoder structure based on recurrent neural networks (RNNs) with attention SeqViews2SeqLabels consists of two connected parts, an encoder-RNN followed by a decoder-RNN, that aim to learn the global features by aggregating sequential views and then performing shape classification from the learned global features, respectively Specifically, the encoder-RNN learns the global features by simultaneously encoding the spatial and content information of sequential views, which captures the semantics of the view sequence With the proposed prediction of sequential labels, the decoder-RNN performs more accurate classification using the learned global features by predicting sequential labels step by step Learning to predict sequential labels provides more and finer discriminative information among shape classes to learn, which alleviates the overfitting problem inherent in training using a limited number of 3D shapes Moreover, we introduce an attention mechanism to further improve the discriminative ability of SeqViews2SeqLabels This mechanism increases the weight of views that are distinctive to each shape class, and it dramatically reduces the effect of selecting the first view position Shape classification and retrieval results under three large-scale benchmarks verify that SeqViews2SeqLabels learns more discriminative global features by more effectively aggregating sequential views than state-of-the-art methods

Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors proposed a high-confidence manipulation localization architecture that utilizes resampling features, long short-term memory (LSTM) cells, and an encoder-decoder network to segment out manipulated regions from non-manipulated ones.
Abstract: With advanced image journaling tools, one can easily alter the semantic meaning of an image by exploiting certain manipulation techniques such as copy clone, object splicing, and removal, which mislead the viewers. In contrast, the identification of these manipulations becomes a very challenging task as manipulated regions are not visually apparent. This paper proposes a high-confidence manipulation localization architecture that utilizes resampling features, long short-term memory (LSTM) cells, and an encoder–decoder network to segment out manipulated regions from non-manipulated ones. Resampling features are used to capture artifacts, such as JPEG quality loss, upsampling, downsampling, rotation, and shearing. The proposed network exploits larger receptive fields (spatial maps) and frequency-domain correlation to analyze the discriminative characteristics between the manipulated and non-manipulated regions by incorporating the encoder and LSTM network. Finally, the decoder network learns the mapping from low-resolution feature maps to pixel-wise predictions for image tamper localization. With the predicted mask provided by the final layer (softmax) of the proposed architecture, end-to-end training is performed to learn the network parameters through back-propagation using the ground-truth masks. Furthermore, a large image splicing dataset is introduced to guide the training process. The proposed method is capable of localizing image manipulations at the pixel level with high precision, which is demonstrated through rigorous experimentation on three diverse datasets.

Journal ArticleDOI
TL;DR: In this article, an alternative method for solving inverse problems using off-the-shelf denoisers, which requires less parameter tuning, is proposed, where the prior term is handled solely by a denoising operation.
Abstract: Inverse problems appear in many applications, such as image deblurring and inpainting. The common approach to address them is to design a specific algorithm for each problem. The Plug-and-Play (P&P) framework, which has been recently introduced, allows solving general inverse problems by leveraging the impressive capabilities of existing denoising algorithms. While this fresh strategy has found many applications, a burdensome parameter tuning is often required in order to obtain high-quality results. In this paper, we propose an alternative method for solving inverse problems using off-the-shelf denoisers, which requires less parameter tuning. First, we transform a typical cost function, composed of fidelity and prior terms, into a closely related, novel optimization problem. Then, we propose an efficient minimization scheme with a P&P property, i.e., the prior term is handled solely by a denoising operation. Finally, we present an automatic tuning mechanism to set the method’s parameters. We provide a theoretical analysis of the method and empirically demonstrate its competitiveness with task-specific techniques and the P&P approach for image inpainting and deblurring.

Journal ArticleDOI
TL;DR: This work proposes a novel approach to jointly exploit feature adaptation with distribution matching and sample adaptation with landmark selection, suitable for both homogeneous- and heterogeneous-domain adaptations by learning domain-specific projections.
Abstract: Domain adaptation aims to leverage knowledge from a well-labeled source domain to a poorly labeled target domain. A majority of existing works transfer the knowledge at either feature level or sample level. Recent studies reveal that both of the paradigms are essentially important, and optimizing one of them can reinforce the other. Inspired by this, we propose a novel approach to jointly exploit feature adaptation with distribution matching and sample adaptation with landmark selection. During the knowledge transfer, we also take the local consistency between the samples into consideration so that the manifold structures of samples can be preserved. At last, we deploy label propagation to predict the categories of new instances. Notably, our approach is suitable for both homogeneous- and heterogeneous-domain adaptations by learning domain-specific projections. Extensive experiments on five open benchmarks, which consist of both standard and large-scale datasets, verify that our approach can significantly outperform not only conventional approaches but also end-to-end deep models. The experiments also demonstrate that we can leverage handcrafted features to promote the accuracy on deep features by heterogeneous adaptation.

Journal ArticleDOI
TL;DR: The live video quality challenge database (LIVE-VQC) as mentioned in this paper is a large-scale video quality assessment database containing 585 videos of unique content, captured by a large number of users, with wide ranges of levels of complex, authentic distortions.
Abstract: The great variations of videographic skills in videography, camera designs, compression and processing protocols, communication and bandwidth environments, and displays leads to an enormous variety of video impairments. Current no-reference (NR) video quality models are unable to handle this diversity of distortions. This is true in part because available video quality assessment databases contain very limited content, fixed resolutions, were captured using a small number of camera devices by a few videographers and have been subjected to a modest number of distortions. As such, these databases fail to adequately represent real world videos, which contain very different kinds of content obtained under highly diverse imaging conditions and are subject to authentic, complex, and often commingled distortions that are difficult or impossible to simulate. As a result, NR video quality predictors tested on real-world video data often perform poorly. Toward advancing NR video quality prediction, we have constructed a large-scale video quality assessment database containing 585 videos of unique content, captured by a large number of users, with wide ranges of levels of complex, authentic distortions. We collected a large number of subjective video quality scores via crowdsourcing. A total of 4776 unique participants took part in the study, yielding over 205 000 opinion scores, resulting in an average of 240 recorded human opinions per video. We demonstrate the value of the new resource, which we call the live video quality challenge database (LIVE-VQC), by conducting a comparison with leading NR video quality predictors on it. This paper is the largest video quality assessment study ever conducted along several key dimensions: number of unique contents, capture devices, distortion types and combinations of distortions, study participants, and recorded subjective scores. The database is available for download on this link: http://live.ece.utexas.edu/research/LIVEVQC/index.html .

Journal ArticleDOI
TL;DR: This paper proposes a novel approach to one-shot learning that learns to map a novel sample instance to a concept, relates that concept to the existing ones in the concept space and, using these relationships, generates new instances by interpolating among the concepts to help learning.
Abstract: The ability to quickly recognize and learn new visual concepts from limited samples enable humans to quickly adapt to new tasks and environments. This ability is enabled by the semantic association of novel concepts with those that have already been learned and stored in memory. Computers can start to ascertain similar abilities by utilizing a semantic concept space. A concept space is a high-dimensional semantic space in which similar abstract concepts appear close and dissimilar ones far apart. In this paper, we propose a novel approach to one-shot learning that builds on this core idea. Our approach learns to map a novel sample instance to a concept, relates that concept to the existing ones in the concept space and, using these relationships, generates new instances, by interpolating among the concepts, to help learning. Instead of synthesizing new image instance, we propose to directly synthesize instance features by leveraging semantics using a novel auto-encoder network called dual TriNet . The encoder part of the TriNet learns to map multi-layer visual features from CNN to a semantic vector. In semantic space, we search for related concepts, which are then projected back into the image feature spaces by the decoder portion of the TriNet. Two strategies in the semantic space are explored. Notably, this seemingly simple strategy results in complex augmented feature distributions in the image feature space, leading to substantially better performance.

Journal ArticleDOI
TL;DR: This work proposes a Stacked Deconvolutional Network (SDN) for semantic segmentation and achieves the new state-ofthe- art results on four datasets, including PASCAL VOC 2012, CamVid, GATECH, COCO Stuff.
Abstract: Recent progress in semantic segmentation has been driven by improving the spatial resolution under Fully Convolutional Networks (FCNs). To address this problem, we propose a Stacked Deconvolutional Network (SDN) for semantic segmentation. In SDN, multiple shallow deconvolutional networks, which are called as SDN units, are stacked one by one to integrate contextual information and bring the fine recovery of localization information. Meanwhile, inter-unit and intra-unit connections are designed to assist network training and enhance feature fusion since the connections improve the flow of information and gradient propagation throughout the network. Besides, hierarchical supervision is applied during the upsampling process of each SDN unit, which enhances the discrimination of feature representations and benefits the network optimization. We carry out comprehensive experiments and achieve the new state-ofthe- art results on four datasets, including PASCAL VOC 2012, CamVid, GATECH, COCO Stuff. In particular, our best model without CRF post-processing achieves an intersection-over-union score of 86.6% in the test set.

Journal ArticleDOI
TL;DR: This paper quantitatively analyzes the structure of the proposed CNN model from multiple dimensions to make the model interpretable and optimal for CNN-based loop filtering for high-efficiency video coding (HEVC).
Abstract: Recently, convolutional neural network (CNN) has attracted tremendous attention and has achieved great success in many image processing tasks. In this paper, we focus on CNN technology combined with image restoration to facilitate video coding performance and propose the content-aware CNN based in-loop filtering for high-efficiency video coding (HEVC). In particular, we quantitatively analyze the structure of the proposed CNN model from multiple dimensions to make the model interpretable and optimal for CNN-based loop filtering. More specifically, each coding tree unit (CTU) is treated as an independent region for processing, such that the proposed content-aware multimodel filtering mechanism is realized by the restoration of different regions with different CNN models under the guidance of the discriminative network. To adapt the image content, the discriminative neural network is learned to analyze the content characteristics of each region for the adaptive selection of the deep learning model. The CTU level control is also enabled in the sense of rate-distortion optimization. To learn the CNN model, an iterative training method is proposed by simultaneously labeling filter categories at the CTU level and fine-tuning the CNN model parameters. The CNN based in-loop filter is implemented after sample adaptive offset in HEVC, and extensive experiments show that the proposed approach significantly improves the coding performance and achieves up to 10.0% bit-rate reduction. On average, 4.1%, 6.0%, 4.7%, and 6.0% bit-rate reduction can be obtained under all intra, low delay, low delay P, and random access configurations, respectively.