scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Circuits and Systems for Video Technology in 2020"


Journal ArticleDOI
TL;DR: This work attempts to leverage powerful generative modeling capabilities of the recently introduced conditional generative adversarial networks (CGAN) by enforcing an additional constraint that the de-rained image must be indistinguishable from its corresponding ground truth clean image.
Abstract: Severe weather conditions, such as rain and snow, adversely affect the visual quality of images captured under such conditions, thus rendering them useless for further usage and sharing. In addition, such degraded images drastically affect the performance of vision systems. Hence, it is important to address the problem of single image de-raining. However, the inherent ill-posed nature of the problem presents several challenges. We attempt to leverage powerful generative modeling capabilities of the recently introduced conditional generative adversarial networks (CGAN) by enforcing an additional constraint that the de-rained image must be indistinguishable from its corresponding ground truth clean image. The adversarial loss from GAN provides additional regularization and helps to achieve superior results. In addition to presenting a new approach to de-rain images, we introduce a new refined loss function and architectural novelties in the generator–discriminator pair for achieving improved results. The loss function is aimed at reducing artifacts introduced by GANs and ensure better visual quality. The generator sub-network is constructed using the recently introduced densely connected networks, whereas the discriminator is designed to leverage global and local information to decide if an image is real/fake. Based on this, we propose a novel single image de-raining method called image de-raining conditional generative adversarial network (ID-CGAN) that considers quantitative, visual, and also discriminative performance into the objective function. The experiments evaluated on synthetic and real images show that the proposed method outperforms many recent state-of-the-art single image de-raining methods in terms of quantitative and visual performances. Furthermore, the experimental results evaluated on object detection datasets using the Faster-RCNN also demonstrate the effectiveness of proposed method in improving the detection performance on images degraded by rain.

747 citations


Journal ArticleDOI
TL;DR: A deep bilinear model for blind image quality assessment that works for both synthetically and authentically distorted images and achieves state-of-the-art performance on both synthetic and authentic IQA databases is proposed.
Abstract: We propose a deep bilinear model for blind image quality assessment that works for both synthetically and authentically distorted images. Our model constitutes two streams of deep convolutional neural networks (CNNs), specializing in two distortion scenarios separately. For synthetic distortions, we first pre-train a CNN to classify the distortion type and the level of an input image, whose ground truth label is readily available at a large scale. For authentic distortions, we make use of a pre-train CNN (VGG-16) for the image classification task. The two feature sets are bilinearly pooled into one representation for a final quality prediction. We fine-tune the whole network on the target databases using a variant of stochastic gradient descent. The extensive experimental results show that the proposed model achieves state-of-the-art performance on both synthetic and authentic IQA databases. Furthermore, we verify the generalizability of our method on the large-scale Waterloo Exploration Database, and demonstrate its competitiveness using the group maximum differentiation competition methodology.

390 citations


Journal ArticleDOI
TL;DR: This paper proposes a novel video summarization framework named attentive encoder–decoder networks forVideo summarization (AVS), in which the encoder uses a bidirectional long short-term memory (BiLSTM) to encode the contextual information among the input video frames.
Abstract: This paper addresses the problem of supervised video summarization by formulating it as a sequence-to-sequence learning problem, where the input is a sequence of original video frames, and the output is a keyshot sequence. Our key idea is to learn a deep summarization network with attention mechanism to mimic the way of selecting the keyshots of human. To this end, we propose a novel video summarization framework named attentive encoder–decoder networks for video summarization (AVS), in which the encoder uses a bidirectional long short-term memory (BiLSTM) to encode the contextual information among the input video frames. As for the decoder, two attention-based LSTM networks are explored by using additive and multiplicative objective functions, respectively. Extensive experiments are conducted on two video summarization benchmark datasets, i.e., SumMe and TVSum. The results demonstrate the superiority of the proposed AVS-based approaches against the state-of-the-art approaches, with remarkable improvements on both datasets.

257 citations


Journal ArticleDOI
TL;DR: A new data augmentation technique called random image cropping and patching (RICAP) which randomly crops four images and patches them to create a new training image and achieves a new state-of-the-art test error of 2.19% on CIFAR-10.
Abstract: Deep convolutional neural networks (CNNs) have achieved remarkable results in image processing tasks. However, their high expression ability risks overfitting. Consequently, data augmentation techniques have been proposed to prevent overfitting while enriching datasets. Recent CNN architectures with more parameters are rendering traditional data augmentation techniques insufficient. In this study, we propose a new data augmentation technique called random image cropping and patching (RICAP) which randomly crops four images and patches them to create a new training image. Moreover, RICAP mixes the class labels of the four images, resulting in an advantage of the soft labels. We evaluated RICAP with current state-of-the-art CNNs (e.g., the shake-shake regularization model) by comparison with competitive data augmentation techniques such as cutout and mixup. RICAP achieves a new state-of-the-art test error of 2.19% on CIFAR-10. We also confirmed that deep CNNs with RICAP achieve better results on classification tasks using CIFAR-100 and ImageNet, an image-caption retrieval task using Microsoft COCO, and other computer vision tasks.

256 citations


Journal ArticleDOI
TL;DR: The evolution and development of neural network-based compression methodologies are introduced for images and video respectively and the joint compression on semantic and visual information is tentatively explored to formulate high efficiency signal representation structure for both human vision and machine vision.
Abstract: In recent years, the image and video coding technologies have advanced by leaps and bounds. However, due to the popularization of image and video acquisition devices, the growth rate of image and video data is far beyond the improvement of the compression ratio. In particular, it has been widely recognized that there are increasing challenges of pursuing further coding performance improvement within the traditional hybrid coding framework. Deep convolution neural network which makes the neural network resurge in recent years and has achieved great success in both artificial intelligent and signal processing fields, also provides a novel and promising solution for image and video compression. In this paper, we provide a systematic, comprehensive and up-to-date review of neural network-based image and video compression techniques. The evolution and development of neural network-based compression methodologies are introduced for images and video respectively. More specifically, the cutting-edge video coding techniques by leveraging deep learning and HEVC framework are presented and discussed, which promote the state-of-the-art video coding performance substantially. Moreover, the end-to-end image and video coding frameworks based on neural networks are also reviewed, revealing interesting explorations on next generation image and video coding frameworks/standards. The most significant research works on the image and video coding related topics using neural networks are highlighted, and future trends are also envisioned. In particular, the joint compression on semantic and visual information is tentatively explored to formulate high efficiency signal representation structure for both human vision and machine vision, which are the two dominant signal receptors in the age of artificial intelligence.

235 citations


Journal ArticleDOI
TL;DR: A channel-wise and spatial feature modulation (CSFM) network in which a series of feature modulation memory (FMM) modules are cascaded with a densely connected structure to transform shallow features to high informative features and maintain long-term information for image super-resolution.
Abstract: The performance of single image super-resolution has achieved significant improvement by utilizing deep convolutional neural networks (CNNs). The features in deep CNN contain different types of information which make different contributions to image reconstruction. However, the most CNN-based models lack discriminative ability for different types of information and deal with them equally, which results in the representational capacity of the models being limited. On the other hand, as the depth of neural network grows, the long-term information coming from preceding layers is easy to be weaken or lost at later layers, which is adverse to super-resolving image. To capture more informative features and maintain long-term information for image super-resolution, we propose a channel-wise and spatial feature modulation (CSFM) network in which a series of feature modulation memory (FMM) modules are cascaded with a densely connected structure to transform shallow features to high informative features. In each FMM module, we construct a set of channel-wise and spatial attention residual (CSAR) blocks and stack them in a chain structure to dynamically modulate the multi-level features in global and local manners. This feature modulation strategy enables the valuable information to be enhanced and the redundant information to be suppressed. Meanwhile, for long-term information persistence, a gated fusion (GF) node is attached at the end of the FMM module to adaptively fuse hierarchical features and distill more effective information via the dense skip connections and the gating mechanism. The extensive quantitative and qualitative evaluations on benchmark datasets illustrate the superiority of our proposed method over the state-of-the-art methods.

228 citations


Journal ArticleDOI
Xin Liao1, Yingbo Yu1, Bin Li2, Zhongpeng Li2, Zheng Qin1 
TL;DR: A novel channel-dependent payload partition strategy based on amplifying channel modification probabilities is proposed, so as to adaptively assign the embedding capacity among RGB channels, and the experimental results show that the new color image steganographic schemes, incorporated with the proposed strategy, can effectively make theembedding changes concentrated mainly in textured regions, and achieve better performance on resisting the modern color image Steganalysis.
Abstract: In traditional steganographic schemes, RGB three channels payloads are assigned equally in a true color image. In fact, the security of color image steganography relates not only to data-embedding algorithms but also to different payload partition. How to exploit inter-channel correlations to allocate payload for performance enhancement is still an open issue in color image steganography. In this paper, a novel channel-dependent payload partition strategy based on amplifying channel modification probabilities is proposed, so as to adaptively assign the embedding capacity among RGB channels. The modification probabilities of three corresponding pixels in RGB channels are simultaneously increased, and thus the embedding impacts could be clustered, in order to improve the empirical steganographic security against the channel co-occurrences detection. The experimental results show that the new color image steganographic schemes, incorporated with the proposed strategy, can effectively make the embedding changes concentrated mainly in textured regions, and achieve better performance on resisting the modern color image steganalysis.

220 citations


Journal ArticleDOI
TL;DR: Inspired by the success of the Transformer model in machine translation, this work extends it to a Multimodal Transformer (MT) model for image captioning that significantly outperforms the previous state-of-the-art methods.
Abstract: Image captioning aims to automatically generate a natural language description of a given image, and most state-of-the-art models have adopted an encoder-decoder framework. The framework consists of a convolution neural network (CNN)-based image encoder that extracts region-based visual features from the input image, and an recurrent neural network (RNN) based caption decoder that generates the output caption words based on the visual features with the attention mechanism. Despite the success of existing studies, current methods only model the co-attention that characterizes the inter-modal interactions while neglecting the self-attention that characterizes the intra-modal interactions. Inspired by the success of the Transformer model in machine translation, here we extend it to a Multimodal Transformer (MT) model for image captioning. Compared to existing image captioning approaches, the MT model simultaneously captures intra- and inter-modal interactions in a unified attention block. Due to the in-depth modular composition of such attention blocks, the MT model can perform complex multimodal reasoning and output accurate captions. Moreover, to further improve the image captioning performance, multi-view visual features are seamlessly introduced into the MT model. We quantitatively and qualitatively evaluate our approach using the benchmark MSCOCO image captioning dataset and conduct extensive ablation studies to investigate the reasons behind its effectiveness. The experimental results show that our method significantly outperforms the previous state-of-the-art methods. With an ensemble of seven models, our solution ranks the 1st place on the real-time leaderboard of the MSCOCO image captioning challenge at the time of the writing of this paper.

206 citations


Journal ArticleDOI
TL;DR: This survey on open-world re-ID provides a guidance for improving the usability of re-IDs technique in practical applications and summarizes the state-of-the-art methods and future directions from both narrow and generalized perspectives.
Abstract: Person re-identification (re-ID) has been a popular topic in computer vision and pattern recognition communities for a decade. Several important milestones such as metric-based and deeply-learned re-ID in recent years have promoted this topic. However, most existing re-ID works are designed for closed-world scenarios rather than realistic open-world settings, which limits the practical application of the re-ID technique. On one hand, the performance of the latest re-ID methods has surpassed the human-level performance on several commonly used benchmarks (e.g., Market1501 and CUHK03), which are collected from closed-world scenarios. On the other hand, open-world tasks that are less developed and more challenging have received increasing attention in the re-ID community. Therefore, this paper starts the first attempt to analyze the trends of open-world re-ID and summarizes them from both narrow and generalized perspectives. In the narrow perspective, open-world re-ID is regarded as person verification (i.e., open-set re-ID) instead of person identification, that is, the query person may not occur in the gallery set. In the generalized perspective, application-driven methods that are designed for specific applications are defined as generalized open-world re-ID. Their settings are usually close to realistic application requirements. Specifically, this survey mainly includes the following four points for open-world re-ID: 1) analyzing the discrepancies between closed- and open-world scenarios; 2) describing the developments of existing open-set re-ID works and their limitations; 3) introducing specific application-driven works from three aspects, namely, raw data, practical procedure, and efficiency; and 4) summarizing the state-of-the-art methods and future directions for open-world re-ID. This survey on open-world re-ID provides a guidance for improving the usability of re-ID technique in practical applications.

175 citations


Journal ArticleDOI
TL;DR: A fast intra-coding algorithm consisting of low-complexity coding tree units (CTU) structure decision and fast intra mode decision and the complexity reduction of the proposed algorithm is up to 70% compared to VVC reference software, and averagely 63% encoding time saving is achieved.
Abstract: Quadtree with nested multi-type tree (QTMT) partition structure is an efficient improvement in versatile video coding (VVC) over the quadtree (QT) structure in the advanced high-efficiency video coding (HEVC) standard. With the exception of the recursive QT partition structure, recursive multi-type tree partition is applied to each leaf node, which generates more flexible block sizes. Besides, intra prediction modes are extended from 35 to 67 so as to satisfy various texture patterns. These newly developed techniques achieve high coding efficiency but also result in very high computational complexity. To tackle this problem, we propose a fast intra-coding algorithm consisting of low-complexity coding tree units (CTU) structure decision and fast intra mode decision in this paper. The contributions of the proposed algorithm lie in the following aspects: 1) the new block size and coding mode distribution features are first explored for a reasonable fast coding scheme; 2) a novel fast QTMT partition decision framework is developed, which can determine the partition decision on both QT and multi-type tree with a novel cascade decision structure; and 3) fast intra mode decision with gradient descent search is introduced, while the best initial search point and search step are also investigated in this paper. The simulation results show that the complexity reduction of the proposed algorithm is up to 70% compared to VVC reference software (VTM), and averagely 63% encoding time saving is achieved with 1.93% BDBR increasing. Such results demonstrate that our method yields a superior performance in terms of computational complexity and compression quality compared to the state-of-the-art methods.

166 citations


Journal ArticleDOI
TL;DR: A perspective crowd counting network (PCC Net), which consists of three parts, which achieves the state-of-the-art performance on the one and attains the competitive results on the other four datasets.
Abstract: Crowd counting from a single image is a challenging task due to high appearance similarity, perspective changes, and severe congestion. Many methods only focus on the local appearance features and they cannot handle the aforementioned challenges. In order to tackle them, we propose a perspective crowd counting network (PCC Net), which consists of three parts: 1) density map estimation (DME) focuses on learning very local features of density map estimation; 2) random high-level density classification (R-HDC) extracts global features to predict the coarse density labels of random patches in images; and 3) fore-/background segmentation (FBS) encodes mid-level features to segments the foreground and background. Besides, the Down, Up, Left, and Right (DULR) module is embedded in PCC Net to encode the perspective changes on four directions (DULR). The proposed PCC Net is verified on five mainstream datasets, which achieves the state-of-the-art performance on the one and attains the competitive results on the other four datasets. The source code is available at https://github.com/gjy3035/PCC-Net .

Journal ArticleDOI
TL;DR: A saliency-aware CNN framework for ship detection, comprising comprehensive ship discriminative features, such as deep feature, saliency map, and coastline prior is proposed, which outperforms representative counterparts in terms of accuracy and speed.
Abstract: Real-time detection of inshore ships plays an essential role in the efficient monitoring and management of maritime traffic and transportation for port management. Current ship detection methods which are mainly based on remote sensing images or radar images hardly meet real-time requirement due to the timeliness of image acquisition. In this paper, we propose to use visual images captured by an on-land surveillance camera network to achieve real-time detection. However, due to the complex background of visual images and the diversity of ship categories, the existing convolution neural network (CNN) based methods are either inaccurate or slow. To achieve high detection accuracy and real-time performance simultaneously, we propose a saliency-aware CNN framework for ship detection, comprising comprehensive ship discriminative features, such as deep feature, saliency map, and coastline prior. This model uses CNN to predict the category and the position of ships and uses the global contrast based salient region detection to correct the location. We also extract coastline information and respectively incorporate it into CNN and saliency detection to obtain more accurate ship locations. We implement our model on Darknet under CUDA 8.0 and CUDNN V5 and use a real-world visual image dataset for training and evaluation. The experimental results show that our model outperforms representative counterparts (Faster R-CNN, SSD, and YOLOv2) in terms of accuracy and speed.

Journal ArticleDOI
TL;DR: The proposed PixelMotionCNN (PMCNN) which includes motion extension and hybrid prediction networks can model spatiotemporal coherence to effectively perform predictive coding inside the learning network and provides a possible new direction to further improve compression efficiency and functionalities of future video coding.
Abstract: One key challenge to learning-based video compression is that motion predictive coding, a very effective tool for video compression, can hardly be trained into a neural network. In this paper, we propose the concept of PixelMotionCNN (PMCNN) which includes motion extension and hybrid prediction networks. PMCNN can model spatiotemporal coherence to effectively perform predictive coding inside the learning network. On the basis of PMCNN, we further explore a learning-based framework for video compression with additional components of iterative analysis/synthesis and binarization. The experimental results demonstrate the effectiveness of the proposed scheme. Although entropy coding and complex configurations are not employed in this paper, we still demonstrate superior performance compared with MPEG-2 and achieve comparable results with H.264 codec. The proposed learning-based scheme provides a possible new direction to further improve compression efficiency and functionalities of future video coding.

Journal ArticleDOI
Hak Gu Kim1, Heoun-taek Lim1, Yong Man Ro1
TL;DR: The proposed deep networks consisting of virtual reality quality score predictor and human perception guider outperforms the existing two-dimentional image quality models and the state-of-the-art imagequality models for omnidirectional images.
Abstract: In this paper, we propose a novel deep learning-based virtual reality image quality assessment method that automatically predicts the visual quality of an omnidirectional image. In order to assess the visual quality in viewing the omnidirectional image, we propose deep networks consisting of virtual reality (VR) quality score predictor and human perception guider. The proposed VR quality score predictor learns the positional and visual characteristics of the omnidirectional image by encoding the positional feature and visual feature of a patch on the omnidirectional image. With the encoded positional feature and visual feature, patch weight and patch quality score are estimated. Then, by aggregating all weights and scores of the patches, the image quality score is predicted. The proposed human perception guider evaluates the predicted quality score by referring to the human subjective score (i.e., ground-truth obtained by subjects) using an adversarial learning. With adversarial learning, the VR quality score predictor is trained to accurately predict the quality score in order to deceive the guider, while the proposed human perception guider is trained to precisely distinguish between the predictor score and the ground-truth subjective score. To verify the performance of the proposed method, we conducted comprehensive subjective experiments and evaluated the performance of the proposed method. The experimental results show that the proposed method outperforms the existing two-dimentional image quality models and the state-of-the-art image quality models for omnidirectional images.

Journal ArticleDOI
TL;DR: This work proposes a novel zero-shot method based on training an end-to-end model that fuses semantic attribute prediction with visual features to propose object bounding boxes for seen and unseen classes and observes significant improvements on the average precision of unseen classes.
Abstract: As we move toward large-scale object detection, it is unrealistic to expect annotated training data, in the form of bounding box annotations around objects, for all object classes at sufficient scale; therefore, the methods capable of unseen object detection are required. We propose a novel zero-shot method based on training an end-to-end model that fuses semantic attribute prediction with visual features to propose object bounding boxes for seen and unseen classes. While we utilize semantic features during training, our method is agnostic to semantic information for unseen classes at test-time. Our method retains the efficiency and effectiveness of YOLOv2 for objects seen during training, while improving its performance for novel and unseen objects. The ability of the state-of-the-art detection methods to learn discriminative object features to reject background proposals also limits their performance for unseen objects. We posit that, to detect unseen objects, we must incorporate semantic information into the visual domain so that the learned visual features reflect this information and lead to improved recall rates for unseen objects. We test our method on PASCAL VOC and MS COCO dataset and observed significant improvements on the average precision of unseen classes.

Journal ArticleDOI
TL;DR: It is theoretically and experimentally demonstrated that PHFMs outperform the above moments in reconstructing images and recognizing rotationally invariant objects considering noise and various attacks.
Abstract: Due to their good rotational invariance and stability, image continuous orthogonal moments are intensively applied in rotationally invariant recognition and image processing. However, most moments produce numerical instability, which impacts the image reconstruction and recognition performance. In this paper, a new set of invariant continuous orthogonal moments, polar harmonic Fourier moments (PHFMs), free of numerical instability is designed. The radial basis functions (RBFs) of the PHFMs are much simpler than those of the Chebyshev-Fourier moments (CHFMs), orthogonal Fourier-Mellin moments (OFMMs), Zernike moments (ZMs), and pseudo-Zernike moments (PZMs). For the same degree, the RBFs of the PHFMs have more zeros and are more evenly distributed than those of the ZMs and PZMs. Therefore, PHFMs do not suffer from information suppression problem; hence, the image description ability of the PHFMs is superior to that of the ZMs and PZMs. Moreover, the RBFs of the PHFMs are always less than or equal to 1.0 near the unit disk center, whereas those of the OFMMs, PZMs, CHFMs, and radial harmonic Fourier moments (RHFMs) are infinite (implying numerical instability). This indicates that PHFMs can outperform these moments in image reconstruction tasks. We theoretically and experimentally demonstrate that PHFMs outperform the above moments in reconstructing images and recognizing rotationally invariant objects considering noise and various attacks. This paper also details the significance of the PHFM phase in image reconstruction, angle estimation using PHFMs, and the accurate moment selection of the PHFMs.

Journal ArticleDOI
Peng Yi1, Zhongyuan Wang1, Kui Jiang1, Zhenfeng Shao1, Jiayi Ma1 
TL;DR: A multi-temporal ultra-dense memory (MTUDM) network for video super-resolution is proposed that outperforms the state-of-the-art methods by a large margin and adopts multi- Temporal information fusion (MTIF) strategy to merge the extracted temporal feature maps in consecutive frames, improving the accuracy without requiring much extra computational cost.
Abstract: Video super-resolution (SR) aims to reconstruct the corresponding high-resolution (HR) frames from consecutive low-resolution (LR) frames. It is crucial for video SR to harness both inter-frame temporal correlations and intra-frame spatial correlations among frames. Previous video SR methods based on convolutional neural network (CNN) mostly adopt a single-channel structure and a single memory module, so they are unable to fully exploit inter-frame temporal correlations specific for video. To this end, this paper proposes a multi-temporal ultra-dense memory (MTUDM) network for video super-resolution. Particularly, we embed convolutional long-short-term memory (ConvLSTM) into ultra-dense residual block (UDRB) to construct an ultra-dense memory block (UDMB) for extracting and retaining spatio-temporal correlations. This design also reduces the layer depth by expanding the width, thus avoiding training difficulties, such as gradient exploding and vanishing under a large model. We further adopt multi-temporal information fusion (MTIF) strategy to merge the extracted temporal feature maps in consecutive frames, improving the accuracy without requiring much extra computational cost. The experimental results on extensive public datasets demonstrate that our method outperforms the state-of-the-art methods by a large margin.

Journal ArticleDOI
TL;DR: The experimental results demonstrate that the proposed FS-SSD can achieve a comparable detection speed but an accuracy superior to those of the six state-of-the-art methods.
Abstract: Objects in unmanned aerial vehicle (UAV) images are generally small due to the high-photography altitude. Although many efforts have been made in object detection, how to accurately and quickly detect small objects is still one of the remaining open challenges. In this paper, we propose a feature fusion and scaling-based single shot detector (FS-SSD) for small object detection in the UAV images. The FS-SSD is an enhancement based on FSSD, a variety of the original single shot multibox detector (SSD). We add an extra scaling branch of the deconvolution module with an average pooling operation to form a feature pyramid. The original feature fusion branch is adjusted to be better suited to the small object detection task. The two feature pyramids generated by the deconvolution module and feature fusion module are utilized to make predictions together. In addition to the deep features learned by the FS-SSD, to further improve the detection accuracy, spatial context analysis is proposed to incorporate the object spatial relationships into object redetection. The interclass and intraclass distances between different object instances are computed as a spatial context, which proves effective for multiclass small object detection. Six experiments are conducted on the PASCAL VOC dataset and the two UAV image datasets. The experimental results demonstrate that the proposed method can achieve a comparable detection speed but an accuracy superior to those of the six state-of-the-art methods.

Journal ArticleDOI
TL;DR: A novel framework for an efficient video content summarization as well as video motion summarization is proposed, using Capsules Net as a spatiotemporal information extractor and a self-attention model to select key-frames sequences inside the shots.
Abstract: Video summarization (VSUMM) has become a popular method in processing massive video data. The key point of VSUMM is to select the key frames to represent the effective contents of a video sequence. The existing methods can only extract the static images of videos as the content summarization, but they ignore the representation of motion information. To cope with these issues, a novel framework for an efficient video content summarization as well as video motion summarization is proposed. Initially, Capsules Net is trained as a spatiotemporal information extractor, and an inter-frames motion curve is generated based on those spatiotemporal features. Subsequently, a transition effects detection method is proposed to automatically segment the video streams into shots. Finally, a self-attention model is introduced to select key-frames sequences inside the shots; thus, key static images are selected as video content summarization, and optical flows can be calculated as video motion summarization. The ultimate experimental results demonstrate that our method is competitive on VSUMM, TvSum, SumMe, and RAI datasets about shot segmentation and video content summarization, and can also represent a good motion summarization result.

Journal ArticleDOI
TL;DR: Experimental results validate that the proposed LRCISSK method can effectively explore the spatial-spectral information and deliver superior performance with at least 1.30% higher OA and 1.03% higher AA on average when compared to other state-of-the-art classifiers.
Abstract: Kernel methods, e.g., composite kernels (CKs) and spatial-spectral kernels (SSKs), have been demonstrated to be an effective way to exploit the spatial-spectral information nonlinearly for improving the classification performance of hyperspectral image (HSI). However, these methods are always conducted with square-shaped window or superpixel techniques. Both techniques are likely to misclassify the pixels that lie at the boundaries of class, and thus a small target is always smoothed away. To alleviate these problems, in this paper, we propose a novel patch-based low rank component induced spatial-spectral kernel method, termed LRCISSK, for HSI classification. First, the latent low-rank features of spectra in each cubic patch of HSI are reconstructed by a low rank matrix recovery (LRMR) technique, and then, to further explore more accurate spatial information, they are used to identify a homogeneous neighborhood for the target pixel (i.e., the centroid pixel) adaptively. Finally, the adaptively identified homogenous neighborhood which consists of the latent low-rank spectra is embedded into the spatial-spectral kernel framework. It can easily map the spectra into the nonlinearly complex manifolds and enable a classifier (e.g., support vector machine, SVM) to distinguish them effectively. Experimental results on three real HSI datasets validate that the proposed LRCISSK method can effectively explore the spatial-spectral information and deliver superior performance with at least 1.30% higher OA and 1.03% higher AA on average when compared to other state-of-the-art classifiers.

Journal ArticleDOI
TL;DR: A novel RDH general framework using multiple histogram modification (MH_RDH) is proposed, which involves two key issues as follows: the construction of multiple histograms based on optimized multi-features and the rate allocation among multiplehistograms is formulated as the one of rate-distortion optimization and solved with evolutionary algorithms.
Abstract: Reversible data hiding (RDH) has unique advantage in copyright and integrity protection for multimedia contents. As a typical RDH scheme, histogram shifting technique (HS) has found wide applications due to its high quality of marked image. At present, most existing HS-based RDH schemes rely on single histogram generated from cover image to hide data. Since the single histogram-based approach (SH_RDH) commonly employs smooth regions in the cover image for data hiding, it might not well utilize the cover image and exploit the correlations among image contents of different texture characteristics. In this paper, a novel RDH general framework using multiple histograms modification (MH_RDH) is proposed, which involves two key issues as follows: 1) the construction of multiple histograms based on optimized multi-features and 2) the rate allocation among multiple histograms is formulated as the one of rate-distortion optimization and solved with evolutionary algorithms. The experimental results show that the proposed method could considerably increase the payload of current MH_RDH-based embedding (ranging from 0.2 to 0.7 bpp for most test images) and outperform the other state-of-the-art SH_RDH and MH_RDH schemes.

Journal ArticleDOI
TL;DR: Experimental results illustrate the effectiveness and robustness of the proposed AL-ResNets or AL-RoR for age estimation in the wild, where it achieves better state-of-the-art performance than all other convolutional neural network methods on the Adience, MORPH Album 2, FG-NET and 15/16LAP datasets.
Abstract: Age estimation from a single face image has been an essential task in the field of human-computer interaction and computer vision, which has a wide range of practical application values. Accuracy of age estimation of face images in the wild is relatively low for existing methods, because they only take into account the global features, while neglecting the fine-grained features of age-sensitive areas. We propose a novel method based on our attention long short-term memory (AL) network for fine-grained age estimation in the wild, inspired by the fine-grained categories and the visual attention mechanism. This method combines the residual networks (ResNets) or the residual network of residual network (RoR) models with LSTM units to construct AL-ResNets or AL-RoR networks to extract local features of age-sensitive regions, which effectively improves the age estimation accuracy. First, a ResNets or a RoR model pretrained on ImageNet dataset is selected as the basic model, which is then fine-tuned on the IMDB-WIKI-101 dataset for age estimation. Then, we fine-tune the ResNets or the RoR on the target age datasets to extract the global features of face images. To extract the local features of age-sensitive regions, the LSTM unit is then presented to obtain the coordinates of the age-sensitive region automatically. Finally, the age group classification is conducted directly on the Adience dataset, and age-regression experiments are performed by the Deep EXpectation algorithm (DEX) on MORPH Album 2, FG-NET and 15/16LAP datasets. By combining the global and the local features, we obtain our final prediction results. Experimental results illustrate the effectiveness and robustness of the proposed AL-ResNets or AL-RoR for age estimation in the wild, where it achieves better state-of-the-art performance than all other convolutional neural network (CNN) methods on the Adience, MORPH Album 2, FG-NET and 15/16LAP datasets.

Journal ArticleDOI
TL;DR: This paper proposes identity-diversity inpainting to facilitate occluded face recognition by integrating GAN with an optimized pre-trained CNN recognizer which serves as the third player to compete with the generator by distinguishing diversity within the same identity class.
Abstract: Face recognition has achieved advanced development by using convolutional neural network (CNN) based recognizers. Existing recognizers typically demonstrate powerful capacity in recognizing un-occluded faces, but often suffer from accuracy degradation when directly identifying occluded faces. This is mainly due to insufficient visual and identity cues caused by occlusions. On the other hand, generative adversarial network (GAN) is particularly suitable when it needs to reconstruct visually plausible occlusions by face inpainting. Motivated by these observations, this paper proposes identity-diversity inpainting to facilitate occluded face recognition. The core idea is integrating GAN with an optimized pre-trained CNN recognizer which serves as the third player to compete with the generator by distinguishing diversity within the same identity class. To this end, a collect of identity-centered features is applied in the recognizer as supervision to enable the inpainted faces clustering towards their identity centers. In this way, our approach can benefit from GAN for reconstruction and CNN for representation, and simultaneously addresses two challenging tasks, face inpainting and face recognition. Experimental results compared with 4 state-of-the-arts prove the efficacy of the proposed approach.

Journal ArticleDOI
TL;DR: A novel attentive semantic recurrent neural network (RNN), namely, stagNet, is presented for understanding group activities and individual actions in videos, by combining the spatio-temporal attention mechanism and semantic graph modeling.
Abstract: In real life, group activity recognition plays a significant and fundamental role in a variety of applications, e.g. sports video analysis, abnormal behavior detection, and intelligent surveillance. In a complex dynamic scene, a crucial yet challenging issue is how to better model the spatio-temporal contextual information and inter-person relationship. In this paper, we present a novel attentive semantic recurrent neural network (RNN), namely, stagNet, for understanding group activities and individual actions in videos, by combining the spatio-temporal attention mechanism and semantic graph modeling. Specifically, a structured semantic graph is explicitly modeled to express the spatial contextual content of the whole scene, which is further incorporated with the temporal factor through structural-RNN. By virtue of the “factor sharing” and “message passing” mechanisms, our stagNet is capable of extracting discriminative and informative spatio-temporal representations and capturing inter-person relationships. Moreover, we adopt a spatio-temporal attention model to focus on key persons/frames for improved recognition performance. Besides, a body-region attention and a global-part feature pooling strategy are devised for individual action recognition. In experiments, four widely-used public datasets are adopted for performance evaluation, and the extensive results demonstrate the superiority and effectiveness of our method.

Journal ArticleDOI
TL;DR: This paper proposes a new pruning scheme that reflects accelerator architectures, and demonstrates a comparable pruning ratio on compact networks such as MobileNet and on slimmed networks that were already pruned in a channel-wise manner.
Abstract: Convolutional neural networks have shown tremendous performance capabilities in computer vision tasks, but their excessive amounts of weight storage and arithmetic operations prevent them from being adopted in embedded environments. One of the solutions involves pruning, where certain unimportant weights are forced to have a value of zero. Many pruning schemes have been proposed, but these have mainly focused on the number of pruned weights, scarcely considering ASIC or FPGA accelerator architectures. When a pruned network is run on an accelerator, the lack of architecture consideration causes some inefficiency problems, including the internal buffer misalignments and load imbalances. This paper proposes a new pruning scheme that reflects accelerator architectures. In the proposed scheme, pruning is performed so that the same number of weights remain for each weight group corresponding to activations fetched simultaneously. In this way, the pruning scheme resolves the inefficiency problems, doubling the accelerator performance. Even with this constraint, the proposed pruning scheme reached a pruning ratio similar to that of the previous unconstrained pruning schemes, not only on AlexNet and VGG16 but also on the state-of-the-art very deep networks, such as ResNet. Furthermore, the proposed scheme demonstrated a comparable pruning ratio on compact networks such as MobileNet and on slimmed networks that were already pruned in a channel-wise manner. In addition to improving the efficiency of previous sparse accelerators, it will be also shown that the proposed pruning scheme can be used to reduce the logic complexity of sparse accelerators.

Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper proposed a unified single image dehazing network that jointly estimates the transmission map and performs de-hazing by using an end-to-end learning framework, where the inherent transmission maps and dehazed result are learned jointly from the loss function.
Abstract: Single image haze removal is an extremely challenging problem due to its inherent ill-posed nature. Several prior-based and learning-based methods have been proposed in the literature to solve this problem and they have achieved visually appealing results. However, most of the existing methods assume constant atmospheric light model and tend to follow a two-step procedure involving prior-based methods for estimating transmission map followed by calculation of dehazed image using the closed form solution. In this paper, we relax the constant atmospheric light assumption and propose a novel unified single image dehazing network that jointly estimates the transmission map and performs dehazing. In other words, our new approach provides an end-to-end learning framework, where the inherent transmission map and dehazed result are learned jointly from the loss function. The extensive experiments evaluated on synthetic and real datasets with challenging hazy images demonstrate that the proposed method achieves significant improvements over the state-of-the-art methods.

Journal ArticleDOI
TL;DR: This paper attempts to establish the connection between the direction feature extraction model and the discriminability of direction features, and proposes a novel exponential and Gaussian fusion model (EGM) to characterize the discrim inative power of different directions.
Abstract: Direction-based methods are the most powerful and popular palmprint recognition methods. However, there is no existing work that completely analyzes the essential differences among different direction-based methods and explores the most discriminant direction representation of a palmprint. In this paper, we attempt to establish the connection between the direction feature extraction model and the discriminability of direction features, and we propose a novel exponential and Gaussian fusion model (EGM) to characterize the discriminative power of different directions. The EGM can provide us with a new insight into the optimal direction feature selection of palmprints. Moreover, we propose a local discriminant direction binary pattern (LDDBP) to completely represent the direction features of a palmprint. Guided by the EGM, the most discriminant directions can be exploited to form the LDDBP-based descriptor for palmprint representation and recognition. Extensive experiment results conducted on four widely used palmprint databases demonstrate the superiority of the proposed LDDBP method over the state-of-the-art direction-based methods.

Journal ArticleDOI
Zhen-Duo Chen1, Chuan-Xiang Li1, Xin Luo1, Liqiang Nie1, Wei Zhang1, Xin-Shun Xu1 
TL;DR: A novel supervised cross-modal hashing framework, namely Scalable disCRete mATrix faCtorization Hashing (SCRATCH), which utilizes collective matrix factorization on original features together with label semantic embedding, to learn the latent representations in a shared latent space.
Abstract: In this paper, we present a novel supervised cross-modal hashing framework, namely Scalable disCRete mATrix faCtorization Hashing (SCRATCH). First, it utilizes collective matrix factorization on original features together with label semantic embedding, to learn the latent representations in a shared latent space. Thereafter, it generates binary hash codes based on the latent representations. During optimization, it avoids using a large $n\times n$ similarity matrix and generates hash codes discretely. Besides, based on different objective functions, learning strategy, and features, we further present three models in this framework, i.e., SCRATCH-o, SCRATCH-t, and SCRATCH-d. The first one is a one-step method, learning the hash functions and the binary codes in the same optimization problem. The second is a two-step method, which first generates the binary codes and then learns the hash functions based on the learned hash codes. The third one is a deep version of SCRATCH-t, which utilizes deep neural networks as hash functions. The extensive experiments on two widely used benchmark datasets demonstrate that SCRATCH-o and SCRATCH-t outperform some state-of-the-art shallow hashing methods for cross-modal retrieval. The SCRATCH-d also outperforms some state-of-the-art deep hashing models.

Journal ArticleDOI
Risheng Liu1, Xin Fan1, Ming Zhu1, Minjun Hou1, Zhongxuan Luo1 
TL;DR: In this article, a large-scale Realworld Underwater Image Enhancement (RUIE) data set is constructed, which is divided into three sub-sets, namely visibility quality, color casts, and higher-level detection/classification.
Abstract: Underwater image enhancement is such an important low-level vision task with many applications that numerous algorithms have been proposed in recent years. These algorithms developed upon various assumptions demonstrate successes from various aspects using different data sets and different metrics. In this work, we setup an undersea image capturing system, and construct a large-scale Real-world Underwater Image Enhancement (RUIE) data set divided into three subsets. The three subsets target at three challenging aspects for enhancement, i.e., image visibility quality, color casts, and higher-level detection/classification, respectively. We conduct extensive and systematic experiments on RUIE to evaluate the effectiveness and limitations of various algorithms to enhance visibility and correct color casts on images with hierarchical categories of degradation. Moreover, underwater image enhancement in practice usually serves as a preprocessing step for mid-level and high-level vision tasks. We thus exploit the object detection performance on enhanced images as a brand new task-specific evaluation criterion. The findings from these evaluations not only confirm what is commonly believed, but also suggest promising solutions and new directions for visibility enhancement, color correction, and object detection on real-world underwater images. The benchmark is available at: https://github.com/dlut-dimt/Realworld-Underwater-Image-Enhancement-RUIE-Benchmark .

Journal ArticleDOI
TL;DR: This work proposes a simple yet effective solution, termed attention-driven loss to alleviate the foreground-background imbalance problem in anomaly detection, which is independent of backbone networks and can be easily augmented in most existing anomaly detection models.
Abstract: Recent video anomaly detection methods focus on reconstructing or predicting frames. Under this umbrella, the long-standing inter-class data-imbalance problem resorts to the imbalance between foreground and stationary background objects in video anomaly detection and this has been less investigated by existing solutions. Naively optimizing the reconstructing loss yields a biased optimization towards background reconstruction rather than the objects of interest in the foreground. To solve this, we proposed a simple yet effective solution, termed attention-driven loss to alleviate the foreground-background imbalance problem in anomaly detection. Specifically, we compute a single mask map that summarizes the frame evolution of moving foreground regions and suppresses the background in the training video clips. After that, we construct an attention map through the combination of the mask map and background to give different weights to the foreground and background region respectively. The proposed attention-driven loss is independent of backbone networks and can be easily augmented in most existing anomaly detection models. Augmented with attention-driven loss, the model is able to achieve AUC 86.0% on Avenue, 83.9% on Ped1, 96% on Ped2 datasets. Extensive experimental results and ablation studies further validate the effectiveness of our model.