scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Circuits and Systems for Video Technology in 2019"


Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors proposed a pedestrian alignment network (PAN) which allows discriminative embedding learning pedestrian alignment without extra annotations, and observed that when the learned feature maps usually exhibit strong activations on the human body rather than the background.
Abstract: Person re-identification (re-ID) is mostly viewed as an image retrieval problem. This task aims to search a query person in a large image pool. In practice, person re-ID usually adopts automatic detectors to obtain cropped pedestrian images. However, this process suffers from two types of detector errors: excessive background and part missing. Both errors deteriorate the quality of pedestrian alignment and may compromise pedestrian matching due to the position and scale variances. To address the misalignment problem, we propose that alignment be learned from an identification procedure. We introduce the pedestrian alignment network (PAN) which allows discriminative embedding learning pedestrian alignment without extra annotations. We observe that when the convolutional neural network learns to discriminate between different identities, the learned feature maps usually exhibit strong activations on the human body rather than the background. The proposed network thus takes advantage of this attention mechanism to adaptively locate and align pedestrians within a bounding box. Visual examples show that pedestrians are better aligned with PAN. Experiments on three large-scale re-ID datasets confirm that PAN improves the discriminative ability of the feature embeddings and yields competitive accuracy with the state-of-the-art methods.

466 citations


Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper reviewed different types of saliency detection algorithms, summarize the important issues of the existing methods, and discuss the existent problems and future works, and the experimental analysis and discussion are conducted to provide a holistic overview of different saliency detectors.
Abstract: The visual saliency detection model simulates the human visual system to perceive the scene and has been widely used in many vision tasks. With the development of acquisition technology, more comprehensive information, such as depth cue, inter-image correspondence, or temporal relationship, is available to extend image saliency detection to RGBD saliency detection, co-saliency detection, or video saliency detection. The RGBD saliency detection model focuses on extracting the salient regions from RGBD images by combining the depth information. The co-saliency detection model introduces the inter-image correspondence constraint to discover the common salient object in an image group. The goal of the video saliency detection model is to locate the motion-related salient object in video sequences, which considers the motion cue and spatiotemporal constraint jointly. In this paper, we review different types of saliency detection algorithms, summarize the important issues of the existing methods, and discuss the existent problems and future works. Moreover, the evaluation datasets and quantitative measurements are briefly introduced, and the experimental analysis and discussion are conducted to provide a holistic overview of different saliency detection methods.

328 citations


Journal ArticleDOI
TL;DR: A novel feature extraction method called robust sparse linear discriminant analysis (RSLDA) is proposed to solve the above problems and achieves the competitive performance compared with other state-of-the-art feature extraction methods.
Abstract: Linear discriminant analysis (LDA) is a very popular supervised feature extraction method and has been extended to different variants. However, classical LDA has the following problems: 1) The obtained discriminant projection does not have good interpretability for features; 2) LDA is sensitive to noise; and 3) LDA is sensitive to the selection of number of projection directions. In this paper, a novel feature extraction method called robust sparse linear discriminant analysis (RSLDA) is proposed to solve the above problems. Specifically, RSLDA adaptively selects the most discriminative features for discriminant analysis by introducing the $l_{2,1}$ norm. An orthogonal matrix and a sparse matrix are also simultaneously introduced to guarantee that the extracted features can hold the main energy of the original data and enhance the robustness to noise, and thus RSLDA has the potential to perform better than other discriminant methods. Extensive experiments on six databases demonstrate that the proposed method achieves the competitive performance compared with other state-of-the-art feature extraction methods. Moreover, the proposed method is robust to the noisy data.

261 citations


Journal ArticleDOI
Ren Yang1, Mai Xu1, Tie Liu1, Zulin Wang1, Zhenyu Guan1 
TL;DR: A quality enhancement convolutional neural network (QE-CNN) method that does not require any modification of the encoder to achieve quality enhancement for HEVC, and a time-constrained quality enhancement optimization (TQEO) scheme.
Abstract: The latest High Efficiency Video Coding (HEVC) standard has been increasingly applied to generate video streams over the Internet However, HEVC compressed videos may incur severe quality degradation, particularly at low bit rates Thus, it is necessary to enhance the visual quality of HEVC videos at the decoder side To this end, this paper proposes a quality enhancement convolutional neural network (QE-CNN) method that does not require any modification of the encoder to achieve quality enhancement for HEVC In particular, our QE-CNN method learns QE-CNN-I and QE-CNN-P models to reduce the distortion of HEVC I and P/B frames, respectively The proposed method differs from the existing CNN-based quality enhancement approaches, which only handle intra-coding distortion and are thus not suitable for P/B frames Our experimental results validate that our QE-CNN method is effective in enhancing quality for both I and P/B frames of HEVC videos To apply our QE-CNN method in time-constrained scenarios, we further propose a time-constrained quality enhancement optimization (TQEO) scheme Our TQEO scheme controls the computational time of QE-CNN to meet a target, meanwhile maximizing the quality enhancement Next, the experimental results demonstrate the effectiveness of our TQEO scheme from the aspects of time control accuracy and quality enhancement under different time constraints Finally, we design a prototype to implement our TQEO scheme in a real-time scenario

175 citations


Journal ArticleDOI
TL;DR: A general steganalysis feature selection method based on decision rough set-positive region reduction that can significantly reduce the feature dimensions and maintain detection accuracy and will remarkably improve the efficiency of feature extraction and stego image detection.
Abstract: Steganography detection based on Rich Model features is a hot research direction in steganalysis. However, rich model features usually result a large computation cost. To reduce the dimension of steganalysis features and improve the efficiency of steganalysis algorithm, differing from previous works that normally proposed new feature extraction algorithm, this paper proposes a general steganalysis feature selection method based on decision rough set $\alpha$ -positive region reduction. First, it is pointed out that decision rough set $\alpha$ -positive region reduction is suitable for steganalysis feature selection. Second, a quantization method of attribute separability is proposed to measure the separability of steganalysis feature components. Third, steganalysis feature components selection algorithm based on decision rough set $\alpha$ -positive region reduction is given; thus, stego images can be detected by the selected feature. The proposed method can significantly reduce the feature dimensions and maintain detection accuracy. Based on the BOSSbase-1.01 image database of 10 000 images, a series of feature selection experiments are carried on two kinds of typical rich model features (35263-D J+SRM feature and 17000-D GFR feature). The results show that even though these two kinds of features are reduced to approximately 8000-D, the detection performance of steganalysis algorithms based on the selected features are also maintained with that of original features, which will remarkably improve the efficiency of feature extraction and stego image detection.

171 citations


Journal ArticleDOI
TL;DR: The proposed input/output architectures for convolutional neural network (CNN)-based cross-view gait recognition are used and it is confirmed that the proposed architectures outperformed the state-of-the-art benchmarks in accordance with their suitable situations of verification/identification tasks and view differences.
Abstract: In this paper, we discuss input/output architectures for convolutional neural network (CNN)-based cross-view gait recognition. For this purpose, we consider two aspects: verification versus identification and the tradeoff between spatial displacements caused by subject difference and view difference. More specifically, we use the Siamese network with a pair of inputs and contrastive loss for verification and a triplet network with a triplet of inputs and triplet ranking loss for identification. The aforementioned CNN architectures are insensitive to spatial displacement, because the difference between a matching pair is calculated at the last layer after passing through the convolution and max pooling layers; hence, they are expected to work relatively well under large view differences. By contrast, because it is better to use the spatial displacement to its best advantage because of the subject difference under small view differences, we also use CNN architectures where the difference between a matching pair is calculated at the input level to make them more sensitive to spatial displacement. We conducted experiments for cross-view gait recognition and confirmed that the proposed architectures outperformed the state-of-the-art benchmarks in accordance with their suitable situations of verification/identification tasks and view differences.

157 citations


Journal ArticleDOI
Mai Xu1, Chen Li1, Zhenzhong Chen2, Zulin Wang1, Zhenyu Guan1 
TL;DR: A new database is presented, which includes the viewing direction data from several subjects watching omnidirectional video sequences, and a subjective VQA method for measuring the difference mean opinion score (DMOS) of the whole and regional omnid Directional video, in terms of overall DMOS and vectorized DMOS, respectively.
Abstract: In contrast with traditional videos, omnidirectional videos enable spherical viewing direction with support for head-mounted displays, providing an interactive and immersive experience Unfortunately, to the best of our knowledge, there are only a few visual quality assessment (VQA) methods, either subjective or objective, for omnidirectional video coding This paper proposes both subjective and objective methods for assessing the quality loss in encoding an omnidirectional video Specifically, we first present a new database, which includes the viewing direction data from several subjects watching omnidirectional video sequences Then, from our database, we find a high consistency in viewing directions across different subjects The viewing directions are normally distributed in the center of the front regions, but they sometimes fall into other regions, related to the video content Given this finding, we present a subjective VQA method for measuring the difference mean opinion score (DMOS) of the whole and regional omnidirectional video, in terms of overall DMOS and vectorized DMOS, respectively Moreover, we propose two objective VQA methods for the encoded omnidirectional video, in light of the human perception characteristics of the omnidirectional video One method weighs the distortion of pixels with regard to their distances to the center of front regions, which considers human preference in a panorama The other method predicts viewing directions according to the video content, and then the predicted viewing directions are leveraged to allocate weights to the distortion of each pixel in our objective VQA method Finally, our experimental results verify that both the subjective and objective methods proposed in this paper advance the state-of-the-art VQA for omnidirectional videos

149 citations


Journal ArticleDOI
TL;DR: In this paper, the authors evaluate density maps generated by density estimation methods on a variety of crowd analysis tasks, including counting, detection, and tracking, and propose several metrics for measuring the quality of a density map, and relate them to experiment results.
Abstract: For crowded scenes, the accuracy of object-based computer vision methods declines when the images are low-resolution and objects have severe occlusions. Taking counting methods for example, almost all the recent state-of-the-art counting methods bypass explicit detection and adopt regression-based methods to directly count the objects of interest. Among regression-based methods, density map estimation, where the number of objects inside a subregion is the integral of the density map over that subregion, is especially promising because it preserves spatial information, which makes it useful for both counting and localization (detection and tracking). With the power of deep convolutional neural networks (CNNs) the counting performance has improved steadily. The goal of this paper is to evaluate density maps generated by density estimation methods on a variety of crowd analysis tasks, including counting, detection, and tracking. Most existing CNN methods produce density maps with resolution that is smaller than the original images, due to the downsample strides in the convolution/pooling operations. To produce an original-resolution density map, we also evaluate a classical CNN that uses a sliding window regressor to predict the density for every pixel in the image. We also consider a fully convolutional adaptation, with skip connections from lower convolutional layers to compensate for loss in spatial information during upsampling. In our experiments, we found that the lower-resolution density maps sometimes have better counting performance. In contrast, the original-resolution density maps improved localization tasks, such as detection and tracking, compared with bilinear upsampling the lower-resolution density maps. Finally, we also propose several metrics for measuring the quality of a density map, and relate them to experiment results on counting and localization.

145 citations


Journal ArticleDOI
TL;DR: This work redesigned the skeleton representations with a depth-first tree traversal order, which enhanced the semantic meaning of skeleton images and better preserved the associated structural information, and proposed a general two-branch attention architecture that automatically focused on spatio–temporal key stages and filtered out unreliable joint predictions.
Abstract: Action recognition with 3D skeleton sequences became popular due to its speed and robustness. The recently proposed convolutional neural networks (CNNs)-based methods show a good performance in learning spatio–temporal representations for skeleton sequences. Despite the good recognition accuracy achieved by previous CNN-based methods, there existed two problems that potentially limit the performance. First, previous skeleton representations were generated by chaining joints with a fixed order. The corresponding semantic meaning was unclear and the structural information among the joints was lost. Second, previous models did not have an ability to focus on informative joints. The attention mechanism was important for skeleton-based action recognition because different joints contributed unequally toward the correct recognition. To solve these two problems, we proposed a novel CNN-based method for skeleton-based action recognition. We first redesigned the skeleton representations with a depth-first tree traversal order, which enhanced the semantic meaning of skeleton images and better preserved the associated structural information. We then proposed the general two-branch attention architecture that automatically focused on spatio–temporal key stages and filtered out unreliable joint predictions. Based on the proposed general architecture, we designed a global long-sequence attention network with refined branch structures. Furthermore, in order to adjust the kernel’s spatio–temporal aspect ratios and better capture long-term dependencies, we proposed a sub-sequence attention network (SSAN) that took sub-image sequences as inputs. We showed that the two-branch attention architecture could be combined with the SSAN to further improve the performance. Our experiment results on the NTU RGB+D data set and the SBU kinetic interaction data set outperformed the state of the art. The model was further validated on noisy estimated poses from the subsets of the UCF101 data set and the kinetics data set.

144 citations


Journal ArticleDOI
TL;DR: This paper proposes a semi-supervised loss to jointly minimize the empirical error on labeled data, as well as the embedding error on both labeled and unlabeled data, which can preserve the semantic similarity and capture the meaningful neighbors on the underlying data structures for effective hashing.
Abstract: Hashing methods have been widely used for efficient similarity retrieval on large scale image database. Traditional hashing methods learn hash functions to generate binary codes from hand-crafted features, which achieve limited accuracy since the hand-crafted features cannot optimally represent the image content and preserve the semantic similarity. Recently, several deep hashing methods have shown better performance because the deep architectures generate more discriminative feature representations. However, these deep hashing methods are mainly designed for supervised scenarios, which only exploit the semantic similarity information, but ignore the underlying data structures. In this paper, we propose the semi-supervised deep hashing approach, to perform more effective hash function learning by simultaneously preserving semantic similarity and underlying data structures. The main contributions are as follows: 1) We propose a semi-supervised loss to jointly minimize the empirical error on labeled data, as well as the embedding error on both labeled and unlabeled data, which can preserve the semantic similarity and capture the meaningful neighbors on the underlying data structures for effective hashing. 2) A semi-supervised deep hashing network is designed to extensively exploit both labeled and unlabeled data, in which we propose an online graph construction method to benefit from the evolving deep features during training to better capture semantic neighbors. To the best of our knowledge, the proposed deep network is the first deep hashing method that can perform hash code learning and feature learning simultaneously in a semi-supervised fashion. Experimental results on five widely-used data sets show that our proposed approach outperforms the state-of-the-art hashing methods.

140 citations


Journal ArticleDOI
TL;DR: This paper proposes a novel image steganography framework that is robust for communication channels offered by various social networks, and proposes a coefficient adjustment scheme to slightly modify the original image based on the stego-image.
Abstract: Posting images on social network platforms is happening everywhere and every single second. Thus, the communication channels offered by various social networks have a great potential for covert communication. However, images transmitted through such channels will usually be JPEG compressed, which fails most of the existing steganographic schemes. In this paper, we propose a novel image steganography framework that is robust for such channels. In particular, we first obtain the channel compressed version (i.e., the channel output) of the original image. Secret data is embedded into the channel compressed original image by using any of the existing JPEG steganographic schemes, which produces the stego-image after the channel transmission. To generate the corresponding image before the channel transmission (termed the intermediate image), we propose a coefficient adjustment scheme to slightly modify the original image based on the stego-image. The adjustment is done such that the channel compressed version of the intermediate image is exactly the same as the stego-image. Therefore, after the channel transmission, secret data can be extracted from the stego-image with 100% accuracy. Various experiments are conducted to show the effectiveness of the proposed framework for image steganography robust to JPEG compression.

Journal ArticleDOI
TL;DR: A heterogeneous association graph is constructed that fuses high-level detections and low-level image evidence for target association and the novel idea of adaptive weights is proposed to analyze the contribution between motion and appearance.
Abstract: Tracking-by-detection is one of the most popular approaches to tracking multiple objects in which the detector plays an important role. Sometimes, detector failures caused by occlusions or various poses are unavoidable and lead to tracking failure. To cope with this problem, we construct a heterogeneous association graph that fuses high-level detections and low-level image evidence for target association. Compared with other methods using low-level information, our proposed heterogeneous association fusion (HAF) tracker is less sensitive to particular parameters and is easier to extend and implement. We use the fused association graph to build track trees for HAF and solve them by the multiple hypotheses tracking framework, which has been proven to be competitive by introducing efficient pruning strategies. In addition, the novel idea of adaptive weights is proposed to analyze the contribution between motion and appearance. We also evaluated our results on the MOT challenge benchmarks and achieved state-of-the-art results on the MOT Challenge 2017.

Journal ArticleDOI
TL;DR: This work proposes a novel enhancement framework using the response characteristics of cameras to lower the distortions, and can obtain enhancement results with fewer color and lightness distortions compared with the several state-of-the-art methods.
Abstract: Low-light image enhancement algorithms can improve the visual quality of low-light images and support the extraction of valuable information for some computer vision techniques. However, existing techniques inevitably introduce color and lightness distortions when enhancing the images. To lower the distortions, we propose a novel enhancement framework using the response characteristics of cameras. First, we discuss how to determine a reasonable camera response model and its parameters. Then, we use the illumination estimation techniques to estimate the exposure ratio for each pixel. Finally, the selected camera response model is used to adjust each pixel to the desired exposure according to the estimated exposure ratio map. Experiments show that our method can obtain enhancement results with fewer color and lightness distortions compared with the several state-of-the-art methods.

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed a two-stream collaborative learning with spatial-temporal attention (TCLSTA) approach, which consists of two models: spatial and temporal attention for salient regions in a frame and temporal-level attention to exploit the discriminative frames in a video.
Abstract: Video classification is highly important and has widespread applications, such as video search and intelligent surveillance. Video naturally contains both static and motion information, which can be represented by frames and optical flow, respectively. Recently, researchers have generally adopted deep networks to capture the static and motion information separately , which has two main limitations. First, the coexistence relationship between spatial and temporal attention is ignored, although they should be jointly modeled as the spatial and temporal evolutions of video to learn discriminative video features. Second, the strong complementarity between static and motion information is ignored, although they should be collaboratively learned to enhance each other. To address the above two limitations, this paper proposes the two-stream collaborative learning with spatial-temporal attention (TCLSTA) approach, which consists of two models. First, for the spatial-temporal attention model , the spatial-level attention emphasizes the salient regions in a frame, and the temporal-level attention exploits the discriminative frames in a video. They are mutually enhanced to jointly learn the discriminative static and motion features for better classification performance. Second, for the static-motion collaborative model , it not only achieves mutual guidance between static and motion information to enhance the feature learning but also adaptively learns the fusion weights of static and motion streams, thus exploiting the strong complementarity between static and motion information to improve video classification. Experiments on four widely used data sets show that our TCLSTA approach achieves the best performance compared with more than 10 state-of-the-art methods.

Journal ArticleDOI
TL;DR: This work solves the sequence learning problem as an image classification task using convolutional neural networks, and builds a classification network with stacked residual blocks and having a special design called linear skip gated connection which can benefit information propagation across multiple residual blocks.
Abstract: For skeleton-based action recognition, most of the existing works used recurrent neural networks. Using convolutional neural networks (CNNs) is another attractive solution considering their advantages in parallelization, effectiveness in feature learning, and model base sufficiency. Besides these, skeleton data are low-dimensional features. It is natural to arrange a sequence of skeleton features chronologically into an image, which retains the original information. Therefore, we solve the sequence learning problem as an image classification task using CNNs. For better learning ability, we build a classification network with stacked residual blocks and having a special design called linear skip gated connection which can benefit information propagation across multiple residual blocks. When arranging the coordinates of body joints in one frame into a skeleton feature, we systematically investigate the performance of part-based, chain-based, and traversal-based orders. Furthermore, a fully convolutional permutation network is designed to learn an optimized order for data rearrangement. Without any bells and whistles, our proposed model achieves state-of-the-art performance on two challenging benchmark datasets, outperforming existing methods significantly.

Journal ArticleDOI
TL;DR: This paper presents attention-based 3D-convolutional neural networks (3D-CNNs) for SLR, a framework that learns spatio-temporal features from raw video without prior knowledge and helps to select the clue.
Abstract: Sign language recognition (SLR) is an important and challenging research topic in the multimedia field. Conventional techniques for SLR rely on hand-crafted features, which achieve limited success. In this paper, we present attention-based 3D-convolutional neural networks (3D-CNNs) for SLR. The framework has two advantages: 3D-CNNs learn spatio-temporal features from raw video without prior knowledge and the attention mechanism helps to select the clue. When training 3D-CNN for capturing spatio-temporal features, spatial attention is incorporated into the network to focus on the areas of interest. After feature extraction, temporal attention is utilized to select the significant motions for classification. The proposed method is evaluated on two large scale sign language data sets. The first one, collected by ourselves, is a Chinese sign language data set that consists of 500 categories. The other is the ChaLearn14 benchmark. The experiment results demonstrate the effectiveness of our approach compared with state-of-the-art algorithms.

Journal ArticleDOI
TL;DR: This paper proposes a novel iterative maximum weighted independent set (MWIS) algorithm for multiple hypothesis tracking (MHT) in a tracking-by-detection framework and proposes a polynomial-time approximation algorithm for the MWIS problem in MHT.
Abstract: This paper proposes a novel iterative maximum weighted independent set (MWIS) algorithm for multiple hypothesis tracking (MHT) in a tracking-by-detection framework. MHT converts the tracking problem into a series of MWIS problems across the tracking time. Previous works solve these NP-hard MWIS problems independently without the use of any prior information from each frame, and they ignore the relevance between adjacent frames. In this paper, we iteratively solve the MWIS problems by using the MWIS solution from the previous frame rather than solving the problem from scratch each time. First, we define five hypothesis categories and a hypothesis transfer model, which explicitly describes the hypothesis relationship between adjacent frames. We also propose a polynomial-time approximation algorithm for the MWIS problem in MHT. In addition to that, we present a confident short tracklet generation method and incorporate tracklet-level association into MHT, which further improves the computational efficiency. Our experiments on both MOT16 and MOT17 benchmarks show that our tracker outperforms all the previously published tracking algorithms on both MOT16 and MOT17 benchmarks. Finally, we demonstrate that the polynomial-time approximate tracker reaches nearly the same tracking performance.

Journal ArticleDOI
TL;DR: This paper proposes a novel, end-to-end approach to video generation using generative adversarial networks (GANs), which involves two concatenated GANs, one capturing motions and the other generating frame details.
Abstract: Given two video frames $X_{0}$ and $X_{n+1}$ , we aim to generate a series of intermediate frames $Y_{1}, Y_{2}, \ldots, Y_{n}$ , such that the resulting video consisting of frames $X_{0}, Y_{1}-Y_{n}, and X_{n+1}$ appears realistic to a human watcher. Such video generation has numerous important applications, including video compression, movie production, slow-motion filming, video surveillance, and forensic analysis. Yet, video generation is highly challenging due to the vast search space of possible frames. Previous methods, mostly based on video prediction and/or video interpolation, tend to generate poor-quality videos with severe motion blur. This paper proposes a novel, end-to-end approach to video generation using generative adversarial networks (GANs). In particular, our design involves two concatenated GANs, one capturing motions and the other generating frame details. The loss function is also carefully engineered to include adversarial loss, gradient difference (for motion learning), and normalized product correlation loss (for frame details). Experiments using three video datasets, namely, Google Robotic Push, KTH human actions, and UCF101, demonstrate that the proposed solution generates high-quality, realistic, and sharp videos, whereas all previous solutions output noisy and blurry results.

Journal ArticleDOI
TL;DR: A dual-stream RNN (DS-RNN) framework to jointly discover and integrate the hidden states of both visual and semantic streams for video caption generation is proposed and achieves competitive performance against the state-of-the-art.
Abstract: Recent progress in using recurrent neural networks (RNNs) for video description has attracted an increasing interest, due to its capability to encode a sequence of frames for caption generation While existing methods have studied various features (eg, CNN, 3D CNN, and semantic attributes) for visual encoding, the representation and fusion of heterogeneous information from multi-modal spaces have not fully explored Consider that different modalities are often asynchronous, frame-level multi-modal fusion (eg, concatenation and linear fusion) will negatively influence each modality In this paper, we propose a dual-stream RNN (DS-RNN) framework to jointly discover and integrate the hidden states of both visual and semantic streams for video caption generation First, an encoding RNN is used for each stream to flexibly exploit the hidden states of respective modality Specifically, we proposed an attentive multi-grained encoder module to enhance the local feature learning with global semantics feature Then, a dual-stream decoder is deployed to integrate the asynchronous yet complementary sequential hidden states from both streams for caption generation Extensive experiments on three benchmark datasets, namely, MSVD, MSR-VTT, and MPII-MD, show that DS-RNN achieves competitive performance against the state-of-the-art Additional ablation studies were conducted on various variants of the proposed DS-RNN

Journal ArticleDOI
TL;DR: A novel end-to-end multi- focus image fusion with a natural enhancement method based on deep convolutional neural network (CNN) that can deliver superior fusion and enhancement performance than the state-of-the-art methods in the presence of multi-focus images with common non-focused areas, anisotropic blur, and misregistration.
Abstract: Common non-focused areas are often present in multi-focus images due to the limitation of the number of focused images. This factor severely degrades the fusion quality of multi-focus images. To address this problem, we propose a novel end-to-end multi-focus image fusion with a natural enhancement method based on deep convolutional neural network (CNN). Several end-to-end CNN architectures that are specifically adapted to this task are first designed and researched. On the basis of the observation that low-level feature extraction can capture low-frequency content, whereas high-level feature extraction effectively captures high-frequency details, we further combine multi-level outputs such that the most visually distinctive features can be extracted, fused, and enhanced. In addition, the multi-level outputs are simultaneously supervised during training to boost the performance of image fusion and enhancement. Extensive experiments show that the proposed method can deliver superior fusion and enhancement performance than the state-of-the-art methods in the presence of multi-focus images with common non-focused areas, anisotropic blur, and misregistration.

Journal ArticleDOI
TL;DR: The proposed framework outperforms previous works on reversible data hiding in encrypted images since the tasks of data embedding/extraction and bitstream recovery are all accomplished by the server, and the image owner and the authorized user are required to implement no extra operations except JPEG encryption or decryption.
Abstract: This paper proposes a novel framework of reversible data hiding in encrypted JPEG bitstream. We first provide a JPEG encryption algorithm to encipher a JPEG image to a smaller size and keep the format compliant to JPEG decoders. After an image owner uploads the encrypted JPEG bitstreams to cloud storage, the server embeds additional messages into the ciphertext to construct a marked encrypted JPEG bitstream. During data hiding, we propose a combined embedding algorithm including two stages, the Huffman code mapping and the ordered histogram shifting. The embedding procedure is reversible. When an authorized user requires a downloading operation, the server extracts additional messages from the marked encrypted JPEG bitstream and recovers the original encrypted bitstream losslessly. After downloading, the user obtains the original JPEG bitstream by a direct decryption. The proposed framework outperforms previous works on reversible data hiding in encrypted images. First, since the tasks of data embedding/extraction and bitstream recovery are all accomplished by the server, the image owner and the authorized user are required to implement no extra operations except JPEG encryption or decryption. Second, the embedding payload is larger than state-of-the-art works.

Journal ArticleDOI
TL;DR: In rigorous experiments, the proposed algorithms demonstrate the state-of-the-art performance on multiple video applications and are made available as a part of the open source package in https://github.com/Netflix/vmaf.
Abstract: The recently developed video multi-method assessment fusion (VMAF) framework integrates multiple quality-aware features to accurately predict the video quality. However, the VMAF does not yet exploit important principles of temporal perception that are relevant to the perceptual video distortion measurement. Here, we propose two improvements to the VMAF framework, called spatiotemporal VMAF and ensemble VMAF, which leverage perceptually-motivated space–time features that are efficiently calculated at multiple scales. We also conducted a large subjective video study, which we have found to be an excellent resource for training our feature-based approaches. In rigorous experiments, we found that the proposed algorithms demonstrate the state-of-the-art performance on multiple video applications. The compared algorithms will be made available as a part of the open source package in https://github.com/Netflix/vmaf .

Journal ArticleDOI
TL;DR: A new mobile video quality database that contains videos afflicted with distortions caused by 26 different stalling patterns, and is making the database publicly available in order to help the advance state-of-the-art research on user-centric mobile network planning and management.
Abstract: Over-the-top mobile adaptive video streaming is invariably influenced by volatile network conditions, which can cause playback interruptions (stalling or rebuffering events) and bitrate fluctuations, thereby impairing users’ quality of experience (QoE). Video quality assessment models that can accurately predict users’ QoE under such volatile network conditions are rapidly gaining attention, since these methods could enable more efficient design of quality control protocols for media-driven services such as YouTube, Amazon, Netflix, and many others. However, the development of improved QoE prediction models requires data sets of videos afflicted with diverse stalling events that have been labeled with ground-truth subjective opinion scores. Toward this end, we have created a new mobile video quality database that we call LIVE Mobile Stall Video Database-II. Our database contains a total of 174 videos afflicted with distortions caused by 26 different stalling patterns. We describe the way we simulated the diverse stalling events to create a corpus of distorted videos, and we detail the human study we conducted to obtain continuous-time subjective scores from 54 subjects. We also present the outcomes of our comprehensive analysis of the impact of several factors that influence subjective QoE, and report the performance of existing QoE-prediction models on our data set. We are making the database (videos, subjective data, and video metadata) publicly available in order to help the advance state-of-the-art research on user-centric mobile network planning and management. The database may be accessed at http://live.ece.utexas.edu/research/LIVEStallStudy/liveMobile.html .

Journal ArticleDOI
TL;DR: A novel deep side semantic embedding (DSSE) model is presented to generate video summaries by leveraging the freely available side information and the superior performance of DSSE is demonstrated to the several state-of-the-art approaches to video summarization.
Abstract: With the rapid growth of video content, video summarization, which focuses on automatically selecting important and informative parts from videos, is becoming increasingly crucial. However, the problem is challenging due to its subjectiveness. Previous research, which predominantly relies on manually designed criteria or resourcefully expensive human annotations, often fails to achieve satisfying results. We observe that the side information associated with a video (e.g., surrounding text such as titles, queries, descriptions, comments, and so on) represents a kind of human-curated semantics of video content. This side information, although valuable for video summarization, is overlooked in existing approaches. In this paper, we present a novel deep side semantic embedding (DSSE) model to generate video summaries by leveraging the freely available side information. The DSSE constructs a latent subspace by correlating the hidden layers of the two uni-modal autoencoders, which embed the video frames and side information, respectively. Specifically, by interactively minimizing the semantic relevance loss and the feature reconstruction loss of the two uni-modal autoencoders, the comparable common information between video frames and side information can be more completely learned. Therefore, their semantic relevance can be more effectively measured. Finally, semantically meaningful segments are selected from videos by minimizing their distances to the side information in the constructed latent subspace. We conduct experiments on two datasets (Thumb1K and TVSum50) and demonstrate the superior performance of DSSE to the several state-of-the-art approaches to video summarization.

Journal ArticleDOI
TL;DR: This paper presents a novel embedding framework with reduced distortion called skewed histogram shifting using a pair of extreme predictions, where only the pixels from the peak and the short tail are used for embedding, which decreases the distortion from the lesser number of pixels being shifted.
Abstract: Reversible data hiding hides data in an image such that the original image is recoverable. This paper presents a novel embedding framework with reduced distortion called skewed histogram shifting using a pair of extreme predictions. Unlike traditional prediction error histogram shifting schemes, where only one good prediction is used to generate a prediction error histogram, the proposed scheme uses a pair of extreme predictions to generate two skewed histograms. By exploiting the structure of the skewed histogram, only the pixels from the peak and the short tail are used for embedding, which decreases the distortion from the lesser number of pixels being shifted. Detailed experiments and analysis are provided using several image databases.

Journal ArticleDOI
Yingying Chen1, Jinqiao Wang1, Bingke Zhu1, Ming Tang1, Hanqing Lu1 
TL;DR: An end-to-end deep sequence learning architecture for moving object detection is proposed and a novel attention long short-term memory (Attention ConvLSTM) is proposed to model pixelwise changes over time.
Abstract: Moving object detection is an essential, well-studied but still open problem in computer vision and plays a fundamental role in many applications. Traditional approaches usually reconstruct background images with hand-crafted visual features, such as color, texture, and edge. Due to lack of prior knowledge or semantic information, it is difficult to deal with complicated and rapid changing scenes. To exploit the temporal structure of the pixel-level semantic information, in this paper, we propose an end-to-end deep sequence learning architecture for moving object detection. First, the video sequences are input into a deep convolutional encoder–decoder network for extracting pixel-wise semantic features. Then, to exploit the temporal context, we propose a novel attention long short-term memory (Attention ConvLSTM) to model pixelwise changes over time. A spatial transformer network and a conditional random field layer are finally appended to reduce the sensitivity to camera motion and smooth the foreground boundaries. A multi-task loss is proposed to jointly optimization for frame-based classification and temporal prediction in an end-to-end network. Experimental results on CDnet 2014 and LASIESTA show 12.15% and 16.71% improvement to the state of the art, respectively.

Journal ArticleDOI
TL;DR: A deep continuous conditional random field (DCCRF) is proposed for solving the online MOT problem in a track-by-detection framework and is trained in an end-to-end manner for better adapting the influences of visual information as well as inter-object relations.
Abstract: Online multi-object tracking (MOT) is a challenging problem and has many important applications including intelligence surveillance, robot navigation, and autonomous driving. In existing MOT methods, individual object’s movements and inter-object relations are mostly modeled separately and relations between them are still manually tuned. In addition, inter-object relations are mostly modeled in a symmetric way, which we argue is not an optimal setting. To tackle those difficulties, in this paper, we propose a deep continuous conditional random field (DCCRF) for solving the online MOT problem in a track-by-detection framework. The DCCRF consists of unary and pairwise terms. The unary terms estimate tracked objects’ displacements across time based on visual appearance information. They are modeled as deep convolution neural networks, which are able to learn discriminative visual features for tracklet association. The asymmetric pairwise terms model inter-object relations in an asymmetric way, which encourages high-confidence tracklets to help correct errors of low-confidence tracklets and not to be affected by low-confidence ones much. The DCCRF is trained in an end-to-end manner for better adapting the influences of visual information as well as inter-object relations. Extensive experimental comparisons with state-of-the-arts as well as detailed component analysis of our proposed DCCRF on two public benchmarks demonstrate the effectiveness of our proposed MOT framework.

Journal ArticleDOI
TL;DR: A 2D-LBP method which uses a sliding window to count the weighted occurrence number of the rotation invariant uniform LBP pattern pairs to obtain the spatial contextual information and obtains higher classification accuracy under different cases, and simultaneously owns shorter time complexity.
Abstract: The local binary pattern (LBP) and its variants have shown the effectiveness in texture images classification, face recognition, and other applications However, most of these LBP methods only focus on the histogram of LBP patterns and ignore the spatial contextual information between LBP patterns In this paper, we propose a 2D-LBP method which uses a sliding window to count the weighted occurrence number of the rotation invariant uniform LBP pattern pairs to obtain the spatial contextual information The multi-resolution 2D-LBP features can also be obtained when the radius of 2D-LBP is changed At last, a two-stage classifier which acts as an ensemble learning step is followed to achieve an accurate classification by combining the predictions on each 2D-LBP with single resolution Theoretical proof shows that the proposed 2D-LBP is a general framework and can be integrated on other LBP variants to derive new feature extraction methods Experimental results show that, the proposed method achieves 9971%, 9709%, 9848%, and 4900% classification accuracy on the public “Brodatz,” “CUReT,” “UIUC,” and “FMD” texture image databases, respectively Compared with the original LBP and its variants, the proposed method obtains higher classification accuracy under different cases, and simultaneously owns shorter time complexity

Journal ArticleDOI
Yu Zhang1, Xinbo Gao1, Lihuo He1, Wen Lu1, Ran He1 
TL;DR: A general-purpose no-reference VQA framework that is based on weakly supervised learning with a convolutional neural network (CNN) and a resampling strategy that is on a par with some state-of-the-art V QA metrics and has promising robustness.
Abstract: Due to the 3D spatiotemporal regularities of natural videos and small-scale video quality databases, effective objective video quality assessment (VQA) metrics are difficult to obtain but highly desirable. In this paper, we propose a general-purpose no-reference VQA framework that is based on weakly supervised learning with a convolutional neural network (CNN) and a resampling strategy. First, an eight-layer CNN is trained by weakly supervised learning to construct the relationship between the deformations of the 3D discrete cosine transform of video blocks and the corresponding weak labels judged by a full-reference (FR) VQA metric. Thus, the CNN obtains the quality assessment capacity converted from the FR-VQA metric, and the effective features of the distorted videos can be extracted through the trained network. Then, we map the frequency histogram calculated from the quality score vectors predicted by the trained network onto the perceptual quality. Especially, to improve the performance of the mapping function, we transfer the frequency histogram of the distorted images and videos to resample the training set. The experiments are carried out on several widely used VQA databases. The experimental results demonstrate that the proposed method is on a par with some state-of-the-art VQA metrics and has promising robustness.

Journal ArticleDOI
TL;DR: The adaptive pixel pairing (APP) and the adaptive mapping selection for the enhancement of pairwise PEE are proposed and shown to increase the similarity between pixels in a pair, by excluding the rough pixels from pairing and only putting the smooth pixels into pairs.
Abstract: Pairwise prediction-error expansion (pairwise PEE) is a recent technique for the high-dimensional reversible data hiding However, in the absence of adaptive embedding, its potential has not been fully exploited In this paper, we propose the adaptive pixel pairing (APP) and the adaptive mapping selection for the enhancement of pairwise PEE Our motivation is twofold: building a sharper 2D histogram and designing the effective 2D mapping for it In APP, we consider to increase the similarity between pixels in a pair, by excluding the rough pixels from pairing and only putting the smooth pixels into pairs In this way, the pixels in a pair have a larger possibility of being equal, and thus the resulted 2D prediction-error histogram (PEH) has lower entropy Next, the adaptive mapping selection mechanism is introduced to properly determine the optimal modification, based on “whether it fits for the resulted PEH” rather than heuristic experience The experimental results show that the proposed method has a significant improvement over the pairwise PEE