scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Circuits and Systems for Video Technology in 2018"


Journal ArticleDOI
TL;DR: A deep learning framework that incorporates temporal and contextual information from tubelets obtained in videos, which dramatically improves the baseline performance of existing still-image detection frameworks when they are applied to videos is proposed, called T-CNN.
Abstract: The state-of-the-art performance for object detection has been significantly improved over the past two years. Besides the introduction of powerful deep neural networks, such as GoogleNet and VGG, novel object detection frameworks, such as R-CNN and its successors, Fast R-CNN, and Faster R-CNN, play an essential role in improving the state of the art. Despite their effectiveness on still images, those frameworks are not specifically designed for object detection from videos. Temporal and contextual information of videos are not fully investigated and utilized. In this paper, we propose a deep learning framework that incorporates temporal and contextual information from tubelets obtained in videos, which dramatically improves the baseline performance of existing still-image detection frameworks when they are applied to videos. It is called T-CNN, i.e., tubelets with convolutional neueral networks. The proposed framework won newly introduced an object-detection-from-video task with provided data in the ImageNet Large-Scale Visual Recognition Challenge 2015. Code is publicly available at https://github.com/myfavouritekk/T-CNN .

467 citations


Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper proposed to formulate depth estimation as a pixelwise classification task by discretizing the continuous ground-truth depths into several bins and label the bins according to their depth ranges.
Abstract: Depth estimation from single monocular images is a key component in scene understanding. Most existing algorithms formulate depth estimation as a regression problem due to the continuous property of depths. However, the depth value of input data can hardly be regressed exactly to the ground-truth value. In this paper, we propose to formulate depth estimation as a pixelwise classification task. Specifically, we first discretize the continuous ground-truth depths into several bins and label the bins according to their depth ranges. Then, we solve the depth estimation problem as classification by training a fully convolutional deep residual network. Compared with estimating the exact depth of a single point, it is easier to estimate its depth range. More importantly, by performing depth classification instead of regression, we can easily obtain the confidence of a depth prediction in the form of probability distribution. With this confidence, we can apply an information gain loss to make use of the predictions that are close to ground-truth during training, as well as fully-connected conditional random fields for post-processing to further improve the performance. We test our proposed method on both indoor and outdoor benchmark RGB-Depth datasets and achieve state-of-the-art performance.

301 citations


Journal ArticleDOI
TL;DR: This letter presents an effective method to encode the spatiotemporal information of a skeleton sequence into color texture images, referred to as skeleton optical spectra, and employs convolutional neural networks (ConvNets) to learn the discriminative features for action recognition.
Abstract: This letter presents an effective method to encode the spatiotemporal information of a skeleton sequence into color texture images, referred to as skeleton optical spectra, and employs convolutional neural networks (ConvNets) to learn the discriminative features for action recognition. Such spectrum representation makes it possible to use a standard ConvNet architecture to learn suitable “dynamic” features from skeleton sequences without training millions of parameters afresh and it is especially valuable when there is insufficient annotated training video data. Specifically, the encoding consists of four steps: mapping of joint distribution, spectrum coding of joint trajectories, spectrum coding of body parts, and joint velocity weighted saturation and brightness. Experimental results on three widely used datasets have demonstrated the efficacy of the proposed method.

298 citations


Journal ArticleDOI
TL;DR: A novel residual network architecture, residual networks of residual networks (RoR) is proposed, to dig the optimization ability of residual Networks, where RoR substitutes optimizing residual mapping of residual mapping for optimizing original residual mapping.
Abstract: A residual networks family with hundreds or even thousands of layers dominates major image recognition tasks, but building a network by simply stacking residual blocks inevitably limits its optimization ability. This paper proposes a novel residual network architecture, residual networks of residual networks (RoR), to dig the optimization ability of residual networks. RoR substitutes optimizing residual mapping of residual mapping for optimizing original residual mapping. In particular, RoR adds levelwise shortcut connections upon original residual networks to promote the learning capability of residual networks. More importantly, RoR can be applied to various kinds of residual networks (ResNets, Pre-ResNets, and WRN) and significantly boost their performance. Our experiments demonstrate the effectiveness and versatility of RoR, where it achieves the best performance in all residual-network-like structures. Our RoR-3-WRN58-4 + SD models achieve new state-of-the-art results on CIFAR-10, CIFAR-100, and SVHN, with the test errors of 3.77%, 19.73%, and 1.59%, respectively. RoR-3 models also achieve state-of-the-art results compared with ResNets on the ImageNet data set.

258 citations


Journal ArticleDOI
TL;DR: This paper proposes to bridge the emotional gap by using a hybrid deep model, which first produces audio–visual segment features with Convolutional Neural Networks and 3D-CNN, then fuses audio– visual segment features in a Deep Belief Networks (DBNs).
Abstract: Emotion recognition is challenging due to the emotional gap between emotions and audio–visual features. Motivated by the powerful feature learning ability of deep neural networks, this paper proposes to bridge the emotional gap by using a hybrid deep model, which first produces audio–visual segment features with Convolutional Neural Networks (CNNs) and 3D-CNN, then fuses audio–visual segment features in a Deep Belief Networks (DBNs). The proposed method is trained in two stages. First, CNN and 3D-CNN models pre-trained on corresponding large-scale image and video classification tasks are fine-tuned on emotion recognition tasks to learn audio and visual segment features, respectively. Second, the outputs of CNN and 3D-CNN models are combined into a fusion network built with a DBN model. The fusion network is trained to jointly learn a discriminative audio–visual segment feature representation. After average-pooling segment features learned by DBN to form a fixed-length global video feature, a linear Support Vector Machine is used for video emotion classification. Experimental results on three public audio–visual emotional databases, including the acted RML database, the acted eNTERFACE05 database, and the spontaneous BAUM-1s database, demonstrate the promising performance of the proposed method. To the best of our knowledge, this is an early work fusing audio and visual cues with CNN, 3D-CNN, and DBN for audio–visual emotion recognition.

249 citations


Journal ArticleDOI
TL;DR: An overview of cross-media retrieval is given, including the concepts, methodologies, major challenges, and open issues, as well as building up the benchmarks, including data sets and experimental results so that researchers can directly adopt the benchmarks to promptly evaluate their proposed methods.
Abstract: Multimedia retrieval plays an indispensable role in big data utilization. Past efforts mainly focused on single-media retrieval. However, the requirements of users are highly flexible, such as retrieving the relevant audio clips with one query of image. So challenges stemming from the “media gap,” which means that representations of different media types are inconsistent, have attracted increasing attention. Cross-media retrieval is designed for the scenarios where the queries and retrieval results are of different media types. As a relatively new research topic, its concepts, methodologies, and benchmarks are still not clear in the literature. To address these issues, we review more than 100 references, give an overview including the concepts, methodologies, major challenges, and open issues, as well as build up the benchmarks, including data sets and experimental results. Researchers can directly adopt the benchmarks to promptly evaluate their proposed methods. This will help them to focus on algorithm design, rather than the time-consuming compared methods and results. It is noted that we have constructed a new data set XMedia , which is the first publicly available data set with up to five media types (text, image, video, audio, and 3-D model). We believe this overview will attract more researchers to focus on cross-media retrieval and be helpful to them.

222 citations


Journal ArticleDOI
TL;DR: This work proposes a unified metric learning-based framework to jointly learn discriminative feature representation and co-salient object detector by optimizing a new objective function that explicitly embeds a metric learning regularization term into support vector machine (SVM) training.
Abstract: Co-saliency detection, which focuses on extracting commonly salient objects in a group of relevant images, has been attracting research interest because of its broad applications In practice, the relevant images in a group may have a wide range of variations, and the salient objects may also have large appearance changes Such wide variations usually bring about large intra-co-salient objects (intra-COs) diversity and high similarity between COs and background, which makes the co-saliency detection task more difficult To address these problems, we make the earliest effort to introduce metric learning to co-saliency detection Specifically, we propose a unified metric learning-based framework to jointly learn discriminative feature representation and co-salient object detector This is achieved by optimizing a new objective function that explicitly embeds a metric learning regularization term into support vector machine (SVM) training Here, the metric learning regularization term is used to learn a powerful feature representation that has small intra-COs scatter, but big separation between background and COs and the SVM classifier is used for subsequent co-saliency detection In the experiments, we comprehensively evaluate the proposed method on two commonly used benchmark data sets The state-of-the-art results are achieved in comparison with the existing co-saliency detection methods

198 citations


Journal ArticleDOI
TL;DR: This paper presents a review of the digital video watermarking techniques in which their applications, challenges, and important properties are discussed, and categorizes them based on the domain in which they embed the watermark.
Abstract: The illegal distribution of a digital movie is a common and significant threat to the film industry. With the advent of high-speed broadband Internet access, a pirated copy of a digital video can now be easily distributed to a global audience. A possible means of limiting this type of digital theft is digital video watermarking whereby additional information, called a watermark, is embedded in the host video. This watermark can be extracted at the decoder and used to determine whether the video content is watermarked. This paper presents a review of the digital video watermarking techniques in which their applications, challenges, and important properties are discussed, and categorizes them based on the domain in which they embed the watermark. It then provides an overview of a few emerging innovative solutions using watermarks. Protecting a 3D video by watermarking is an emerging area of research. The relevant 3D video watermarking techniques in the literature are classified based on the image-based representations of a 3D video in stereoscopic, depth-image-based rendering, and multi-view video watermarking. We discuss each technique, and then present a survey of the literature. Finally, we provide a summary of this paper and propose some future research directions.

181 citations


Journal ArticleDOI
TL;DR: This paper focuses on the video content analysis techniques applied in sportscasts over the past decade from the perspectives of fundamentals and general review, a content hierarchical model, and trends and challenges.
Abstract: Sports data analysis is becoming increasingly large scale, diversified, and shared, but difficulty persists in rapidly accessing the most crucial information. Previous surveys have focused on the methodologies of sports video analysis from the spatiotemporal viewpoint instead of a content-based viewpoint, and few of these studies have considered semantics. This paper develops a deeper interpretation of content-aware sports video analysis by examining the insight offered by research into the structure of content under different scenarios. On the basis of this insight, we provide an overview of the themes particularly relevant to the research on content-aware systems for broadcast sports. Specifically, we focus on the video content analysis techniques applied in sportscasts over the past decade from the perspectives of fundamentals and general review, a content hierarchical model, and trends and challenges. Content-aware analysis methods are discussed with respect to object-, event-, and context-oriented groups. In each group, the gap between sensation and content excitement must be bridged using proper strategies. In this regard, a content-aware approach is required to determine user demands. Finally, this paper summarizes the future trends and challenges for sports video analysis. We believe that our findings can advance the field of research on content-aware video analysis for broadcast sports.

179 citations


Journal ArticleDOI
TL;DR: The experimental results indicated that the proposed watermarking mechanism can withstand various processing attacks and accurately locate the tampered area of an image.
Abstract: This paper presents a blind dual watermarking mechanism for digital color images in which invisible robust watermarks are embedded for copyright protection and fragile watermarks are embedded for image authentication. For the purpose of copyright protection, the first watermark is embedded using the discrete wavelet transform in YCbCr color space, and it can be extracted blindly without access to the host image. However, fragile watermarking is based on an improved least significant bits’ replacement approach in RGB components for image authentication. The authenticity and integrity of a suspicious image can be verified blindly without the host image and the original watermark. The combination of robust and fragile watermarking makes the proposed mechanism suitable for protecting valuable original images. The experimental results indicated that the proposed watermarking mechanism can withstand various processing attacks and accurately locate the tampered area of an image.

150 citations


Journal ArticleDOI
TL;DR: A weight-sample-based method for foreground detection that allows us to use a few samples with variable weights to achieve effective change detection and is superior to the state-of-the-art approaches on the challenging CDnet data set.
Abstract: Background subtraction techniques are often treated as fundamental and significant ways to analyze and understand video content. In this paper, we propose a weight-sample-based method for foreground detection. This method allows us to use a few samples with variable weights to achieve effective change detection. To rapidly adapt to changing scenarios, a minimum-weight update policy is first proposed to replace the most inefficient sample instead of the oldest sample or a random sample. In addition, a reward-and-penalty weighting strategy is put forward to reinforce active samples and punish others. In this way, the weights of relatively effective samples are increased and the false updating of effective samples with smaller weights is reduced. Moreover, some other strategies, such as spatial-diffusion policy and random time subsampling, are also incorporated to ensure the flexibility of the proposed method. Finally, in our experiments, an adaptive feedback technique is incorporated into our algorithm to adapt to more challenging videos, and the final results indicate that our method is superior to the state-of-the-art approaches on the challenging CDnet data set.

Journal ArticleDOI
Feng Jiang1, Wen Tao1, Shaohui Liu1, Jie Ren1, Xun Guo2, Debin Zhao1 
TL;DR: Experimental results validate that the proposed compression framework greatly outperforms several compression frameworks that use existing image coding standards with the state-of-the-art deblocking or denoising post-processing methods.
Abstract: Deep learning, e.g., convolutional neural networks (CNNs), has achieved great success in image processing and computer vision especially in high-level vision applications, such as recognition and understanding. However, it is rarely used to solve low-level vision problems such as image compression studied in this paper. Here, we move forward a step and propose a novel compression framework based on CNNs. To achieve high-quality image compression at low bit rates, two CNNs are seamlessly integrated into an end-to-end compression framework. The first CNN, named compact convolutional neural network (ComCNN), learns an optimal compact representation from an input image, which preserves the structural information and is then encoded using an image codec (e.g., JPEG, JPEG2000, or BPG). The second CNN, named reconstruction convolutional neural network (RecCNN), is used to reconstruct the decoded image with high quality in the decoding end. To make two CNNs effectively collaborate, we develop a unified end-to-end learning algorithm to simultaneously learn ComCNN and RecCNN, which facilitates the accurate reconstruction of the decoded image using RecCNN. Such a design also makes the proposed compression framework compatible with existing image coding standards. Experimental results validate that the proposed compression framework greatly outperforms several compression frameworks that use existing image coding standards with the state-of-the-art deblocking or denoising post-processing methods.

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed an accumulative motion context (AMOC) network for video-based person re-identification, which jointly learns appearance representation and motion context from a collection of adjacent frames using a two-stream convolutional architecture.
Abstract: Video-based person re-identification plays a central role in realistic security and video surveillance. In this paper, we propose a novel accumulative motion context (AMOC) network for addressing this important problem, which effectively exploits the long-range motion context for robustly identifying the same person under challenging conditions. Given a video sequence of the same or different persons, the proposed AMOC network jointly learns appearance representation and motion context from a collection of adjacent frames using a two-stream convolutional architecture. Then, AMOC accumulates clues from motion context by recurrent aggregation, allowing effective information flow among adjacent frames and capturing dynamic gist of the persons. The architecture of AMOC is end-to-end trainable, and thus, motion context can be adapted to complement appearance clues under unfavorable conditions ( e.g. , occlusions). Extensive experiments are conduced on three public benchmark data sets, i.e. , the iLIDS-VID, PRID-2011, and MARS data sets, to investigate the performance of AMOC. The experimental results demonstrate that the proposed AMOC network outperforms state-of-the-arts for video-based re-identification significantly and confirm the advantage of exploiting long-range motion context for video-based person re-identification, validating our motivation evidently.

Journal ArticleDOI
TL;DR: A very compact universal feature set is proposed and a multiclass classification scheme for identifying many common image operations is designed, which significantly outperforms the existing forensic methods in terms of both effectiveness and universality.
Abstract: Image forensics has attracted wide attention during the past decade. However, most existing works aim at detecting a certain operation, which means that their proposed features usually depend on the investigated image operation and they consider only binary classification. This usually leads to misleading results if irrelevant features and/or classifiers are used. For instance, a JPEG decompressed image would be classified as an original or median filtered image if it was fed into a median filtering detector. Hence, it is important to develop forensic methods and universal features that can simultaneously identify multiple image operations. Based on extensive experiments and analysis, we find that any image operation, including existing anti-forensics operations, will inevitably modify a large number of pixel values in the original images. Thus, some common inherent statistics such as the correlations among adjacent pixels cannot be preserved well. To detect such modifications, we try to analyze the properties of local pixels within the image in the residual domain rather than the spatial domain considering the complexity of the image contents. Inspired by image steganalytic methods, we propose a very compact universal feature set and then design a multiclass classification scheme for identifying many common image operations. In our experiments, we tested the proposed features as well as several existing features on 11 typical image processing operations and four kinds of anti-forensic methods. The experimental results show that the proposed strategy significantly outperforms the existing forensic methods in terms of both effectiveness and universality.

Journal ArticleDOI
TL;DR: A novel finger vein recognition framework is proposed, including an anatomy structure analysis-based vein extraction algorithm and an integration matching strategy, and extensive experiments on two public finger vein databases verify the effectiveness of the proposed framework.
Abstract: Finger vein recognition has received a lot of attention recently and is viewed as a promising biometric trait. In related methods, vein pattern-based methods explore intrinsic finger vein recognition, but their performance remains unsatisfactory owing to defective vein networks and weak matching. One important reason may be the neglect of deep analysis of the vein anatomy structure. By comprehensively exploring the anatomy structure and imaging characteristic of vein patterns, this paper proposes a novel finger vein recognition framework, including an anatomy structure analysis-based vein extraction algorithm and an integration matching strategy. Specifically, the vein pattern is extracted from the orientation map-guided curvature based on the valley- or half valley-shaped cross-sectional profile. In addition, the extracted vein pattern is further thinned and refined to obtain a reliable vein network. In addition to the vein network, the relatively clear vein branches in the image are mined from the vein pattern, referred to as the vein backbone. In matching, the vein backbone is used in vein network calibration to overcome finger displacements. The similarity of two calibrated vein networks is measured by the proposed elastic matching and further recomputed by integrating the overlap degree of corresponding vein backbones. Extensive experiments on two public finger vein databases verify the effectiveness of the proposed framework.

Journal ArticleDOI
TL;DR: A new CNN structure for up-sampling is explored, which features deconvolution of feature maps, multi-scale fusion, and residue learning, making the network both compact and efficient.
Abstract: Inspired by the recent advances of image super-resolution using convolutional neural network (CNN), we propose a CNN-based block up-sampling scheme for intra frame coding. A block can be down-sampled before being compressed by normal intra coding, and then up-sampled to its original resolution. Different from previous studies on down/up-sampling-based coding, the up-sampling methods in our scheme have been designed by training CNN instead of hand-crafted. We explore a new CNN structure for up-sampling, which features deconvolution of feature maps, multi-scale fusion, and residue learning, making the network both compact and efficient. We also design different networks for the up-sampling of luma and chroma components, respectively, where the chroma up-sampling CNN utilizes the luma information to boost its performance. In addition, we design a two-stage up-sampling process, the first stage being within the block-by-block coding loop, and the second stage being performed on the entire frame, so as to refine block boundaries. We also empirically study how to set the coding parameters of down-sampled blocks for pursuing the frame-level rate-distortion optimization. Our proposed scheme is implemented into the high-efficiency video coding (HEVC) reference software, and a comprehensive set of experiments have been performed to evaluate our methods. Experimental results show that our scheme achieves significant bits saving compared with the HEVC anchor, especially at low bit rates, leading to on average 5.5% BD-rate reduction on common test sequences and on average 9.0% BD-rate reduction on ultrahigh definition test sequences.

Journal ArticleDOI
TL;DR: A new database, comprising a total of 208 videos, which model six common in-capture distortions of digital videos, and evaluated several top-performing no-reference IQA and VQA algorithms on the new database and studied how real-world in- capture distortions challenge both human viewers as well as automatic perceptual quality prediction models.
Abstract: Digital videos often contain visual distortions that are introduced by the camera’s hardware or processing software during the capture process. These distortions often detract from a viewer’s quality of experience. Understanding how human observers perceive the visual quality of digital videos is of great importance to camera designers. Thus, the development of automatic objective methods that accurately quantify the impact of visual distortions on perception has greatly accelerated. Video quality algorithm design and verification require realistic databases of distorted videos and human judgments of them. However, most current publicly available video quality databases have been created under highly controlled conditions using graded, simulated, and post-capture distortions (such as jitter and compression artifacts) on high-quality videos. The commercial plethora of hand-held mobile video capture devices produces videos often afflicted by a variety of complex distortions generated during the capturing process. These in-capture distortions are not well-modeled by the synthetic, post-capture distortions found in existing VQA databases. Toward overcoming this limitation, we designed and created a new database that we call the LIVE-Qualcomm mobile in-capture video quality database, comprising a total of 208 videos, which model six common in-capture distortions. We also conducted a subjective quality assessment study using this database, in which each video was assessed by 39 unique subjects. Furthermore, we evaluated several top-performing no-reference IQA and VQA algorithms on the new database and studied how real-world in-capture distortions challenge both human viewers as well as automatic perceptual quality prediction models. The new database is freely available at: http://live.ece.utexas.edu/research/incaptureDatabase/index.html .

Journal ArticleDOI
TL;DR: This paper proposes a novel reversible data hiding scheme for encrypted images by using homomorphic and probabilistic properties of Paillier cryptosystem that has lower computation complexity, higher security performance, and better embedding performance.
Abstract: This paper proposes a novel reversible data hiding scheme for encrypted images by using homomorphic and probabilistic properties of Paillier cryptosystem. In the proposed method, groups of adjacent pixels are randomly selected, and reversibly embedded into the rest of the image to make room for data embedding. In each group, there are a reference pixel and a few host pixels. Least significant bits (LSBs) of the reference pixels are reset before encryption and the encrypted host pixels are replaced with the encrypted reference pixel in the same group to form mirroring ciphertext groups (MCGs). In such a way, the modification on MCGs for data embedding will not cause any pixel oversaturation in plaintext domain and the embedded data can be directly extracted from the encrypted domain. In an MCG, the reference ciphertext pixel is kept unchanged as a reference while data hider embeds the encrypted additional data into the LSBs of the host ciphertext pixels by employing homomorphic multiplication. On the receiver side, the hidden ciphertext data can be retrieved by employing a modular multiplicative inverse operation between the marked host ciphertext pixels and their corresponding reference ciphertext pixels, respectively. After that, the hidden data are extracted promptly by looking for a one-to-one mapping table from ciphertext to plaintext. Data extraction and image restoration can be accomplished without any error after decryption. Compared with the existing works, the proposed scheme has lower computation complexity, higher security performance, and better embedding performance. The experiments on the standard image files also certify the effectiveness of the proposed scheme.

Journal ArticleDOI
TL;DR: Experimental results on the Outex, CUReT, KTH-TIPS, and UIUC texture data sets show that LETRIST consistently produces better or comparable classification results than the state-of-the-art approaches.
Abstract: Classifying texture images, especially those with significant rotation, illumination, scale, and viewpoint changes, is a fundamental and challenging problem in computer vision. This paper proposes a simple yet effective image descriptor, called Locally Encoded TRansform feature hISTogram (LETRIST), for texture classification. LETRIST is a histogram representation that explicitly encodes the joint information within an image across feature and scale spaces. The proposed representation is training-free, low-dimensional, yet discriminative and robust for texture description. It consists of the following major steps. First, a set of transform features is constructed to characterize local texture structures and their correlation by applying linear and non-linear operators on the extremum responses of directional Gaussian derivative filters in scale space. Established on the basis of steerable filters, the constructed transform features are exactly rotationally invariant as well as computationally efficient. Second, the scalar quantization via binary or multi-level thresholding is adopted to quantize these transform features into texture codes. Two quantization schemes are designed, both of which are robust to image rotation and illumination changes. Third, the cross-scale joint coding is explored to aggregate the discrete texture codes into a compact histogram representation, i.e., LETRIST. Experimental results on the Outex, CUReT, KTH-TIPS, and UIUC texture data sets show that LETRIST consistently produces better or comparable classification results than the state-of-the-art approaches. Impressively, recognition rates of 100.00% and 99.00% have been achieved on the Outex and KTH-TIPS data sets, respectively. In addition, the noise robustness is evaluated on the Outex and CUReT data sets. The source code is publicly available at https://github.com/stc-cqupt/letrist .

Journal ArticleDOI
TL;DR: Qualitative and quantitative results show that the proposed 3D feature constrained reconstruction (3D-FCR) algorithm can lead to a promising improvement of LDCT image quality.
Abstract: Low-dose computed tomography (LDCT) images are often highly degraded by amplified mottle noise and streak artifacts. Maintaining image quality under low-dose scan protocols is a well-known challenge. Recently, sparse representation-based techniques have been shown to be efficient in improving such CT images. In this paper, we propose a 3D feature constrained reconstruction (3D-FCR) algorithm for LDCT image reconstruction. The feature information used in the 3D-FCR algorithm relies on a 3D feature dictionary constructed from available high quality standard-dose CT sample. The CT voxels and the sparse coefficients are sequentially updated using an alternating minimization scheme. The performance of the 3D-FCR algorithm was assessed through experiments conducted on phantom simulation data and clinical data. A comparison with previously reported solutions was also performed. Qualitative and quantitative results show that the proposed method can lead to a promising improvement of LDCT image quality.

Journal ArticleDOI
TL;DR: Experimental results on benchmark data sets show that the proposed model can effectively learn spatio-temporal features relevant for re-identification and outperforms existing video-based person re-Identification methods.
Abstract: This paper presents an end-to-end learning architecture for video-based person re-identification by integrating convolutional neural networks (CNNs) and bidirectional recurrent neural networks (BRNNs). Given a video with consecutive frames, features of each frame are extracted with CNN and then are fed into the BRNN to get a final spatio-temporal representation about the video. Specifically, CNN acts as a Spatial Feature Extractor, while BRNN is expected to capture the temporal cues of sequential frames in both forward and backward directions, simultaneously. The whole network is trained end-to-end with a joint identification and verification manner. Experimental results on benchmark data sets show that the proposed model can effectively learn spatio-temporal features relevant for re-identification and outperforms existing video-based person re-identification methods.

Journal ArticleDOI
TL;DR: This paper proposes integrating semantic information into learning locality-aware feature (LAF) sets for accurate crowd counting and extends the traditional vector of locally aggregated descriptor (VLAD) encoding method to a more generalized form weighted-VLAD (W-VDAD) in which diverse coefficient weights are taken into consideration.
Abstract: Crowd counting is an important task in computer vision, which has many applications in video surveillance. Although the regression-based framework has achieved great improvements for crowd counting, how to improve the discriminative power of image representation is still an open problem. Conventional holistic features used in crowd counting often fail to capture semantic attributes and spatial cues of the image. In this paper, we propose integrating semantic information into learning locality-aware feature (LAF) sets for accurate crowd counting. First, with the help of a convolutional neural network, the original pixel space is mapped onto a dense attribute feature map, where each dimension of the pixelwise feature indicates the probabilistic strength of a certain semantic class. Then, LAF built on the idea of spatial pyramids on neighboring patches is proposed to explore more spatial context and local information. Finally, the traditional vector of locally aggregated descriptor (VLAD) encoding method is extended to a more generalized form weighted-VLAD (W-VLAD) in which diverse coefficient weights are taken into consideration. Experimental results validate the effectiveness of our presented method.

Journal ArticleDOI
TL;DR: A spatiotemporal low-rank modeling method on dynamic video clips for estimating the robust background model and superior performance over state-of-the-art approaches is demonstrated.
Abstract: Background modeling constitutes the building block of many computer-vision tasks. Traditional schemes model the background as a low rank matrix with corrupted entries. These schemes operate in batch mode and do not scale well with the data size. Moreover, without enforcing spatiotemporal information in the low-rank component, and because of occlusions by foreground objects and redundancy in video data, the design of a background initialization method robust against outliers is very challenging. To overcome these limitations, this paper presents a spatiotemporal low-rank modeling method on dynamic video clips for estimating the robust background model. The proposed method encodes spatiotemporal constraints by regularizing spectral graphs. Initially, a motion-compensated binary matrix is generated using optical flow information to remove redundant data and to create a set of dynamic frames from the input video sequence. Then two graphs are constructed, one between frames for temporal consistency and the other between features for spatial consistency, to encode the local structure for continuously promoting the intrinsic behavior of the low-rank model against outliers. These two terms are then incorporated in the iterative Matrix Completion framework for improved segmentation of background. Rigorous evaluation on severely occluded and dynamic background sequences demonstrates the superior performance of the proposed method over state-of-the-art approaches.

Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper proposed a novel two-flow convolutional neural network (YCNN) to track different objects in an object-independent approach with a novel Two-Flow CNN.
Abstract: The main challenges of visual object tracking arise from the arbitrary appearance of the objects that need to be tracked. Most existing algorithms try to solve this problem by training a new model to regenerate or classify each tracked object. As a result, the model needs to be initialized and retrained for each new object. In this paper, we propose to track different objects in an object-independent approach with a novel two-flow convolutional neural network (YCNN). The YCNN takes two inputs (one is an object image patch, the other is a larger searching image patch), then outputs a response map which predicts how likely and where the object would appear in the search patch. Unlike the object-specific approaches, the YCNN is actually trained to measure the similarity between the two image patches. Thus, this model will not be limited to any specific object. Furthermore, the network is end-to-end trained to extract both shallow and deep dedicated convolutional features for visual tracking. And once properly trained, the YCNN can be used to track all kinds of objects without further training and updating. As a result, our algorithm is able to run at a very high speed of 45 frames-per-second. The effectiveness of the proposed algorithm can also be proved by the experiments on two popular data sets: OTB-100 and VOT-2014.

Journal ArticleDOI
TL;DR: It is shown that due to the use of this strategy, L1-LDA is accompanied with some serious problems that hinder the derivation of the optimal discrimination for data, and an effective iterative framework to solve a general L 1-norm minimization–maximization (minmax) problem is proposed.
Abstract: Recent works have proposed two L1-norm distance measure-based linear discriminant analysis (LDA) methods, L1-LD and LDA-L1, which aim to promote the robustness of the conventional LDA against outliers. In LDA-L1, a gradient ascending iterative algorithm is applied, which, however, suffers from the choice of stepwise. In L1-LDA, an alternating optimization strategy is proposed to overcome this problem. In this paper, however, we show that due to the use of this strategy, L1-LDA is accompanied with some serious problems that hinder the derivation of the optimal discrimination for data. Then, we propose an effective iterative framework to solve a general L1-norm minimization–maximization ( minmax ) problem. Based on the framework, we further develop a effective L1-norm distance-based LDA (called L1-ELDA) method. Theoretical insights into the convergence and effectiveness of our algorithm are provided and further verified by extensive experimental results on image databases.

Journal ArticleDOI
TL;DR: The proposed framework utilizes two state-of-the-art ConvNets, i.e., the very deep spatial net (VGGNet) and the temporal net from Two-Stream convolutional layers, and proposes the new Line pooling strategy, which can speed up the extraction of feature and achieve the comparable performance of the Trajectory pooling.
Abstract: Deep ConvNets have shown their good performance in image classification tasks. However, there still remains problems in deep video representations for action recognition. On one hand, current video ConvNets are relatively shallow compared with image ConvNets, which limits their capability of capturing the complex video action information; on the other hand, temporal information of videos is not properly utilized to pool and encode the video sequences. Toward these issues, in this paper we utilize two state-of-the-art ConvNets, i.e., the very deep spatial net (VGGNet [1] ) and the temporal net from Two-Stream ConvNets [2] , for action representation. The convolutional layers and the proposed new layer, called frame-diff layer, are extracted and pooled with two temporal pooling strategies: Trajectory pooling and Line pooling. The pooled local descriptors are then encoded with vector of locally aggregated descriptors (VLAD) [3] to form the video representations. In order to verify the effectiveness of the proposed framework, we conduct experiments on UCF101 and HMDB51 data sets. It achieves accuracy of 92.08% on UCF101, which is the state-of-the-art, and the accuracy of 65.62% on HMDB51, which is comparable to the state-of-the-art. In addition, we propose the new Line pooling strategy, which can speed up the extraction of feature and achieve the comparable performance of the Trajectory pooling.

Journal ArticleDOI
TL;DR: A deep metric learning-based regression method is proposed to extract density related features, and learn better distance measurement simultaneously, which can be used for crowdedness regression tasks, including congestion level detection and crowd counting.
Abstract: Cross-scene regression tasks, such as congestion level detection and crowd counting, are useful but challenging. There are two main problems, which limit the performance of existing algorithms. The first one is that no appropriate congestion-related feature can reflect the real density in scenes. Though deep learning has been proved to be capable of extracting high level semantic representations, it is hard to converge on regression tasks, since the label is too weak to guide the learning of parameters in practice. Thus, many approaches utilize additional information, such as a density map, to guide the learning, which increases the effort of labeling. Another problem is that most existing methods are composed of several steps, for example, feature extraction and regression. Since the steps in the pipeline are separated, these methods face the problem of complex optimization. To remedy it, a deep metric learning-based regression method is proposed to extract density related features, and learn better distance measurement simultaneously. The proposed networks trained end-to-end for better optimization can be used for crowdedness regression tasks, including congestion level detection and crowd counting. Extensive experiments confirm the effectiveness of the proposed method.

Journal ArticleDOI
TL;DR: An RR-IQA method from the perspective of SCI visual perception, where the quality of the distorted SCI is evaluated by comparing a set of extracted statistical features that consider both primary visual information and unpredictable uncertainty.
Abstract: The screen content images (SCIs) quality influences the user experience and the interactive performance of remote computing systems. With numerous approaches proposed to evaluate the quality of natural images, much less work has been dedicated to reduced-reference image quality assessment (RR-IQA) of SCIs. Here, we propose an RR-IQA method from the perspective of SCI visual perception. In particular, the quality of the distorted SCI is evaluated by comparing a set of extracted statistical features that consider both primary visual information and unpredictable uncertainty. A unique property that differentiates the proposed method from previous RR-IQA methods for natural images is the consideration of behaviors when human subjects view the screen content, which motivates us to establish the perceptual model according to the distinct properties of SCIs. Validations based on the screen content IQA database show that the proposed algorithm provides accurate predictions across a wide range of SCI distortions with negligible transmission overhead.

Journal ArticleDOI
TL;DR: A novel hashing model is proposed to efficiently learn robust discrete binary codes, which is referred as Robust and Flexible Discrete Hashing (RFDH), which is directly learned based on discrete matrix decomposition so that the large quantization error caused by relaxation is avoided.
Abstract: Multimodal hashing approaches have gained great success on large-scale cross-modal similarity search applications, due to their appealing computation and storage efficiency. However, it is still a challenge work to design binary codes to represent the original features with good performance in an unsupervised manner. We argue that there are some limitations that need to be further considered for unsupervised multimodal hashing: 1) most existing methods drop the discrete constraints to simplify the optimization, which will cause large quantization error; 2) many methods are sensitive to outliers and noises since they use $\ell _{2}$ -norm in their objective functions which can amplify the errors; and 3) the weight of each modality, which greatly influences the retrieval performance, is manually or empirically determined and may not fully fit the specific training set. The above limitations may significantly degrade the retrieval accuracy of unsupervised multimodal hashing methods. To address these problems, in this paper, a novel hashing model is proposed to efficiently learn robust discrete binary codes, which is referred as Robust and Flexible Discrete Hashing (RFDH). In the proposed RFDH model, binary codes are directly learned based on discrete matrix decomposition, so that the large quantization error caused by relaxation is avoided. Moreover, the $\ell _{2,1}$ -norm is used in the objective function to improve the robustness, such that the learned model is not sensitive to data outliers and noises. In addition, the weight of each modality is adaptively adjusted according to training data. Hence the important modality will get large weights during the hash learning procedure. Owing to above merits of RFDH, it can generate more effective hash codes. Besides, we introduce two kinds of hash function learning methods to project unseen instances into hash codes. Extensive experiments on several well-known large databases demonstrate superior performance of the proposed hash model over most state-of-the-art unsupervised multimodal hashing methods.

Journal ArticleDOI
TL;DR: A study of subjective and objective quality assessment of compressed 4K ultra-high-definition (UHD) videos in an immersive viewing environment and investigates added values of UHD over conventional high definition (HD) in terms of perceptual quality.
Abstract: We present a study of subjective and objective quality assessment of compressed 4K ultra-high-definition (UHD) videos in an immersive viewing environment. First, we conduct a subjective quality evaluation experiment for 4K UHD videos compressed by three state-of-the-art video coding techniques, i.e., Advanced Video Coding, High Efficiency Video Coding, and VP9. In particular, we aim at investigating added values of UHD over conventional high definition (HD) in terms of perceptual quality. The results are systematically analyzed in various viewpoints, such as coding scheme, bitrate, and video content. Second, existing state-of-the-art objective quality assessment techniques are benchmarked using the subjective data in order to investigate their validity and limitation for 4K UHD videos. Finally, the video and subjective data are made publicly available for further research by the research community.