scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Circuits and Systems for Video Technology in 2017"


Journal ArticleDOI
TL;DR: This paper proposes a novel active learning (AL) framework, which is capable of building a competitive classifier with optimal feature representation via a limited amount of labeled training instances in an incremental learning manner and incorporates deep convolutional neural networks into AL.
Abstract: Recent successes in learning-based image classification, however, heavily rely on the large number of annotated training samples, which may require considerable human effort. In this paper, we propose a novel active learning (AL) framework, which is capable of building a competitive classifier with optimal feature representation via a limited amount of labeled training instances in an incremental learning manner. Our approach advances the existing AL methods in two aspects. First, we incorporate deep convolutional neural networks into AL. Through the properly designed framework, the feature representation and the classifier can be simultaneously updated with progressively annotated informative samples. Second, we present a cost-effective sample selection strategy to improve the classification performance with less manual annotations. Unlike traditional methods focusing on only the uncertain samples of low prediction confidence, we especially discover the large amount of high-confidence samples from the unlabeled set for feature learning. Specifically, these high-confidence samples are automatically selected and iteratively assigned pseudolabels. We thus call our framework cost-effective AL (CEAL) standing for the two advantages. Extensive experiments demonstrate that the proposed CEAL framework can achieve promising results on two challenging image classification data sets, i.e., face recognition on the cross-age celebrity face recognition data set database and object categorization on Caltech-256.

581 citations


Journal ArticleDOI
TL;DR: A subjective study in a state-of-the-art mixed reality system shows that introduced prediction distortions are negligible compared with the original reconstructed point clouds and shows the benefit of reconstructed point cloud video as a representation in the 3D virtual world.
Abstract: We present a generic and real-time time-varying point cloud codec for 3D immersive video. This codec is suitable for mixed reality applications in which 3D point clouds are acquired at a fast rate. In this codec, intra frames are coded progressively in an octree subdivision. To further exploit inter-frame dependencies, we present an inter-prediction algorithm that partitions the octree voxel space in $N \times N \times N$ macroblocks ( $N=8,16,32$ ). The algorithm codes points in these blocks in the predictive frame as a rigid transform applied to the points in the intra-coded frame. The rigid transform is computed using the iterative closest point algorithm and compactly represented in a quaternion quantization scheme. To encode the color attributes, we defined a mapping of color per vertex attributes in the traversed octree to an image grid and use legacy image coding method based on JPEG. As a result, a generic compression framework suitable for real-time 3D tele-immersion is developed. This framework has been optimized to run in real time on commodity hardware for both the encoder and decoder. Objective evaluation shows that a higher rate-distortion performance is achieved compared with available point cloud codecs. A subjective study in a state-of-the-art mixed reality system shows that introduced prediction distortions are negligible compared with the original reconstructed point clouds. In addition, it shows the benefit of reconstructed point cloud video as a representation in the 3D virtual world. The codec is available as open source for integration in immersive and augmented communication applications and serves as a base reference software platform in JTC1/SC29/WG11 (MPEG) for the further development of standardized point-cloud compression solutions.

346 citations


Journal ArticleDOI
Lukas Cavigelli1, Luca Benini1
TL;DR: A new architecture, design, and implementation, as well as the first reported silicon measurements of such an accelerator, outperforming previous work in terms of power, area, and I/O efficiency are presented.
Abstract: An ever-increasing number of computer vision and image/video processing challenges are being approached using deep convolutional neural networks, obtaining state-of-the-art results in object recognition and detection, semantic segmentation, action recognition, optical flow, and super resolution. Hardware acceleration of these algorithms is essential to adopt these improvements in embedded and mobile computer vision systems. We present a new architecture, design, and implementation, as well as the first reported silicon measurements of such an accelerator, outperforming previous work in terms of power, area, and I/O efficiency. The manufactured device provides up to 196 GOp/s on 3.09 $\text {mm}^{2}$ of silicon in UMC 65-nm technology and can achieve a power efficiency of 803 GOp/s/W. The massively reduced bandwidth requirements make it the first architecture scalable to TOp/s performance.

164 citations


Journal ArticleDOI
Zhi Liu1, Junhao Li1, Linwei Ye1, Guangling Sun1, Liquan Shen1 
TL;DR: The experimental results on two video data sets with various unconstrained videos demonstrate that the proposed model consistently outperforms the state-of-the-art spatiotemporal saliency models on saliency detection performance.
Abstract: This paper proposes an effective spatiotemporal saliency model for unconstrained videos with complicated motion and complex scenes. First, superpixel-level motion and color histograms as well as global motion histogram are extracted as the features for saliency measurement. Then a superpixel-level graph with the addition of a virtual background node representing the global motion is constructed, and an iterative motion saliency (MS) measurement method that utilizes the shortest path algorithm on the graph is exploited to reasonably generate MS maps. Temporal propagation of saliency in both forward and backward directions is performed using efficient operations on inter-frame similarity matrices to obtain the integrated temporal saliency maps with the better coherence. Finally, spatial propagation of saliency both locally and globally is performed via the use of intra-frame similarity matrices to obtain the spatiotemporal saliency maps with the even better quality. The experimental results on two video data sets with various unconstrained videos demonstrate that the proposed model consistently outperforms the state-of-the-art spatiotemporal saliency models on saliency detection performance.

157 citations


Journal ArticleDOI
TL;DR: This paper proposes an integrated pipeline that incorporates the output of object trajectory analysis and pixel-based analysis for abnormal behavior inference and shows that this approach is able to detect several types of abnormal group behaviors with less number of false alarms compared with existing approaches.
Abstract: In this paper, we present a unified approach for abnormal behavior detection and group behavior analysis in video scenes. Existing approaches for abnormal behavior detection do either use trajectory-based or pixel-based methods. Unlike these approaches, we propose an integrated pipeline that incorporates the output of object trajectory analysis and pixel-based analysis for abnormal behavior inference. This enables to detect abnormal behaviors related to speed and direction of object trajectories, as well as complex behaviors related to finer motion of each object. By applying our approach on three different data sets, we show that our approach is able to detect several types of abnormal group behaviors with less number of false alarms compared with existing approaches.

156 citations


Journal ArticleDOI
TL;DR: By integrating the complementary information of MLR, CFV, and the CNN features of the fully connected layer, the state-of-the-art performance can be achieved on scene recognition and DA problems.
Abstract: Convolutional neural network (CNN) has achieved the state-of-the-art performance in many different visual tasks. Learned from a large-scale training data set, CNN features are much more discriminative and accurate than the handcrafted features. Moreover, CNN features are also transferable among different domains. On the other hand, traditional dictionary-based features (such as BoW and spatial pyramid matching) contain much more local discriminative and structural information, which is implicitly embedded in the images. To further improve the performance, in this paper, we propose to combine CNN with dictionary-based models for scene recognition and visual domain adaptation (DA). Specifically, based on the well-tuned CNN models (e.g., AlexNet and VGG Net), two dictionary-based representations are further constructed, namely, mid-level local representation (MLR) and convolutional Fisher vector (CFV) representation. In MLR, an efficient two-stage clustering method, i.e., weighted spatial and feature space spectral clustering on the parts of a single image followed by clustering all representative parts of all images, is used to generate a class-mixture or a class-specific part dictionary. After that, the part dictionary is used to operate with the multiscale image inputs for generating mid-level representation. In CFV, a multiscale and scale-proportional Gaussian mixture model training strategy is utilized to generate Fisher vectors based on the last convolutional layer of CNN. By integrating the complementary information of MLR, CFV, and the CNN features of the fully connected layer, the state-of-the-art performance can be achieved on scene recognition and DA problems. An interested finding is that our proposed hybrid representation (from VGG net trained on ImageNet) is also complementary to GoogLeNet and/or VGG-11 (trained on Place205) greatly.

153 citations


Journal ArticleDOI
TL;DR: A novel network structure, which allows an arbitrary number of frames as the network input, is proposed and can be learned on a small target data set because it can leverage the off-the-shelf image-level CNN for model parameter initialization.
Abstract: Encouraged by the success of convolutional neural networks (CNNs) in image classification, recently much effort is spent on applying the CNNs to the video-based action recognition problems. One challenge is that a video contains a varying number of frames, which is incompatible to the standard input format of the CNNs. Existing methods handle this issue either by directly sampling a fixed number of frames or bypassing this issue by introducing a 3D convolutional layer, which conducts convolution in spatial-temporal domain. In this paper, we propose a novel network structure, which allows an arbitrary number of frames as the network input. The key to our solution is to introduce a module consisting of an encoding layer and a temporal pyramid pooling layer. The encoding layer maps the activation from the previous layers to a feature vector suitable for pooling, whereas the temporal pyramid pooling layer converts multiple frame-level activations into a fixed-length video-level representation. In addition, we adopt a feature concatenation layer that combines the appearance and motion information. Compared with the frame sampling strategy, our method avoids the risk of missing any important frames. Compared with the 3D convolutional method, which requires a huge video data set for network training, our model can be learned on a small target data set because we can leverage the off-the-shelf image-level CNN for model parameter initialization. Experiments on three challenging data sets, Hollywood2, HMDB51, and UCF101 demonstrate the effectiveness of the proposed network.

145 citations


Journal ArticleDOI
TL;DR: This paper presents an approach for detecting anomalous events in videos with crowds that uses general concepts, such as orientation, velocity, and entropy to capture anomalies and proposes a novel spatiotemporal feature descriptor, called histograms of optical flow orientation and magnitude and entropy, based on optical flow information.
Abstract: This paper presents an approach for detecting anomalous events in videos with crowds. The main goal is to recognize patterns that might lead to an anomalous event. An anomalous event might be characterized by the deviation from the normal or usual, but not necessarily in an undesirable manner, e.g., an anomalous event might just be different from normal but not a suspicious event from the surveillance point of view. One of the main challenges of detecting such events is the difficulty to create models due to their unpredictability and their dependency on the context of the scene. Based on these challenges, we present a model that uses general concepts, such as orientation, velocity, and entropy to capture anomalies. Using such a type of information, we can define models for different cases and environments. Assuming images captured from a single static camera, we propose a novel spatiotemporal feature descriptor, called histograms of optical flow orientation and magnitude and entropy , based on optical flow information. To determine the normality or abnormality of an event, the proposed model is composed of training and test steps. In the training, we learn the normal patterns. Then, during test, events are described and if they differ significantly from the normal patterns learned, they are considered as anomalous. The experimental results demonstrate that our model can handle different situations and is able to recognize anomalous events with success. We use the well-known UCSD and Subway data sets and introduce a new data set, namely, Badminton.

142 citations


Journal ArticleDOI
TL;DR: Experimental results show that the proposed fast intra-coding algorithm achieves about a 54% encoding time reduction on average with only a 0.7% BD-rate increase for the HEVC reference software HM 14.0 under all-intra configuration.
Abstract: The latest video coding standard High Efficiency Video Coding (HEVC) achieves about a 50% bit-rate reduction compared with H.264/AVC under the same perceptual video quality. For intra coding, a coding unit (CU) is recursively divided into a quadtree-based structure from the largest CU $64 \times 64$ to the smallest CU $8 \times 8$ . Also, up to 35 intra-prediction modes are allowed. These two techniques improve the intra-coding performance significantly. However, the encoding complexity increases several times compared with H.264/AVC intra coding. In this paper, fast intra-mode decision and CU size decision are proposed to reduce the complexity of HEVC intra coding while maintaining the rate-distortion (RD) performance. For fast intra-mode decision, a gradient-based method is proposed to reduce the candidate modes for rough mode decision and RD optimization. For fast CU size decision, the homogenous CUs are early terminated first. Then two linear support vector machines that employ the depth difference and HAD cost ratio (and RD cost ratio) as features are proposed to perform the decisions of early CU split and early CU termination for the rest of the CUs. Experimental results show that the proposed fast intra-coding algorithm achieves about a 54% encoding time reduction on average with only a 0.7% BD-rate increase for the HEVC reference software HM 14.0 under all-intra configuration.

132 citations


Journal ArticleDOI
TL;DR: In this article, an inter-scale regularizer is introduced into the WLS optimization objective to enforce the consistency of the cost volume among the neighboring scales, which leads to the proposed framework.
Abstract: This paper proposes a generic framework that enables a multiscale interaction in the cost aggregation step of stereo matching algorithms. Inspired by the formulation of image filters, we first reformulate cost aggregation from a weighted least-squares (WLS) optimization perspective and show that different cost aggregation methods essentially differ in the choices of similarity kernels. Our key motivation is that while the human stereo vision system processes information at both coarse and fine scales interactively for the correspondence search, state-of-the-art approaches aggregate costs at the finest scale of the input stereo images only, ignoring inter-consistency across multiple scales. This motivation leads us to introduce an inter-scale regularizer into the WLS optimization objective to enforce the consistency of the cost volume among the neighboring scales. The new optimization objective with the inter-scale regularization is convex, and thus, it is easily and analytically solved. Minimizing this new objective leads to the proposed framework. Since the regularization term is independent of the similarity kernel, various cost aggregation approaches, including discrete and continuous parameterization methods, can be easily integrated into the proposed framework. We show that the cross-scale framework is important as it effectively and efficiently expands state-of-the-art cost aggregation methods and leads to significant improvements, when evaluated on Middlebury, Middlebury Third, KITTI, and New Tsukuba data sets.

130 citations


Journal ArticleDOI
TL;DR: Experiments demonstrate that the proposed image-deblocking algorithm combining SSR and QC outperforms the current state-of-the-art methods in both peak signal-to-noise ratio and visual perception.
Abstract: The block discrete cosine transform (BDCT) has been widely used in current image and video coding standards, owing to its good energy compaction and decorrelation properties. However, because of independent quantization of DCT coefficients in each block, BDCT usually gives rise to visually annoying blocking compression artifacts, especially at low bit rates. In this paper, to reduce blocking artifacts and obtain high-quality images, image deblocking is cast as an optimization problem within maximum a posteriori framework, and a novel algorithm for image deblocking by using structural sparse representation (SSR) prior and quantization constraint (QC) prior is proposed. The SSR prior is utilized to simultaneously enforce the intrinsic local sparsity and the nonlocal self-similarity of natural images, while QC is explicitly incorporated to ensure a more reliable and robust estimation. A new split Bregman iteration-based method with an adaptively adjusted regularization parameter is developed to solve the proposed optimization problem, which makes the entire algorithm more practical. Experiments demonstrate that the proposed image-deblocking algorithm combining SSR and QC outperforms the current state-of-the-art methods in both peak signal-to-noise ratio and visual perception.

Journal ArticleDOI
TL;DR: An asymmetric distance model for learning camera-specific projections to transform the unmatched features of each view into a common space where discriminative features across view space are extracted, and a cross-view consistency regularization is introduced.
Abstract: Person reidentification, which matches person images of the same identity across nonoverlapping camera views, becomes an important component for cross-camera-view activity analysis. Most (if not all) person reidentification algorithms are designed based on appearance features. However, appearance features are not stable across nonoverlapping camera views under dramatic lighting change, and those algorithms assume that two cross-view images of the same person can be well represented either by exploring robust and invariant features or by learning matching distance. Such an assumption ignores the nature that images are captured under different camera views with different camera characteristics and environments, and thus, mostly there exists large discrepancy between the extracted features under different views. To solve this problem, we formulate an asymmetric distance model for learning camera-specific projections to transform the unmatched features of each view into a common space where discriminative features across view space are extracted. A cross-view consistency regularization is further introduced to model the correlation between view-specific feature transformations of different camera views, which reflects their nature relations and plays a significant role in avoiding overfitting. A kernel cross-view discriminant component analysis is also presented. Extensive experiments have been conducted to show that asymmetric distance modeling is important for person reidentification, which matches the concerns on cross-disjoint-view matching, reporting superior performance compared with related distance learning methods on six publically available data sets.

Journal ArticleDOI
Chen Zhao1, Siwei Ma1, Jian Zhang1, Ruiqin Xiong1, Wen Gao1 
TL;DR: A novel algorithm for effectively reconstructing videos from CS measurements based on an effective scheme based on the split Bregman iteration algorithm to solve the formulated weighted minimization problem.
Abstract: The compressive sensing (CS) theory indicates that robust reconstruction of signals can be obtained from far fewer measurements than those required by the Nyquist–Shannon theorem. Thus, CS has great potential in video acquisition and processing, considering that it makes the subsequent complex data compression unnecessary. In this paper, we propose a novel algorithm for effectively reconstructing videos from CS measurements. The algorithm comprises double phases, of which the first phase exploits intra-frame correlation and provides good initial recovery for each frame, and the second phase iteratively enhances reconstruction quality by alternating interframe multihypothesis (MH) prediction and sparsity modeling of residuals in a weighted manner. The weights of residual coefficients are updated in each iteration using a statistical method based on the MH predictions. These procedures are performed in the unit of overlapped patches such that potential blocking artifacts can be effectively suppressed through averaging. In addition, we devise an effective scheme based on the split Bregman iteration algorithm to solve the formulated weighted ${\ell }_{1}$ minimization problem. The experimental results demonstrate that the proposed algorithm outperforms the state-of-the-art methods in both objective and subjective reconstruction quality.

Journal ArticleDOI
TL;DR: This paper introduces a new higher order linear dynamical system (h-LDS) descriptor based on the higher order decomposition of the multidimensional image data and enables the analysis of dynamic textures by using information from various image elements.
Abstract: In this paper, we consider the problem of multi-dimensional dynamic texture analysis, and we introduce a new higher order linear dynamical system (h-LDS) descriptor. The proposed h-LDS descriptor is based on the higher order decomposition of the multidimensional image data and enables the analysis of dynamic textures by using information from various image elements. In addition, we propose a methodology for its application to video-based early warning systems that focus on smoke identification. More specifically, the proposed methodology enables the representation of video subsequences as histograms of h-LDS descriptors produced by the smoke candidate image patches in each subsequence. Finally, to further improve the classification accuracy, we propose the combination of multidimensional dynamic texture analysis with the spatiotemporal modeling of smoke by using a particle swarm optimization approach. The ability of the h-LDS to analyze the dynamic texture information is evaluated through a multivariate comparison against the standard LDS descriptor. The experimental results that use two video datasets have shown the great potential of the proposed smoke detection method.

Journal ArticleDOI
TL;DR: An optimal bit allocation (OBA) scheme for coding tree unit level RC in HEVC is proposed using a recursive Taylor expansion (RTE) method to iteratively solve the formulation and an approximate closed-form solution can be obtained, thus achieving OBA and bit reallocation.
Abstract: For High Efficiency Video Coding (HEVC), the R– $\lambda $ scheme is the latest rate control (RC) scheme, which investigates the relationships among allocated bits, the slope of rate-distortion (R-D) curve $\lambda $ , and quantization parameter. However, we argue that bit allocation in the existing R– $\lambda $ scheme is not optimal. In this paper, we therefore propose an optimal bit allocation (OBA) scheme for coding tree unit level RC in HEVC. Specifically, to achieve the OBA, we first develop an optimization formulation with a novel R-D estimation, instead of the existing R– $\lambda $ estimation. Unfortunately, it is intractable to obtain a closed-form solution to the optimization formulation. We thus propose a recursive Taylor expansion (RTE) method to iteratively solve the formulation. As a result, an approximate closed-form solution can be obtained, thus achieving OBA and bit reallocation. Both theoretical and numerical analyses show the fast convergence speed and little computational time of the proposed RTE method. Therefore, our OBA scheme can be achieved at little encoding complexity cost. Finally, the experimental results validate the effectiveness of our scheme in three aspects: R-D performance, RC accuracy, and robustness over dynamic scene changes.

Journal ArticleDOI
TL;DR: The proposed RDH scheme for encrypted palette images adopts a color partitioning method to use the palette colors to construct a certain number of embeddable color triples, whose indexes are self-embedded into the encrypted image so that a data hider can collect the usable color tri doubles to embed the secret data.
Abstract: Reversible data hiding (RDH) into encrypted images is of increasing attention to researchers as the original content can be perfectly reconstructed after the embedded data are extracted while the content owner’s privacy remains protected. The existing RDH techniques are designed for grayscale images and, therefore, cannot be directly applied to palette images. Since the pixel values in a palette image are not the actual color values, but rather the color indexes, RDH in encrypted palette images is more challenging than that designed for normal image formats. To the best knowledge of the authors, there is no suitable RDH scheme designed for encrypted palette images that has been reported, while palette images have been widely utilized. This has motivated us to design a reliable RDH scheme for encrypted palette images. The proposed method adopts a color partitioning method to use the palette colors to construct a certain number of embeddable color triples, whose indexes are self-embedded into the encrypted image so that a data hider can collect the usable color triples to embed the secret data. For a receiver, the embedded color triples can be determined by verifying a self-embedded check code that enables the receiver to retrieve the embedded data only with the data hiding key. Using the encryption key, the receiver can roughly reconstruct the image content. Experiments have shown that our proposed method has the property that the presented data extraction and image recovery are separable and reversible. Compared with the state-of-the-art works, our proposed method can provide a relatively high data-embedding payload, maintain high peak signal-to-noise ratio values of the decrypted and marked images, and have a low computational complexity.

Journal ArticleDOI
TL;DR: This paper proposes a novel tracking by matching framework for robust tracking based on basis matching rather than point matching, which outperforms those of several state-of-the-art methods.
Abstract: Most existing tracking approaches are based on either the tracking by detection framework or the tracking by matching framework. The former needs to learn a discriminative classifier using positive and negative samples, which will cause tracking drift due to unreliable samples. The latter usually performs tracking by matching local interest points between a target candidate and the tracked target, which is not robust to target appearance changes over time. In this paper, we propose a novel tracking by matching framework for robust tracking based on basis matching rather than point matching. In particular, we learn the target model from target images using a set of Gabor basis functions, which have large responses on the corresponding spatial positions after a max pooling. During tracking, a target candidate is evaluated by computing the responses of the Gabor basis functions on their corresponding spatial positions. The experimental results on a set of challenging sequences validate that the performance of the proposed tracking method outperforms those of several state-of-the-art methods.

Journal ArticleDOI
TL;DR: This paper investigates how to fuse grayscale and thermal video data for detecting foreground objects in challenging scenarios and proposes an intuitive yet effective method called weighted low-rank decomposition (WELD), which adaptively pursues the cross-modality low- rank representation.
Abstract: This paper investigates how to fuse grayscale and thermal video data for detecting foreground objects in challenging scenarios. To this end, we propose an intuitive yet effective method called weighted low-rank decomposition (WELD), which adaptively pursues the cross-modality low-rank representation. Specifically, we form two data matrices by accumulating sequential frames from the grayscale and the thermal videos, respectively. Within these two observing matrices, WELD detects moving foreground pixels as sparse outliers against the low-rank structure background and incorporates the weight variables to make the models of two modalities complementary to each other. The smoothness constraints of object motion are also introduced in WELD to further improve the robustness to noises. For optimization, we propose an iterative algorithm to efficiently solve the low-rank models with three subproblems. Moreover, we utilize an edge-preserving filtering-based method to substantially speed up WELD while preserving its accuracy. To provide a comprehensive evaluation benchmark of grayscale-thermal foreground detection, we create a new data set including 25 aligned grayscale-thermal video pairs with high diversity. Our extensive experiments on both the newly created data set and the public data set OSU3 suggest that WELD achieves superior performance and comparable efficiency against other state-of-the-art approaches.

Journal ArticleDOI
TL;DR: This paper tries to improve the overall multicamera object tracking performance by a global graph model with an improved similarity metric and shows that the method can work better even in the condition of poor SCT.
Abstract: Nonoverlapping multicamera visual object tracking typically consists of two steps: single-camera object tracking (SCT) and inter-camera object tracking (ICT). Most of tracking methods focus on SCT, which happens in the same scene, while for real surveillance scenes, ICT is needed and single-camera tracking methods cannot work effectively. In this paper, we try to improve the overall multicamera object tracking performance by a global graph model with an improved similarity metric. Our method treats the similarities of single-camera tracking and inter-camera tracking differently and obtains the optimization in a global graph model. The results show that our method can work better even in the condition of poor SCT.

Journal ArticleDOI
TL;DR: This paper proposes a novel superpixel segmentation approach based on a distance function that is designed to balance among boundary adherence, intensity homogeneity, and compactness (COM) characteristics of the resulting superpixels.
Abstract: As one of the most popular image oversegmentations, superpixel has been commonly used as supporting regions for primitives to reduce computations in various computer vision tasks. In this paper, we propose a novel superpixel segmentation approach based on a distance function that is designed to balance among boundary adherence, intensity homogeneity, and compactness (COM) characteristics of the resulting superpixels. Given an expected number of superpixels, our method begins with initializing the superpixel seed positions to obtain the initial labels of pixels. Then, we optimize the superpixels iteratively based on the defined distance measurement. We update the positions and intensities of superpixel seeds based on the three-sigma rule. The experimental results demonstrate that our algorithm is more effective and accurate than previous superpixel methods and achieves a comparable tradeoff between superpixel COM and adherence to object boundaries.

Journal ArticleDOI
TL;DR: A novel method for 4D light-field (LF) depth estimation exploiting the special linear structure of an epipolar plane image (EPI) and locally linear embedding (LLE) based on a local reliability measure to achieve higher performance than the typical and recent state-of-the-art LF stereo matching methods.
Abstract: In this paper, we propose a novel method for 4D light-field (LF) depth estimation exploiting the special linear structure of an epipolar plane image (EPI) and locally linear embedding (LLE). Without high computational complexity, depth maps are locally estimated by locating the optimal slope of each line segmentation on the EPIs, which are projected by the corresponding scene points. For each pixel to be processed, we build and then minimize the matching cost that aggregates the intensity pixel value, gradient pixel value, spatial consistency, as well as reliability measure to select the optimal slope from a predefined set of directions. Next, a subangle estimation method is proposed to further refine the obtained optimal slope of each pixel. Furthermore, based on a local reliability measure, all the pixels are classified into reliable and unreliable pixels. For the unreliable pixels, LLE is employed to propagate the missing pixels by the reliable pixels based on the assumption of manifold preserving property maintained by natural images. We demonstrate the effectiveness of our approach on a number of synthetic LF examples and real-world LF data sets, and show that our experimental results can achieve higher performance than the typical and recent state-of-the-art LF stereo matching methods.

Journal ArticleDOI
TL;DR: This paper analytically studies the characteristics of delay announcement by analyzing the components of the user response and designs a QoE-driven delay announcement scheme by establishing an objective user response function.
Abstract: As a useful tool for improving the user’s quality of experience (QoE), delay announcement has received substantial attention recently. However, how to make a simple and efficient delay announcement in the cloud mobile media environment is still an open and challenging problem. Unfortunately, traditional convex and stochastic optimization-based methods cannot address this issue due to the subjective user response with respect to the announced delay. To resolve this problem, this paper analytically studies the characteristics of delay announcement by analyzing the components of the user response and designs a QoE-driven delay announcement scheme by establishing an objective user response function. On the methodology end, the user response associated with the announced delay is approximated in the framework of fluid model, where the interaction between the system performance and delay announcement is well described by a series of mathematical functions. On the technology end, this paper develops a novel state-dependent announcement scheme that is more reliable than the other competing ones and can improve the user’s QoE dramatically. Extensive simulation results validate the efficiency of the proposed delay announcement scheme.

Journal ArticleDOI
TL;DR: A nonlocal (NL) extension of TV regularization is introduced, which models the sparsity of the image gradient with pixelwise content-adaptive distributions, reflecting the nonstationary nature of image statistics.
Abstract: Total variation (TV) regularization is widely used in image restoration to exploit the local smoothness of image content. Essentially, the TV model assumes a zero-mean Laplacian distribution for the gradient at all pixels. However, real-world images are nonstationary in general, and the zero-mean assumption of pixel gradient might be invalid, especially for regions with strong edges or rich textures. This paper introduces a nonlocal (NL) extension of TV regularization, which models the sparsity of the image gradient with pixelwise content-adaptive distributions, reflecting the nonstationary nature of image statistics. Taking advantage of the NL similarity of natural images, the proposed approach estimates the image gradient statistics at a particular pixel from a group of nonlocally searched patches, which are similar to the patch located at the current pixel. The gradient data in these NL similar patches are regarded as the samples of the gradient distribution to be learned. In this way, more accurate estimation of gradient is achieved. Experimental results demonstrate that the proposed method outperforms the conventional TV and several other anchors remarkably and produces better objective and subjective image qualities.

Journal ArticleDOI
TL;DR: A coding method to decompose the joint distortion (abbreviated to DeJoin) into distortion on individual pixels; thus, the message can be efficiently embedded with syndrome-trellis codes and it is proved that DeJoin can approach the lower bound of joint distortion.
Abstract: Recent advances on adaptive steganography imply that the security of steganography can be improved by exploiting the mutual impact of modifications between adjacent cover elements, such as pixels of images, which is called a nonadditive distortion model. In this paper, we propose a framework for nonadditive distortion steganography by defining joint distortion on pixel blocks. To reduce the complexity for minimizing joint distortion, we design a coding method to decompose the joint distortion (abbreviated to DeJoin) into distortion on individual pixels; thus, the message can be efficiently embedded with syndrome-trellis codes. We prove that DeJoin can approach the lower bound of joint distortion. As an example, we define joint distortion according to the principle of synchronizing modification direction and then design steganographic algorithms with DeJoin. The experimental results show that the proposed method outperforms previous nonadditive distortion steganography when resisting the state-of-the-art steganalysis.

Journal ArticleDOI
TL;DR: An improved WLD (IWLD) is proposed to better depict low-level image appearance information, and a modified sparse-representation-based classification model is developed to both control the reconstruction error of coding coefficients and minimize the classification error.
Abstract: Automatic violence detection from video is a hot topic for many video surveillance applications. However, there has been little success in developing an algorithm that can detect violence in surveillance videos with high performance. In this paper, following our recently proposed idea of motion Weber local descriptor (WLD), we make two major improvements and propose a more effective and efficient algorithm for detecting violence from motion images. First, we propose an improved WLD (IWLD) to better depict low-level image appearance information, and then extend the spatial descriptor IWLD by adding a temporal component to capture local motion information and hence form the motion IWLD. Second, we propose a modified sparse-representation-based classification model to both control the reconstruction error of coding coefficients and minimize the classification error. Based on the proposed sparse model, a class-specific dictionary containing dictionary atoms corresponding to the class labels is learned using class labels of training samples. With this learned dictionary, not only the representation residual but also the representation coefficients become discriminative. A classification scheme integrating the modified sparse model is developed to exploit such discriminative information. The experimental results on three benchmark data sets have demonstrated the superior performance of the proposed approach over the state of the arts.

Journal ArticleDOI
TL;DR: A pedestrian detection framework that is computationally less expensive as well as more accurate than HOG-linear SVM and hardware implementation on Altera Cyclone IV field-programmable gate array results in more than 40% savings in logic resources.
Abstract: Pedestrian detection is a key problem in computer vision and is currently addressed with increasingly complex solutions involving compute-intensive features and classification schemes. In this scope, histogram of oriented gradients (HOG) in conjunction with linear support vector machine (SVM) classifier is considered to be the single most discriminative feature that has been adopted as a stand-alone detector as well as a key instrument in advance systems involving hybrid features and cascaded detectors. In this paper, we propose a pedestrian detection framework that is computationally less expensive as well as more accurate than HOG-linear SVM. The proposed scheme exploits the discriminating power of the locally significant gradients in building orientation histograms without involving complex floating point operations while computing the feature. The integer-only feature allows the use of powerful histogram inter-section kernel SVM classifier in a fast lookup-table-based implementation. Resultantly, the proposed framework achieves at least 3% more accurate detection results than HOG on standard data sets while being 1.8 and 2.6 times faster on conventional desktop PC and embedded ARM platforms, respectively, for a single scale pedestrian detection on VGA resolution video. In addition, hardware implementation on Altera Cyclone IV field-programmable gate array results in more than 40% savings in logic resources compared with its HOG-linear SVM competitor. Hence, the proposed feature and classification setup is shown to be a better candidate as the single most discriminative pedestrian detector than the currently accepted HOG-linear SVM.

Journal ArticleDOI
TL;DR: This paper proposes to learn deep representations with an adaptive margin listwise loss, which can assign larger margins to harder negative samples to be interpreted as an implementation of the automatic hard negative mining strategy.
Abstract: Person reidentification (re-id) aims to match a specific person across nonoverlapping cameras, which is an important but challenging task in video surveillance. Conventional methods mainly focus either on feature constructing or metric learning. Recently, some deep learning-based methods have been proposed to learn image features and similarity measures jointly. However, current deep models for person re-id are usually trained with either pairwise loss , where the number of negative pairs greatly outnumbering that of positive pairs may lead the training model to be biased toward negative pairs or constant margin hinge loss , without considering the fact that hard negative samples should be paid more attention in the training stage. In this paper, we propose to learn deep representations with an adaptive margin listwise loss. First, ranking lists instead of image pairs are used as training samples, in this way, the problem of data imbalance is relaxed. Second, by introducing an adaptive margin parameter in the listwise loss function, it can assign larger margins to harder negative samples, which can be interpreted as an implementation of the automatic hard negative mining strategy. To gain robustness against changes in poses and part occlusions, our architecture combines four convolutional neural networks, each of which embeds images from different scales or different body parts. The final combined model performs much better than each single model. The experimental results show that our approach achieves very promising results on the challenging CUHK03, CUHK01, and VIPeR data sets.

Journal ArticleDOI
TL;DR: The main elements of an integrated platform, which target tele-immersion and future 3D applications, are described in this paper, addressing the tasks of real-time capturing, robust 3D human shape/appearance reconstruction, and skeleton-based motion tracking.
Abstract: The latest developments in 3D capturing, processing, and rendering provide means to unlock novel 3D application pathways. The main elements of an integrated platform, which target tele-immersion and future 3D applications, are described in this paper, addressing the tasks of real-time capturing, robust 3D human shape/appearance reconstruction, and skeleton-based motion tracking. More specifically, initially, the details of a multiple RGB-depth (RGB-D) capturing system are given, along with a novel sensors’ calibration method. A robust, fast reconstruction method from multiple RGB-D streams is then proposed, based on an enhanced variation of the volumetric Fourier transform-based method, parallelized on the Graphics Processing Unit, and accompanied with an appropriate texture-mapping algorithm. On top of that, given the lack of relevant objective evaluation methods, a novel framework is proposed for the quantitative evaluation of real-time 3D reconstruction systems. Finally, a generic, multiple depth stream-based method for accurate real-time human skeleton tracking is proposed. Detailed experimental results with multi-Kinect2 data sets verify the validity of our arguments and the effectiveness of the proposed system and methodologies.

Journal ArticleDOI
TL;DR: The proposed visual descriptors outperform the state-of-the-art methods with a significant margin using the most challenging data sets and are demonstrated within three applications: crowd video classification, anomaly detection, and violence detection in crowds.
Abstract: Crowd behavior analysis has recently emerged as an increasingly important and dedicated problem for crowd monitoring and management in the visual surveillance community. In particular, it is receiving a lot of attention to detect potentially dangerous situations and to prevent overcrowdedness. In this paper, we propose to quantify crowd properties by a rich set of visual descriptors. The calculation of these descriptors is realized through a novel spatio-temporal model of the crowd. It consists of modeling time-varying dynamics of the crowd using local feature tracks. It also involves a Delaunay triangulation to approximate neighborhood interactions. In total, the crowd is represented as an evolving graph, where the nodes correspond to the tracklets. From this graph, various mid-level representations are extracted to determine the ongoing crowd behaviors. In particular, the effectiveness of the proposed visual descriptors is demonstrated within three applications: crowd video classification, anomaly detection, and violence detection in crowds. The obtained results on videos from different data sets prove the relevance of these visual descriptors to crowd behavior analysis. In addition, by means of comparisons to other existing methods, we demonstrate that the proposed descriptors outperform the state-of-the-art methods with a significant margin using the most challenging data sets.

Journal ArticleDOI
TL;DR: A novel green video transmission (GVT) algorithm that uses video clustering and channel assignment to assist in video transmission and demonstrates a superior video transmission performance compared with the existing methods.
Abstract: Video transmission is an indispensable component of most applications related to the mobile cloud networks (MCNs). However, because of the complexity of the communication environment and the limitation of resources, attempts to develop an effective solution for video transmission in the MCN face certain difficulties. In this paper, we propose a novel green video transmission (GVT) algorithm that uses video clustering and channel assignment to assist in video transmission. A video clustering model is designed based on game theory to classify the different video parts stored in mobile devices. Using the results of video clustering, the GVT algorithm provides the function of channel assignment, and its assignment process depends on the content of the video to improve channel utilization in the MCN. Extensive simulations are carried out to evaluate the GVT with several performance criteria. Our analysis and simulations show that the proposed GTV demonstrates a superior video transmission performance compared with the existing methods.