scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Circuits and Systems for Video Technology in 2008"


Journal ArticleDOI
TL;DR: A comprehensive survey of efforts in the past couple of decades to address the problems of representation, recognition, and learning of human activities from video and related applications is presented.
Abstract: The past decade has witnessed a rapid proliferation of video cameras in all walks of life and has resulted in a tremendous explosion of video content. Several applications such as content-based video annotation and retrieval, highlight extraction and video summarization require recognition of the activities occurring in the video. The analysis of human activities in videos is an area with increasingly important consequences from security and surveillance to entertainment and personal archiving. Several challenges at various levels of processing-robustness against errors in low-level processing, view and rate-invariant representations at midlevel processing and semantic representation of human activities at higher level processing-make this problem hard to solve. In this review paper, we present a comprehensive survey of efforts in the past couple of decades to address the problems of representation, recognition, and learning of human activities from video and related applications. We discuss the problem at two major levels of complexity: 1) "actions" and 2) "activities." "Actions" are characterized by simple motion patterns typically executed by a single human. "Activities" are more complex and involve coordinated actions among a small number of humans. We will discuss several approaches and classify them according to their ability to handle varying degrees of complexity as interpreted above. We begin with a discussion of approaches to model the simplest of action classes known as atomic or primitive actions that do not require sophisticated dynamical modeling. Then, methods to model actions with more complex dynamics are discussed. The discussion then leads naturally to methods for higher level representation of complex activities.

1,426 citations


Journal ArticleDOI
TL;DR: The methods reviewed are intended for real-time surveillance through definition of a diverse set of events for further analysis triggering, including virtual fencing, speed profiling, behavior classification, anomaly detection, and object interaction.
Abstract: This paper presents a survey of trajectory-based activity analysis for visual surveillance. It describes techniques that use trajectory data to define a general set of activities that are applicable to a wide range of scenes and environments. Events of interest are detected by building a generic topographical scene description from underlying motion structure as observed over time. The scene topology is automatically learned and is distinguished by points of interest and motion characterized by activity paths. The methods we review are intended for real-time surveillance through definition of a diverse set of events for further analysis triggering, including virtual fencing, speed profiling, behavior classification, anomaly detection, and object interaction.

528 citations


Journal ArticleDOI
TL;DR: The proposed work addresses anomaly detection by means of trajectory analysis, an approach with several application fields, most notably video surveillance and traffic monitoring, based on single-class support vector machine (SVM) clustering, where the novelty detection SVM capabilities are used for the identification of anomalous trajectories.
Abstract: During the last years, the task of automatic event analysis in video sequences has gained an increasing attention among the research community. The application domains are disparate, ranging from video surveillance to automatic video annotation for sport videos or TV shots. Whatever the application field, most of the works in event analysis are based on two main approaches: the former based on explicit event recognition, focused on finding high-level, semantic interpretations of video sequences, and the latter based on anomaly detection. This paper deals with the second approach, where the final goal is not the explicit labeling of recognized events, but the detection of anomalous events differing from typical patterns. In particular, the proposed work addresses anomaly detection by means of trajectory analysis, an approach with several application fields, most notably video surveillance and traffic monitoring. The proposed approach is based on single-class support vector machine (SVM) clustering, where the novelty detection SVM capabilities are used for the identification of anomalous trajectories. Particular attention is given to trajectory classification in absence of a priori information on the distribution of outliers. Experimental results prove the validity of the proposed approach.

507 citations


Journal ArticleDOI
TL;DR: A new method for human face recognition by utilizing Gabor-based region covariance matrices as face descriptors is presented, using both pixel locations and Gabor coefficients to form the covariant matrices.
Abstract: This paper presents a new method for human face recognition by utilizing Gabor-based region covariance matrices as face descriptors. Both pixel locations and Gabor coefficients are employed to form the covariance matrices. Experimental results demonstrate the advantages of this proposed method.

296 citations


Journal ArticleDOI
TL;DR: Two efficient approaches to conceal regions of interest (ROIs) based on transform-domain or codestream-domain scrambling based on pseudorandomly flipped during encoding are introduced.
Abstract: In this paper, we address the problem of privacy protection in video surveillance. We introduce two efficient approaches to conceal regions of interest (ROIs) based on transform-domain or codestream-domain scrambling. In the first technique, the sign of selected transform coefficients is pseudorandomly flipped during encoding. In the second method, some bits of the codestream are pseudorandomly inverted. We address more specifically the cases of MPEG-4 as it is today the prevailing standard in video surveillance equipment. Simulations show that both techniques successfully hide private data in ROIs while the scene remains comprehensible. Additionally, the amount of noise introduced by the scrambling process can be adjusted. Finally, the impact on coding efficiency performance is small, and the required computational complexity is negligible.

261 citations


Journal ArticleDOI
TL;DR: The higher the estimation granularity is, the better the rate-distortion performance is since the deeper the adaptation of the decoding process is to the video statistical characteristics, which means that the pixel and coefficient levels are the best performing for PDWZ and TDWZ solutions, respectively.
Abstract: In recent years, practical Wyner-Ziv (WZ) video coding solutions have been proposed with promising results. Most of the solutions available in the literature model the correlation noise (CN) between the original frame and its estimation made at the decoder, which is the so-called side information (SI), by a given distribution whose relevant parameters are estimated using an offline process, assuming that the SI is available at the encoder or the originals are available at the decoder. The major goal of this paper is to propose a more realistic WZ video coding approach by performing online estimation of the CN model parameters at the decoder, for pixel and transform domain WZ video codecs. In this context, several new techniques are proposed based on metrics which explore the temporal correlation between frames with different levels of granularity. For pixel-domain WZ (PDWZ) video coding, three levels of granularity are proposed: frame, block, and pixel levels. For transform-domain WZ (TDWZ) video coding, DCT bands and coefficients are the two granularity levels proposed. The higher the estimation granularity is, the better the rate-distortion performance is since the deeper the adaptation of the decoding process is to the video statistical characteristics, which means that the pixel and coefficient levels are the best performing for PDWZ and TDWZ solutions, respectively.

241 citations


Journal ArticleDOI
TL;DR: A competing framework for better motion vector coding and SKIP mode is proposed, with a systematic bitrate saving on Baseline and High profile, compared to an H.264/MPEG4-AVC standard codec, which reaches up to 45%.
Abstract: The H.264/MPEG4-AVC video coding standard has achieved a higher coding efficiency compared to its predecessors. The significant bitrate reduction is mainly obtained by efficient motion compensation tools, as variable block sizes, multiple reference frames, 1/4-pel motion accuracy and powerful prediction modes (e.g., SKIP and DIRECT). These tools have contributed to an increased proportion of the motion information in the total bit- stream. To achieve the performance required by the future ITU-T challenge, namely to provide a codec with 50% bitrate reduction compared to the current H.264, the reduction of this motion information cost is essential. This paper proposes a competing framework for better motion vector coding and SKIP mode. The predictors for the SKIP mode and the motion vector predictors are optimally selected by a rate-distortion criterion. These methods take advantage from the use of the spatial and the temporal redundancies in the motion vector fields, where the simple spatial median usually fails. An adaptation of the temporal predictors according to the temporal distances between motion vector fields is also described for multiple reference frames and B-slices options. These two combined schemes lead to a systematic bitrate saving on Baseline and High profile, compared to an H.264/MPEG4-AVC standard codec, which reaches up to 45%.

224 citations


Journal ArticleDOI
TL;DR: The experimental results show that the high visual quality of stego-images, the data embedding capacity, and the robustness of the proposed lossless data hiding scheme against compression are acceptable for many applications, including semi-fragile image authentication.
Abstract: Recently, among various data hiding techniques, a new subset, lossless data hiding, has received increasing interest. Most of the existing lossless data hiding algorithms are, however, fragile in the sense that the hidden data cannot be extracted out correctly after compression or other incidental alteration has been applied to the stego-image. The only existing semi-fragile (referred to as robust in this paper) lossless data hiding technique, which is robust against high-quality JPEG compression, is based on modulo-256 addition to achieve losslessness. In this paper, we first point out that this technique has suffered from the annoying salt-and-pepper noise caused by using modulo-256 addition to prevent overflow/underflow. We then propose a novel robust lossless data hiding technique, which does not generate salt-and-pepper noise. By identifying a robust statistical quantity based on the patchwork theory and employing it to embed data, differentiating the bit-embedding process based on the pixel group's distribution characteristics, and using error correction codes and permutation scheme, this technique has achieved both losslessness and robustness. It has been successfully applied to many images, thus demonstrating its generality. The experimental results show that the high visual quality of stego-images, the data embedding capacity, and the robustness of the proposed lossless data hiding scheme against compression are acceptable for many applications, including semi-fragile image authentication. Specifically, it has been successfully applied to authenticate losslessly compressed JPEG2000 images, followed by possible transcoding. It is expected that this new robust lossless data hiding algorithm can be readily applied in the medical field, law enforcement, remote sensing and other areas, where the recovery of original images is desired.

214 citations


Journal ArticleDOI
TL;DR: The achieved system performance is at least one order of magnitude better than a PC-based solution, a result achieved by investigating the impact of several hardware-orientated optimizations on performance, area and accuracy.
Abstract: This paper proposes a parallel hardware architecture for image feature detection based on the scale invariant feature transform algorithm and applied to the simultaneous localization and mapping problem. The work also proposes specific hardware optimizations considered fundamental to embed such a robotic control system on-a-chip. The proposed architecture is completely stand-alone; it reads the input data directly from a CMOS image sensor and provides the results via a field-programmable gate array coupled to an embedded processor. The results may either be used directly in an on-chip application or accessed through an Ethernet connection. The system is able to detect features up to 30 frames per second (320times240 pixels) and has accuracy similar to a PC-based implementation. The achieved system performance is at least one order of magnitude better than a PC-based solution, a result achieved by investigating the impact of several hardware-orientated optimizations on performance, area and accuracy.

198 citations


Journal ArticleDOI
TL;DR: The proposed filtering framework for multitarget tracking that is based on the probability hypothesis density filter and data association using graph matching and a novel particle resampling strategy improves the accuracy of the tracker, especially in cluttered scenes.
Abstract: We propose a filtering framework for multitarget tracking that is based on the probability hypothesis density (PHD) filter and data association using graph matching. This framework can be combined with any object detectors that generate positional and dimensional information of objects of interest. The PHD filter compensates for missing detections and removes noise and clutter. Moreover, this filter reduces the growth in complexity with the number of targets from exponential to linear by propagating the first-order moment of the multitarget posterior, instead of the full posterior. In order to account for the nature of the PHD propagation, we propose a novel particle resampling strategy and we adapt dynamic and observation models to cope with varying object scales. The proposed resampling strategy allows us to use the PHD filter when a priori knowledge of the scene is not available. Moreover, the dynamic and observation models are not limited to the PHD filter and can be applied to any Bayesian tracker that can handle state-dependent variances. Extensive experimental results on a large video surveillance dataset using a standard evaluation protocol show that the proposed filtering framework improves the accuracy of the tracker, especially in cluttered scenes.

197 citations


Journal ArticleDOI
TL;DR: This paper addresses the problem of H.264/AVC complexity by using region-of-interest (ROI) based bit allocation and computational power allocation schemes, and the overall subjective visual quality can also be improved.
Abstract: Due to the complexity of H.264/AVC, it is very challenging to apply this standard to design a conversational video communication system. This problem is addressed in this paper by using region-of-interest (ROI) based bit allocation and computational power allocation schemes. In our system, the ROI is first detected by using the direct frame difference and skin-tone information. Several coding parameters including quantization parameter, candidates for mode decision, the number of referencing frames, accuracy of motion vectors and the search range of motion estimation are adaptively adjusted at the macroblock (MB) level according to the relative importance of each MB. Subsequently, the encoder could allocate more resources such as bits and computational power to the ROI, and the decoding complexity is also optimized at the encoder side by utilizing an ROI based rate-distortion-complexity (R-D-C) cost function. The encoder is thus simplified and decoding-friendly, and the overall subjective visual quality can also be improved.

Journal ArticleDOI
TL;DR: This paper proposes a model that accurately estimates the expected distortion by explicitly accounting for the loss pattern, inter-frame error propagation, and the correlation between error frames, and works well for video-telephony-type of sequences with low to medium motion.
Abstract: Video communication is often afflicted by various forms of losses, such as packet loss over the Internet. This paper examines the question of whether the packet loss pattern, and in particular, the burst length, is important for accurately estimating the expected mean-squared error distortion resulting from packet loss of compressed video. We focus on the challenging case of low-bit-rate video where each P-frame typically fits within a single packet. Specifically, we: 1) verify that the loss pattern does have a significant effect on the resulting distortion; 2) explain why a loss pattern, for example a burst loss, generally produces a larger distortion than an equal number of isolated losses; and 3) propose a model that accurately estimates the expected distortion by explicitly accounting for the loss pattern, inter-frame error propagation, and the correlation between error frames. The accuracy of the proposed model is validated with H.264/AVC coded video and previous frame concealment, where for most sequences the total distortion is predicted to within plusmn0.3 dB for burst loss of length two packets, as compared to prior models which underestimate the distortion by about 1.5 dB. Furthermore, as the burst length increases, our prediction is within plusmn0.7 dB, while prior models degrade and underestimate the distortion by over 3 dB. The proposed model works well for video-telephony-type of sequences with low to medium motion. We also present a simple illustrative example, of how knowledge of the effect of burst loss can be used to adapt the schedule of video streaming to provide improved performance for a burst loss channel, without requiring an increase in bit rate.

Journal ArticleDOI
Sun-Il Lee1, Chang D. Yoo1
TL;DR: A novel video fingerprinting method based on the centroid of gradient orientations is proposed, and the experimental results show that the proposed fingerprint outperforms the considered features in the context ofVideo fingerprinting.
Abstract: Video fingerprints are feature vectors that uniquely characterize one video clip from another. The goal of video fingerprinting is to identify a given video query in a database (DB) by measuring the distance between the query fingerprint and the fingerprints in the DB. The performance of a video fingerprinting system, which is usually measured in terms of pairwise independence and robustness, is directly related to the fingerprint that the system uses. In this paper, a novel video fingerprinting method based on the centroid of gradient orientations is proposed. The centroid of gradient orientations is chosen due to its pairwise independence and robustness against common video processing steps that include lossy compression, resizing, frame rate change, etc. A threshold used to reliably determine a fingerprint match is theoretically derived by modeling the proposed fingerprint as a stationary ergodic process, and the validity of the model is experimentally verified. The performance of the proposed fingerprint is experimentally evaluated and compared with that of other widely-used features. The experimental results show that the proposed fingerprint outperforms the considered features in the context of video fingerprinting.

Journal ArticleDOI
TL;DR: A new block-based DCT framework in which the first transform may choose to follow a direction other than the vertical or horizontal one, which is able to provide a better coding performance for image blocks that contain directional edges.
Abstract: Nearly all block-based transform schemes for image and video coding developed so far choose the 2-D discrete cosine transform (DCT) of a square block shape. With almost no exception, this conventional DCT is implemented separately through two 1-D transforms, one along the vertical direction and another along the horizontal direction. In this paper, we develop a new block-based DCT framework in which the first transform may choose to follow a direction other than the vertical or horizontal one. The coefficients produced by all directional transforms in the first step are arranged appropriately so that the second transform can be applied to the coefficients that are best aligned with each other. Compared with the conventional DCT, the resulting directional DCT framework is able to provide a better coding performance for image blocks that contain directional edges-a popular scenario in many image signals. By choosing the best from all directional DCTs (including the conventional DCT as a special case) for each image block, we will demonstrate that the rate-distortion coding performance can be improved remarkably. Finally, a brief theoretical analysis is presented to justify why certain coding gain (over the conventional DCT) results from this directional framework.

Journal ArticleDOI
TL;DR: The proposed action graph not only performs effective and robust recognition of actions, but it can also be expanded efficiently with new actions, and an algorithm is proposed for adding a new action to a trained action graph without compromising the existing action graph.
Abstract: This paper presents a graphical model for learning and recognizing human actions. Specifically, we propose to encode actions in a weighted directed graph, referred to as action graph, where nodes of the graph represent salient postures that are used to characterize the actions and are shared by all actions. The weight between two nodes measures the transitional probability between the two postures represented by the two nodes. An action is encoded as one or multiple paths in the action graph. The salient postures are modeled using Gaussian mixture models (GMMs). Both the salient postures and action graph are automatically learned from training samples through unsupervised clustering and expectation and maximization (EM) algorithm. The proposed action graph not only performs effective and robust recognition of actions, but it can also be expanded efficiently with new actions. An algorithm is also proposed for adding a new action to a trained action graph without compromising the existing action graph. Extensive experiments on widely used and challenging data sets have verified the performance of the proposed methods, its tolerance to noise and viewpoints, its robustness across different subjects and data sets, as well as the effectiveness of the algorithm for learning new actions.

Journal ArticleDOI
TL;DR: The usage of histogram matching prior to multiview encoding leads to substantial gains in the coding efficiency, and this prefiltering step can be combined with block-based illumination compensation techniques that modify the coder and decoder themselves, especially with the approach implemented in the Multiview reference software of the joint video team (JVT).
Abstract: Significant advances have recently been made in the coding of video data recorded with multiple cameras. However, luminance and chrominance variations between the camera views may deteriorate the performance of multiview codecs and image-based rendering algorithms. A histogram matching algorithm can be applied to efficiently compensate for these differences in a prefiltering step. A mapping function is derived which adapts the cumulative histogram of a distorted sequence to the cumulative histogram of a reference sequence. If all camera views of a multiview sequence are adapted to a common reference using histogram matching, the spatial prediction across camera views is improved. The basic algorithm is extended in three ways: a time-constant calculation of the mapping function, RGB color conversion, and the use of global disparity compensation. The best coding results are achieved when time-constant histogram calculation and RGB color conversion are combined. In this case, the usage of histogram matching prior to multiview encoding leads to substantial gains in the coding efficiency of up to 0.7 dB for the luminance component and up to 1.9 dB for the chrominance components. This prefiltering step can be combined with block-based illumination compensation techniques that modify the coder and decoder themselves, especially with the approach implemented in the multiview reference software of the joint video team (JVT). Additional coding gains up to 0.4 dB can be observed when both methods are combined.

Journal ArticleDOI
TL;DR: This paper presents an image watermarking scheme by the use of two statistical features (the histogram shape and the mean) in the Gaussian filtered low-frequency component of images that is mathematically invariant to scaling the size of images and robust to interpolation errors during geometric transformations, and common image processing operations.
Abstract: Watermark resistance to geometric attacks is an important issue in the image watermarking community. Most countermeasures proposed in the literature usually focus on the problem of global affine transforms such as rotation, scaling and translation (RST), but few are resistant to challenging cropping and random bending attacks (RBAs). The main reason is that in the existing watermarking algorithms, those exploited robust features are more or less related to the pixel position. In this paper, we present an image watermarking scheme by the use of two statistical features (the histogram shape and the mean) in the Gaussian filtered low-frequency component of images. The two features are: 1) mathematically invariant to scaling the size of images; 2) independent of the pixel position in the image plane; 3)statistically resistant to cropping; and 4) robust to interpolation errors during geometric transformations, and common image processing operations. As a result, the watermarking system provides a satisfactory performance for those content-preserving geometric deformations and image processing operations, including JPEG compression, lowpass filtering, cropping and RBAs.

Journal ArticleDOI
TL;DR: A new, simpler pedestrian detector using the covariance features is proposed and a faster strategy-multiple layer boosting with heterogeneous features is adopted-to exploit the efficiency of the Haar feature and the discriminative power of the covariances feature.
Abstract: Efficiently and accurately detecting pedestrians plays a very important role in many computer vision applications such as video surveillance and smart cars. In order to find the right feature for this task, we first present a comprehensive experimental study on pedestrian detection using state-of-the-art locally extracted features (e.g., local receptive fields, histogram of oriented gradients, and region covariance). Building upon the findings of our experiments, we propose a new, simpler pedestrian detector using the covariance features. Unlike the work in [1], where the feature selection and weak classifier training are performed on the Riemannian manifold, we select features and train weak classifiers in the Euclidean space for faster computation. To this end, AdaBoost with weighted Fisher linear discriminant analysis-based weak classifiers are designed. A cascaded classifier structure is constructed for efficiency in the detection phase. Experiments on different datasets prove that the new pedestrian detector is not only comparable to the state-of-the-art pedestrian detectors but it also performs at a faster speed. To further accelerate the detection, we adopt a faster strategy-multiple layer boosting with heterogeneous features-to exploit the efficiency of the Haar feature and the discriminative power of the covariance feature. Experiments show that, by combining the Haar and covariance features, we speed up the original covariance feature detector [1] by up to an order of magnitude in detection time with a slight drop in detection performance.

Journal ArticleDOI
TL;DR: Experimental results show that the proposed approach outperforms state-of-the-art algorithms both in terms of accuracy and robustness in discovering common patterns in video as well as in recognizing outliers.
Abstract: We present a novel multifeature video object trajectory clustering algorithm that estimates common patterns of behaviors and isolates outliers. The proposed algorithm is based on four main steps, namely the extraction of a set of representative trajectory features, non-parametric clustering, cluster merging and information fusion for the identification of normal and rare object motion patterns. First we transform the trajectories into a set of feature spaces on which mean-shift identifies the modes and the corresponding clusters. Furthermore, a merging procedure is devised to refine these results by combining similar adjacent clusters. The final common patterns are estimated by fusing the clustering results across all feature spaces. Clusters corresponding to reoccurring trajectories are considered as normal, whereas sparse trajectories are associated to abnormal and rare events. The performance of the proposed algorithm is evaluated on standard data-sets and compared with state-of-the-art techniques. Experimental results show that the proposed approach outperforms state-of-the-art algorithms both in terms of accuracy and robustness in discovering common patterns in video as well as in recognizing outliers.

Journal ArticleDOI
TL;DR: An automatic sketch synthesis algorithm is proposed based on embedded hidden Markov model (E-HMM) and selective ensemble strategy and achieves satisfactory effect of sketch synthesis with a small set of face training samples.
Abstract: Sketch synthesis plays an important role in face sketch-photo recognition system. In this manuscript, an automatic sketch synthesis algorithm is proposed based on embedded hidden Markov model (E-HMM) and selective ensemble strategy. First, the E-HMM is adopted to model the nonlinear relationship between a sketch and its corresponding photo. Then based on several learned models, a series of pseudo-sketches are generated for a given photo. Finally, these pseudo-sketches are fused together with selective ensemble strategy to synthesize a finer face pseudo-sketch. Experimental results illustrate that the proposed algorithm achieves satisfactory effect of sketch synthesis with a small set of face training samples.

Journal ArticleDOI
TL;DR: An automated activity analysis and summarization for eldercare video monitoring and an adaptive learning method to estimate the physical location and moving speed of a person from a single camera view without calibration are developed.
Abstract: In this work, we study how continuous video monitoring and intelligent video processing can be used in eldercare to assist the independent living of elders and to improve the efficiency of eldercare practice. More specifically, we develop an automated activity analysis and summarization for eldercare video monitoring. At the object level, we construct an advanced silhouette extraction, human detection and tracking algorithm for indoor environments. At the feature level, we develop an adaptive learning method to estimate the physical location and moving speed of a person from a single camera view without calibration. At the action level, we explore hierarchical decision tree and dimension reduction methods for human action recognition. We extract important ADL (activities of daily living) statistics for automated functional assessment. To test and evaluate the proposed algorithms and methods, we deploy the camera system in a real living environment for about a month and have collected more than 200 hours (in excess of 600 G bytes) of activity monitoring videos. Our extensive tests over these massive video datasets demonstrate that the proposed automated activity analysis system is very efficient.

Journal ArticleDOI
TL;DR: A joint adaptation, resource allocation and scheduling (JARS) algorithm, which allocates the communication resource based on the video users' quality of service, adapts video sources based on smart summarization, and schedules the transmissions to meet the frame delivery deadlines.
Abstract: Multi-user video streaming over wireless channels is a challenging problem, where the demand for better video quality and small transmission delays needs to be reconciled with the limited and often time-varying communication resources. This paper presents a framework for joint network optimization, source adaptation, and deadline-driven scheduling for multi-user video streaming over wireless networks. We develop a joint adaptation, resource allocation and scheduling (JARS) algorithm, which allocates the communication resource based on the video users' quality of service, adapts video sources based on smart summarization, and schedules the transmissions to meet the frame delivery deadlines. The proposed algorithm leads to near full utilization of the network resources and satisfies the delivery deadlines for all video frames. Substantial performance improvements are achieved compared with heuristic schemes that do not take the interactions between multiple users into consideration.

Journal ArticleDOI
TL;DR: The experimental results demonstrate that the proposed algorithm can achieve better R-D performance than that of other two RC algorithms including the algorithm JVT-G012 which is the current recommended RC scheme implemented in the H.264 reference software JM9.5.
Abstract: A rate-distortion (R-D) optimization rate control (RC) algorithm with adaptive initialization is presented for H.264. First, a linear distortion-quantization (D-Q) model is introduced and thus a close-form solution is developed to derive optimal quantization parameters (Qp)for encoding each macroblock. Then we exploit to determine the initial Qp efficiently and adaptively according to the content of video sequences. The experimental results demonstrate that the proposed algorithm can achieve better R-D performance than that of other two RC algorithms including the algorithm JVT-G012 which is the current recommended RC scheme implemented in the H.264 reference software JM9.5.

Journal ArticleDOI
TL;DR: A decoupled probabilistic algorithm, named Bayesian tensor analysis (BTA), which can automatically and suitably determine dimensionality for different modalities of tensor data and empirical studies on expression retargeting justify the advantages of BTA.
Abstract: Effectively modeling a collection of three-dimensional (3-D) faces is an important task in various applications, especially facial expression-driven ones, e.g., expression generation, retargeting, and synthesis. These 3-D faces naturally form a set of second-order tensors-one modality for identity and the other for expression. The number of these second-order tensors is three times of that of the vertices for 3-D face modeling. As for algorithms, Bayesian data modeling, which is a natural data analysis tool, has been widely applied with great success; however, it works only for vector data. Therefore, there is a gap between tensor-based representation and vector-based data analysis tools. Aiming at bridging this gap and generalizing conventional statistical tools over tensors, this paper proposes a decoupled probabilistic algorithm, which is named Bayesian tensor analysis (BTA). Theoretically, BTA can automatically and suitably determine dimensionality for different modalities of tensor data. With BTA, a collection of 3-D faces can be well modeled. Empirical studies on expression retargeting also justify the advantages of BTA.

Journal ArticleDOI
TL;DR: This work develops an adaptive scheme to estimate P-R-D model parameters and perform online resource allocation and energy optimization for real-time video encoding and shows that, for typical videos with nonstationary scene statistics, the energy consumption can be significantly reduced, especially in delay-tolerant portable video communication applications.
Abstract: Portable video communication devices operate on batteries with limited energy supply. However, video compression is computationally intensive and energy-demanding. Therefore, one of the central challenging issues in portable video communication system design is to minimize the energy consumption of video encoding so as to prolong the operational lifetime of portable video devices. In this work, based on power-rate-distortion (P-R-D) optimization, we develop a new approach for energy minimization by exploring the energy tradeoff between video encoding and wireless communication and exploiting the nonstationary characteristics of input video data. Both analytically and experimentally, we demonstrate that incorporating the third dimension of power consumption into conventional R-D analysis gives us one extra dimension of flexibility in resource allocation and allows us to achieve significant energy saving. Within the P-R-D analysis framework, power is tightly coupled with rate, enabling us to trade bits for joules and perform energy minimization through optimum bit allocation. We analyze the energy saving gain of P-R-D optimization. We develop an adaptive scheme to estimate P-R-D model parameters and perform online resource allocation and energy optimization for real-time video encoding. Our experimental studies show that, for typical videos with nonstationary scene statistics, using the proposed P-R-D optimization technology, the energy consumption of video encoding can be significantly reduced (by up to 50%), especially in delay-tolerant portable video communication applications.

Journal ArticleDOI
TL;DR: Three new algorithms (running average, median, mixture of Gaussians) modeling background directly from compressed video, and a two-stage segmentation approach based on the proposed background models, which can achieve comparable accuracy to their counterparts in the spatial domain.
Abstract: Modeling background and segmenting moving objects are significant techniques for video surveillance and other video processing applications. Most existing methods of modeling background and segmenting moving objects mainly operate in the spatial domain at pixel level. In this paper, we present three new algorithms (running average, median, mixture of Gaussians) modeling background directly from compressed video, and a two-stage segmentation approach based on the proposed background models. The proposed methods utilize discrete cosine transform (DCT) coefficients (including ac coefficients) at block level to represent background, and adapt the background by updating DCT coefficients. The proposed segmentation approach can extract foreground objects with pixel accuracy through a two-stage process. First a new background subtraction technique in the DCT domain is exploited to identify the block regions fully or partially occupied by foreground objects, and then pixels from these foreground blocks are further classified in the spatial domain. The experimental results show the proposed background modeling algorithms can achieve comparable accuracy to their counterparts in the spatial domain, and the associated segmentation scheme can visually generate good segmentation results with efficient computation. For instance, the computational cost of the proposed median and MoG algorithms are only 40.4% and 20.6% of their counterparts in the spatial domain for background construction.

Journal ArticleDOI
TL;DR: This paper proposes a multiview approach to achieve automatic detection of a falling person in video sequences where motion is modeled using a layered hidden Markov model (LHMM), and proves that two cameras are sufficient in practice.
Abstract: Automatic detection of a falling person in video sequences has interesting applications in video-surveillance and is an important part of future pervasive home monitoring systems. In this paper, we propose a multiview approach to achieve this goal, where motion is modeled using a layered hidden Markov model (LHMM). The posture classification is performed by a fusion unit, merging the decision provided by the independently processing cameras in a fuzzy logic context. In each view, the fall detection is optimized in a given plane by performing a metric image rectification, making it possible to extract simple and robust features, and being convenient for real-time purpose. A theoretical analysis of the chosen descriptor enables us to define the optimal camera placement for detecting people falling in unspecified situations, and we prove that two cameras are sufficient in practice. Regarding event detection, the LHMM offers a principle way for solving the inference problem. Moreover, the hierarchical architecture decouples the motion analysis into different temporal granularity levels, making the algorithm able to detect very sudden changes, and robust to low-level steps errors.

Journal ArticleDOI
TL;DR: It is indicated that there exists strong correlation between the optimal mean squared error threshold and the image quality factor Q, which is selected in the encoding end and can be computed from the quantization table embedded in the JPEG file.
Abstract: We propose a simple yet effective deblocking method for JPEG compressed image through postfiltering in shifted windows (PSW) of image blocks. The MSE is compared between the original image block and the image blocks in shifted windows, so as to decide whether these altered blocks are used in the smoothing procedure. Our research indicates that there exists strong correlation between the optimal mean squared error threshold and the image quality factor Q, which is selected in the encoding end and can be computed from the quantization table embedded in the JPEG file. Also we use the standard deviation of each original block to adjust the threshold locally so as to avoid the over-smoothing of image details. With various image and bit-rate conditions, the processed image exhibits both great visual effect improvement and significant peak signal-to-noise ratio gain with fairly low computational complexity. Extensive experiments and comparison with other deblocking methods are conducted to justify the effectiveness of the proposed PSW method in both objective and subjective measures.

Journal ArticleDOI
Sang-Yong Lee1, Yogendera Kumar1, Ji-Man Cho1, Sangwon Lee2, Soo-Won Kim1 
TL;DR: Experimental results obtained from various samples show that the present method is insensitive to Gaussian and impulsive noises and able to improve the quality of the image by focusing to the appropriate target object.
Abstract: A new passive autofocus algorithm consisting of a robust focus measure for object detection and fuzzy reasoning for target selection is presented. The proposed algorithm first detects objects distributed in the image using a mid-frequency discrete cosine transform focus measure and then selects the target object through fuzzy reasoning with three fuzzy membership functions. The proposed algorithm is designed as full digital blocks and fabricated using 0.35-mum CMOS technology. Experimental results obtained from various samples show that the present method is insensitive to Gaussian and impulsive noises and able to improve the quality of the image by focusing to the appropriate target object.

Journal ArticleDOI
TL;DR: This paper addresses the creation of two balanced descriptions based on the concept of redundant slices, while keeping full compatibility with the H.264 standard syntax and decoding behavior in case of single description reception.
Abstract: In this paper, a novel H.264 multiple description technique is proposed. The coding approach is based on the redundant slice representation option, defined in the H.264 standard. In presence of losses, the redundant representation can be used to replace missing portions of the compressed bit stream, thus yielding a certain degree of error resilience. This paper addresses the creation of two balanced descriptions based on the concept of redundant slices, while keeping full compatibility with the H.264 standard syntax and decoding behavior in case of single description reception. When two descriptions are available still a standard H.264 decoder can be used, given a simple preprocessing of the received compressed bit streams. An analytical setup is employed in order to optimally select the amount of redundancy to be inserted in each frame, taking into account both the transmission condition and the video decoder error propagation. Experimental results demonstrate that the proposed technique favorably compares with other H.264 multiple description approaches.