scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Circuits and Systems for Video Technology in 2011"


Journal ArticleDOI
TL;DR: A two-phase test sample representation method for face recognition using the representation ability of each training sample to determine M “nearest neighbors” for the test sample and uses the representation result to perform classification.
Abstract: In this paper, we propose a two-phase test sample representation method for face recognition. The first phase of the proposed method seeks to represent the test sample as a linear combination of all the training samples and exploits the representation ability of each training sample to determine M “nearest neighbors” for the test sample. The second phase represents the test sample as a linear combination of the determined M nearest neighbors and uses the representation result to perform classification. We propose this method with the following assumption: the test sample and its some neighbors are probably from the same class. Thus, we use the first phase to detect the training samples that are far from the test sample and assume that these samples have no effects on the ultimate classification decision. This is helpful to accurately classify the test sample. We will also show the probability explanation of the proposed method. A number of face recognition experiments show that our method performs very well.

498 citations


Journal ArticleDOI
TL;DR: A new method is proposed to detect falls by analyzing human shape deformation during a video sequence, which gives very good results (as low as 0% error with a multi-camera setup) compared with other common image processing methods.
Abstract: Faced with the growing population of seniors, developed countries need to establish new healthcare systems to ensure the safety of elderly people at home. Computer vision provides a promising solution to analyze personal behavior and detect certain unusual events such as falls. In this paper, a new method is proposed to detect falls by analyzing human shape deformation during a video sequence. A shape matching technique is used to track the person's silhouette along the video sequence. The shape deformation is then quantified from these silhouettes based on shape analysis methods. Finally, falls are detected from normal activities using a Gaussian mixture model. This paper has been conducted on a realistic data set of daily activities and simulated falls, and gives very good results (as low as 0% error with a multi-camera setup) compared with other common image processing methods.

452 citations


Journal ArticleDOI
TL;DR: Whether and to what extent the addition of NSS is beneficial to objective quality prediction in general terms is evaluated, and some practical issues in the design of an attention-based metric are addressed.
Abstract: Since the human visual system (HVS) is the ultimate assessor of image quality, current research on the design of objective image quality metrics tends to include an important feature of the HVS, namely, visual attention. Different metrics for image quality prediction have been extended with a computational model of visual attention, but the resulting gain in reliability of the metrics so far was variable. To better understand the basic added value of including visual attention in the design of objective metrics, we used measured data of visual attention. To this end, we performed two eye-tracking experiments: one with a free-looking task and one with a quality assessment task. In the first experiment, 20 observers looked freely to 29 unimpaired original images, yielding us so-called natural scene saliency (NSS). In the second experiment, 20 different observers assessed the quality of distorted versions of the original images. The resulting saliency maps showed some differences with the NSS, and therefore, we applied both types of saliency to four different objective metrics predicting the quality of JPEG compressed images. For both types of saliency the performance gain of the metrics improved, but to a larger extent when adding the NSS. As a consequence, we further integrated NSS in several state-of-the-art quality metrics, including three full-reference metrics and two no-reference metrics, and evaluated their prediction performance for a larger set of distortions. By doing so, we evaluated whether and to what extent the addition of NSS is beneficial to objective quality prediction in general terms. In addition, we address some practical issues in the design of an attention-based metric. The eye-tracking data are made available to the research community .

254 citations


Journal ArticleDOI
TL;DR: The analyses show that the proposed (PRO) method has a substantially higher degree of efficacy, outperforming other methods by an metric accuracy rate of up to 53.43%.
Abstract: Motion detection is the first essential process in the extraction of information regarding moving objects and makes use of stabilization in functional areas, such as tracking, classification, recognition, and so on. In this paper, we propose a novel and accurate approach to motion detection for the automatic video surveillance system. Our method achieves complete detection of moving objects by involving three significant proposed modules: a background modeling (BM) module, an alarm trigger (AT) module, and an object extraction (OE) module. For our proposed BM module, a unique two-phase background matching procedure is performed using rapid matching followed by accurate matching in order to produce optimum background pixels for the background model. Next, our proposed AT module eliminates the unnecessary examination of the entire background region, allowing the subsequent OE module to only process blocks containing moving objects. Finally, the OE module forms the binary object detection mask in order to achieve highly complete detection of moving objects. The detection results produced by our proposed (PRO) method were both qualitatively and quantitatively analyzed through visual inspection and for accuracy, along with comparisons to the results produced by other state-of-the-art methods. The analyses show that our PRO method has a substantially higher degree of efficacy, outperforming other methods by an metric accuracy rate of up to 53.43%.

197 citations


Journal ArticleDOI
TL;DR: It is found that the temporal correction factor follows closely an inverted falling exponential function, whereas the quantization effect on the coded frames can be captured accurately by a sigmoid function of the peak signal-to-noise ratio.
Abstract: In this paper, we explore the impact of frame rate and quantization on perceptual quality of a video. We propose to use the product of a spatial quality factor that assesses the quality of decoded frames without considering the frame rate effect and a temporal correction factor, which reduces the quality assigned by the first factor according to the actual frame rate. We find that the temporal correction factor follows closely an inverted falling exponential function, whereas the quantization effect on the coded frames can be captured accurately by a sigmoid function of the peak signal-to-noise ratio. The proposed model is analytically simple, with each function requiring only a single content-dependent parameter. The proposed overall metric has been validated using both our subjective test scores as well as those reported by others. For all seven data sets examined, our model yields high Pearson correlation (higher than 0.9) with measured mean opinion score (MOS). We further investigate how to predict parameters of our proposed model using content features derived from the original videos. Using predicted parameters from content features, our model still fits with measured MOS with high correlation.

174 citations


Journal ArticleDOI
TL;DR: A novel framework for LDE is developed by incorporating the merits from the generalized statistical quantity histogram (GSQH) and the histogram-based embedding and is secure for copyright protection because of the safe storage and transmission of side information.
Abstract: Histogram-based lossless data embedding (LDE) has been recognized as an effective and efficient way for copyright protection of multimedia. Recently, a LDE method using the statistical quantity histogram has achieved good performance, which utilizes the similarity of the arithmetic average of difference histogram (AADH) to reduce the diversity of images and ensure the stable performance of LDE. However, this method is strongly dependent on some assumptions, which limits its applications in practice. In addition, the capacities of the images with the flat AADH, e.g., texture images, are a little bit low. For this purpose, we develop a novel framework for LDE by incorporating the merits from the generalized statistical quantity histogram (GSQH) and the histogram-based embedding. Algorithmically, we design the GSQH driven LDE framework carefully so that it: (1) utilizes the similarity and sparsity of GSQH to construct an efficient embedding carrier, leading to a general and stable framework; (2) is widely adaptable for different kinds of images, due to the usage of the divide-and-conquer strategy; (3) is scalable for different capacity requirements and avoids the capacity problems caused by the flat histogram distribution; (4) is conditionally robust against JPEG compression under a suitable scale factor; and (5) is secure for copyright protection because of the safe storage and transmission of side information. Thorough experiments over three kinds of images demonstrate the effectiveness of the proposed framework.

167 citations


Journal ArticleDOI
TL;DR: A novel method for the protection of bitstreams of state-of-the-art video codec H.264/AVC by keeping exactly the same bitrate, generating completely compliant bitstream and utilizing negligible computational power is presented.
Abstract: This paper presents a novel method for the protection of bitstreams of state-of-the-art video codec H.264/AVC. The problem of selective encryption (SE) is addressed along with the compression in the entropy coding modules. H.264/AVC supports two types of entropy coding modules. Context-adaptive variable length coding (CAVLC) is supported in H.264/AVC baseline profile and context-adaptive binary arithmetic coding (CABAC) is supported in H.264/AVC main profile. SE is performed in both types of entropy coding modules of this video codec. For this purpose, in this paper the encryption step is done simultaneously with the entropy coding CAVLC or CABAC. SE is performed by using the advanced encryption standard (AES) algorithm with the cipher feedback mode on a subset of codewords/binstrings. For CAVLC, SE is performed on equal length codewords from a specific variable length coding table. In case of CABAC, it is done on equal length binstrings. In our scheme, entropy coding module serves the purpose of encryption cipher without affecting the coding efficiency of H.264/AVC by keeping exactly the same bitrate, generating completely compliant bitstream and utilizing negligible computational power. Owing to no escalation in bitrate, our encryption algorithm is better suited for real-time multimedia streaming over heterogeneous networks. It is perfect for playback on handheld devices because of negligible increase in processing power. Nine different benchmark video sequences containing different combinations of motion, texture, and objects are used for experimental evaluation of the proposed algorithm.

149 citations


Journal ArticleDOI
TL;DR: A novel contextual bag-of-words (CBOW) representation is proposed to model two kinds of typical contextual relations between local patches, i.e., a semantic conceptual relation and a spatial neighboring relation.
Abstract: Bag-of-words (BOW), which represents an image by the histogram of local patches on the basis of a visual vocabulary, has attracted intensive attention in visual categorization due to its good performance and flexibility. Conventional BOW neglects the contextual relations between local patches due to its Naive Bayesian assumption. However, it is well known that contextual relations play an important role for human beings to recognize visual categories from their local appearance. This paper proposes a novel contextual bag-of-words (CBOW) representation to model two kinds of typical contextual relations between local patches, i.e., a semantic conceptual relation and a spatial neighboring relation. To model the semantic conceptual relation, visual words are grouped on multiple semantic levels according to the similarity of class distribution induced by them, accordingly local patches are encoded and images are represented. To explore the spatial neighboring relation, an automatic term extraction technique is adopted to measure the confidence that neighboring visual words are relevant. Word groups with high relevance are used and their statistics are incorporated into the BOW representation. Classification is taken using the support vector machine with an efficient kernel to incorporate the relational information. The proposed approach is extensively evaluated on two kinds of visual categorization tasks, i.e., video event and scene categorization. Experimental results demonstrate the importance of contextual relations of local patches and the CBOW shows superior performance to conventional BOW.

148 citations


Journal ArticleDOI
TL;DR: Improved performance of the proposed approach in comparison to other unimodal and multimodal techniques of the relevant literature is demonstrated and the contribution of high-level audiovisual features toward improved video segmentation to scenes is highlighted.
Abstract: In this paper, a novel approach to video temporal decomposition into semantic units, termed scenes, is presented. In contrast to previous temporal segmentation approaches that employ mostly low-level visual or audiovisual features, we introduce a technique that jointly exploits low-level and high-level features automatically extracted from the visual and the auditory channel. This technique is built upon the well-known method of the scene transition graph (STG), first by introducing a new STG approximation that features reduced computational cost, and then by extending the unimodal STG-based temporal segmentation technique to a method for multimodal scene segmentation. The latter exploits, among others, the results of a large number of TRECVID-type trained visual concept detectors and audio event detectors, and is based on a probabilistic merging process that combines multiple individual STGs while at the same time diminishing the need for selecting and fine-tuning several STG construction parameters. The proposed approach is evaluated on three test datasets, comprising TRECVID documentary films, movies, and news-related videos, respectively. The experimental results demonstrate the improved performance of the proposed approach in comparison to other unimodal and multimodal techniques of the relevant literature and highlight the contribution of high-level audiovisual features toward improved video segmentation to scenes.

139 citations


Journal ArticleDOI
TL;DR: A voltage-scalable and process-variation resilient, hybrid memory architecture, suitable for use in MPEG-4 video processors such that power dissipation can be traded for graceful degradation in “quality.”
Abstract: We present a voltage-scalable and process-variation resilient, hybrid memory architecture, suitable for use in MPEG-4 video processors such that power dissipation can be traded for graceful degradation in “quality.” The key innovation in our proposed work is a hybrid memory array, which is a mixture of conventional 6T and 8T SRAM bit-cells. The fundamental premise of our approach lies in the fact that the human visual system is mostly sensitive to higher order bits of luminance pixels in video data. We implemented a preferential storage policy in which the higher order luma bits are stored in robust 8T bit-cells while the lower order bits are stored in conventional 6T bit-cells. This facilitates aggressive scaling of supply voltage in memory as the important luma bits, stored in 8T bit-cells, remain relatively unaffected by voltage scaling. The not-so-important lower order luma bits, stored in 6T bit-cells, if affected, contribute insignificantly to the overall degradation in output video quality. Simulation results show that under iso-area condition, we can obtain at least 32% power savings in the hybrid memory array compared to the conventional 6T SRAM array.

135 citations


Journal ArticleDOI
TL;DR: This work presents comprehensive analyses on the impacts of the compression distortion of texture videos and depth maps on the quality of the virtual views, and derives a concise distortion model for the synthesized virtual views using the Lagrangian multiplier method.
Abstract: In 3-D video coding, texture videos and depth maps need to be jointly coded. The distortion of texture videos and depth maps can be propagated to the synthesized virtual views. Besides coding efficiency of texture videos and depth maps, joint bit allocation between texture videos and depth maps is also an important research issue in 3-D video coding. First, we present comprehensive analyses on the impacts of the compression distortion of texture videos and depth maps on the quality of the virtual views, and then derive a concise distortion model for the synthesized virtual views. Based on this model, the joint bit allocation problem is formulated as a constrained optimization problem, and is solved by using the Lagrangian multiplier method. Experimental results demonstrate the high accuracy of the derived distortion model. Meanwhile, the rate-distortion (R-D) performance of the proposed algorithm is close to those of search-based algorithms which can give the best R-D performance, while the complexity of the proposed algorithm is lower than that of search-based algorithms. Moreover, compared with the bit allocation method using fixed texture and depth bits ratio (5:1), a maximum 1.2 dB gain can be achieved by the proposed algorithm.

Journal ArticleDOI
TL;DR: No one can obtain any hidden information from a single share, hence ensures the security, and a brand new sharing scheme of progressive VC to produce pixel-unexpanded shares is proposed.
Abstract: The basic (k, n)-threshold visual cryptography (VC) scheme is to share a secret image with n participants. The secret image can be recovered while stacking k or more shares obtained; but we will get nothing if there are less than k pieces of shares being overlapped. On the contrary, progressive VC can be utilized to recover the secret image gradually by superimposing more and more shares. If we only have a few pieces of shares, we could get an outline of the secret image; by increasing the number of the shares being stacked, the details of the hidden information can be revealed progressively. Previous research, such as Jin in 2005, and Fang and Lin in 2006, were all based upon pixel-expansion, which not only causes the waste of storage space and transmission time but also gets a poor visual quality on the stacked image. Furthermore, Fang and Lin's research had a severe security problem that will disclose the secret information on each share. In this letter, we proposed a brand new sharing scheme of progressive VC to produce pixel-unexpanded shares. In our research, the possibility for either black or white pixels of the secret image to appear as black pixels on the shares is the same, which approximates to 1/n. Therefore, no one can obtain any hidden information from a single share, hence ensures the security. When superimposing k (sheets of share), the possibility for the white pixels being stacked into black pixels remains 1/n, while the possibility rises to k/n for the black pixels, which sharpens the contrast of the stacked image and the hidden information, therefore, become more and more obvious. After superimposing all of the shares, the contrast rises to (n-1)/n which is apparently better than the traditional ways that can only obtain 50% of contrast, consequently, a clearer recovered image can be achieved.

Journal ArticleDOI
TL;DR: A visual fatigue prediction metric which can replace subjective evaluation for stereoscopic images that detects stereoscopic impairments caused by inappropriate shooting parameters or camera misalignment which induces excessive horizontal and vertical disparities is proposed.
Abstract: In this letter, we propose a visual fatigue prediction metric which can replace subjective evaluation for stereoscopic images. It detects stereoscopic impairments caused by inappropriate shooting parameters or camera misalignment which induces excessive horizontal and vertical disparities. Pearson's correlation was measured between the proposed metrics and the subjective results by using k-fold cross-validation, acquiring ranges of 78-87% with sparse feature and 74-85% with dense feature.

Journal ArticleDOI
Won Jun Kim1, Chanho Jung1, Changick Kim1
TL;DR: The proposed spatiotemporal scheme is computationally efficient, reliable, and simple to implement and thus it can be easily extended to various applications such as image retargeting and moving object extraction.
Abstract: This paper presents a novel method for detecting salient regions in both images and videos based on a discriminant center-surround hypothesis that the salient region stands out from its surroundings. To this end, our spatiotemporal approach combines the spatial saliency by computing distances between ordinal signatures of edge and color orientations obtained from the center and the surrounding regions and the temporal saliency by simply computing the sum of absolute difference between temporal gradients of the center and the surrounding regions. Our proposed method is computationally efficient, reliable, and simple to implement and thus it can be easily extended to various applications such as image retargeting and moving object extraction. The proposed method has been extensively tested and the results show that the proposed scheme is effective in detecting saliency compared to various state-of-the-art methods.

Journal ArticleDOI
TL;DR: This paper uses the structural similarity index as the quality metric for rate-distortion modeling and develops an optimum bit allocation and rate control scheme for video coding that achieves up to 25% bit-rate reduction over the JM reference software of H.264.
Abstract: The quality of video is ultimately judged by human eye; however, mean squared error and the like that have been used as quality metrics are poorly correlated with human perception. Although the characteristics of human visual system have been incorporated into perceptual-based rate control, most existing schemes do not take rate-distortion optimization into consideration. In this paper, we use the structural similarity index as the quality metric for rate-distortion modeling and develop an optimum bit allocation and rate control scheme for video coding. This scheme achieves up to 25% bit-rate reduction over the JM reference software of H.264. Under the rate-distortion optimization framework, the proposed scheme can be easily integrated with the perceptual-based mode decision scheme. The overall bit-rate reduction may reach as high as 32% over the JM reference software.

Journal ArticleDOI
TL;DR: An approach to de-identify individuals from videos that involves tracking and segmenting individuals in a conservative voxel space involving x, y, and time and applying algorithmic identification on the transformed videos.
Abstract: Advances in cameras and web technology have made it easy to capture and share large amounts of video data over to a large number of people. A large number of cameras oversee public and semi-public spaces today. These raise concerns on the unintentional and unwarranted invasion of the privacy of individuals caught in the videos. To address these concerns, automated methods to de-identify individuals in these videos are necessary. De-identification does not aim at destroying all information involving the individuals. Its ideal goals are to obscure the identity of the actor without obscuring the action. This paper outlines the scenarios in which de-identification is required and the issues brought out by those. We also present an approach to de-identify individuals from videos. Our approach involves tracking and segmenting individuals in a conservative voxel space involving x, y , and time. A de-identification transformation is applied per frame using these voxels to obscure the identity. Face, silhouette, gait, and other characteristics need to be obscured, ideally. We show results of our scheme on a number of videos and for several variations of the transformations. We present the results of applying algorithmic identification on the transformed videos. We also present the results of a user-study to evaluate how well humans can identify individuals from the transformed videos.

Journal ArticleDOI
TL;DR: A novel tracking scheme that jointly employs particle filters and multi-mode anisotropic mean shift and the tracker estimates the dynamic shape and appearance of objects, and also performs online learning of reference object.
Abstract: This paper addresses issues in object tracking where videos contain complex scenarios. We propose a novel tracking scheme that jointly employs particle filters and multi-mode anisotropic mean shift. The tracker estimates the dynamic shape and appearance of objects, and also performs online learning of reference object. Several partition prototypes and fully tunable parameters are applied to the rectangular object bounding box for improving the estimates of shape and multiple appearance modes in the object. The main contributions of the proposed scheme include: 1) use a novel approach for online learning of reference object distributions; 2) use a five parameter set (2-D central location, width, height, and orientation) of rectangular bounding box as tunable variables in the joint tracking scheme; 3) derive the multi-mode anisotropic mean shift related to a partitioned rectangular bounding box and several partition prototypes; and 4) relate the bounding box parameter computation with the multi-mode mean shift estimates by combining eigen decomposition, geometry of subareas, and weighted average. This has led to more accurate and efficient tracking where only small number of particles (<;20) is required. Experiments have been conducted for a range of videos captured by a dynamic or stationary camera, where the target object may experience long-term partial occlusions, intersections with other objects with similar color distributions, deformable object accompanied with shape, pose or abrupt motion speed changes, and cluttered background. Comparisons with existing methods and performance evaluations are also performed. Test results have shown marked improvement of the proposed method in terms of robustness to occlusions, tracking drifts and tightness and accuracy of tracked bounding box. Limitations of the method are also mentioned.

Journal ArticleDOI
TL;DR: This paper has focused on proposing a novel RGVSS scheme by skillfully designing a procedure of distinguishing different light transmissions on shared images based on the pixel values of the logo image with two primary advantages: no pixel expansion, and being user-friendly.
Abstract: Recently, the visual secret sharing (VSS) technique based on a random-grid algorithm (RGVSS), proposed by Kafri and Keren in 1987, has drawn attention in academia again. However, Kafri and Keren's scheme is not participant-friendly; that is to say, the generated shared images are meaningless, so users feel that this huge amount of data is hard to manage. The literature has illustrated the concept of creating meaningful shared images, in which some shape or information appears for easing management, for VSS technique by visual cryptography (VCVSS). Those friendly VCVSS schemes are not directly suitable for RGVSS. Instead, a new friendly RGVSS must be designed. Most friendly VCVSS schemes worsen the pixel expansion problem, in which the size of shared images is larger than that of the original secret image, to achieve the goal of generic meaningful shares. As a result, in this paper we have focused on proposing a novel RGVSS scheme by skillfully designing a procedure of distinguishing different light transmissions on shared images based on the pixel values of the logo image with two primary advantages: no pixel expansion, and being user-friendly. In order to illustrate the correctness, the formal analysis is demonstrated while the experimental results show the proposed schemes do work.

Journal ArticleDOI
TL;DR: This paper describes an efficient approximation method based on an evolutionary algorithm for optimizing the coverage and resource allocation in VSN with pan-tilt-zoom camera nodes and combines this method with an expectation-maximization algorithm.
Abstract: A visual sensor network (VSN) consists of a large amount of camera nodes which are able to process the captured image data locally and to extract the relevant information. The tight resource limitations in these networks of embedded sensors and processors represent a major challenge for the application development. In this paper, we focus on finding optimal VSN configurations which are basically given by: 1) the selection of cameras to sufficiently monitor the area of interest; 2) the setting of the cameras' frame rate and resolution to fulfill the quality of service requirements; and 3) the assignment of processing tasks to cameras to achieve all required monitoring activities. We formally specify this configuration problem and describe an efficient approximation method based on an evolutionary algorithm. We analyze our approximation method on three different scenarios and compare the predicted results with measurements on real implementations on a VSN platform. We finally combine our approximation method with an expectation-maximization algorithm for optimizing the coverage and resource allocation in VSN with pan-tilt-zoom camera nodes.

Journal ArticleDOI
TL;DR: The proposed depth boundary reconstruction filter is designed considering occurrence frequency, similarity, and closeness of pixels and is useful for efficient depth coding as well as high-quality 3-D rendering.
Abstract: A depth image is 3-D information used for virtual view synthesis in 3-D video system. In depth coding, the object boundaries are hard to compress and severely affect the rendering quality since they are sensitive to coding errors. In this paper, we propose a depth boundary reconstruction filter and utilize it as an in-loop filter to code the depth video. The proposed depth boundary reconstruction filter is designed considering occurrence frequency, similarity, and closeness of pixels. Experimental results demonstrate that the proposed depth boundary reconstruction filter is useful for efficient depth coding as well as high-quality 3-D rendering.

Journal ArticleDOI
TL;DR: A new message passing scheme named tile-based BP that reduces the memory and bandwidth to a fraction of the ordinary BP algorithms without performance degradation by splitting the MRF into many tiles and only storing the messages across the neighboring tiles is proposed.
Abstract: Loopy belief propagation (BP) is an effective solution for assigning labels to the nodes of a graphical model such as the Markov random field (MRF), but it requires high memory, bandwidth, and computational costs. Furthermore, the iterative, pixel-wise, and sequential operations of BP make it difficult to parallelize the computation. In this paper, we propose two techniques to address these issues. The first technique is a new message passing scheme named tile-based BP that reduces the memory and bandwidth to a fraction of the ordinary BP algorithms without performance degradation by splitting the MRF into many tiles and only storing the messages across the neighboring tiles. The tile-wise processing also enables data reuse and pipeline, resulting in efficient hardware implementation. The second technique is an O(L) fast message construction algorithm that exploits the properties of robust functions for parallelization. We apply these two techniques to a very large-scale integration circuit for stereo matching that generates high-resolution disparity maps in near real-time. We also implement the proposed schemes on graphics processing unit (GPU) which is four-time faster than standard BP on GPU.

Journal ArticleDOI
TL;DR: This paper presents a hierarchical scheme with block-based and pixel-based codebooks for foreground detection with superior performance to that of the former related approaches.
Abstract: This paper presents a hierarchical scheme with block-based and pixel-based codebooks for foreground detection. The codebook is mainly used to compress information to achieve a high efficient processing speed. In the block-based stage, 12 intensity values are employed to represent a block. The algorithm extends the concept of the block truncation coding, and thus it can further improve the processing efficiency by enjoying its low complexity advantage. In detail, the block-based stage can remove most of the backgrounds without reducing the true positive rate, yet it has low precision. To overcome this problem, the pixel-based stage is adopted to enhance the precision, which also can reduce the false positive rate. Moreover, the short-term information is employed to improve background updating for adaptive environments. As documented in the experimental results, the proposed algorithm can provide superior performance to that of the former related approaches.

Journal ArticleDOI
TL;DR: A DBC video coding scheme is proposed, where a super-resolution technique is employed to restore the down-sampled frames to their original resolutions, and the performance improvement of the proposed DBC scheme is analyzed at low bit rates, and verified by experiments.
Abstract: It has been reported that oversampling a still image before compression does not guarantee a good image quality. Similarly, down-sampling before video compression in low bit rate video coding may alleviate the blocking effect and improve peak signal-to-noise ratio of the decoded frames. When the number of discrete cosine transform coefficients is reduced in such a down-sampling based coding (DBC), the bit budget of each coefficient will increase, thus reduce the quantization error. A DBC video coding scheme is proposed in this paper, where a super-resolution technique is employed to restore the down-sampled frames to their original resolutions. The performance improvement of the proposed DBC scheme is analyzed at low bit rates, and verified by experiments.

Journal ArticleDOI
TL;DR: A fast and accurate block-based local motion estimator together with a robust alignment algorithm based on voting is proposed for video stabilization purposes and Experimental results confirm the effectiveness of both local and global motion estimators.
Abstract: Today, many people in the world without any (or with little) knowledge about video recording, thanks to the widespread use of mobile devices (personal digital assistants, mobile phones, etc.), take videos. However, the unwanted movements of their hands typically blur and introduce disturbing jerkiness in the recorded sequences. Many video stabilization techniques have been hence developed with different performances but only fast strategies can be implemented on embedded devices. A fundamental issue is the overall robustness with respect to different scene contents (indoor, outdoor, etc.) and conditions (illumination changes, moving objects, etc.). In this paper, we propose a fast and robust image alignment algorithm for video stabilization purposes. Our contribution is twofold: a fast and accurate block-based local motion estimator together with a robust alignment algorithm based on voting. Experimental results confirm the effectiveness of both local and global motion estimators.

Journal ArticleDOI
TL;DR: This paper proposes a novel fire-flame detection method using fuzzy finite automata (FFA) with probability density functions based on visual features, thereby providing a systemic approach to handling irregularity in computational systems and the ability to handle continuous spaces by combining the capabilities of automata with fuzzy logic.
Abstract: Fire-flame detection using a video camera is difficult because a flame has irregular characteristics, i.e., vague shapes and color patterns. Therefore, in this paper, we propose a novel fire-flame detection method using fuzzy finite automata (FFA) with probability density functions based on visual features, thereby providing a systemic approach to handling irregularity in computational systems and the ability to handle continuous spaces by combining the capabilities of automata with fuzzy logic. First, moving regions are detected via background subtraction, and the candidate flame regions are then identified by applying flame color models. In general, flame regions have a continuous irregular pattern; therefore, probability density functions are generated for the variation in intensity, wavelet energy, and motion orientation and applied to the FFA. The proposed algorithm is successfully applied to various fire/non-fire videos, and its detection performance is better than that of other methods.

Journal ArticleDOI
Hosik Sohn1, Wesley De Neve1, Yong Man Ro1
TL;DR: This paper discusses a privacy-protected video surveillance system that makes use of JPEG extended range (JPEG XR), and demonstrates that subband-adaptive scrambling is able to conceal privacy-sensitive face regions with a feasible level of protection.
Abstract: This paper discusses a privacy-protected video surveillance system that makes use of JPEG extended range (JPEG XR). JPEG XR offers a low-complexity solution for the scalable coding of high-resolution images. To address privacy concerns, face regions are detected and scrambled in the transform domain, taking into account the quality and spatial scalability features of JPEG XR. Experiments were conducted to investigate the performance of our surveillance system, considering visual distortion, bit stream overhead, and security aspects. Our results demonstrate that subband-adaptive scrambling is able to conceal privacy-sensitive face regions with a feasible level of protection. In addition, our results show that subband-adaptive scrambling of face regions outperforms subband-adaptive scrambling of frames in terms of coding efficiency, except when low video bit rates are in use.

Journal ArticleDOI
TL;DR: To speed up feature extraction and to retain additional global features in different scales for higher classification accuracy, a boosting light and pyramid sampling histogram of oriented gradients feature extraction method and a spatio-temporal appearance-related similarity measure are proposed.
Abstract: Visual surveillance from low-altitude airborne platforms plays a key role in urban traffic surveillance. Moving vehicle detection and motion analysis are very important for such a system. However, illumination variance, scene complexity, and platform motion make the tasks very challenging. In addition, the used algorithms have to be computationally efficient in order to be used on a real-time platform. To deal with these problems, a new framework for vehicle detection and motion analysis from low-altitude airborne videos is proposed. Our paper has two major contributions. First, to speed up feature extraction and to retain additional global features in different scales for higher classification accuracy, a boosting light and pyramid sampling histogram of oriented gradients feature extraction method is proposed. Second, to efficiently correlate vehicles across different frames for vehicle motion trajectories computation, a spatio-temporal appearance-related similarity measure is proposed. Compared to other representative existing methods, our experimental results showed that the proposed method is able to achieve better performance with higher detection rate, lower false positive rate, and faster detection speed.

Journal ArticleDOI
TL;DR: This paper presents a video super-resolution algorithm to interpolate an arbitrary frame in a low resolution video sequence from sparsely existing high resolution key-frames and shows that the proposed algorithm provides significantly better subjective visual quality as well as higher peak-to-peak signal- to-noise ratio than those by previous interpolation algorithms.
Abstract: This paper presents a video super-resolution algorithm to interpolate an arbitrary frame in a low resolution video sequence from sparsely existing high resolution key-frames. First, a hierarchical block-based motion estimation is performed between an input and low resolution key-frames. If the motion-compensated error is small, then an input low resolution patch is temporally super-resolved via bi-directional overlapped block motion compensation. Otherwise, the input patch is spatially super-resolved using the dictionary that has been already learned from the low resolution and its corresponding high resolution key-frame pair. Finally, possible blocking artifacts between temporally super-resolved patches and spatially super-resolved patches are concealed using a specific de-blocking filter. The experimental results show that the proposed algorithm provides significantly better subjective visual quality as well as higher peak-to-peak signal-to-noise ratio than those by previous interpolation algorithms.

Journal ArticleDOI
TL;DR: The proposed raster-scan-based data compressor for video capsule endoscopy application performs strongly with a compression ratio of 80% and a very high reconstruction peak signal-to-noise ratio (over 48 dB).
Abstract: The main challenge in video capsule endoscopic system is to reduce the area and power consumption while maintaining acceptable video reconstruction. In this paper, a subsample-based data compressor for video endoscopy application is presented. The algorithm is developed around the special features of endoscopic images that consists of a differential pulse-coded modulation (DPCM) followed by Golomb-Rice coding. Based on the nature of endoscopic images, several subsampling schemes on the chrominance components are applied. This video compressor is designed in a way to work with any commercial low-power image sensors that outputs image pixels in a raster scan fashion, eliminating the need of memory buffer and temporary storage (as needed in transform coding schemes). An image corner clipping algorithm is also introduced. The reconstructed images have been verified by five medical doctors for acceptability. The proposed low-complexity design is implemented in a 0.18 μm CMOS technology and consumes 592 standard cells, 0.16 × 0.16 mm silicon area, and 42 μW of power. Compared to other algorithms targeted to video capsule endoscopy, the proposed raster-scan-based scheme performs strongly with a compression ratio of 80% and a very high reconstruction peak signal-to-noise ratio (over 48 dB).

Journal ArticleDOI
TL;DR: This work presents a frame fusion based copy detection approach, which converts video copy detection to frame similarity search and frame fusion under a temporal consistency assumption, and focuses mainly on the frame fusion stage due to its critical role in copy detection performance.
Abstract: Content-based video copy detection is very important for copyright protection in view of the growing popularity of video sharing websites, which deals with not only whether a copy occurs in a query video stream but also where the copy is located and where the copy is originated from. While a lot of work has addressed the problem with good performance, less effort has been made to consider the copy detection problem in the case of a continuous query stream, for which precise temporal localization and some complex video transformations like frame insertion and video editing need to be handled. We attempt to attack the problem by presenting a frame fusion based copy detection approach, which converts video copy detection to frame similarity search and frame fusion under a temporal consistency assumption. Our work focuses mainly on the frame fusion stage due to its critical role in copy detection performance. The proposed frame fusion scheme is based on a Viterbi-like algorithm, comprising an online back-tracking strategy with three relaxed constraints. The experimental results show that the proposed approach achieves high localization accuracy in both the query stream and the reference database even when a query video stream undergoes some complex transformations, while achieving comparable performance compared with state-of-the-art copy detection methods.