scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Circuits and Systems for Video Technology in 2005"


Journal ArticleDOI
TL;DR: Experimental results show that the fast intraprediction mode decision scheme increases the speed of intracoding significantly with negligible loss of peak signal-to-noise ratio.
Abstract: The H.264/AVC video coding standard aims to enable significantly improved compression performance compared to all existing video coding standards. In order to achieve this, a robust rate-distortion optimization (RDO) technique is employed to select the best coding mode and reference frame for each macroblock. As a result, the complexity and computation load increase drastically. This paper presents a fast mode decision algorithm for H.264/AVC intraprediction based on local edge information. Prior to intraprediction, an edge map is created and a local edge direction histogram is then established for each subblock. Based on the distribution of the edge direction histogram, only a small part of intraprediction modes are chosen for RDO calculation. Experimental results show that the fast intraprediction mode decision scheme increases the speed of intracoding significantly with negligible loss of peak signal-to-noise ratio.

485 citations


Journal ArticleDOI
TL;DR: In this application, video summaries that emphasize both content balance and perceptual quality can be generated directly from a temporal graph that embeds both the structure and attention information.
Abstract: We propose a unified approach for video summarization based on the analysis of video structures and video highlights. Two major components in our approach are scene modeling and highlight detection. Scene modeling is achieved by normalized cut algorithm and temporal graph analysis, while highlight detection is accomplished by motion attention modeling. In our proposed approach, a video is represented as a complete undirected graph and the normalized cut algorithm is carried out to globally and optimally partition the graph into video clusters. The resulting clusters form a directed temporal graph and a shortest path algorithm is proposed to efficiently detect video scenes. The attention values are then computed and attached to the scenes, clusters, shots, and subshots in a temporal graph. As a result, the temporal graph can inherently describe the evolution and perceptual importance of a video. In our application, video summaries that emphasize both content balance and perceptual quality can be generated directly from a temporal graph that embeds both the structure and attention information.

366 citations


Journal ArticleDOI
TL;DR: A comprehensive, efficient video text detection, localization, and extraction method, which emphasizes the multilingual capability over the whole processing, and is also robust to various background complexities and text appearances.
Abstract: Text in video is a very compact and accurate clue for video indexing and summarization. Most video text detection and extraction methods hold assumptions on text color, background contrast, and font style. Moreover, few methods can handle multilingual text well since different languages may have quite different appearances. This paper performs a detailed analysis of multilingual text characteristics, including English and Chinese. Based on the analysis, we propose a comprehensive, efficient video text detection, localization, and extraction method, which emphasizes the multilingual capability over the whole processing. The proposed method is also robust to various background complexities and text appearances. The text detection is carried out by edge detection, local thresholding, and hysteresis edge recovery. The coarse-to-fine localization scheme is then performed to identify text regions accurately. The text extraction consists of adaptive thresholding, dam point labeling, and inward filling. Experimental results on a large number of video images and comparisons with other methods are reported in detail.

354 citations


Journal ArticleDOI
TL;DR: A new rate-distortion (R-D) model is proposed by utilizing the true quantization stepsize and an improved rate-control scheme for the H.264/AVC encoder based on this new R-D model is developed.
Abstract: In this paper, an efficient rate-control scheme for H.264/AVC video encoding is proposed. The redesign of the quantization scheme in H.264/AVC results in that the relationship between the quantization parameter and the true quantization stepsize is no longer linear. Based on this observation, we propose a new rate-distortion (R-D) model by utilizing the true quantization stepsize and then develop an improved rate-control scheme for the H.264/AVC encoder based on this new R-D model. In general, the current R-D optimization (RDO) mode-selection scheme in H.264/AVC test model is difficult for rate control, because rate control usually requires a predetermined set of motion vectors and coding modes to select the quantization parameter, whereas the RDO does in the different order and requires a predetermined quantization parameter to select motion vectors and coding modes. To tackle this problem, we develop a complexity-adjustable rate-control scheme based on the proposed R-D model. Briefly, the proposed scheme is a one-pass process at frame level and a partial two-pass process at macroblock level. Since the number of macroblocks with the two-pass processing can be controlled by an encoder parameter, the fully one-pass implementation is a subset of the proposed algorithm. An additional topic discussed in this paper is about video buffering. Since a hypothetical reference decoder (HRD) has been defined in H.264/AVC to guarantee that the buffers never overflow or underflow, the more accurate rate-allocation schemes are proposed to satisfy these requirements of HRD.

341 citations


Journal ArticleDOI
TL;DR: This paper analyzes the encoding mechanism of typical video coding systems, and develops a parametric video encoding architecture which is fully scalable in computational complexity, using dynamic voltage scaling (DVS), an energy consumption management technology recently developed in CMOS circuits design.
Abstract: Mobile devices performing video coding and streaming over wireless and pervasive communication networks are limited in energy supply. To prolong the operational lifetime of these devices, an embedded video encoding system should be able to adjust its computational complexity and energy consumption as demanded by the situation and its environment. To analyze, control, and optimize the rate-distortion (R-D) behavior of the wireless video communication system under the energy constraint, we develop a power-rate-distortion (P-R-D) analysis framework, which extends the traditional R-D analysis by including another dimension, the power consumption. Specifically, in this paper, we analyze the encoding mechanism of typical video coding systems, and develop a parametric video encoding architecture which is fully scalable in computational complexity. Using dynamic voltage scaling (DVS), an energy consumption management technology recently developed in CMOS circuits design, the complexity scalability can be translated into the energy consumption scalability of the video encoder. We investigate the R-D behavior of the complexity control parameters and establish an analytic P-R-D model. Both theoretically and experimentally, we show that, using this P-R-D model, the video coding system is able to automatically adjust its complexity control parameters to match the available energy supply of the mobile device while maximizing the picture quality. The P-R-D model provides a theoretical guideline for system design and performance optimization in mobile video communication under energy constraints.

338 citations


Journal ArticleDOI
TL;DR: This paper proposed two solutions for platform-based design of H.264/AVC intra frame coder with comprehensive analysis of instructions and exploration of parallelism, and proposed a system architecture with four-parallel intra prediction and mode decision to enhance the processing capability.
Abstract: Intra prediction with rate-distortion constrained mode decision is the most important technology in H.264/AVC intra frame coder, which is competitive with the latest image coding standard JPEG2000, in terms of both coding performance and computational complexity. The predictor generation engine for intra prediction and the transform engine for mode decision are critical because the operations require a lot of memory access and occupy 80% of the computation time of the entire intra compression process. A low cost general purpose processor cannot process these operations in real time. In this paper, we proposed two solutions for platform-based design of H.264/AVC intra frame coder. One solution is a software implementation targeted at low-end applications. Context-based decimation of unlikely candidates, subsampling of matching operations, bit-width truncation to reduce the computations, and interleaved full-search/partial-search strategy to stop the error propagation and to maintain the image quality, are proposed and combined as our fast algorithm. Experimental results show that our method can reduce 60% of the computation used for intra prediction and mode decision while keeping the peak signal-to-noise ratio degradation less than 0.3 dB. The other solution is a hardware accelerator targeted at high-end applications. After comprehensive analysis of instructions and exploration of parallelism, we proposed our system architecture with four-parallel intra prediction and mode decision to enhance the processing capability. Hadamard-based mode decision is modified as discrete cosine transform-based version to reduce 40% of memory access. Two-stage macroblock pipelining is also proposed to double the processing speed and hardware utilization. The other features of our design are reconfigurable predictor generator supporting all of the 13 intra prediction modes, parallel multitransform and inverse transform engine, and CAVLC bitstream engine. A prototype chip is fabricated with TSMC 0.25-/spl mu/m CMOS 1P5M technology. Simulation results show that our implementation can process 16 mega-pixels (4096/spl times/4096) within 1 s, or namely 720/spl times/480 4:2:0 30 Hz video in real time, at the operating frequency of 54 MHz. The transistor count is 429 K, and the core size is only 1.855/spl times/1.885 mm/sup 2/.

331 citations


Journal ArticleDOI
TL;DR: A fast intermode decision algorithm to decide the best mode in intercoding makes use of the spatial homogeneity and the temporal stationarity characteristics of video objects and is able to reduce on the average 30% encoding time.
Abstract: The new video coding standard, H.264/MPEG-4 AVC, uses variable block sizes ranging from 4/spl times/4 to 16/spl times/16 in interframe coding. This new feature has achieved significant coding gain compared to coding a macroblock (MB) using fixed block size. However, this feature results in extremely high computational complexity when brute force rate distortion optimization (RDO) algorithm is used. This paper proposes a fast intermode decision algorithm to decide the best mode in intercoding. It makes use of the spatial homogeneity and the temporal stationarity characteristics of video objects. Specifically, spatial homogeneity of a MB is decided based on the MB's edge intensity, and temporal stationarity is decided by the difference of the current MB and it colocated counterpart in the reference frame. Based on the homogeneity and stationarity of the video objects, only a small number of intermodes are selected in the RDO process. The experimental results show that the fast intermode decision algorithm is able to reduce on the average 30% encoding time, with a negligible peak signal-to-noise ratio loss of 0.03 dB or, equivalently, a bit rate increment of 0.6%.

314 citations


Journal ArticleDOI
TL;DR: Based on the observation that a Cauchy density is more accurate in estimating the distribution of the ac coefficients than the traditional Laplacian density, rate and distortion models with improved accuracy are developed and justified in a frame bit-allocation application for H.264.
Abstract: Based on the observation that a Cauchy density is more accurate in estimating the distribution of the ac coefficients than the traditional Laplacian density, rate and distortion models with improved accuracy are developed. The entropy and distortion models for quantized discrete cosine transform coefficients are justified in a frame bit-allocation application for H.264. Extensive analysis with carefully selected anchor video sequences demonstrates a 0.24-dB average peak signal-to-noise ratio (PSNR) improvement over the JM 8.4 rate control algorithm, and a 0.33-dB average PSNR improvement over the TM5-based bit-allocation algorithm that has recently been proposed for H.264 by Li et al. The analysis also demonstrates 20% and 60% reductions in PSNR variation among the encoded pictures when compared to the JM 8.4 rate control algorithm and the TM5-based bit-allocation algorithm, respectively.

265 citations


Journal ArticleDOI
TL;DR: A novel audio-visual feature-based framework for event detection in broadcast video of multiple different field sports and the results suggest that high event retrieval and content rejection statistics are achievable.
Abstract: In this paper, we propose a novel audio-visual feature-based framework for event detection in broadcast video of multiple different field sports. Features indicating significant events are selected and robust detectors built. These features are rooted in characteristics common to all genres of field sports. The evidence gathered by the feature detectors is combined by means of a support vector machine, which infers the occurrence of an event based on a model generated during a training phase. The system is tested generically across multiple genres of field sports including soccer, rugby, hockey, and Gaelic football and the results suggest that high event retrieval and content rejection statistics are achievable.

251 citations


Journal ArticleDOI
TL;DR: A novel, yet simple, image-adaptive watermarking scheme for image authentication by applying a simple quantization-index-modulation process on wavelet domain singular value decomposition, which is robust against JPEG compression but extremely sensitive to malicious manipulation such as filtering and random noising.
Abstract: In this letter, we propose a novel, yet simple, image-adaptive watermarking scheme for image authentication by applying a simple quantization-index-modulation process on wavelet domain singular value decomposition. Unlike the traditional wavelet-based watermarking schemes where the watermark bits are embedded directly on the wavelet coefficients, the proposed scheme is based on bit embedding on the singular value (luminance) of the blocks within wavelet subband of the original image. To improve the fidelity and the perceptual quality of the watermarked image and to enhance the security of watermarking, we model the adaptive quantization parameters based on the statistics of blocks within subbands. The scheme is robust against JPEG compression but extremely sensitive to malicious manipulation such as filtering and random noising. Watermark detection is efficient and blind in the sense only the quantization parameters but not the original image are required. The quantization parameters adaptive to blocks are vector quantized to reduce the watermarking overhead.

250 citations


Journal ArticleDOI
TL;DR: A novel sequence matching technique to detect copies of a video clip that is robust to the many digitization and encoding processes that give rise to several distortions, including changes in brightness, color, frame format, as well as different blocky artifacts.
Abstract: This paper proposes a novel sequence matching technique to detect copies of a video clip. If a video copy detection technique is to be effective, it needs to be robust to the many digitization and encoding processes that give rise to several distortions, including changes in brightness, color, frame format, as well as different blocky artifacts. Most of the video copy detection algorithms proposed so far focus mostly on coping with signal distortions introduced by different encoding parameters; however, these algorithms do not cope well with display format conversions. We propose a copy-detection scheme that is robust to the above-mentioned distortions and is also robust to display format conversions. To this end, each image frame is partitioned into 2 /spl times/ 2 by intensity averaging, and the partitioned values are stored for indexing and matching. Our spatiotemporal approach combines spatial matching of ordinal signatures obtained from the partitions of each frame and temporal matching of temporal signatures from the temporal trails of the partitions. The proposed method has been extensively tested and the results show the proposed scheme is effective in detecting copies which have been subjected to wide range of modifications.

Journal ArticleDOI
TL;DR: A new JND estimator for color video is devised in image-domain with the nonlinear additivity model for masking and is incorporated into a motion-compensated residue signal preprocessor for variance reduction toward coding quality enhancement, and both perceptual quality and objective quality are enhanced in coded video at a given bit rate.
Abstract: We present a motion-compensated residue signal preprocessing scheme in video coding scheme based on just-noticeable-distortion (JND) profile Human eyes cannot sense any changes below the JND threshold around a pixel due to their underlying spatial/temporal masking properties An appropriate (even imperfect) JND model can significantly help to improve the performance of video coding algorithms From the viewpoint of signal compression, smaller variance of signal results in less objective distortion of the reconstructed signal for a given bit rate In this paper, a new JND estimator for color video is devised in image-domain with the nonlinear additivity model for masking (NAMM) and is incorporated into a motion-compensated residue signal preprocessor for variance reduction toward coding quality enhancement As the result, both perceptual quality and objective quality are enhanced in coded video at a given bit rate A solution of adaptively determining the parameter for the residue preprocessor is also proposed The devised technique can be applied to any standardized video coding scheme based on motion compensated prediction It provides an extra design option for quality control, besides quantization, in contrast with most of the existing perceptually adaptive schemes which have so far focused on determination of proper quantization steps As an example for demonstration, the proposed scheme has been implemented in the MPEG-2 TM5 coder, and achieved an average peak signal-to-noise (PSNR) increment of 0505 dB over the twenty video sequences which have been tested The perceptual quality improvement has been confirmed by the subjective viewing tests conducted

Journal ArticleDOI
TL;DR: A new measure to determine homogeneous blocks and a new structure analyzer for rejecting blocks with structure based on high-pass operators and special masks for corners to stabilize the homogeneity estimation are proposed.
Abstract: Noise can significantly impact the effectiveness of video processing algorithms. This paper proposes a fast white-noise variance estimation that is reliable even in images with large textured areas. This method finds intensity-homogeneous blocks first and then estimates the noise variance in these blocks, taking image structure into account. This paper proposes a new measure to determine homogeneous blocks and a new structure analyzer for rejecting blocks with structure. This analyzer is based on high-pass operators and special masks for corners to stabilize the homogeneity estimation. For typical video quality (PSNR of 20-40 dB), the proposed method outperforms other methods significantly and the worst-case estimation error is 3 dB, which is suitable for real applications such as video broadcasts. The method performs well both in highly noisy and good-quality images. It also works well in images including few uniform blocks.

Journal ArticleDOI
TL;DR: The proposed method minimizes the fitting error between the input motion vectors and the motion vectors generated from the estimated motion model using the Newton-Raphson method with outlier rejections.
Abstract: Global motion estimation is a powerful tool widely used in video processing and compression as well as in computer vision areas. We propose a new approach for estimating global motions from coarsely sampled motion vector fields. The proposed method minimizes the fitting error between the input motion vectors and the motion vectors generated from the estimated motion model using the Newton-Raphson method with outlier rejections. Applications of the proposed method in video coding include fast global motion estimation for MPEG-4 Advanced Simple Profile coding, MPEG-2 to MPEG-4 ASP transcoding, and error concealments. Simulation results and analyses are provided for the proposed method and the applications, which show the effectiveness of the method in terms of accuracy, robustness, and speed.

Journal ArticleDOI
TL;DR: Alternative methods for the generation of the motion information for the DIRECT mode using spatial or combined spatiotemporal correlation are introduced and improvements on the existing Rate Distortion Optimization related to B slices within the H.264 codec are presented.
Abstract: The new H.264 (MPEG-4 AVC) video coding standard can achieve considerably higher coding efficiency compared to previous standards. This is accomplished mainly due to the consideration of variable block sizes for motion compensation, multiple reference frames, intra prediction, but also due to better exploitation of the spatiotemporal correlation that may exist between adjacent Macroblocks, with the SKIP mode in predictive (P) slices and the two DIRECT modes in bipredictive (B) slices. These modes, when signaled, could in effect represent the motion of a macroblock (MB) or block without having to transmit any additional motion information required by other inter-MB types. This property also allows these modes to be highly compressible especially due to the consideration of run length coding strategies. Although spatial correlation of motion vectors from adjacent MBs is used for SKIP mode to predict its motion parameters, until recently, DIRECT mode considered only temporal correlation of adjacent pictures. In this letter, we introduce alternative methods for the generation of the motion information for the DIRECT mode using spatial or combined spatiotemporal correlation. Considering that temporal correlation requires that the motion and timestamp information from previous pictures are available in both the encoder and decoder, it is shown that our spatial-only method can reduce or eliminate such requirements while, at the same time, achieving similar performance. The combined methods, on the other hand, by jointly exploiting spatial and temporal correlation either at the MB or slice/picture level, can achieve even higher coding efficiency. Finally, improvements on the existing Rate Distortion Optimization related to B slices within the H.264 codec are also presented, which can lead to improvements of up to 16% in bit rate reduction or, equivalently, more than 0.7 dB in PSNR.

Journal ArticleDOI
TL;DR: The proposed 2BT-based motion estimation technique improves motion estimation accuracy in terms of peak signal-to-noise ratio of reconstructed frames and also results in visually more accurate frames subsequent to motion compensation compared to the 1BT- based motion estimation approach.
Abstract: One-bit transforms (1BTs) have been proposed for low-complexity block-based motion estimation by reducing the representation order to a single bit, and employing binary matching criteria. However, as a single bit is used in the representation of image frames, bad motion vectors are likely to be resolved in 1BT-based motion estimation algorithms particularly for small block sizes. It is proposed in this paper to utilize a two-bit transform (2BT) for block-based motion estimation. Image frames are converted into two-bit representations by a simple block-by-block two bit transform based on multithresholding with mean and linearly approximated standard deviation values. In order to avoid blocking effects at block boundaries during the block-by-block transformation while enabling the two-bit representation to be constructed according to local detail, threshold values are computed within a larger window surrounding the transforming block. The 2BT makes use of lower bit-depth and binary matching criteria properties of 1BTs to achieve low-complexity block motion estimation. The 2BT improves motion estimation accuracy and seriously reduces the amount of bad motion vectors compared to 1BTs, particularly for small block sizes. It is shown that the proposed 2BT-based motion estimation technique improves motion estimation accuracy in terms of peak signal-to-noise ratio of reconstructed frames and also results in visually more accurate frames subsequent to motion compensation compared to the 1BT-based motion estimation approach.

Journal ArticleDOI
TL;DR: This semantic analysis approach can be used in semantic annotation and transcoding systems, which take into consideration the users environment including preferences, devices used, available network bandwidth and content identity.
Abstract: An approach to knowledge-assisted semantic video object detection based on a multimedia ontology infrastructure is presented. Semantic concepts in the context of the examined domain are defined in an ontology, enriched with qualitative attributes (e.g., color homogeneity), low-level features (e.g., color model components distribution), object spatial relations, and multimedia processing methods (e.g., color clustering). Semantic Web technologies are used for knowledge representation in the RDF(S) metadata standard. Rules in F-logic are defined to describe how tools for multimedia analysis should be applied, depending on concept attributes and low-level features, for the detection of video objects corresponding to the semantic concepts defined in the ontology. This supports flexible and managed execution of various application and domain independent multimedia analysis tasks. Furthermore, this semantic analysis approach can be used in semantic annotation and transcoding systems, which take into consideration the users environment including preferences, devices used, available network bandwidth and content identity. The proposed approach was tested for the detection of semantic objects on video data of three different domains.

Journal ArticleDOI
TL;DR: This work presents a framework for the classification of feature films into genres, based only on computable visual cues, and demonstrates that low-level visual features (without the use of audio or text cues) may be utilized for movie classification.
Abstract: This work presents a framework for the classification of feature films into genres, based only on computable visual cues. We view the work as a step toward high-level semantic film interpretation, currently using low-level video features and knowledge of ubiquitous cinematic practices. Our current domain of study is the movie preview, commercial advertisements primarily created to attract audiences. A preview often emphasizes the theme of a film and hence provides suitable information for classification. In our approach, we classify movies into four broad categories: Comedies, Action, Dramas, or Horror films. Inspired by cinematic principles, four computable video features (average shot length, color variance, motion content and lighting key) are combined in a framework to provide a mapping to these four high-level semantic classes. Mean shift classification is used to discover the structure between the computed features and each film genre. We have conducted extensive experiments on over a hundred film previews and notably demonstrate that low-level visual features (without the use of audio or text cues) may be utilized for movie classification. Our approach can also be broadened for many potential applications including scene understanding, the building and updating of video databases with minimal human intervention, browsing, and retrieval of videos on the Internet (video-on-demand) and video libraries.

Journal ArticleDOI
TL;DR: A high-performance and memory-efficient pipeline architecture which performs the one-level two-dimensional (2-D) discrete wavelet transform (DWT) in the 5/3 and 9/7 filters by cascading the three key components.
Abstract: In this paper, we propose a high-performance and memory-efficient pipeline architecture which performs the one-level two-dimensional (2-D) discrete wavelet transform (DWT) in the 5/3 and 9/7 filters. In general, the internal memory size of 2-D architecture highly depends on the pipeline registers of one-dimensional (1-D) DWT. Based on the lifting-based DWT algorithm, the primitive data path is modified and an efficient pipeline architecture is derived to shorten the data path. Accordingly, under the same arithmetic resources, the 1-D DWT pipeline architecture can operate at a higher processing speed (up to 200 MHz in 0.25-/spl mu/m technology) than other pipelined architectures with direct implementation. The proposed 2-D DWT architecture is composed of two 1-D processors (column and row processors). Based on the modified algorithm, the row processor can partially execute each row-wise transform with only two column-processed data. Thus, the pipeline registers of 1-D architecture do not fully turn into the internal memory of 2-D DWT. For an N/spl times/M image, only 3.5N internal memory is required for the 5/3 filter, and 5.5N is required for the 9/7 filter to perform the one-level 2-D DWT decomposition with the critical path of one multiplier delay (i.e., N and M indicate the height and width of an image). The pipeline data path is regular and practicable. Finally, the proposed architecture implements the 5/3 and 9/7 filters by cascading the three key components.

Journal ArticleDOI
TL;DR: This paper proposes two architectures for multiple description video coding, both of them are based on the motion compensation prediction loop, and uses a polyphase down-sampling technique to create the MDs and to introduce cross-redundancy among the descriptions.
Abstract: In this paper, we address the problem of video transmission over unreliable networks, such as the Internet, where packet losses occur. The most recent literature indicates multiple description (MD) as a promising coding approach to handle this issue. Moreover, it has been shown also how important the use of motion compensation prediction is in an MD-coding scheme. This paper proposes two architectures for multiple description video coding, both of them are based on the motion compensation prediction loop. The common characteristic of the two architectures is the use of a polyphase down-sampling technique to create the MDs and to introduce cross-redundancy among the descriptions. The first scheme, that we call drift-compensation multiple description video coder (DC-MDVC) appears very robust when used in an error-prone environment, but it can provide only two descriptions. The second architecture, called independent flow multiple description video coder (IF-MDVC), generates multiple sets of data before the motion compensation loop; in this case, there are no severe limitations in the selection of the number of descriptions used by the coder.

Journal ArticleDOI
TL;DR: An encoding framework which exploits semantics for video content delivery and shows that the use of semantic video analysis prior to encoding sensibly reduces the bandwidth requirements compared to traditional encoders not only for an object-based encoder but also for a frame- based encoder.
Abstract: We present an encoding framework which exploits semantics for video content delivery. The video content is organized based on the idea of main content message. In the work reported in this paper, the main content message is extracted from the video data through semantic video analysis, an application-dependent process that separates relevant information from non relevant information. We use here semantic analysis and the corresponding content annotation under a new perspective: the results of the analysis are exploited for object-based encoders, such as MPEG-4, as well as for frame-based encoders, such as MPEG-1. Moreover, the use of MPEG-7 content descriptors in conjunction with the video is used for improving content visualization for narrow channels and devices with limited capabilities. Finally, we analyze and evaluate the impact of semantic video analysis in video encoding and show that the use of semantic video analysis prior to encoding sensibly reduces the bandwidth requirements compared to traditional encoders not only for an object-based encoder but also for a frame-based encoder.

Journal ArticleDOI
TL;DR: An algorithm for tracking video objects which is based on a hybrid strategy that uses both object and region information to solve the correspondence problem and implicitly provides one with a description of the objects and their track, thus enabling indexing and manipulation of the video content.
Abstract: We present an algorithm for tracking video objects which is based on a hybrid strategy. This strategy uses both object and region information to solve the correspondence problem. Low-level descriptors are exploited to track object's regions and to cope with track management issues. Appearance and disappearance of objects, splitting and partial occlusions are resolved through interactions between regions and objects. Experimental results demonstrate that this approach has the ability to deal with multiple deformable objects, whose shape varies over time. Furthermore, it is very simple, because the tracking is based on the descriptors, which represent a very compact piece of information about regions, and they are easy to define and track automatically. Finally, this procedure implicitly provides one with a description of the objects and their track, thus enabling indexing and manipulation of the video content.

Journal ArticleDOI
TL;DR: The resulting theoretically derived spatial domain quantization noise model shows that in general the compression noise in the spatial domain is both correlated and spatially varying.
Abstract: In lossy image compression schemes utilizing the discrete cosine transform (DCT), quantization of the DCT coefficients introduces error in the image representation and a loss of signal information. At high compression ratios, this introduced error produces visually undesirable compression artifacts that can dramatically lower the perceived quality of a particular image. This paper provides a spatial domain model of the quantization error based on a statistical noise model of the error introduced when quantizing the DCT coefficients. The resulting theoretically derived spatial domain quantization noise model shows that in general the compression noise in the spatial domain is both correlated and spatially varying. This provides some justification to many of the ad hoc artifact removal filters that have been proposed. More importantly, the proposed noise model can be incorporated in a post-processing algorithm that correctly incorporates the spatial correction of the quantizer error. Experimental results demonstrate the effectiveness of this approach.

Journal ArticleDOI
TL;DR: A novel hybrid digital video watermarking scheme based on the scene change analysis and error correction code is proposed, which is robust against the attacks of frame dropping, averaging and statistical analysis and optimizes the quality of the watermarked video.
Abstract: We have seen an explosion of data exchange in the Internet and the extensive use of digital media. Consequently, digital data owners can quickly and massively transfer multimedia documents across the Internet. This leads to wide interest in multimedia security and multimedia copyright protection. We propose a novel hybrid digital video watermarking scheme based on the scene change analysis and error correction code. Our video watermarking algorithm is robust against the attacks of frame dropping, averaging and statistical analysis, which were not solved effectively in the past. We start with a complete survey of current watermarking technologies, and noticed that none of the existing schemes is capable of resisting all attacks. Accordingly, we propose the idea of embedding different parts of a single watermark into different scenes of a video. We then analyze the strengths of different watermarking schemes, and apply a hybrid approach to form a super watermarking scheme that can resist most of the attacks. To increase the robustness of the scheme, the watermark is refined by an error correcting code, while the correcting code is embedded as a watermark in the audio channel. It optimizes the quality of the watermarked video. The effectiveness of this scheme is verified through a series of experiments, in which a number of standard image processing attacks are conducted, and the robustness of our approach is demonstrated using the criteria of the latest StirMark test.

Journal ArticleDOI
TL;DR: A fast heuristic model based on dynamic programming is proposed for the search of FIS shape, which searches the optimal focus measure in the whole image volume, instead of the small volume as adopted in previous methods.
Abstract: The most popular shape from focus (SFF) methods in the literature are based on the concept of focused image surface (FIS)-the surface formed by the best focus points. According to paraxial-geometric optics, there is one-to-one correspondence between the shape of an object and the shape of its FIS. Therefore, the problem of three-dimensional (3-D) shape recovery from image focus can be described as the problem of determining the shape of the FIS. The conventional SFF method is inaccurate because of piecewise constant approximation of the FIS. The SFF method based on the FIS has shown better results by exhaustive search of the FIS shape using planar surface approximation at the cost of considerably higher computations. In this paper, search of the FIS shape is presented as an optimization problem, i.e., maximization of the focus measure in the 3-D image volume. The proposed method searches the optimal focus measure in the whole image volume, instead of the small volume as adopted in previous methods. The dynamic programming, instead of the approximation techniques, is used to search the optimal FIS shape. A direct application of dynamic programming on a 3-D data is impractical, because of higher computational complexity. Therefore a fast heuristic model based on dynamic programming is proposed for the search of FIS shape. The shape recovery results of the new method are better than previous methods. The proposed algorithm is significantly faster than the FIS algorithm, but a little slower than the conventional algorithm.

Journal ArticleDOI
TL;DR: An original approach to partitioning of a video into shots based on a foveated representation of the video using a single technique, rather than a set of dedicated methods is proposed.
Abstract: We view scenes in the real world by moving our eyes three to four times each second and integrating information across subsequent fixations (foveation points). By taking advantage of this fact, in this paper we propose an original approach to partitioning of a video into shots based on a foveated representation of the video. More precisely, the shot-change detection method is related to the computation, at each time instant, of a consistency measure of the fixation sequences generated by an ideal observer looking at the video. The proposed scheme aims at detecting both abrupt and gradual transitions between shots using a single technique, rather than a set of dedicated methods. Results on videos of various content types are reported and validate the proposed approach.

Journal ArticleDOI
TL;DR: A wavelet-based multiscale linear minimum mean square-error estimation (LMMSE) scheme for image denoising is proposed, and the determination of the optimal wavelet basis with respect to the proposed scheme is discussed.
Abstract: In this paper, a wavelet-based multiscale linear minimum mean square-error estimation (LMMSE) scheme for image denoising is proposed, and the determination of the optimal wavelet basis with respect to the proposed scheme is also discussed. The overcomplete wavelet expansion (OWE), which is more effective than the orthogonal wavelet transform (OWT) in noise reduction, is used. To explore the strong interscale dependencies of OWE, we combine the pixels at the same spatial location across scales as a vector and apply LMMSE to the vector. Compared with the LMMSE within each scale, the interscale model exploits the dependency information distributed at adjacent scales. The performance of the proposed scheme is dependent on the selection of the wavelet bases. Two criteria, the signal information extraction criterion and the distribution error criterion, are proposed to measure the denoising performance. The optimal wavelet that achieves the best tradeoff between the two criteria can be determined from a library of wavelet bases. To estimate the wavelet coefficient statistics precisely and adaptively, we classify the wavelet coefficients into different clusters by context modeling, which exploits the wavelet intrascale dependency and yields a local discrimination of images. Experiments show that the proposed scheme outperforms some existing denoising methods.

Journal ArticleDOI
TL;DR: Experimental results show that the proposed representations outperform the alpha-trimmed average histogram for video retrieval and compared to other histogram-based techniques for video shot representation and retrieval.
Abstract: In this paper, we propose an optimal key frame representation scheme based on global statistics for video shot retrieval. Each pixel in this optimal key frame is constructed by considering the probability of occurrence of those pixels at the corresponding pixel position among the frames in a video shot. Therefore, this constructed key frame is called temporally maximum occurrence frame (TMOF), which is an optimal representation of all the frames in a video shot. The retrieval performance of this representation scheme is further improved by considering the k pixel values with the largest probabilities of occurrence and the highest peaks of the probability distribution of occurrence at each pixel position for a video shot. The corresponding schemes are called k-TMOF and k-pTMOF, respectively. These key frame representation schemes are compared to other histogram-based techniques for video shot representation and retrieval. In the experiments, three video sequences in the MPEG-7 content set were used to evaluate the performances of the different key frame representation schemes. Experimental results show that our proposed representations outperform the alpha-trimmed average histogram for video retrieval.

Journal ArticleDOI
TL;DR: Both objective and subjective quality evaluations are given by evaluating the proposed perceptual rate control (PRC) scheme in the H.263 platform, and the evaluations show that the proposed PRC scheme achieves significant quality improvement in block-based coding for bandwidth-hungry applications.
Abstract: We present a method for extracting local visual perceptual cues and its application for rate control of videophone, in order to ensure the scarce bits to be assigned for maximum perceptual coding quality. The optimum quantization step is determined with the rate-distortion model considering the local perceptual cues in the visual signal. For extraction of the perceptual cues, luminance adaptation and texture masking are used as the stimulus-driven factors, while skin color serves as the cognition-driven factor in the current implementation. Both objective and subjective quality evaluations are given by evaluating the proposed perceptual rate control (PRC) scheme in the H.263 platform, and the evaluations show that the proposed PRC scheme achieves significant quality improvement in block-based coding for bandwidth-hungry applications.

Journal ArticleDOI
TL;DR: A novel Coordinate Rotation Digital Computer (CORDIC) rotator algorithm that converges to the final target angle by adaptively executing appropriate iteration steps while keeping the scale factor virtually constant and completely predictable is proposed.
Abstract: In this paper, we proposed a novel Coordinate Rotation Digital Computer (CORDIC) rotator algorithm that converges to the final target angle by adaptively executing appropriate iteration steps while keeping the scale factor virtually constant and completely predictable. The new feature of our scheme is that, depending on the input angle, the scale factor can assume only two values, viz., 1 and 1//spl radic/2, and it is independent of the number of executed iterations, nature of iterations, and word length. In this algorithm, compared to the conventional CORDIC, a reduction of 50% iteration is achieved on an average without compromising the accuracy. The adaptive selection of the appropriate iteration step is predicted from the binary representation of the target angle, and no further arithmetic computation in the angle approximation datapath is required. The convergence range of the proposed CORDIC rotator is spanned over the entire coordinate space. The new CORDIC rotator requires 22% less adders and 53% less registers compared to that of the conventional CORDIC. The synthesized cell area of the proposed CORDIC rotator core is 0.7 mm/sup 2/ and its power dissipation is 7 mW in IHP in-house 0.25-/spl mu/m BiCMOS technology.