scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Circuits and Systems for Video Technology in 2000"


Journal Article•DOI•
TL;DR: A low bit-rate embedded video coding scheme that utilizes a 3-D extension of the set partitioning in hierarchical trees (SPIHT) algorithm which has proved so successful in still image coding, which allows multiresolutional scalability in encoding and decoding in both time and space from one bit stream.
Abstract: We propose a low bit-rate embedded video coding scheme that utilizes a 3-D extension of the set partitioning in hierarchical trees (SPIHT) algorithm which has proved so successful in still image coding. Three-dimensional spatio-temporal orientation trees coupled with powerful SPIHT sorting and refinement renders 3-D SPIHT video coder so efficient that it provides comparable performance to H.263 objectively and subjectively when operated at the bit rates of 30 to 60 kbits/s with minimal system complexity. Extension to color-embedded video coding is accomplished without explicit bit allocation, and can be used for any color plane representation. In addition to being rate scalable, the proposed video coder allows multiresolutional scalability in encoding and decoding in both time and space from one bit stream. This added functionality along with many desirable attributes, such as full embeddedness for progressive transmission, precise rate control for constant bit-rate traffic, and low complexity for possible software-only video applications, makes the proposed video coder an attractive candidate for multimedia applications.

560 citations


Journal Article•DOI•
TL;DR: The results of a performance evaluation and characterization of a number of shot-change detection methods that use color histograms, block motion matching, or MPEG compressed data are presented.
Abstract: A number of automated shot-change detection methods for indexing a video sequence to facilitate browsing and retrieval have been proposed. Many of these methods use color histograms or features computed from block motion or compression parameters to compute frame differences. It is important to evaluate and characterize their performance so as to deliver a single set of algorithms that may be used by other researchers for indexing video databases. We present the results of a performance evaluation and characterization of a number of shot-change detection methods that use color histograms, block motion matching, or MPEG compressed data.

494 citations


Journal Article•DOI•
TL;DR: A sliding-window method for data selection is used to mitigate the impact of a scene change and the data points for updating a model are adaptively selected such that the statistical behavior is improved.
Abstract: This paper presents a scalable rate control (SRC) scheme based on a more accurate second-order rate-distortion model. A sliding-window method for data selection is used to mitigate the impact of a scene change. The data points for updating a model are adaptively selected such that the statistical behavior is improved. For video object (VO) shape coding, we use an adaptive threshold method to remove shape-coding artifacts for MPEG-4 applications. A dynamic bit allocation among VOs is implemented according to the coding complexities for each VO. SRC achieves more accurate bit allocation with low latency and limited buffer size. In a single framework, SRC offers multiple layers of controls for objects, frames, and macroblocks (MBs). At MB level, SRC provides finer bit rate and buffer control. At multiple VO level, SRC offers superior VO presentation for multimedia applications. The proposed SRC scheme has been adopted as part of the International Standard of the emerging ISO MPEG-4 standard.

446 citations


Journal Article•DOI•
TL;DR: It is argued that more robustness can be achieved if watermarks are embedded in dc components since dc components have much larger perceptual capacity than any ac components and a new embedding strategy for watermarking is proposed based on a quantitative analysis on the magnitudes of DCT components of host images.
Abstract: Both watermark structure and embedding strategy affect robustness of image watermarks. Where should watermarks be embedded in the discrete cosine transform (DCT) domain in order for the invisible image watermarks to be robust? Though many papers in the literature agree that watermarks should be embedded in perceptually significant components, dc components are explicitly excluded from watermark embedding. In this letter, a new embedding strategy for watermarking is proposed based on a quantitative analysis on the magnitudes of DCT components of host images. We argue that more robustness can be achieved if watermarks are embedded in dc components since dc components have much larger perceptual capacity than any ac components. Based on this idea, an adaptive watermarking algorithm is presented. We incorporate the feature of texture masking and luminance masking of the human visual system into watermarking. Experimental results demonstrate that the invisible watermarks embedded with the proposed watermark algorithm are very robust.

281 citations


Journal Article•DOI•
TL;DR: It is shown that in certain structured settings, it is possible to obtain reliable estimates of camera motion by directly processing data easily obtained from the MPEG format.
Abstract: As digital video becomes more pervasive, efficient ways of searching and annotating video according to content will be increasingly important. Such tasks arise, for example, in the management of digital video libraries for content-based retrieval and browsing. We develop tools based on camera motion for analyzing and annotating a class of structured video using the low-level information available directly from MPEG-compressed video. In particular, we show that in certain structured settings, it is possible to obtain reliable estimates of camera motion by directly processing data easily obtained from the MPEG format. Working directly with the compressed video greatly reduces the processing time and enhances storage efficiency. As an illustration of this idea, we have developed a simple basketball annotation system which combines the low-level information extracted from an MPEG stream with the prior knowledge of basketball structure to provide high-level content analysis, annotation, and browsing for events such as wide-angle and close-up views, fast breaks, probable shots at the basket, etc. The methods used in this example should also be useful in the analysis of high-level content of structured video in other domains.

254 citations


Journal Article•DOI•
Shipeng Li1, Weiping Li•
TL;DR: Comparison of shape-adaptive wavelet coding with other coding schemes for arbitrarily shaped visual objects shows that shape- Adaptive Wavelet coding always achieves better coding efficiency than other schemes.
Abstract: This paper presents a shape-adaptive wavelet coding technique for coding arbitrarily shaped still texture. This technique includes shape-adaptive discrete wavelet transforms (SA-DWTs) and extensions of zerotree entropy (ZTE) coding and embedded zerotree wavelet (EZW) coding. Shape-adaptive wavelet coding is needed for efficiently coding arbitrarily shaped visual objects, which is essential for object-oriented multimedia applications. The challenge is to achieve high coding efficiency while satisfying the functionality of representing arbitrarily shaped visual texture. One of the features of the SA-DWTs is that the number of coefficients after SA-DWTs is identical to the number of pixels in the original arbitrarily shaped visual object. Another feature of the SA-DWT is that the spatial correlation, locality properties of wavelet transforms, and self-similarity across subbands are well preserved in the SA-DWT. Also, for a rectangular region, the SA-DWT becomes identical to the conventional wavelet transforms. For the same reason, the extentions of ZTE and EZW to coding arbitrarily shaped visual objects carefully treat "don't care" nodes in the wavelet trees. Comparison of shape-adaptive wavelet coding with other coding schemes for arbitrarily shaped visual objects shows that shape-adaptive wavelet coding always achieves better coding efficiency than other schemes. One implementation of the shape-adaptive wavelet coding technique has been included in the new multimedia coding standard MPEG-4 for coding arbitrarily shaped still texture. Software implementation is also available.

253 citations


Journal Article•DOI•
TL;DR: Two light-field compression schemes are presented and the first proposed coder is based on video-compression techniques that have been modified to code the four-dimensional light-fields data structure efficiently, establishing a hierarchical structure among the light- field images.
Abstract: Two light-field compression schemes are presented. The codecs are compared with regard to compression efficiency and rendering performance. The first proposed coder is based on video-compression techniques that have been modified to code the four-dimensional light-field data structure efficiently. The second coder relies entirely on disparity-compensated image prediction, establishing a hierarchical structure among the light-field images. The coding performance of both schemes is evaluated using publicly available light fields of synthetic, as well as real-world, scenes. Compression ratios vary between 100:1 and 2000:1, depending on the reconstruction quality and light-field scene characteristics.

223 citations


Journal Article•DOI•
TL;DR: Simulation results show that the end-to-end transport architecture achieves good perceptual picture quality for MPEG-4 video under low bit-rate and varying network conditions and efficiently utilizes network resources.
Abstract: With the success of the Internet and flexibility of MPEG-4, transporting MPEG-4 video over the Internet is expected to be an important component of many multimedia applications in the near future. Video applications typically have delay and loss requirements, which cannot be adequately supported by the current Internet. Thus, it is a challenging problem to design an efficient MPEG-4 video delivery system that can maximize the perceptual quality while achieving high resource utilization. This paper addresses this problem by presenting an end-to-end architecture for transporting MPEG-4 video over the Internet. We present a framework for transporting MPEG-4 video, which includes source rate adaptation, packetization, feedback control, and error control. The main contributions of this paper are: (1) a feedback control algorithm based on the Real Time Protocol (RTP) and the Real Time Control Protocol (RTCP); (2) an adaptive source-encoding algorithm for MPEG-4 video which is able to adjust the output rate of MPEG-4 video to the desired rate; and (3) an efficient and robust packetization algorithm for MPEG video bit-streams at the sync layer for Internet transport. Simulation results show that our end-to-end transport architecture achieves good perceptual picture quality for MPEG-4 video under low bit-rate and varying network conditions and efficiently utilizes network resources.

219 citations


Journal Article•DOI•
TL;DR: It was found that spatial filtering of one channel of a stereo video-sequence may be an effective means of reducing the transmission bandwidth: the overall sensation of depth was unaffected by low-pass filtering, while ratings of quality and of sharpness were strongly weighted towards the eye with the greater spatial resolution.
Abstract: We explored the response of the human visual system to mixed-resolution stereo video-sequences, in which one eye view was spatially or temporally low-pass filtered. It was expected that the perceived quality, depth, and sharpness would be relatively unaffected by low-pass filtering, compared to the case where both eyes viewed a filtered image. Subjects viewed two 10-second stereo video-sequences, in which the right-eye frames were filtered vertically (V) and horizontally (H) at 1/2 H, 1/2 V, 1/4 H, 1/4 V, 1/2 H 1/2 V, 1/2 H 1/4 V, 1/4 H 1/2 V, and 1/4 H 1/4 V resolution. Temporal filtering was implemented for a subset of these conditions at 1/2 temporal resolution, or with drop-and-repeat frames. Subjects rated the overall quality, sharpness, and overall sensation of depth. It was found that spatial filtering produced acceptable results: the overall sensation of depth was unaffected by low-pass filtering, while ratings of quality and of sharpness were strongly weighted towards the eye with the greater spatial resolution. By comparison, temporal filtering produced unacceptable results: field averaging and drop-and-repeat frame conditions yielded images with poor quality and sharpness, even though perceived depth was relatively unaffected. We conclude that spatial filtering of one channel of a stereo video-sequence may be an effective means of reducing the transmission bandwidth.

217 citations


Journal Article•DOI•
TL;DR: It is demonstrated that the new approach can significantly increase received video quality, but at the cost of a considerable computational overhead, and the technique is extended to allow for higher computational efficiency.
Abstract: Audio-visual and other multimedia services are seen as important sources of traffic for future telecommunication networks, including wireless networks. A major drawback with some wireless networks is that they introduce a significant number of transmission errors into the digital bitstream. For video, such errors can have the effect of degrading the quality of service to the point where it is unusable. We introduce a technique that allows for the concealment of the impact of these errors. Our work is based on MPEG-2 encoded video transmitted over a wireless network whose data structures are similar to those of asynchronous transfer mode (ATM) networks. Our simulations include the impact of the MPEG-2 systems layer and cover cell-loss rates up to 5%. This is substantially higher than those that have been discussed in the literature up to this time. We demonstrate that our new approach can significantly increase received video quality, but at the cost of a considerable computational overhead. We then extend our technique to allow for higher computational efficiency and demonstrate that a significant quality improvement is still possible.

193 citations


Journal Article•DOI•
TL;DR: A novel fast block-matching algorithm named normalized partial distortion search is proposed, which reduces computations by using a halfway-stop technique in the calculation of the block distortion measure and normalized the accumulated partial distortion and the current minimum distortion before comparison.
Abstract: Many fast block-matching algorithms reduce computations by limiting the number of checking points. They can achieve high computation reduction, but often result in relatively higher matching error compared with the full-search algorithm. A novel fast block-matching algorithm named normalized partial distortion search is proposed. The proposed algorithm reduces computations by using a halfway-stop technique in the calculation of the block distortion measure. In order to increase the probability of early rejection of non-possible candidate motion vectors, the proposed algorithm normalized the accumulated partial distortion and the current minimum distortion before comparison. Experimental results show that the proposed algorithm can maintain its mean square error performance very close to the full-search algorithm while achieving an average computation reduction of 12-13 times, with respect to the full-search algorithm.

Journal Article•DOI•
TL;DR: An error-concealment scheme that is based on block-matching principles and spatio-temporal video redundancy is presented and proves to be satisfactory for packet error rates (PER) ranging from 1% to 10% and for video sequences with different content and motion and surpasses that of other EC methods under study.
Abstract: The MPEG-2 compression algorithm is very sensitive to channel disturbances due to the use of variable-length coding. A single bit error during transmission leads to noticeable degradation of the decoded sequence quality, in that part or an entire slice information is lost until the next resynchronization point is reached. Error concealment (EC) methods, implemented at the decoder side, present one way of dealing with this problem. An error-concealment scheme that is based on block-matching principles and spatio-temporal video redundancy is presented in this paper. Spatial information (for the first frame of the sequence or the next scene) or temporal information (for the other frames) is used to reconstruct the corrupted regions. The concealment strategy is embedded in the MPEG-2 decoder model in such a way that error concealment is applied after entire frame decoding. Its performance proves to be satisfactory for packet error rates (PER) ranging from 1% to 10% and for video sequences with different content and motion and surpasses that of other EC methods under study.

Journal Article•DOI•
TL;DR: A novel method of reducing power consumption of the ME by adaptively changing the pixel resolution during the computation of the motion vector is proposed, which results in more than 60% reduction in power consumption.
Abstract: Power consumption is very critical for portable video applications such as portable videophone and digital camcorder. Motion estimation (ME) in the video encoder requires a huge amount of computation, and hence consumes the largest portion of power. We propose a novel method of reducing power consumption of the ME by adaptively changing the pixel resolution during the computation of the motion vector. The pixel resolution is changed by masking or truncating the least significant bits of the pixel data, which is governed by the bit-rate control mechanism. Experimental results show that on average more than 4 bits ran be truncated without significantly affecting the picture quality. This results in more than 60% reduction in power consumption.

Journal Article•DOI•
TL;DR: The results of both experiments provide support for the argument that stereoscopic camera toe-in should be avoided if possible and an effect of display duration on observers' judgements of naturalness and quality of stereoscopic images is indicated.
Abstract: Two experiments are presented that were aimed to investigate the effects of stereoscopic filming parameters and display duration on observers' judgements of naturalness and quality of stereoscopic images. The paper first presents a literature review of temporal factors in stereoscopic vision, with reference to stereoscopic displays. Several studies have indicated an effect of display duration on performance-oriented (criterion based) measures. The experiments reported were performed to extend the study of display duration from performance to appreciation-oriented measures. In addition, the present study aimed to investigate the effects of manipulating camera separation, convergence distance, and focal length on perceived quality and naturalness, In the first experiment, using display durations of both 5 and 10 s, 12 observers rated the naturalness of depth and the quality of depth for stereoscopic still images. The results showed no significant main effect of the display duration. A small yet significant shift between naturalness and quality was found for both duration conditions. This result replicated earlier findings, indicating that this is a reliable effect, albeit content-dependent. The second experiment was performed using display durations ranging from 1 to 15 s. The results of this experiment showed a small yet significant effect of display duration. Whereas longer display durations do not have a negative impact on the appreciative scores of optimally reproduced stereoscopic images, observers do give lower judgements to monoscopic images and stereoscopic images with unnatural disparity values as display duration increases. In addition, the results of both experiments provide support for the argument that stereoscopic camera toe-in should be avoided if possible.

Journal Article•DOI•
TL;DR: A three-level video-event detection methodology that can be applied to different events by adapting the classifier at the intermediate level and by specifying a new event model at the highest level is proposed.
Abstract: We propose a three-level video-event detection methodology and apply it to animal-hunt detection in wildlife documentaries. The first level extracts color, texture, and motion features, and detects shot boundaries and moving object blobs. The mid-level employs a neural network to determine the object class of the moving object blobs. This level also generates shot descriptors that combine features from the first level and inferences from the mid-level. The shot descriptors are then used by the domain-specific inference process at the third level to detect video segments that match the user defined event model. The proposed approach has been applied to the detection of hunts in wildlife documentaries. Our method can be applied to different events by adapting the classifier at the intermediate level and by specifying a new event model at the highest level. Event-based video indexing, summarization, and browsing are among the applications of the proposed approach.

Journal Article•DOI•
Ru-Shang Wang, Yao Wang1•
TL;DR: A novel image alignment approach which can convert images captured using nonparallel cameras to coplanar-like images and a coder for multiview sequences, which exploits the proposed alignment and structure estimation algorithm.
Abstract: This paper considers the problem of structure and motion estimation in multiview teleconferencing-type sequences and its application for video-sequence compression and intermediate-view generation. First, we introduce a new approach for structure estimation from a stereo pair acquired by two parallel cameras. It is based on a 2-D mesh representation of both views of the imaged scene and a parametrization of the structure information by the disparity between corresponding nodes in the image pair. Next, we describe a novel image alignment approach which can convert images captured using nonparallel cameras to coplanar-like images. This approach greatly eases the computational burden incurred by the nonparallel camera geometry, where one must consider both horizontal and vertical disparities. Finally, we present a coder for multiview sequences, which exploits the proposed alignment and structure estimation algorithm. By extracting the foreground objects and estimating the disparity field between a selected view and a reference view the coder can compress the image pair very efficiently. In the meantime, by using the coded structure information, the decoder can generate virtual viewpoints between decoded views, which can be very helpful for telepresence applications.

Journal Article•DOI•
TL;DR: An adaptive model-driven bit-allocation algorithm for video sequence coding based on a parametric rate-distortion model, which exploits characteristics of human visual perception to efficiently allocate bits according to a region's visual importance.
Abstract: We present an adaptive model-driven bit-allocation algorithm for video sequence coding. The algorithm is based on a parametric rate-distortion model, and facilitates both picture-and macroblock-level bit allocation. A region classification scheme is incorporated into the algorithm, which exploits characteristics of human visual perception to efficiently allocate bits according to a region's visual importance. The application of this algorithm to MPEG video coding is discussed in detail. We show that the proposed algorithm is computationally efficient and has many advantages over the MPEG-2 TM5 bit-allocation algorithm.

Journal Article•DOI•
TL;DR: A phase-correction filter is introduced, which is applied to one type (even or odd) of fields before motion detection/compensation, which has improved the motion-compensated PSNR by more than 2 dB, on average.
Abstract: We present a new method for the motion detection/compensation between opposite parity fields in interlaced video sequences. We introduce a phase-correction filter, which is applied to one type (even or odd) of fields before motion detection/compensation. By means of this phase-correction filter, the motion-compensated PSNR has been improved by more than 2 dB, on average. We also present a new deinterlacing algorithm based on the newly developed motion detection/compensation. This algorithm requires storing one field only, and the phase-corrected field is used for both motion detection/compensation and intrafield deinterlacing, thus making the proposed algorithm computationally very efficient. Excellent deinterlacing results have been obtained.

Journal Article•DOI•
TL;DR: A hardware-independent technique that improves the display rate of animated characters by acting on the sole geometric and rendering information and shows how impostors can be used to render virtual humans.
Abstract: Rendering and animating in real-time a multitude of articulated characters presents a real challenge, and few hardware systems are up to the task. Up to now, little research has been conducted to tackle the issue of real-time rendering of numerous virtual humans. This paper presents a hardware-independent technique that improves the display rate of animated characters by acting on the sole geometric and rendering information. We first review the acceleration techniques traditionally in use in computer graphics and highlight their suitability to articulated characters. We then show how impostors can be used to render virtual humans. We introduce concrete case studies that demonstrate the effectiveness of our approach. Finally, we tackle the visibility issue.

Journal Article•DOI•
TL;DR: A multi-metric model comprising of a perceptual model and a blockiness detector is proposed, designed for MPEG video, and very high correlation between the objective scores from the model and the subjective assessment results has been achieved.
Abstract: Different coding schemes introduce different artifacts to the decoded pictures, making it difficult to design an objective quality model capable of measuring all of them. A feasible approach is to design a picture-quality model for each kind of known distortion, and combine the results from the models according to the perceptual impact of each type of impairment. In this letter, a multi-metric model comprising of a perceptual model and a blockiness detector is proposed, designed for MPEG video. Very high correlation between the objective scores from the model and the subjective assessment results has been achieved.

Journal Article•DOI•
TL;DR: This work has developed a motion-compensated frame-rate conversion algorithm to reduce the 3:2 pulldown artifacts, and by using frame- rate conversion with interpolation instead of field repetition, mean square error and blocking artifacts are reduced significantly.
Abstract: Currently, the most popular method of converting 24 frames per second (fps) film to 60 fields/s video is to repeat each odd-numbered frame for 3 fields and each even-numbered frame for 2 fields. This method is known as 3:2 pulldown and is an easy and inexpensive way to perform 24 fps to 60 fields/s frame-rate conversion. However, the 3:2 pulldown introduces artifacts, which are especially visible when viewing on progressive displays and during slow-motion playback. We have developed a motion-compensated frame-rate conversion algorithm to reduce the 3:2 pulldown artifacts. By using frame-rate conversion with interpolation instead of field repetition, mean square error and blocking artifacts are reduced significantly. The techniques developed here can also be applied to the general frame-rate conversion problem.

Journal Article•DOI•
J. Ribas-Corbera1, Shaw-Min Lei2•
TL;DR: The approach selects the frame targets using formulas that result from combining an analytical rate-distortion optimization and a heuristic technique that compensates for the distortion dependency among frames, geared toward low-complexity real-time video coding.
Abstract: In typical block-based video coding, the rate-control scheme allocates a target number of bits to each frame of a video sequence and selects the block quantization parameters to meet the frame targets. In this work, we present a new technique for assigning such targets. This method has been adopted in the test model TMN10 of H.263+, but it is applicable to any video coder and is particularly useful for those that use B frames. Our approach selects the frame targets using formulas that result from combining an analytical rate-distortion optimization and a heuristic technique that compensates for the distortion dependency among frames. The method does not require pre-analyses, and encodes each frame only once; hence, it is geared toward low-complexity real-time video coding. We compare this new frame-layer bit allocation in TMN10 to that in MPEG-2's TM5 for a variety of bit rates and video sequences.

Journal Article•DOI•
TL;DR: This design combines the techniques of fast direct two-dimensional DCT algorithm, the bit level adder-based distributed arithmetic, and common subexpression sharing to reduce the hardware cost and enhance the computing speed.
Abstract: This paper presents a cost-effective processor core design that features the simplest hardware and is suitable for discrete cosine transform/indiscrete cosine transform (DCT/IDCT) operations in H.263 and digital camera. This design combines the techniques of fast direct two-dimensional DCT algorithm, the bit level adder-based distributed arithmetic, and common subexpression sharing to reduce the hardware cost and enhance the computing speed. The resulting architecture is very simple and regular such that it can be easily scaled for higher throughput rate requirements. The DCT design has been implemented by 0.6 /spl mu/m SPDM CMOS technology and only costs 1493 gate count, or 0.78 mm/sup 2/. The proposed design can meet real-time DCT/IDCT requirements of the H.263 codec system for QCIF image frame size at 10 frames/s with 4:2:0 color format. Moreover, the proposed design still possesses additional computing power for other operations when operating at 33 MHz.

Journal Article•DOI•
TL;DR: Two efficient quadtree-based algorithms for variable-size block matching (VSBM) motion estimation that employ an efficient dynamic programming technique utilizing the special structure of a quadtree and a heuristic way to select variable-sized square blocks are reported.
Abstract: This paper reports two efficient quadtree-based algorithms for variable-size block matching (VSBM) motion estimation. The schemes allow the dimensions of blocks to adapt to local activity within the image, and the total number of blocks in any frame can be varied while still accurately representing true motion. This permits adaptive bit allocation between the representation of displacement and residual data, and also the variation of the overall bit-rate on a frame-by-frame basis. The first algorithm computes the optimal selection of variable-sized blocks to provide the best-achievable prediction error under the fixed number of blocks for a quadtree-based VSBM technique. The algorithm employs an efficient dynamic programming technique utilizing the special structure of a quadtree. Although this algorithm is computationally intensive, it does provide a yardstick by which the performance of other more practical VSBM techniques can be measured. The second algorithm adopts a heuristic way to select variable-sized square blocks. It relies more on local motion information than on global error optimization. Experiments suggest that the effective use of local information contributes to minimizing the overall error. The result is a more computationally efficient VSBM technique than the optimal algorithm, but with a comparable prediction error.

Journal Article•DOI•
TL;DR: An efficient technique for summarization of stereoscopic video sequences is presented, which extracts a small but meaningful set of video frames using a content-based sampling algorithm and experimental results indicate the reliable performance of the proposed scheme on real-life stereoscope video sequences.
Abstract: An efficient technique for summarization of stereoscopic video sequences is presented, which extracts a small but meaningful set of video frames using a content-based sampling algorithm. The proposed video-content representation provides the capability of browsing digital stereoscopic video sequences and performing more efficient content-based queries and indexing. Each stereoscopic video sequence is first partitioned into shots by applying a shot-cut detection algorithm so that frames (or stereo pairs) of similar visual characteristics are gathered together. Each shot is then analyzed using stereo-imaging techniques, and the disparity field, occluded areas, and depth map are estimated. A multiresolution implementation of the recursive shortest spanning tree (RSST) algorithm is applied for color and depth segmentation, while fusion of color and depth segments is employed for reliable video object extraction. In particular, color segments are projected onto depth segments so that video objects on the same depth plane are retained, while at the same time accurate object boundaries are extracted. Feature vectors are then constructed using multidimensional fuzzy classification of segment features including size, location, color, and depth. Shot selection is accomplished by clustering similar shots based on the generalized Lloyd-Max algorithm, while for a given shot, key frames are extracted using an optimization method for locating frames of minimally correlated feature vectors. For efficient implementation of the latter method, a genetic algorithm is used. Experimental results are presented, which indicate the reliable performance of the proposed scheme on real-life stereoscopic video sequences.

Journal Article•DOI•
TL;DR: This work shows that OBM techniques can be successfully applied to stereo image coding, and takes advantage of the smoothness properties typically found in disparity fields to further improve the performance of OBM in this particular application.
Abstract: We propose a modified overlapped block-matching (OBM) scheme for stereo image coding. OBM has been used in video coding but, to the best of our knowledge, it has not been applied to stereo image coding to date. In video coding, OBM has proven useful in reducing blocking artifacts (since multiple vectors can be used for each block), while also maintaining most of the advantages of fixed-size block matching. There are two main novelties in this work. First, we show that OBM techniques can be successfully applied to stereo image coding. Second, we take advantage of the smoothness properties typically found in disparity fields to further improve the performance of OBM in this particular application. Specifically, we note that practical OBM approaches use noniterative estimation techniques, which produce lower quality estimates than iterative methods. By introducing smoothness constraints into the noniterative DV computation, we improve the quality of the estimated disparity as compared to standard noniterative OBM approaches. In addition, we propose a disparity estimation/compensation approach using adaptive windows with variable shapes, which results in a reduction in complexity. We provide experimental results that show that our proposed hybrid OBM scheme achieves a PSNR gain (about 1.5-2 dB) as compared to a simple block-based scheme, with some slight PSNR gains (about 0.2-0.5 dB) in a reduced complexity, as compared to an approach based on standard OBM with half-pixel accuracy.

Journal Article•DOI•
TL;DR: Through a seamless integration, it is demonstrated that network adaptivity is enhanced enough to mitigate the packet loss and bandwidth fluctuation, resulting in a more smooth video experience at the receiver.
Abstract: A feedback-based Internet video transmission scheme based on ITU-T H.263+ is presented. The proposed system is capable of continually accommodating its stream size and managing the packet loss recovery in response to network condition changes. It consists of multiple components: TCP-friendly end-to-end congestion control and available bandwidth estimation, encoding frame-rate control and delay-based smoothing at the sender, media-aware packetization and packet loss recovery tied with congestion control, and quality recovery tools such as motion-compensated frame interpolation at the receiver. These components are designed to meet the low computational complexity requirement so that the whole system can operate in real-time. Among these, the video-aware congestion control known as a receiver-based congestion control mechanism, the variable frame-rate H.263+ encoding, and fast motion-compensated frame interpolation components are key features. Through a seamless integration, it is demonstrated that network adaptivity is enhanced enough to mitigate the packet loss and bandwidth fluctuation, resulting in a more smooth video experience at the receiver.

Journal Article•DOI•
TL;DR: The title "just enough reality" hints at the contrast between the popularly perceived requirements for strict "virtual reality" and the expert's pragmatic acceptance of "sufficient reality" to satisfy the human interface requirements of real-world applications.
Abstract: We address human factors and technology issues for the design of stereoscopic display systems that are natural and comfortable to view. Our title "just enough reality" hints at the contrast between the popularly perceived requirements for strict "virtual reality" and the expert's pragmatic acceptance of "sufficient reality" to satisfy the human interface requirements of real-world applications. We first review how numerous perceptions and illusions of depth can be exploited to synergistically complement binocular stereopsis. Then we report the results of our experimental studies of stereoscopy with very small interocular separations and correspondingly small on-screen disparities, which we call "microstereopsis." We outline the implications of microstereopsis for the design of future stereoscopic camera and display systems, especially the possibility of achieving zone-less autostereoscopic displays. We describe a possible class of implementations based on a nonlambertian filter element, and a particular implementation that would use an electronically switched louver filter to realize it.

Journal Article•DOI•
TL;DR: An adaptive block based intra refresh algorithm for increasing error robustness in an interframe coding system is described, demonstrating a significant improvement in terms of error recovery time over nonadaptive intra update strategies.
Abstract: An adaptive block based intra refresh algorithm for increasing error robustness in an interframe coding system is described. The goal of this algorithm is to allow the intra update rates for different image regions to vary according to various channel conditions and image characteristics. The update scheme is based on an "error-sensitivity metric," accumulated at the encoder, representing the vulnerability of each coded block to channel errors. As each new frame is encoded, the accumulated metric for each block is examined, and those blocks deemed to have an unacceptably high metric are sent using intra coding as opposed to inter coding. This approach requires no feedback channel and is fully compatible with H.263. It involves a negligible increase in encoder complexity and no change in the decoder complexity. Simulations performed using an H.263 bitstream corrupted by channel errors demonstrate a significant improvement in terms of error recovery time over nonadaptive intra update strategies.

Journal Article•DOI•
TL;DR: A novel combined shape and motion estimation using sliding texture considerably improves the calibration data of the individual views in comparison to fixed-shape model based camera-motion estimation.
Abstract: A system for the automatic reconstruction of real-world objects from multiple uncalibrated camera views is presented. The camera position and orientation for all views, the 3-D shape of the rigid object, as well as the associated color information, are recovered from the image sequence. The system proceeds in four steps. First, the internal camera parameters describing the imaging geometry are calibrated using a reference object. Second, an initial 3-D description of the object is computed from two views. This model information is then used in a third step to estimate the camera positions for all available views using a novel linear 3-D motion and shape estimation algorithm. The main feature of this third step is the simultaneous estimation of 3-D camera-motion parameters and object shape refinement with respect to the initial 3-D model. The initial 3-D shape model exhibits only a few degrees of freedom and the object shape refinement is defined as flexible deformation of the initial shape model. Our formulation of the shape deformation allows the object texture to slide on the surface, which differs from traditional flexible body modeling. This novel combined shape and motion estimation using sliding texture considerably improves the calibration data of the individual views in comparison to fixed-shape model based camera-motion estimation. Since the shape model used for model based camera-motion estimation is only approximate, a volumetric 3-D reconstruction process is initiated in the fourth step that combines the information from ail views simultaneously. The recovered object consists of a set of voxels with associated color information that describes even fine structures and details of the object. New views of the object can be rendered from the recovered 3-D model, which has potential applications in virtual reality or multimedia systems and the emerging field of video coding using 3-D scene models.