Showing papers in "IEEE Transactions on Circuits and Systems for Video Technology in 2004"
TL;DR: A brief overview of the field of biometrics is given and some of its advantages, disadvantages, strengths, limitations, and related privacy concerns are summarized.
Abstract: A wide variety of systems requires reliable personal recognition schemes to either confirm or determine the identity of an individual requesting their services. The purpose of such schemes is to ensure that the rendered services are accessed only by a legitimate user and no one else. Examples of such applications include secure access to buildings, computer systems, laptops, cellular phones, and ATMs. In the absence of robust personal recognition schemes, these systems are vulnerable to the wiles of an impostor. Biometric recognition, or, simply, biometrics, refers to the automatic recognition of individuals based on their physiological and/or behavioral characteristics. By using biometrics, it is possible to confirm or establish an individual's identity based on "who she is", rather than by "what she possesses" (e.g., an ID card) or "what she remembers" (e.g., a password). We give a brief overview of the field of biometrics and summarize some of its advantages, disadvantages, strengths, limitations, and related privacy concerns.
TL;DR: Algorithms developed by the author for recognizing persons by their iris patterns have now been tested in many field and laboratory trials, producing no false matches in several million comparison tests.
Abstract: Algorithms developed by the author for recognizing persons by their iris patterns have now been tested in many field and laboratory trials, producing no false matches in several million comparison tests. The recognition principle is the failure of a test of statistical independence on iris phase structure encoded by multi-scale quadrature wavelets. The combinatorial complexity of this phase information across different persons spans about 249 degrees of freedom and generates a discrimination entropy of about 3.2 b/mm/sup 2/ over the iris, enabling real-time decisions about personal identity with extremely high confidence. The high confidence levels are important because they allow very large databases to be searched exhaustively (one-to-many "identification mode") without making false matches, despite so many chances. Biometrics that lack this property can only survive one-to-one ("verification") or few comparisons. The paper explains the iris recognition algorithms and presents results of 9.1 million comparisons among eye images from trials in Britain, the USA, Japan, and Korea.
TL;DR: This work uses a recursive set-partitioning procedure to sort subsets of wavelet coefficients by maximum magnitude with respect to thresholds that are integer powers of two, and concludes that this algorithm retains all the desirable features of these algorithms and is highly competitive to them in compression efficiency.
Abstract: We propose an embedded, block-based, image wavelet transform coding algorithm of low complexity. It uses a recursive set-partitioning procedure to sort subsets of wavelet coefficients by maximum magnitude with respect to thresholds that are integer powers of two. It exploits two fundamental characteristics of an image transform-the well-defined hierarchical structure, and energy clustering in frequency and in space. The two partition strategies allow for versatile and efficient coding of several image transform structures, including dyadic, blocks inside subbands, wavelet packets, and discrete cosine transform (DCT). We describe the use of this coding algorithm in several implementations, including reversible (lossless) coding and its adaptation for color images, and show extensive comparisons with other state-of-the-art coders, such as set partitioning in hierarchical trees (SPIHT) and JPEG2000. We conclude that this algorithm, in addition to being very flexible, retains all the desirable features of these algorithms and is highly competitive to them in compression efficiency.
TL;DR: A human recognition algorithm by combining static and dynamic body biometrics, fused on the decision level using different combinations of rules to improve the performance of both identification and verification is described.
Abstract: Vision-based human identification at a distance has recently gained growing interest from computer vision researchers. This paper describes a human recognition algorithm by combining static and dynamic body biometrics. For each sequence involving a walker, temporal pose changes of the segmented moving silhouettes are represented as an associated sequence of complex vector configurations and are then analyzed using the Procrustes shape analysis method to obtain a compact appearance representation, called static information of body. In addition, a model-based approach is presented under a Condensation framework to track the walker and to further recover joint-angle trajectories of lower limbs, called dynamic information of gait. Both static and dynamic cues obtained from walking video may be independently used for recognition using the nearest exemplar classifier. They are fused on the decision level using different combinations of rules to improve the performance of both identification and verification. Experimental results of a dataset including 20 subjects demonstrate the feasibility of the proposed algorithm.
TL;DR: By transforming a photo image into a sketch, this work reduces the difference between photo and sketch significantly, thus allowing effective matching between the two, and demonstrates the efficacy of the algorithm.
Abstract: Automatic retrieval of face images from police mug-shot databases is critically important for law enforcement agencies. It can effectively help investigators to locate or narrow down potential suspects. However, in many cases, a photo image of a suspect is not available and the best substitute is often a sketch drawing based on the recollection of an eyewitness. We present a novel photo retrieval system using face sketches. By transforming a photo image into a sketch, we reduce the difference between photo and sketch significantly, thus allowing effective matching between the two. Experiments over a data set containing 188 people clearly demonstrate the efficacy of the algorithm.
TL;DR: The perceptual requirements for 3-D TV that can be extracted from the literature are summarized and issues that require further investigation are addressed in order for 3D TV to be a success.
Abstract: A high-quality three-dimensional (3-D) broadcast service (3-D TV) is becoming increasingly feasible based on various recent technological developments combined with an enhanced understanding of 3-D perception and human factors issues surrounding 3-D TV. In this paper, 3-D technology and perceptually relevant issues, in particular 3-D image quality and visual comfort, in relation to 3-D TV systems are reviewed. The focus is on near-term displays for broadcast-style single- and multiple-viewer systems. We discuss how an image quality model for conventional two-dimensional images needs to be modified to be suitable for image quality research for 3-D TV. In this respect, studies are reviewed that have focused on the relationship between subjective attributes of 3-D image quality and physical system parameters that induce them (e.g., parameter choices in image acquisition, compression, and display). In particular, artifacts that may arise in 3-D TV systems are addressed, such as keystone distortion, depth-plane curvature, puppet theater effect, cross talk, cardboard effect, shear distortion, picket-fence effect, and image flipping. In conclusion, we summarize the perceptual requirements for 3-D TV that can be extracted from the literature and address issues that require further investigation in order for 3-D TV to be a success.
TL;DR: The proposed approach to personal verification using the thermal images of palm-dorsa vein patterns is valid and effective for vein-pattern verification and introduces a logical and reasonable method to select a trained threshold for verification.
Abstract: A novel approach to personal verification using the thermal images of palm-dorsa vein patterns is presented in this paper. The characteristics of the proposed method are that no prior knowledge about the objects is necessary and the parameters can be set automatically. In our work, an infrared (IR) camera is adopted as the input device to capture the thermal images of the palm-dorsa. In the proposed approach, two of the finger webs are automatically selected as the datum points to define the region of interest (ROI) on the thermal images. Within each ROI, feature points of the vein patterns (FPVPs) are extracted by modifying the basic tool of watershed transformation based on the properties of thermal images. According to the heat conduction law (the Fourier law), multiple features can be extracted from each FPVP for verification. Multiresolution representations of images with FPVPs are obtained using multiple multiresolution filters (MRFs) that extract the dominant points by filtering miscellaneous features for each FPVP. A hierarchical integrating function is then applied to integrate multiple features and multiresolution representations. The former is integrated by an inter-to-intra personal variation ratio and the latter is integrated by a positive Boolean function. We also introduce a logical and reasonable method to select a trained threshold for verification. Experiments were conducted using the thermal images of palm-dorsas and the results are satisfactory with an acceptable accuracy rate (FRR:2.3% and FAR:2.3%). The experimental results demonstrate that our proposed approach is valid and effective for vein-pattern verification.
TL;DR: The criteria that should be satisfied by a descriptor for nonrigid shapes with a single closed contour are discussed and a shape representation method that fulfills these criteria is proposed that is very efficient and invariant to several kinds of transformations.
Abstract: In this paper, we discuss the criteria that should be satisfied by a descriptor for nonrigid shapes with a single closed contour. We then propose a shape representation method that fulfills these criteria. In the proposed approach, contour convexities and concavities at different scale levels are represented using a two-dimensional (2-D) matrix. The representation can be visualized as a 2-D surface, where "hills" and "valleys" represent contour convexities and concavities, respectively. The optimal matching of two shape representations is achieved using dynamic programming and a dissimilarity measure is defined based on this matching. The proposed algorithm is very efficient and invariant to several kinds of transformations including some articulations and modest occlusions. The retrieval performance of the approach is illustrated using the MPEG-7 shape database, which is one of the most complete shape databases currently available. Our experiments indicate that the proposed representation is well suited for object indexing and retrieval in large databases. Furthermore, the representation can be used as a starting point to obtain more compact descriptors.
TL;DR: This paper shows how adaptive media playout (AMP), the variation of the playout speed of media frames depending on channel conditions, allows the client to buffer less data, thus introducing less delay, for a given buffer underflow probability.
Abstract: When media is streamed over best-effort networks, media data is buffered at the client to protect against playout interruptions due to packet losses and random delays. While the likelihood of an interruption decreases as more data is buffered, the latency that is introduced increases. In this paper we show how adaptive media playout (AMP), the variation of the playout speed of media frames depending on channel conditions, allows the client to buffer less data, thus introducing less delay, for a given buffer underflow probability. We proceed by defining models for the streaming media system and the random, lossy, packet delivery channel. Our streaming system model buffers media at the client, and combats packet losses with deadline-constrained automatic repeat request (ARQ). For the channel, we define a two-state Markov model that features state-dependent packet loss probability. Using the models, we develop a Markov chain analysis to examine the tradeoff between buffer underflow probability and latency for AMP-augmented video streaming. The results of the analysis, verified with simulation experiments, indicate that AMP can greatly improve the tradeoff, allowing reduced latencies for a given buffer underflow probability.
TL;DR: A new kernel function, called the cosine kernel, is proposed to increase the discriminating capability of the original polynomial kernel function and a geometry-based feature vector selection scheme is adopted to reduce the computational complexity of KFDA.
Abstract: This work is a continuation and extension of our previous research where kernel Fisher discriminant analysis (KFDA), a combination of the kernel trick with Fisher linear discriminant analysis (FLDA), was introduced to represent facial features for face recognition. This work makes three main contributions to further improving the performance of KFDA. First, a new kernel function, called the cosine kernel, is proposed to increase the discriminating capability of the original polynomial kernel function. Second, a geometry-based feature vector selection scheme is adopted to reduce the computational complexity of KFDA. Third, a variant of the nearest feature line classifier is employed to enhance the recognition performance further as it can produce virtual samples to make up for the shortage of training samples. Experiments have been carried out on a mixed database with 125 persons and 970 images and they demonstrate the effectiveness of the improvements.
TL;DR: This paper introduces bidirectional motion compensated temporal filtering with unconnected pixel detection and I blocks and incorporates a recently suggested lifting implementation of the subband/wavelet filter for improved MV accuracy in an MC-EZBC coder.
Abstract: In conventional motion-compensated three-dimensional subband/wavelet coding, where the motion compensation is unidirectional, incorrect classification of connected and unconnected pixels caused by incorrect motion vectors (MVs) has resulted in some coding inefficiency and visual artifacts in the embedded low-frame-rate video. In this paper, we introduce bidirectional motion compensated temporal filtering with unconnected pixel detection and I blocks. We also incorporate a recently suggested lifting implementation of the subband/wavelet filter for improved MV accuracy in an MC-EZBC coder. Simulation results compare PSNR performance of this new version of MC-EZBC versus H.26L under the constraint of equal groups of pictures size, and show a general parity with this state-of-the-art nonscalable coder on several test clips.
TL;DR: A noniterative, wavelet-based deblocking algorithm that can suppress both block discontinuities and ringing artifacts effectively while preserving true edges and textural information is proposed.
Abstract: It is well known that at low-bit-rate block discrete cosine transform compressed image exhibits visually annoying blocking and ringing artifacts. In this paper, we propose a noniterative, wavelet-based deblocking algorithm to reduce both types of artifacts. The algorithm exploits the fact that block discontinuities are constrained by the dc quantization interval of the quantization table, as well as the behavior of wavelet modulus maxima evolution across wavelet scales to derive appropriate threshold maps at different wavelet scales. Since ringing artifacts occur near strong edges, which can be located either along block boundaries or within blocks, suppression of block discontinuities does not always reduce ringing artifacts. By exploiting the behavior of ringing artifacts in the wavelet domain, we propose a simple yet effective method for the suppression of such artifacts. The proposed algorithm can suppress both block discontinuities and ringing artifacts effectively while preserving true edges and textural information. Simulation results and extensive comparative study with both iterative and noniterative methods reported in the literature have shown the effectiveness of our algorithm.
TL;DR: An enhanced hexagonal search algorithm is proposed to further improve the performance in terms of reducing number of search points and distortion, where a novel fast inner search is employed by exploiting the distortion information of the evaluated points.
Abstract: Fast block motion estimation normally consists of low-resolution coarse search and the following fine-resolution inner search. Most motion estimation algorithms developed attempt to speed up the coarse search without considering accelerating the focused inner search. On top of the hexagonal search method recently developed, an enhanced hexagonal search algorithm is proposed to further improve the performance in terms of reducing number of search points and distortion, where a novel fast inner search is employed by exploiting the distortion information of the evaluated points. Our experimental results substantially justify the merits of the proposed algorithm.
TL;DR: Experimental results of the application of the segmentation algorithm to known sequences demonstrate the efficiency of the proposed segmentation approach and reveal the potential of employing this segmentation algorithms as part of an object-based video indexing and retrieval scheme.
Abstract: In this paper, a novel algorithm is presented for the real-time, compressed-domain, unsupervised segmentation of image sequences and is applied to video indexing and retrieval. The segmentation algorithm uses motion and color information directly extracted from the MPEG-2 compressed stream. An iterative rejection scheme based on the bilinear motion model is used to effect foreground/background segmentation. Following that, meaningful foreground spatiotemporal objects are formed by initially examining the temporal consistency of the output of iterative rejection, clustering the resulting foreground macroblocks to connected regions and finally performing region tracking. Background segmentation to spatiotemporal objects is additionally performed. MPEG-7 compliant low-level descriptors describing the color, shape, position, and motion of the resulting spatiotemporal objects are extracted and are automatically mapped to appropriate intermediate-level descriptors forming a simple vocabulary termed object ontology. This, combined with a relevance feedback mechanism, allows the qualitative definition of the high-level concepts the user queries for (semantic objects, each represented by a keyword) and the retrieval of relevant video segments. Desired spatial and temporal relationships between the objects in multiple-keyword queries can also be expressed, using the shot ontology. Experimental results of the application of the segmentation algorithm to known sequences demonstrate the efficiency of the proposed segmentation approach. Sample queries reveal the potential of employing this segmentation algorithm as part of an object-based video indexing and retrieval scheme.
TL;DR: An algorithm of generating video texture on the reconstructed dynamic 3-D object surface by introducing image-based rendering techniques and Experimental results demonstrate the effectiveness of the improved method in generating high fidelity object images from arbitrary viewpoints.
Abstract: Three-dimensional (3-D) video is a real 3-D movie recording the object's full 3-D shape, motion, and precise surface texture. This paper first proposes a parallel pipeline processing method for reconstructing a dynamic 3-D object shape from multiview video images, by which a temporal series of full 3-D voxel representations of the object behavior can be obtained in real time. To realize the real-time processing, we first introduce a plane-based volume intersection algorithm: first represent an observable 3-D space by a group of parallel plane slices, then back-project observed multiview object silhouettes onto each slice, and finally apply two-dimensional silhouette intersection on each slice. Then, we propose a method to parallelize this algorithm using a PC cluster, where we employ five-stage pipeline processing in each PC as well as slice-by-slice parallel silhouette intersection. Several results of the quantitative performance evaluation are given to demonstrate the effectiveness of the proposed methods. In the latter half of the paper, we present an algorithm of generating video texture on the reconstructed dynamic 3-D object surface. We first describe a naive view-independent rendering method and show its problems. Then, we improve the method by introducing image-based rendering techniques. Experimental results demonstrate the effectiveness of the improved method in generating high fidelity object images from arbitrary viewpoints.
TL;DR: This paper proposes an accurate and robust quasi-automatic lip segmentation algorithm that enables an accurate tracking even after hundreds of frames and shows that the mean keypoints' tracking errors of the algorithm are comparable to manual points' selection errors.
Abstract: Lip segmentation is an essential stage in many multimedia systems such as videoconferencing, lip reading, or low-bit-rate coding communication systems. In this paper, we propose an accurate and robust quasi-automatic lip segmentation algorithm. First, the upper mouth boundary and several characteristic points are detected in the first frame by using a new kind of active contour: the "jumping snake." Unlike classic snakes, it can be initialized far from the final edge and the adjustment of its parameters is easy and intuitive. Then, to achieve the segmentation, we propose a parametric model composed of several cubic curves. Its high flexibility enables accurate lip contour extraction even in the challenging case of a very asymmetric mouth. Compared to existing models, it brings a significant improvement in accuracy and realism. The segmentation in the following frames is achieved by using an interframe tracking of the keypoints and the model parameters. However, we show that, with a usual tracking algorithm, the keypoints' positions become unreliable after a few frames. We therefore propose an adjustment process that enables an accurate tracking even after hundreds of frames. Finally, we show that the mean keypoints' tracking errors of our algorithm are comparable to manual points' selection errors.
TL;DR: A hand and face segmentation methodology using color and motion cues for the content-based representation of sign language video sequences and derives a segmentation threshold for the classifier.
Abstract: We present a hand and face segmentation methodology using color and motion cues for the content-based representation of sign language video sequences. The methodology consists of three stages: skin-color segmentation; change detection; face and hand segmentation mask generation. In skin-color segmentation, a universal color-model is derived and image pixels are classified as skin or nonskin based on their Mahalanobis distance. We derive a segmentation threshold for the classifier. The aim of change detection is to localize moving objects in a video sequences. The change detection technique is based on the F test and block-based motion estimation. Finally, the results from skin-color segmentation and change detection are analyzed to segment the face and hands. The performance of the algorithm is illustrated by simulations carried out on standard test sequences.
TL;DR: A new fast algorithm for stereo analysis is proposed, which circumvents the window search by using a hybrid recursive matching strategy based on the effective selection of a small number of candidates.
Abstract: Real-time stereo analysis is an important research area in computer vision. In this context, we propose a stereo algorithm for an immersive video-conferencing system by which conferees at different geographical places can meet under similar conditions as in the real world. For this purpose, virtual views of the remote conferees are generated and adapted to the current viewpoint of the local participant. Dense vector fields of high accuracy are required in order to guarantee an adequate quality of the virtual views. Due to the usage of a wide baseline system with strongly convergent camera configurations, the dynamic disparity range is about 150 pixels. Considering computational costs, a full search or even a local search restricted to a small window of a few pixels, as it is implemented in many real-time algorithms, is not suitable for our application because processing on full-resolution video according to CCIR 601 TV standard with 25 frames per second is addressed-the most desirable as a pure software solution running on available processors without any support from dedicated hardware. Therefore, we propose in this paper a new fast algorithm for stereo analysis, which circumvents the window search by using a hybrid recursive matching strategy based on the effective selection of a small number of candidates. However, stereo analysis requires more than a straightforward application of stereo matching. The crucial problem is to produce accurate stereo correspondences in all parts of the image. Especially, errors in occluded regions and homogenous or less structured regions lead to disturbing artifacts in the synthesized virtual views. To cope with this problem, mismatches have to be detected and substituted by a sophisticated interpolation and extrapolation scheme.
TL;DR: A method for automatically estimating the number of objects and extracting independently moving video objects using motion vectors is presented here and a strategy for edge refinement is proposed to extract the precise object boundaries.
Abstract: This paper addresses the problem of extracting video objects from MPEG compressed video. The only cues used for object segmentation are the motion vectors which are sparse in MPEG. A method for automatically estimating the number of objects and extracting independently moving video objects using motion vectors is presented here. First, the motion vectors are accumulated over a few frames to enhance the motion information, which are further spatially interpolated to get dense motion vectors. The final segmentation, using the dense motion vectors, is obtained by applying the expectation maximization (EM) algorithm. A block-based affine clustering method is proposed for determining the number of appropriate motion models to be used for the EM step and the segmented objects are temporally tracked to obtain the video objects. Finally, a strategy for edge refinement is proposed to extract the precise object boundaries. Illustrative examples are provided to demonstrate the efficacy of the approach. A prominent application of the proposed method is that of object-based coding, which is part of the MPEG-4 standard.
TL;DR: An overview is presented here of the MPEG activity exploring the need for standardization in this area to support these new applications, under the name of 3DAV (for 3-D audio-visual), as an example, a detailed solution for omnidirectional video is presented as one of the application scenarios in3DAV.
Abstract: New kinds of media are emerging that extend the functionality of available technology. The growth of immersive recording technologies has led to video-based rendering systems for photographing and reproducing environments in motion. This lends itself to new forms of interactivity for the viewer, including the ability to explore a photographic scene and interact with its features. The three-dimensional (3-D) qualities of objects in the scene can be extracted by analysis techniques and displayed by the use of stereo vision. The data types and image bandwidth needed for this type of media experience may require especially efficient formats for representation, coding, and transmission. An overview is presented here of the MPEG activity exploring the need for standardization in this area to support these new applications, under the name of 3DAV (for 3-D audio-visual). As an example, a detailed solution for omnidirectional video is presented as one of the application scenarios in 3DAV.
TL;DR: In this article, a hierarchical multifeature coding scheme is proposed to facilitate coarse-to-fine matching for efficient and effective palmprint verification and identification in a large database, where four level features are defined: global geometry-based key point distance (Level-1 feature), global texture energy (Level 2 feature), fuzzy "interest" line (Level 3 feature), and local directional texture energy(Level-4 feature).
Abstract: Automatic personal identification is a significant component of security systems with many challenges and practical applications. The advances in biometric technology have led to the very rapid growth in identity authentication. This paper presents a new approach to personal identification using palmprints. To tackle the key issues such as feature extraction, representation, indexing, similarity measurement, and fast search for the best match, we propose a hierarchical multifeature coding scheme to facilitate coarse-to-fine matching for efficient and effective palmprint verification and identification in a large database. In our approach, four-level features are defined: global geometry-based key point distance (Level-1 feature), global texture energy (Level-2 feature), fuzzy "interest" line (Level-3 feature), and local directional texture energy (Level-4 feature). In contrast to the existing systems that employ a fixed mechanism for feature extraction and similarity measurement, we extract multiple features and adopt different matching criteria at different levels to achieve high performance by a coarse-to-fine guided search. The proposed method has been tested in a database with 7752 palmprint images from 386 different palms. The use of Level-1, Level-2, and Level-3 features can remove candidates from the database by 9.6%, 7.8%, and 60.6%, respectively. For a system embedded with an Intel Pentium III processor (500 MHz), the execution time of the simulation of our hierarchical coding scheme for a large database with 10/sup 6/ palmprint samples is 2.8 s while the traditional sequential approach requires 6.7 s with 4.5% verification equal error rate. Our experimental results demonstrate the feasibility and effectiveness of the proposed method.
TL;DR: A semantic indexing algorithm which uses both audio and visual information for salient event detection in soccer, using camera motion information as a visual cue and the "loudness" as an audio descriptor is proposed.
Abstract: Content characterization of sport videos is a subject of great interest to researchers working on the analysis of multimedia documents. In this paper, we propose a semantic indexing algorithm which uses both audio and visual information for salient event detection in soccer. The video signal is processed first by extracting low-level visual descriptors directly from an MPEG-2 bit stream. It is assumed that any instance of an event of interest typically affects two consecutive shots and is characterized by a different temporal evolution of the visual descriptors in the two shots. This motivates the introduction of a controlled Markov chain to describe such evolution during an event of interest, with the control input modeling the occurrence of a shot transition. After adequately training different controlled Markov chain models, a list of video segments can be extracted to represent a specific event of interest using the maximum likelihood criterion. To reduce the presence of false alarms, low-level audio descriptors are processed to order the candidate video segments in the list so that those associated to the event of interest are likely to be found in the very first positions. We focus in particular on goal detection, which represents a key event in a soccer game, using camera motion information as a visual cue and the "loudness" as an audio descriptor. The experimental results show the effectiveness of the proposed multimodal approach.
TL;DR: A postprocessing method for the correction of visual demosaicking artifacts is introduced, which impressively removes false colors while maintaining image sharpness and yields excellent improvements in terms of objective image quality measures.
Abstract: A postprocessing method for the correction of visual demosaicking artifacts is introduced. The restored, full-color images previously obtained by cost-effective color filter array interpolators are processed to improve their visual quality. Based on a localized color ratio model and the original underlying Bayer pattern structure, the proposed solution impressively removes false colors while maintaining image sharpness. At the same time, it yields excellent improvements in terms of objective image quality measures.
TL;DR: This work argues that integrating these two approaches and allowing them to benefit from each other will yield better performance than using either of them alone.
Abstract: Relevance feedback and region-based representations are two effective ways to improve the accuracy of content-based image retrieval systems. Although these two techniques have been successfully investigated and developed in the last few years, little attention has been paid to combining them together. We argue that integrating these two approaches and allowing them to benefit from each other will yield better performance than using either of them alone. To do that, on the one hand, two relevance feedback algorithms are proposed based on region representations. One is inspired from the query point movement method. By assembling all of the segmented regions of positive examples together and reweighting the regions to emphasize the latest ones, a pseudo image is formed as the new query. An incremental clustering technique is also considered to improve the retrieval efficiency. The other is the introduction of existing support vector machine-based algorithms. A new kernel is proposed so as to enable the algorithms to be applicable to region-based representations. On the other hand, a rational region weighting scheme based on users' feedback information is proposed. The region weights that somewhat coincide with human perception not only can be used in a query session, but can also be memorized and accumulated for future queries. Experimental results on a database of 10 000 general-purpose images demonstrate the effectiveness of the proposed framework.
TL;DR: An optimization-based system that automates home video editing that automatically selects suitable or desirable highlight segments from a set of raw home videos and aligns them with a given piece of incidental music to create an edited video segment to a desired length based on the content of the video and incidental music.
Abstract: In this paper, we present an optimization-based system that automates home video editing. This system automatically selects suitable or desirable highlight segments from a set of raw home videos and aligns them with a given piece of incidental music to create an edited video segment to a desired length based on the content of the video and incidental music. We developed an approach for extracting temporal structure and determining the importance of a video segment in order to facilitate the selection of highlight segments. Additionally we extract a temporal structure, beats, and tempos from the incidental music. In order to create more professional-looking results, the selected highlight segments satisfy a set of editing rules and are matched to the content of the incidental music. This task is formulated as a nonlinear 0-1 programming problem and the rules, which are adjustable and increasable, are embedded as constraints. The output video is rendered by connecting the selected highlight video segments with transition effects and the incidental music. Under this framework, we can choose the best-matched music for a given video and support different output styles.
TL;DR: A new distortion-minimized bit allocation scheme with hybrid unequal error protection (UEP) and delay-constrained automatic repeat request (ARQ) is proposed, which dynamically adapts to the estimated time-varying network conditions.
Abstract: The paper addresses the important issues of resource allocation for scalable video transmission over third generation (3G) wireless networks. By taking the time-varying wireless channel/network condition and scalable video codec characteristic into account, we allocate resources between source and channel coders based on the minimum-distortion or minimum-power consumption criterion. Specifically, we first present how to estimate the time-varying wireless channel/network condition through measurements of throughput and error rate in a 3G wireless network. Then, we propose a new distortion-minimized bit allocation scheme with hybrid unequal error protection (UEP) and delay-constrained automatic repeat request (ARQ), which dynamically adapts to the estimated time-varying network conditions. Furthermore, a novel power-minimized bit allocation scheme with channel-adaptive hybrid UEP and delay-constrained ARQ is proposed for mobile devices. In our proposed distortion/power-minimized bit-allocation scheme, bits are optimally distributed among source coding, forward error correction, and ARQ according to the varying channel/network condition. Simulation and analysis are performed using a progressive fine granularity scalability video codec. The simulation results show that our proposed schemes can significantly improve the reconstructed video quality under the same network conditions.
TL;DR: The concepts of presence, immersion, and co-presence are introduced and their relation to virtual collaborative environments in the context of communications and identify calibration, multiple-view analysis, tracking, and view synthesis as the fundamental image-processing modules addressing such a need.
Abstract: This survey paper discusses the three-dimensional image processing challenges posed by present and future immersive telecommunications, especially immersive video conferencing and television. We introduce the concepts of presence, immersion, and co-presence and discuss their relation to virtual collaborative environments in the context of communications. Several examples are used to illustrate the current state of the art. We highlight the crucial need of real-time, highly realistic video with adaptive viewpoint for future immersive communications and identify calibration, multiple-view analysis, tracking, and view synthesis as the fundamental image-processing modules addressing such a need. For each topic, we sketch the basic problem and representative solutions from the image processing literature.
TL;DR: Zhang et al. as mentioned in this paper proposed a novel three-step face detection approach to address the problem of automatic human face detection from images in surveillance and biometric applications, which adopts a simple-to-complex strategy.
Abstract: Automatic human face detection from images in surveillance and biometric applications is a challenging task due to variations in image background, view, illumination, articulation, and facial expression. We propose a novel three-step face detection approach to addressing this problem. The approach adopts a simple-to-complex strategy. First, a linear-filtering algorithm is applied to enhance detection performance by removing most nonface-like candidates rapidly. Second, a boosting chain algorithm is adopted to combine the boosting classifiers into a hierarchical "chain" structure. By utilizing the inter-layer discriminative information, this algorithm reveals a higher efficiency than traditional approaches. Last, a postfiltering algorithm, consisting of image preprocessing; support vector machine-filter and color-filter, is applied to refine the final prediction. As only a few candidate windows remain in the final stage, this algorithm greatly improves detection accuracy with small computation cost. Compared with conventional approaches, this three-step approach is shown to be more effective and capable of handling more pose variations. Moreover, together with a two-level hierarchy in-plane pose estimator, a rapid multiview face detector is built. Experimental results demonstrate a significant performance improvement for the proposed approach over others.
TL;DR: A content-based movie parsing and indexing approach is presented; it analyzes both audio and visual sources and accounts for their interrelations to extract high-level semantic cues to extract meaningful movie events and assign semantic labels for the purpose of content indexing.
Abstract: A content-based movie parsing and indexing approach is presented; it analyzes both audio and visual sources and accounts for their interrelations to extract high-level semantic cues. Specifically, the goal of this work is to extract meaningful movie events and assign them semantic labels for the purpose of content indexing. Three types of key events, namely, 2-speaker dialogs, multiple-speaker dialogs, and hybrid events, are considered. Moreover, speakers present in the detected movie dialogs are further identified based on the audio source parsing. The obtained audio and visual cues are then integrated to index the movie content. Our experiments have shown that an effective integration of the audio and visual sources can lead to a higher level of video content understanding, abstraction and indexing.
TL;DR: This paper considers two previously proposed fast, variable complexity, forward DCT algorithms, one based on frequency selection, the other based on accuracy selection, and proposes a hybrid algorithm that combines both forms of complexity reduction in order to achieve overall better performance over a broader range of operating rates.
Abstract: The discrete cosine transform (DCT) is one of the major components in most of image and video compression systems. The variable complexity algorithm framework has been applied successfully to achieve complexity savings in the computation of the inverse DCT in decoders. These gains can be achieved due to the highly predictable sparseness of the quantized DCT coefficients in natural image/video data. With the increasing demand for instant video messaging and two-way video transmission over mobile communication systems running on general-purpose embedded processors, the encoding complexity needs to be optimized. In this paper, we focus on complexity reduction techniques for the forward DCT, which is one of the more computationally intensive tasks in the encoder. Unlike the inverse DCT, the forward DCT does not operate on sparse input data, but rather generates sparse output data. Thus, complexity reduction must be obtained using different methods from those used for the inverse DCT. In the literature, two major approaches have been applied to speed up the forward DCT computation, namely, frequency selection, in which only a subset of DCT coefficients is computed, and accuracy selection, in which all the DCT coefficients are computed with reduced accuracy. These two approaches can achieve significant computation savings with minor output quality degradation, as long as the coding parameters are such that the quantization error is larger than the error due to the approximate DCT computation. Thus, in order to be useful, these algorithms have to be combined using an efficient mechanism that can select the "right" level of approximation as a function of the characteristics of the input and the target rate, a selection that is often based on heuristic criteria. In this paper, we consider two previously proposed fast, variable complexity, forward DCT algorithms, one based on frequency selection, the other based on accuracy selection. We provide an explicit analysis of the additional distortion that each scheme introduces as a function of the quantization parameter and the variance of the input block. This analysis then allows us to improve the performance of these algorithms by making it possible to select the best approximation level for each block and a target quantization parameter. We also propose a hybrid algorithm that combines both forms of complexity reduction in order to achieve overall better performance over a broader range of operating rates. We show how our techniques lead to scalable implementations where complexity can be reduced if needed, at the cost of small reductions in video quality. Our hybrid algorithm can speed up the DCT and quantization process by close to a factor of 4 as compared to fixed-complexity forward DCT implementations, with only a slight quality degradation in PSNR.