scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Multimedia in 2002"


Journal Article•DOI•
TL;DR: A novel watermarking algorithm based on singular value decomposition (SVD) is proposed and results show that the newwatermarking method performs well in both security and robustness.
Abstract: Digital watermarking has been proposed as a solution to the problem of copyright protection of multimedia documents in networked environments. There are two important issues that watermarking algorithms need to address. First, watermarking schemes are required to provide trustworthy evidence for protecting rightful ownership. Second, good watermarking schemes should satisfy the requirement of robustness and resist distortions due to common image manipulations (such as filtering, compression, etc.). In this paper, we propose a novel watermarking algorithm based on singular value decomposition (SVD). Analysis and experimental results show that the new watermarking method performs well in both security and robustness.

978 citations


Journal Article•DOI•
TL;DR: Experimental results show that LSI, together with both textual and visual features, is able to extract the underlying semantic structure of web documents, thus helping to improve the retrieval performance significantly, even when querying is done using only keywords.
Abstract: We present the results of our work that seek to negotiate the gap between low-level features and high-level concepts in the domain of web document retrieval. This work concerns a technique, called the latent semantic indexing (LSI), which has been used for textual information retrieval for many years. In this environment, LSI determines clusters of co-occurring keywords so that a query which uses a particular keyword can then retrieve documents perhaps not containing this keyword, but containing other keywords from the same cluster. In this paper, we examine the use of this technique for content-based web document retrieval, using both keywords and image features to represent the documents. Two different approaches to image feature representation, namely, color histograms and color anglograms, are adopted and evaluated. Experimental results show that LSI, together with both textual and visual features, is able to extract the underlying semantic structure of web documents, thus helping to improve the retrieval performance significantly, even when querying is done using only keywords.

279 citations


Journal Article•DOI•
TL;DR: The experimental results for broadcasted sports video of American football games indicate that intermodal collaboration is effective for video indexing by the events such as touchdown (TD) and field goal (FG).
Abstract: In this paper, we propose event-based video indexing, which is a kind of indexing by its semantical contents. Because video data is composed of multimodal information streams such as visual, auditory, and textual [closed caption (CC)] streams, we introduce a strategy of intermodal collaboration, i.e., collaborative processing taking account of the semantical dependency between these streams. Its aim is to improve the reliability and efficiency in contents analysis of video. Focusing here on temporal correspondence between visual and CC streams, the proposed method attempts to seek for time spans in which events are likely to take place through extraction of keywords from the CC stream and then to index shots in the visual stream. The experimental results for broadcasted sports video of American football games indicate that intermodal collaboration is effective for video indexing by the events such as touchdown (TD) and field goal (FG).

254 citations


Journal Article•DOI•
TL;DR: The components of bimodal recognizers are reviewed, the accuracy of bIModal recognition is discussed, some outstanding research issues as well as possible application domains are highlighted, and the combination of auditory and visual modalities promises higher recognition accuracy and robustness than can be obtained with a single modality.
Abstract: Speech recognition and speaker recognition by machine are crucial ingredients for many important applications such as natural and flexible human-machine interfaces. Most developments in speech-based automatic recognition have relied on acoustic speech as the sole input signal, disregarding its visual counterpart. However, recognition based on acoustic speech alone can be afflicted with deficiencies that preclude its use in many real-world applications, particularly under adverse conditions. The combination of auditory and visual modalities promises higher recognition accuracy and robustness than can be obtained with a single modality. Multimodal recognition is therefore acknowledged as a vital component of the next generation of spoken language systems. The paper reviews the components of bimodal recognizers, discusses the accuracy of bimodal recognition, and highlights some outstanding research issues as well as possible application domains.

244 citations


Journal Article•DOI•
TL;DR: A statistical model for characterizing texture images based on wavelet-domain hidden Markov models that can be easily steered to characterize that texture at any other orientation and obtains a rotation-invariant model of the texture image.
Abstract: We present a statistical model for characterizing texture images based on wavelet-domain hidden Markov models. With a small number of parameters, the new model captures both the subband marginal distributions and the dependencies across scales and orientations of the wavelet descriptors. Applied to the steerable pyramid, once it is trained for an input texture image, the model can be easily steered to characterize that texture at any other orientation. Furthermore, after a diagonalization operation, we obtain a rotation-invariant model of the texture image. We also propose a fast algorithm to approximate the Kullback-Leibler distance between two wavelet-domain hidden Markov models. We demonstrate the effectiveness of the new texture models in retrieval experiments with large image databases, where significant improvements are shown.

239 citations


Journal Article•DOI•
TL;DR: This work proposes a general active learning framework for content-based information retrieval and uses this framework to guide hidden annotations in order to improve the retrieval performance.
Abstract: We propose a general active learning framework for content-based information retrieval. We use this framework to guide hidden annotations in order to improve the retrieval performance. For each object in the database, we maintain a list of probabilities, each indicating the probability of this object having one of the attributes. During training, the learning algorithm samples objects in the database and presents them to the annotator to assign attributes. For each sampled object, each probability is set to be one or zero depending on whether or not the corresponding attribute is assigned by the annotator. For objects that have not been annotated, the learning algorithm estimates their probabilities with biased kernel regression. Knowledge gain is then defined to determine, among the objects that have not been annotated, which one the system is the most uncertain. The system then presents it as the next sample to the annotator to which it is assigned attributes. During retrieval, the list of probabilities works as a feature vector for us to calculate the semantic distance between two objects, or between the user query and an object in the database. The overall distance between two objects is determined by a weighted sum of the semantic distance and the low-level feature distance. The algorithm is tested on both synthetic databases and real databases of 3D models. In both cases, the retrieval performance of the system improves rapidly with the number of annotated samples. Furthermore, we show that active learning outperforms learning based on random sampling.

220 citations


Journal Article•DOI•
TL;DR: This work develops unique algorithms for assessing the quality of foveated image/video data using a model of human visual response and demonstrates that quality vs. compression is enhanced considerably by the foveation approach.
Abstract: Most image and video compression algorithms that have been proposed to improve picture quality relative to compression efficiency have either been designed based on objective criteria such as signal-to-noise-ratio (SNR) or have been evaluated, post-design, against competing methods using an objective sample measure. However, existing quantitative design criteria and numerical measurements of image and video quality both fail to adequately capture those attributes deemed important by the human visual system, except, perhaps, at very low error rates. We present a framework for assessing the quality of and determining the efficiency of foveated and compressed images and video streams. Image foveation is a process of nonuniform sampling that accords with the acquisition of visual information at the human retina. Foveated image/video compression algorithms seek to exploit this reduction of sensed information by nonuniformly reducing the resolution of the visual data. We develop unique algorithms for assessing the quality of foveated image/video data using a model of human visual response. We demonstrate these concepts on foveated, compressed video streams using modified (foveated) versions of H.263 that are standard-compliant. We rind that quality vs. compression is enhanced considerably by the foveation approach.

178 citations


Journal Article•DOI•
Shashi Shekhar1, Paul Schrater1, Ranga Raju Vatsavai1, Weili Wu1, Sanjay Chawla •
TL;DR: It is argued that the SAR model makes more restrictive assumptions about the distribution of feature values and class boundaries than MRF, and the relationship between SAR and MRF is analogous to the relationships between regression and Bayesian classifiers.
Abstract: Modeling spatial context (e.g., autocorrelation) is a key challenge in classification problems that arise in geospatial domains. Markov random fields (MRF) is a popular model for incorporating spatial context into image segmentation and land-use classification problems. The spatial autoregression (SAR) model, which is an extension of the classical regression model for incorporating spatial dependence, is popular for prediction and classification of spatial data in regional economics, natural resources, and ecological studies. There is little literature comparing these alternative approaches to facilitate the exchange of ideas. We argue that the SAR model makes more restrictive assumptions about the distribution of feature values and class boundaries than MRF. The relationship between SAR and MRF is analogous to the relationship between regression and Bayesian classifiers. This paper provides comparisons between the two models using a probabilistic and an experimental framework.

153 citations


Journal Article•DOI•
TL;DR: Based on the analysis of temporal slices, novel approaches for clustering and retrieval of video shots are proposed, found to be useful particularly for sports games, where motion and color are important visual cues when searching and browsing the desired video shots.
Abstract: Based on the analysis of temporal slices, we propose novel approaches for clustering and retrieval of video shots. Temporal slices are a set of two-dimensional (2-D) images extracted along the time dimension of an image volume. They encode rich set of visual patterns for similarity measure. In this paper, we first demonstrate that tensor histogram features extracted from temporal slices are suitable for motion retrieval. Subsequently, we integrate both tensor and color histograms for constructing a two-level hierarchical clustering structure. Each cluster in the top level contains shots with similar color while each cluster in bottom level consists of shots with similar motion. The constructed structure is then used for the cluster-based retrieval. The proposed approaches are found to be useful particularly for sports games, where motion and color are important visual cues when searching and browsing the desired video shots.

132 citations


Journal Article•DOI•
TL;DR: A novel architecture called Unified VoD (UVoD) is proposed that can be configured to achieve cost-performance tradeoff anywhere between the two extremes (i.e., TVoD and NVoD).
Abstract: Current video-on-demand (VoD)) systems can be classified into two categories: 1) true-Voll) (TVoD) and 2) near-VoD (NVod)). TVoD systems allocate a dedicated channel for every user to achieve short response times so that the user can select what video to play, when to play it, and perform interactive VCR-like controls at will. By contrast, NVoD systems transmit videos repeatedly over multiple broadcast or multicast channels to enable multiple users to share a single video channel so that system cost can be substantially reduced. The tradeoffs are limited video selections, fixed playback schedule, and limited or no interactive control. TVoD systems can be considered as one extreme where service quality is maximized, while NVoD systems can be considered as the other extreme where system cost is minimized. This paper proposes a novel architecture called Unified VoD) (UVoD) that can be configured to achieve cost-performance tradeoff anywhere between the two extremes (i.e., TVoD and NVoD). Assuming that a video client can concurrently receive two video channels and has local buffers for caching a portion of the video data, the proposed UVoD architecture can achieve significant performance gains (e.g., 400% more capacity for a 500-channel system) over TVoD under the same latency constraint. This paper presents the UVoD architecture, establishes a performance model, and analyzes UVoD's performance via numerical and simulation results.

123 citations


Journal Article•DOI•
TL;DR: A systematic evaluation of the mutual dependencies of segmentation methods and their performances and introduces a method measuring the quality of a segmentation method and its economic impact rather than the amount of errors.
Abstract: Although various logical story unit (LSU) segmentation methods based on visual content have been presented in literature, a common ground for comparison is missing. We present a systematic evaluation of the mutual dependencies of segmentation methods and their performances. LSUs are subjective and cannot be defined with full certainty. To limit subjectivity, we present definitions based on film theory. For evaluation, we introduce a method measuring the quality of a segmentation method and its economic impact rather than the amount of errors. Furthermore, the inherent complexity of the segmentation problem given a visual feature is measured. Also, we show to what extent LSU segmentation depends on the quality of shot boundary segmentation. To understand LSU segmentation, we present a unifying framework classifying segmentation methods into four essentially different types. We present results of an evaluation of the four types under similar circumstances using an unprecedented amount of 20 hours of 17 complete videos in different genres. Tools and ground truths are available for interactive use via the Internet.

Journal Article•DOI•
TL;DR: An object tracking method for object-based video processing which uses a two-dimensional Gabor wavelet transform (GWT) and a 2D golden section algorithm that is robust to object deformation and supports object tracking in noisy video sequences is presented.
Abstract: The paper presents an object tracking method for object-based video processing which uses a two-dimensional (2D) Gabor wavelet transform (GWT) and a 2D golden section algorithm. An object in the current frame is modeled by local features from a number of the selected feature points, and the global placement of these feature points. The feature points are stochastically selected based on the energy of their GWT coefficients. Points with higher energy have a higher probability of being selected since they are visually more important. The amplitudes of the GWT coefficients of a feature point are then used as the local feature. The global placement of the feature points is determined by a 2D mesh whose feature is the area of the triangles formed by the feature points. In this way, a local feature is represented by a GWT coefficient amplitude vector, and a global feature is represented by a triangle area vector. One advantage of the 2D mesh is that the direction of its triangle area vector is invariant to affine transform. Consequently, the similarity between two local features or two global features can be defined as a function of the angle and the length ratio between two vectors, and the overall similarity between two objects is a weighted sum of the local and global similarities. In order to find the corresponding object in the next frame, the 2D golden section algorithm is employed. Our results show that the method is robust to object deformation and supports object tracking in noisy video sequences.

Journal Article•DOI•
TL;DR: This approach, motivated and directed by the existing cinematic conventions known as film grammar, uses the attributes of motion and shot length to define and compute a novel measure of tempo, confirming tempo as a useful high-level semantic construct in its own right and a promising component of others such as rhythm, tone or mood of a film.
Abstract: The paper addresses the challenge of bridging the semantic gap that exists between the simplicity of features that can be currently computed in automated content indexing systems and the richness of semantics in user queries posed for media search and retrieval. It proposes a unique computational approach to extraction of expressive elements of motion pictures for deriving high-level semantics of stories portrayed, thus enabling rich video annotation and interpretation. This approach, motivated and directed by the existing cinematic conventions known as film grammar, as a first step toward demonstrating its effectiveness, uses the attributes of motion and shot length to define and compute a novel measure of tempo of a movie. Tempo flow plots are defined and derived for a number of full-length movies and edge analysis is performed leading to the extraction of dramatic story sections and events signaled by their unique tempo. The results confirm tempo as a useful high-level semantic construct in its own right and a promising component of others such as rhythm, tone or mood of a film. In addition to the development of this computable tempo measure, a study is conducted as to the usefulness of biasing it toward either of its constituents, namely motion or shot length. Finally, a refinement is made to the shot length normalizing mechanism, driven by the peculiar characteristics of shot length distribution exhibited by movies. Results of these additional studies, and possible applications and limitations are discussed.

Journal Article•DOI•
Chai Wah Wu1•
TL;DR: A multimedia authentication scheme which combines some of the best features of feature-based and hash-based algorithms and the data does not need to be modified in order to be made authenticatable is proposed.
Abstract: Recently, a number of authentication schemes have been proposed for multimedia data. The main requirement for such authentication systems is that minor modifications which do not alter the content of the data preserve the authenticity of the data, whereas modifications which do modify the content render the data not authentic. These schemes can be classified into two classes depending on the underlying model of image authentication. We look at some of the advantages and disadvantages of these schemes and their relationship with limitations of the underlying model of image authentication. In particular, we study feature-based algorithms and hash-based algorithms. The main disadvantage of feature-based algorithms is that similar images generate similar features, and therefore it is possible for a forger to generate dissimilar images with the same features. On the other hand, the class of hash-based algorithms utilizes a cryptographic digital signature scheme and inherits the security of digital signatures to thwart forgery attacks. The main disadvantage of hash-based algorithms is that the image needs to be modified in order to be made authenticatable. We propose a multimedia authentication scheme which combines some of the best features of these two classes of algorithms. The proposed scheme utilizes cryptographic digital signature schemes and the data does not need to be modified in order to be made authenticatable. We show how results in sphere packings and coverings can be useful in the design. Several applications including the authentication of images on CD-ROM and handwritten documents are discussed.

Journal Article•DOI•
Gaurav Aggarwal1, T. V. Ashwin, Sugata Ghosal•
TL;DR: A CBIR system that efficiently addresses the inherent subjectivity in user perception during a retrieval session by employing a novel idea of intra-query modification and learning is proposed.
Abstract: Most interactive "query-by-example" based image retrieval systems utilize relevance feedback from the user for bridging the gap between the user's implied concept and the low-level image representation in the database. However, traditional relevance feedback usage in the context of content-based image retrieval (CBIR) may not be very efficient due to a significant overhead in database search and image download time in client-server environments. In this paper, we propose a CBIR system that efficiently addresses the inherent subjectivity in user perception during a retrieval session by employing a novel idea of intra-query modification and learning. The proposed system generates an object-level view of the query image using a new color segmentation technique. Color, shape and spatial features of individual segments are used for image representation and retrieval. The proposed system automatically generates a set of modifications by manipulating the features of the query segment(s). An initial estimate of user perception is learned from the user feedback provided on the set of modified images. This largely improves the precision in the first database search itself and alleviates the overheads of database search and image download. Precision-to-recall ratio is improved in further iterations through a new relevance feedback technique that utilizes both positive as well as negative examples. Extensive experiments have been conducted to demonstrate the feasibility and advantages of the proposed system.

Journal Article•DOI•
TL;DR: This paper introduces the local polar coordinate file (LPC-file), a filtering approach for nearest-neighbor searches in high-dimensional image databases that outperforms both of the VA-file and the sequential scan in total elapsed time and in the number of disk accesses and is robust in both "good" distributions and "bad" distributions.
Abstract: Nearest neighbor (NN) search is emerging as an important search paradigm in a variety of applications in which objects are represented as vectors of d numeric features. However, despite decades of efforts, except for the filtering approach such as the VA-file, the current solutions to find exact kNNs are far from satisfactory for large d. The filtering approach represents vectors as compact approximations and by first scanning these smaller approximations, only a small fraction of the real vectors are visited. In this paper, we introduce the local polar coordinate file (LPC-file) using the filtering approach for nearest-neighbor searches in high-dimensional image databases. The basic idea is to partition the vector space into rectangular cells and then to approximate vectors by polar coordinates on the partitioned local cells. The LPC information significantly enhances the discriminatory power of the approximation. To demonstrate the effectiveness of the LPC-file, we conducted extensive experiments and compared the performance with the VA-file and the sequential scan by using synthetic and real data sets. The experimental results demonstrate that the LPC-file outperforms both of the VA-file and the sequential scan in total elapsed time and in the number of disk accesses and that the LPC-file is robust in both "good" distributions (such as random) and "bad" distributions (such as skewed and clustered).

Journal Article•DOI•
TL;DR: This paper uses speech recognition technology to index spoken audio and video files from the World Wide Web when no transcriptions are available, and shows that, even if the transcription is inaccurate, it can still achieve good retrieval performance for typical user queries.
Abstract: As the Web transforms from a text-only medium into a more multimedia-rich medium, the need arises to perform searches based on the multimedia content. In this paper, we present an audio and video search engine to tackle this problem. The engine uses speech recognition technology to index spoken audio and video files from the World Wide Web (WWW) when no transcriptions are available. If transcriptions (even imperfect ones) are available, we can also take advantage of them to improve the indexing process. Our engine indexes several thousand talk and news radio shows covering a wide range of topics and speaking styles from a selection of public Web sites with multimedia archives. Our Web site is similar in spirit to normal Web search sites; it contains an index, not the actual multimedia content. The audio from these shows suffers in acoustic quality due to bandwidth limitations, coding, compression, and poor acoustic conditions. Our word error rate (WER) results using appropriately trained acoustic models show remarkable resilience to the high compression, although many factors combine to increase the average WERs over standard broadcast news benchmarks. We show that, even if the transcription is inaccurate, we can still achieve good retrieval performance for typical user queries (77.5%).

Journal Article•DOI•
TL;DR: An important feature of this work is to introduce semantic constraints based on structure and silence in the authors' computational model, which results in computable scenes that are more consistent with human observations.
Abstract: We present a computational scene model and also derive novel algorithms for computing audio and visual scenes and within-scene structures in films. We use constraints derived from film-making rules and from experimental results in the psychology of audition, in our computational scene model. Central to the computational model is the notion of a causal, finite-memory viewer model. We segment the audio and video data separately. In each case, we determine the degree of correlation of the most recent data in the memory with the past. The audio and video scene boundaries are determined using local maxima and minima, respectively. We derive four types of computable scenes that arise due to different kinds of audio and video scene boundary synchronizations. We show how to exploit the local topology of an image sequence in conjunction with statistical tests, to determine dialogs. We also derive a simple algorithm to detect silences in audio. An important feature of our work is to introduce semantic constraints based on structure and silence in our computational model. This results in computable scenes that are more consistent with human observations. The algorithms were tested on a difficult data set: three commercial films. We take the first hour of data from each of the three films. The best results: computational scene detection: 94%; dialogue detection: 91%; and recall 100% precision.

Journal Article•DOI•
Guojun Lu1•
TL;DR: This paper provides a survey of techniques and data structures required to organize feature vectors and manage the search process so that objects relevant to the query can be located quickly.
Abstract: As more and more information is captured and stored in digital form, many techniques and systems have been developed for indexing and retrieval of text documents, audio, images, and video. The retrieval is normally based on similarities between extracted feature vectors of the query and stored items. Feature vectors are usually multidimensional. When the number of stored objects and/or the number of dimensions of the feature vectors are large, it will be too slow to linearly search all stored feature vectors to find those that satisfy the query criteria. Techniques and data structures are thus required to organize feature vectors and manage the search process so that objects relevant to the query can be located quickly. This paper provides a survey of these techniques and data structures.

Journal Article•DOI•
TL;DR: A novel approach allowing layered content-based retrieval of video-event shots referring to potentially interesting situations is presented, which refers to potentially dangerous situations: abandoned objects and predefined human events are considered.
Abstract: Increased communication capabilities and automatic scene understanding allow human operators to simultaneously monitor multiple environments. Due to the amount of data to be processed in new surveillance systems, the human operator must be helped by automatic processing tools in the work of inspecting video sequences. In this paper, a novel approach allowing layered content-based retrieval of video-event shots referring to potentially interesting situations is presented. Interpretation of events is used for defining new video-event shot detection and indexing criteria. Interesting events refer to potentially dangerous situations: abandoned objects and predefined human events are considered in this paper. Video-event shot detection and indexing capabilities are used for online and offline content-based retrieval of scenes to be detected.

Journal Article•DOI•
TL;DR: This work investigates ways to store or stage partial video in proxy servers to reduce the network bandwidth requirement over WAN, and proposes several frame staging selection algorithms to determine the video frames to be stored in the proxy server.
Abstract: Due to the high bandwidth requirement and rate variability of compressed video, delivering video across wide area networks (WANs) is a challenging issue. Proxy servers have been used to reduce network congestion and improve client access time on the Internet by caching passing data. We investigate ways to store or stage partial video in proxy servers to reduce the network bandwidth requirement over WAN. A client needs to access a portion of the video from a proxy server over a local area network (LAN) and the rest from a central server across a WAN. Therefore, client buffer requirement and video synchronization are to be considered. We study the tradeoffs between client buffer, storage requirement on the proxy server, and bandwidth requirement over WAN. Given a video delivery rate for the WAN, we propose several frame staging selection algorithms to determine the video frames to be stored in the proxy server. A scheme called chunk algorithm, which partitions a video into different segments (chunks of frames) with alternating chunks stored in the proxy server, is shown to offer the best tradeoff. We also investigate an efficient way to utilize client buffer when the combination of video streams from WAN and LAN is considered.

Journal Article•DOI•
TL;DR: LucentVision is presented, an instantly indexed multimedia database system developed for the sport of tennis that analyzes video from multiple cameras in real time and captures the activity of the players and the ball in the form of motion trajectories and stores these trajectories in a database.
Abstract: We introduce a new paradigm for real-time conversion of a real world event into a rich multimedia database by processing data from multiple sensors observing the event. A real-time analysis of the sensor data, tightly coupled with domain knowledge, results in instant indexing of multimedia data at capture time. This yields semantic information to answer complex queries about the content and the ability to extract portions of data that correspond to complex actions performed in the real world. The power of such an instantly indexed multimedia database system, in content-based retrieval of multimedia data or in semantic analysis and visualization of the data, far exceeds that of systems which index multimedia data only after it is produced. We present LucentVision, an instantly indexed multimedia database system developed for the sport of tennis. This system analyzes video from multiple cameras in real time and captures the activity of the players and the ball in the form of motion trajectories. The system stores these trajectories in a database along with video, 3D models of the environment, scores, and other domain-specific information. LucentVision has been used to enhance live television and Internet broadcasts with game analyses and virtual replays in more than 250 international tennis matches.

Journal Article•DOI•
Jie Song1, K.J.R. Liu•
TL;DR: By using diversity techniques and OFDM, the frequency selective fading effects in broadband wireless channels can be significantly decreased and it is shown that subchannels in OFDM systems approach Gaussian noisy channels when the diversity gain gets large; as a result, the system performance can be improved in terms of throughput and channel coding efficiency.
Abstract: A joint source-channel coding (JSCC) scheme for robust progressive image transmission over broadband wireless channels using orthogonal frequency division multiplexing (OFDM) systems with spatial diversity is proposed for the application environments where no feedback channel is available such as broadcasting services. Most of current research about JSCC focuses on either binary symmetric channels (BSC) or additive white Gaussian noise (AWGN) channels. To deal with fading channels in most previous methods, the fading channel is modeled as two state Gilbert-Elliott channel model and the JSCC is normally aimed at the BER of bad channel status, which is not optimal when the channel is at good status. By using diversity techniques and OFDM, the frequency selective fading effects in broadband wireless channels can be significantly decreased and we show that subchannels in OFDM systems approach Gaussian noisy channels when the diversity gain gets large; as a result, the system performance can be improved in terms of throughput and channel coding efficiency. After analyzing the channel property of OFDM systems with spatial diversity, a practical JSCC scheme for OFDM systems is proposed. Simulation results are presented for transmit diversity with different numbers of antennas and different multipath delay and Doppler spread. It is observed from simulations that the performance can be improved more than 4 dB in terms of peak signal-to-noise ratio (PSNR) of the received image Lena and the performance is not very sensitive to different multipath spread and Doppler frequency.

Journal Article•DOI•
TL;DR: The paper shows that energy-efficient watermarks must satisfy a power-spectrum condition (PSC), which states that the watermark's power spectrum should be directly proportional to the original signal's.
Abstract: The paper presents a model for watermarking and some attacks on watermarks. Given the watermarked signal, the so-called Wiener attack performs minimum mean-squared error (MMSE) estimation of the watermark and subtracts the weighted MMSE estimate from the watermarked signal. Under the assumption of a fixed correlation detector, the attack is shown to minimize the expected correlation statistic for the same attack distortion among linear, shift-invariant filtering attacks. It also leads to the idea of energy-efficient watermarking: watermarking that resists MMSE estimation as much as possible, and provides a meaningful way to evaluate robustness. The paper shows that energy-efficient watermarks must satisfy a power-spectrum condition (PSC), which states that the watermark's power spectrum should be directly proportional to the original signal's. PSC-compliant watermarks are proven to be most robust. Experiments with signal models and natural images demonstrate that watermarks that do not closely fulfill the PSC are vulnerable to the Wiener attack, while PSC-compliant watermarks are highly resistant to it. These theoretical and experimental results justify prior heuristic arguments that, for maximum robustness, a watermark should be closely matched to the spectral content of the original signal. The results also discourage the use of watermarks that do not approximately satisfy the PSC.

Journal Article•DOI•
TL;DR: The GC-tree is a new dynamic index structure based on a special subspace partitioning strategy which is optimized for a clustered high-dimensional image dataset and outperforms all other methods for efficient similarity search in image databases.
Abstract: We propose a new dynamic index structure called the GC-tree (or the grid cell tree) for efficient similarity search in image databases. The GC-tree is based on a special subspace partitioning strategy which is optimized for a clustered high-dimensional image dataset. The basic ideas are threefold: 1) we adaptively partition the data space based on a density function that identifies dense and sparse regions in a data space; 2) we concentrate the partition on the dense regions, and the objects in the sparse regions of a certain partition level are treated as if they lie within a single region; and 3) we dynamically construct an index structure that corresponds to the space partition hierarchy. The resultant index structure adapts well to the strongly clustered distribution of high-dimensional image datasets. To demonstrate the practical effectiveness of the GC-tree, we experimentally compared the GC-tree with the IQ-tree, LPC-file, VA-file, and linear scan. The result of our experiments shows that the GC-tree outperforms all other methods.

Journal Article•DOI•
TL;DR: The core of this framework is compression, and it is shown how to exploit two types of data correlation, the intra-pixel and the inter-pixel correlations, in order to achieve a manageable storage size.
Abstract: Image-based modeling and rendering has been demonstrated as a cost-effective and efficient approach to virtual reality applications. The computational model that most image-based techniques are based on is the plenoptic function. Since the original formulation of the plenoptic function does not include illumination, most previous image-based virtual reality applications simply assume that the illumination is fixed. We propose a formulation of the plenoptic function, called the plenoptic illumination function, which explicitly specifies the illumination component. Techniques based on this new formulation can be extended to support relighting as well as view interpolation. To relight images with various illumination configurations, we also propose a local illumination model, which utilizes the rules of image superposition. We demonstrate how this new formulation can be applied to extend two existing image-based representations, panorama representation such as QuickTime VR and two-plane parameterization, to support relighting with trivial modifications. The core of this framework is compression, and we therefore show how to exploit two types of data correlation, the intra-pixel and the inter-pixel correlations, in order to achieve a manageable storage size.

Journal Article•DOI•
Theo Gevers1•
TL;DR: This work aims for content-based image retrieval of textured objects in natural scenes under varying illumination and viewing conditions using a retrieval scheme based on matching feature distributions derived from color invariant gradients.
Abstract: We aim for content-based image retrieval of textured objects in natural scenes under varying illumination and viewing conditions. To achieve this, image retrieval is based on matching feature distributions derived from color invariant gradients. To cope with object cluttering, region-based texture segmentation is applied on the target images prior to the actual image retrieval process. The retrieval scheme is empirically verified on color images taken from textured objects under different lighting conditions.

Journal Article•DOI•
TL;DR: In this paper, an effective wipe detection method is proposed using the macroblock (MB) information of the MPEG compressed video using the prediction directions of B frames, which are revealed in the MB types.
Abstract: For video scene analysis, the wipe transition is considered most complex and difficult to detect. In this paper, an effective wipe detection method is proposed using the macroblock (MB) information of the MPEG compressed video. By analyzing the prediction directions of B frames, which are revealed in the MB types, the scene change region of each frame can be found. Once the accumulation of the scene change regions covers most of the area of the frame, the sequence will be considered a motionless wipe transition sequence. Besides, uncommon intracoded MB of the B frame can also be applied as an indicator of the motion wipe transition. A very simple analysis based on small amount of MB type information is sufficient to achieve wipe detection directly on MPEG compressed video. Easy extraction of MB type information, low-complexity analysis algorithm and robustness to arbitrary shape and direction of wipe transitions are the great advantages of the proposed method.

Journal Article•DOI•
TL;DR: The proposed retrieval method provides human detection and activity recognition at different resolution levels from low complexity to low false rates and connects low level features to high level semantics by developing relational object and activity presentations.
Abstract: We propose a hierarchical retrieval system where shape, color and motion characteristics of the human body are captured in compressed and uncompressed domains. The proposed retrieval method provides human detection and activity recognition at different resolution levels from low complexity to low false rates and connects low level features to high level semantics by developing relational object and activity presentations. The available information of standard video compression algorithms are used in order to reduce the amount of time and storage needed for the information retrieval. The principal component analysis is used for activity recognition using MPEG motion vectors and results are presented for walking, kicking, and running to demonstrate that the classification among activities is clearly visible. For low resolution and monochrome images it is demonstrated that the structural information of human silhouettes can be captured from AC-DCT coefficients.

Journal Article•DOI•
TL;DR: The framework is described and used to compare several well-known algorithms and to show that a multiresolution decoding is recognized faster than a single large-scale decoding, and that global blurriness slows down recognition more than do localized "splotch" artifacts.
Abstract: Mean squared error (MSE) and peak signal-to-noise-ratio (PSNR) are the most common methods for measuring the quality of compressed images, despite the fact that their inadequacies have long been recognized. Quality for compressed still images is sometimes evaluated using human observers who provide subjective ratings of the images. Both SNR and subjective quality judgments, however, may be inappropriate for evaluating progressive compression methods which are to be used for fast browsing applications. In this paper, we present a novel experimental and statistical framework for comparing progressive coders. The comparisons use response time studies in which human observers view a series of progressive transmissions, and respond to questions about the images as they become recognizable. We describe the framework and use it to compare several well-known algorithms (JPEG, set partitioning in hierarchical trees (SPIHT), and embedded zerotree wavelet (EZW)), and to show that a multiresolution decoding is recognized faster than a single large-scale decoding. Our experiments also show that, for the particular algorithms used, at the same PSNR, global blurriness slows down recognition more than do localized "splotch" artifacts.