scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Multimedia in 2005"


Journal ArticleDOI
TL;DR: A computational framework for affective video content representation and modeling is proposed based on the dimensional approach to affect that is known from the field of psychophysiology that is characterized by the dimensions of arousal (intensity of affect) and valence (type of affect).
Abstract: This paper looks into a new direction in video content analysis - the representation and modeling of affective video content . The affective content of a given video clip can be defined as the intensity and type of feeling or emotion (both are referred to as affect) that are expected to arise in the user while watching that clip. The availability of methodologies for automatically extracting this type of video content will extend the current scope of possibilities for video indexing and retrieval. For instance, we will be able to search for the funniest or the most thrilling parts of a movie, or the most exciting events of a sport program. Furthermore, as the user may want to select a movie not only based on its genre, cast, director and story content, but also on its prevailing mood, the affective content analysis is also likely to contribute to enhancing the quality of personalizing the video delivery to the user. We propose in this paper a computational framework for affective video content representation and modeling. This framework is based on the dimensional approach to affect that is known from the field of psychophysiology. According to this approach, the affective video content can be represented as a set of points in the two-dimensional (2-D) emotion space that is characterized by the dimensions of arousal (intensity of affect) and valence (type of affect). We map the affective video content onto the 2-D emotion space by using the models that link the arousal and valence dimensions to low-level features extracted from video data. This results in the arousal and valence time curves that, either considered separately or combined into the so-called affect curve, are introduced as reliable representations of expected transitions from one feeling to another along a video, as perceived by a viewer.

625 citations


Journal ArticleDOI
TL;DR: A generic framework of a user attention model is presented, which estimates the attentions viewers may pay to video contents, and a set of modeling methods for visual and aural attentions are proposed.
Abstract: Due to the information redundancy of video, automatically extracting essential video content is one of key techniques for accessing and managing large video library. In this paper, we present a generic framework of a user attention model, which estimates the attentions viewers may pay to video contents. As human attention is an effective and efficient mechanism for information prioritizing and filtering, user attention model provides an effective approach to video indexing based on importance ranking. In particular, we define viewer attention through multiple sensory perceptions, i.e. visual and aural stimulus as well as partly semantic understanding. Also, a set of modeling methods for visual and aural attentions are proposed. As one of important applications of user attention model, a feasible solution of video summarization, without fully semantic understanding of video content as well as complex heuristic rules, is implemented to demonstrate the effectiveness, robustness, and generality of the user attention model. The promising results from the user study on video summarization indicate that the user attention model is an alternative way to video understanding.

567 citations


Journal ArticleDOI
TL;DR: An overview of several video transcoding techniques and some of the related research issues is provided, to propose solutions to some of these research issues, and identify possible research directions.
Abstract: One of the fundamental challenges in deploying multimedia systems, such as telemedicine, education, space endeavors, marketing, crisis management, transportation, and military, is to deliver smooth and uninterruptible flow of audio-visual information, anytime and anywhere. A multimedia system may consist of various devices (PCs, laptops, PDAs, smart phones, etc.) interconnected via heterogeneous wireline and wireless networks. In such systems, multimedia content originally authored and compressed with a certain format may need bit rate adjustment and format conversion in order to allow access by receiving devices with diverse capabilities (display, memory, processing, decoder). Thus, a transcoding mechanism is required to make the content adaptive to the capabilities of diverse networks and client devices. A video transcoder can perform several additional functions. For example, if the bandwidth required for a particular video is fluctuating due to congestion or other causes, a transcoder can provide fine and dynamic adjustments in the bit rate of the video bitstream in the compressed domain without imposing additional functional requirements in the decoder. In addition, a video transcoder can change the coding parameters of the compressed video, adjust spatial and temporal resolution, and modify the video content and/or the coding standard used. This paper provides an overview of several video transcoding techniques and some of the related research issues. We introduce some of the basic concepts of video transcoding, and then review and contrast various approaches while highlighting critical research issues. We propose solutions to some of these research issues, and identify possible research directions.

374 citations


Journal ArticleDOI
TL;DR: This research examines the limitations of selective encryption using cryptanalysis, and proposes another approach that turns entropy coders into encryption ciphers using multiple statistical models that can be applied to most modern compressed audio/video such as MPEG audio, MPEG video, and JPEG/JPEG2000 images.
Abstract: Two approaches for integrating encryption with multimedia compression systems are studied in this research, i.e., selective encryption and modified entropy coders with multiple statistical models. First, we examine the limitations of selective encryption using cryptanalysis, and provide examples that use selective encryption successfully. Two rules to determine whether selective encryption is suitable for a compression system are concluded. Next, we propose another approach that turns entropy coders into encryption ciphers using multiple statistical models. Two specific encryption schemes are obtained by applying this approach to the Huffman coder and the QM coder. It is shown that security is achieved without sacrificing the compression performance and the computational speed. This modified entropy coding methodology can be applied to most modern compressed audio/video such as MPEG audio, MPEG video, and JPEG/JPEG2000 images.

361 citations


Journal ArticleDOI
TL;DR: This work presents a system for producing short, representative samples (or "audio thumbnails") of selections of popular music, and presents a development of the chromagram, a variation on traditional time-frequency distributions that seeks to represent the cyclic attribute of pitch perception, known as chroma.
Abstract: With the growing prevalence of large databases of multimedia content, methods for facilitating rapid browsing of such databases or the results of a database search are becoming increasingly important. However, these methods are necessarily media dependent. We present a system for producing short, representative samples (or "audio thumbnails") of selections of popular music. The system searches for structural redundancy within a given song with the aim of identifying something like a chorus or refrain. To isolate a useful class of features for performing such structure-based pattern recognition, we present a development of the chromagram, a variation on traditional time-frequency distributions that seeks to represent the cyclic attribute of pitch perception, known as chroma. The pattern recognition system itself employs a quantized chromagram that represents the spectral energy at each of the 12 pitch classes. We evaluate the system on a database of popular music and score its performance against a set of "ideal" thumbnail locations. Overall performance is found to be quite good, with the majority of errors resulting from songs that do not meet our structural assumptions.

300 citations


Journal ArticleDOI
TL;DR: A novel approach for clustering shots into scenes by transforming this task into a graph partitioning problem, which automates this objective which is useful for applications such as video-on-demand, digital libraries, and the Internet.
Abstract: This paper presents a method to perform a high-level segmentation of videos into scenes. A scene can be defined as a subdivision of a play in which either the setting is fixed, or when it presents continuous action in one place. We exploit this fact and propose a novel approach for clustering shots into scenes by transforming this task into a graph partitioning problem. This is achieved by constructing a weighted undirected graph called a shot similarity graph (SSG), where each node represents a shot and the edges between the shots are weighted by their similarity based on color and motion information. The SSG is then split into subgraphs by applying the normalized cuts for graph partitioning. The partitions so obtained represent individual scenes in the video. When clustering the shots, we consider the global similarities of shots rather than the individual shot pairs. We also propose a method to describe the content of each scene by selecting one representative image from the video as a scene key-frame. Recently, DVDs have become available with a chapter selection option where each chapter is represented by one image. Our algorithm automates this objective which is useful for applications such as video-on-demand, digital libraries, and the Internet. Experiments are presented with promising results on several Hollywood movies and one sitcom.

275 citations


Journal ArticleDOI
TL;DR: The goal was to first develop a system for segmentation of the audio signal, and then classification into one of two main categories: speech or music, and results show that efficiency is exceptionally good, without sacrificing performance.
Abstract: Over the last several years, major efforts have been made to develop methods for extracting information from audiovisual media, in order that they may be stored and retrieved in databases automatically, based on their content. In this work we deal with the characterization of an audio signal, which may be part of a larger audiovisual system or may be autonomous, as for example in the case of an audio recording stored digitally on disk. Our goal was to first develop a system for segmentation of the audio signal, and then classification into one of two main categories: speech or music. Among the system's requirements are its processing speed and its ability to function in a real-time environment with a small responding delay. Because of the restriction to two classes, the characteristics that are extracted are considerably reduced and moreover the required computations are straightforward. Experimental results show that efficiency is exceptionally good, without sacrificing performance. Segmentation is based on mean signal amplitude distribution, whereas classification utilizes an additional characteristic related to the frequency. The classification algorithm may be used either in conjunction with the segmentation algorithm, in which case it verifies or refutes a music-speech or speech-music change, or autonomously, with given audio segments. The basic characteristics are computed in 20 ms intervals, resulting in the segments' limits being specified within an accuracy of 20 ms. The smallest segment length is one second. The segmentation and classification algorithms were benchmarked on a large data set, with correct segmentation about 97% of the time and correct classification about 95%.

232 citations


Journal ArticleDOI
TL;DR: A unified framework for semantic shot classification in sports video, which has been widely studied and achieved good classification accuracy of 85%-95% on the game videos of five typical ball type sports with over 5500 shots of about 8 h.
Abstract: The extensive amount of multimedia information available necessitates content-based video indexing and retrieval methods. Since humans tend to use high-level semantic concepts when querying and browsing multimedia databases, there is an increasing need for semantic video indexing and analysis. For this purpose, we present a unified framework for semantic shot classification in sports video, which has been widely studied due to tremendous commercial potentials. Unlike most existing approaches, which focus on clustering by aggregating shots or key-frames with similar low-level features, the proposed scheme employs supervised learning to perform a top-down video shot classification. Moreover, the supervised learning procedure is constructed on the basis of effective mid-level representations instead of exhaustive low-level features. This framework consists of three main steps: 1) identify video shot classes for each sport; 2) develop a common set of motion, color, shot length-related mid-level representations; and 3) supervised learning of the given sports video shots. It is observed that for each sport we can predefine a small number of semantic shot classes, about 5-10, which covers 90%-95% of broadcast sports video. We employ nonparametric feature space analysis to map low-level features to mid-level semantic video shot attributes such as dominant object (a player) motion, camera motion patterns, and court shape, etc. Based on the fusion of those mid-level shot attributes, we classify video shots into the predefined shot classes, each of which has clear semantic meanings. With this framework we have achieved good classification accuracy of 85%-95% on the game videos of five typical ball type sports (i.e., tennis, basketball, volleyball, soccer, and table tennis) with over 5500 shots of about 8 h. With correctly classified sports video shots, further structural and temporal analysis, such as event detection, highlight extraction, video skimming, and table of content, will be greatly facilitated.

210 citations


Journal ArticleDOI
TL;DR: This comment demonstrates that this watermarking algorithm is fundamentally flawed in that the extracted watermark is not the embedded watermark but determined by the reference watermark, which biases the false positive detection rate.
Abstract: In a recent paper by Tan and Liu , a watermarking algorithm for digital images based on singular value decomposition (SVD) is proposed. This comment demonstrates that this watermarking algorithm is fundamentally flawed in that the extracted watermark is not the embedded watermark but determined by the reference watermark. The reference watermark generates the pair of SVD matrices employed in the watermark detector. In the watermark detection stage, the fact that the employed SVD matrices depend on the reference watermark biases the false positive detection rate such that it has a probability of one. Hence, any reference watermark that is being searched for in an arbitrary image can be found. Both theoretical analysis and experimental results are given to support our conclusion.

186 citations


Journal ArticleDOI
TL;DR: Two cross-diamond-hexagonal search algorithms, which differ from each other by their sizes of hexagonal search patterns, are proposed, which show that the proposed CDHSs perform faster than the diamond search (DS) by about 144% and the cross- diamond search (CDS)By about 73%, whereas similar prediction quality is still maintained.
Abstract: We propose two cross-diamond-hexagonal search (CDHS) algorithms, which differ from each other by their sizes of hexagonal search patterns. These algorithms basically employ two cross-shaped search patterns consecutively in the very beginning steps and switch using diamond-shaped patterns. To further reduce the checking points, two pairs of hexagonal search patterns are proposed in conjunction with candidates found located at diamond corners. Experimental results show that the proposed CDHSs perform faster than the diamond search (DS) by about 144% and the cross-diamond search (CDS) by about 73%, whereas similar prediction quality is still maintained.

179 citations


Journal ArticleDOI
TL;DR: An overview of DIA is provided, its use in multimedia applications is described, and some of the ongoing activities in MPEG on extending DIA for use in rights governed environments are reported on.
Abstract: MPEG-21 Digital Item Adaptation (DIA) has recently been finalized as part of the MPEG-21 Multimedia Framework. DIA specifies metadata for assisting the adaptation of Digital Items according to constraints on the storage, transmission and consumption, thereby enabling various types of quality of service management. This paper provides an overview of DIA, describes its use in multimedia applications, and reports on some of the ongoing activities in MPEG on extending DIA for use in rights governed environments.

Journal ArticleDOI
TL;DR: This paper proposes a time-spread echo as an alternative to the single echo in conventional echo hiding, and shows good imperceptibility and robustness against typical signal processing.
Abstract: Conventional watermarking techniques based on echo hiding provide many benefits, but also have several disadvantages, for example, a lenient decoding process, weakness against multiple encoding attacks, etc. In this paper, to improve the weak points of conventional echo hiding, we propose a time-spread echo as an alternative to the single echo in conventional echo hiding. Spreading an echo in the time domain is achieved by using pseudonoise (PN) sequences. By spreading the echo, the amplitude of each echo can be reduced, i.e., the energy of each echo becomes small, so that the distortion induced by watermarking is imperceptible to humans while the decoding performance of the embedded watermarks is better maintained as compared with the case of conventional echo hiding, as shown by computer simulations, in which several parameters, such as the amplitude and length of PN sequences and analysis window length, were varied. Robustness against typical signal processing was also evaluated in these simulations and showed fair performance. Results of a listening test using some pieces of music showed good imperceptibility.

Journal ArticleDOI
TL;DR: An effective way to model the textures found in a given music signal is described, and it is shown that such timbre models provide new solutions to many issues traditionally encountered in music signal processing and music information retrieval.
Abstract: Electronic Music Distribution is in need of robust and automatically extracted music descriptors. An important attribute of a piece of polyphonic music is what is commonly referred to as "the way it sounds". While there has been a large quantity of research done to model the timbre of individual instruments, little work has been done to analyze "real world" timbre mixtures such as the ones found in popular music. In this paper, we present our research about such "polyphonic timbres". We describe an effective way to model the textures found in a given music signal, and show that such timbre models provide new solutions to many issues traditionally encountered in music signal processing and music information retrieval. Notably, we describe their applications for music similarity, segmentation and pattern induction.

Journal ArticleDOI
TL;DR: This paper addresses issues that arise in copyright protection systems of digital images, which employ blind watermark verification structures in the discrete cosine transform (DCT) domain, by designing a new processor for blind watermarks detection using the Cauchy member of the alpha-stable family.
Abstract: This paper addresses issues that arise in copyright protection systems of digital images, which employ blind watermark verification structures in the discrete cosine transform (DCT) domain. First, we observe that statistical distributions with heavy algebraic tails, such as the alpha-stable family, are in many cases more accurate modeling tools for the DCT coefficients of JPEG-analyzed images than families with exponential tails such as the generalized Gaussian. Motivated by our modeling results, we then design a new processor for blind watermark detection using the Cauchy member of the alpha-stable family. The Cauchy distribution is chosen because it is the only non-Gaussian symmetric alpha-stable distribution that exists in closed form and also because it leads to the design of a nearly optimum detector with robust detection performance. We analyze the performance of the new detector in terms of the associated probabilities of detection and false alarm and we compare it to the performance of the generalized Gaussian detector by performing experiments with various test images.

Journal ArticleDOI
TL;DR: This paper addresses the challenge of automatically extracting the highlights from sports TV broadcasts by finding a generic method of highlights extraction, which does not require the development of models for the events that are thought to be interpreted by the users as highlights.
Abstract: This paper addresses the challenge of automatically extracting the highlights from sports TV broadcasts. In particular, we are interested in finding a generic method of highlights extraction, which does not require the development of models for the events that are thought to be interpreted by the users as highlights. Instead, we search for highlights in those video segments that are expected to excite the users most. It is namely realistic to assume that a highlighting event induces a steady increase in a user's excitement, as compared to other, less interesting events. We mimic the expected variations in a user's excitement by observing the temporal behavior of selected audiovisual low-level features and the editing scheme of a video. Relations between this noncontent information and the evoked excitement are drawn partly from psychophysiological research and partly from analyzing the live-video directing practice. The expected variations in a user's excitement are represented by the excitement time curve, which is, subsequently, filtered in an adaptive way to extract the highlights in the prespecified total length and in view of the preferences regarding the highlights strength: extraction can namely be performed with variable sensitivity to capture few "strong" highlights or more "less strong" ones. We evaluate and discuss the performance of our method on the case study of soccer TV broadcasts.

Journal ArticleDOI
TL;DR: The results show that semantic video indexing results significantly benefit from using the TIME framework, and three different machine learning techniques are compared, i.c. C4.5 decision tree, maximum entropy, and support vector machine.
Abstract: We propose the time interval multimedia event (TIME) framework as a robust approach for classification of semantic events in multimodal video documents. The representation used in TIME extends the Allen temporal interval relations and allows for proper inclusion of context and synchronization of the heterogeneous information sources involved in multimodal video analysis. To demonstrate the viability of our approach, it was evaluated on the domains of soccer and news broadcasts. For automatic classification of semantic events, we compare three different machine learning techniques, i.c. C4.5 decision tree, maximum entropy, and support vector machine. The results show that semantic video indexing results significantly benefit from using the TIME framework.

Journal ArticleDOI
TL;DR: A system for automatically extracting the region of interest (ROI) and controlling virtual cameras' control based on panoramic video that targets applications such as classroom lectures and video conferencing is presented.
Abstract: We present a system for automatically extracting the region of interest (ROI) and controlling virtual cameras' control based on panoramic video. It targets applications such as classroom lectures and video conferencing. For capturing panoramic video, we use the FlyCam system that produces high resolution, wide-angle video by stitching video images from multiple stationary cameras. To generate conventional video, a region of interest can be cropped from the panoramic video. We propose methods for ROI detection, tracking, and virtual camera control that work in both the uncompressed and compressed domains. The ROI is located from motion and color information in the uncompressed domain and macroblock information in the compressed domain, and tracked using a Kalman filter. This results in virtual camera control that simulates human controlled video recording. The system has no physical camera motion and the virtual camera parameters are readily available for video indexing.

Journal ArticleDOI
TL;DR: A feature classification technique, based on the analysis of two statistical properties in the spatial and DCT domains, is proposed to blindly determine the existence of hidden messages in an image to be effective in class separation.
Abstract: In contrast to steganography, steganalysis is focused on detecting (the main goal of this research), tracking, extracting, and modifying secret messages transmitted through a covert channel. In this paper, a feature classification technique, based on the analysis of two statistical properties in the spatial and DCT domains, is proposed to blindly (i.e., without knowledge of the steganographic schemes) to determine the existence of hidden messages in an image. To be effective in class separation, the nonlinear neural classifier was adopted. For evaluation, a database composed of 2088 plain and stego images (generated by using six different embedding schemes) was established. Based on this database, extensive experiments were conducted to prove the feasibility and diversity of our proposed system. It was found that the proposed system consists of: 1) a 90%/sup +/ positive-detection rate; 2) not limited to the detection of a particular steganographic scheme; 3) capable of detecting stego images with an embedding rate as low as 0.01 bpp; and 4) considering the test of plain images incurred low-pass filtering, sharpening, and JPEG compression.

Journal ArticleDOI
TL;DR: A fast object tracking algorithm that predicts the object contour using motion vector information and is computationally superior to existing region-based methods for object tracking is proposed.
Abstract: We propose a fast object tracking algorithm that predicts the object contour using motion vector information. The segmentation step common in region-based tracking methods is avoided, except for the initialization of the object. Tracking is achieved by predicting the object boundary using block motion vectors followed by updating the contour using occlusions/disocclusion detection. An adaptive block-based approach has been used for estimating motion between frames. An efficient modulation scheme is used to control the gap between frames used for motion estimation. The algorithm for detecting disocclusion proceeds in two steps. First, uncovered regions are estimated from the displaced frame difference. These uncovered regions are classified into actual disocclusions and false alarms by observing the motion characteristics of uncovered regions. Occlusion and disocclusion are considered as dual events and this relationship is explained in detail. The algorithm for detecting occlusion is developed by modifying the disocclusion detection algorithm in accordance with the duality principle. The overall tracking algorithm is computationally superior to existing region-based methods for object tracking. The immediate applications of the proposed tracking algorithm are video compression using MPEG-4 and content retrieval based on standards like H.264. Preliminary simulation results demonstrate the performance of the proposed algorithm.

Journal ArticleDOI
TL;DR: A theoretical framework for the linear collusion analysis of watermarked digital video sequences is presented, and a new theorem is derived equating a definition of statistical invisibility, collusion-resistance, and two practical watermark design rules that play a key role in the subsequent development of a novel collusion-resistant video watermarking algorithm.
Abstract: We present a theoretical framework for the linear collusion analysis of watermarked digital video sequences, and derive a new theorem equating a definition of statistical invisibility, collusion-resistance, and two practical watermark design rules. The proposed framework is simple and intuitive; the basic processing unit is the video frame and we consider second-order statistical descriptions of their temporal inter-relationships. Within this analytical setup, we define the linear frame collusion attack, the analytic notion of a statistically invisible video watermark, and show that the latter is an effective counterattack against the former. Finally, to show how the theoretical results detailed in this paper can easily be applied to the construction of collusion-resistant video watermarks, we encapsulate the analysis into two practical video watermark design rules that play a key role in the subsequent development of a novel collusion-resistant video watermarking algorithm discussed in a companion paper.

Journal ArticleDOI
TL;DR: The concepts behind metadata and constraint specifications that act as interfaces to the decision-taking component of an adaptation engine in MPEG-21 Part 7 are presented, universal methods based on pattern search are shown to process the information in the tools to make decisions, and some adaptation use cases where these tools can be used.
Abstract: In order to cater to the diversity of terminals and networks, efficient, and flexible adaptation of multimedia content in the delivery path to end consumers is required. To this end, it is necessary to associate the content with metadata that provides the relationship between feasible adaptation choices and various media characteristics obtained as a function of these choices. Furthermore, adaptation is driven by specification of terminal, network, user preference or rights based constraints on media characteristics that are to be satisfied by the adaptation process. Using the metadata and the constraint specification, an adaptation engine can take an appropriate decision for adaptation, efficiently and flexibly. MPEG-21 Part 7 entitled Digital Item Adaptation standardizes among other things the metadata and constraint specifications that act as interfaces to the decision-taking component of an adaptation engine. This paper presents the concepts behind these tools in the standard, shows universal methods based on pattern search to process the information in the tools to make decisions, and presents some adaptation use cases where these tools can be used.

Journal ArticleDOI
TL;DR: A Dynamic TCP-friendly AIMD (DTAIMD) algorithm is proposed, and extensive simulation results are given to verify the derived necessary and sufficient condition, and to demonstrate the performance of the proposed DTAIMd algorithm.
Abstract: In this paper, the performance of TCP-friendly generic AIMD (Additive Increase and Multiplicative Decrease) algorithms for Web-based playback and multirate multimedia applications is investigated. The necessary and sufficient TCP-friendly condition is derived, and the effectiveness and responsiveness of AIMD are studied. Due to practical implications, a Dynamic TCP-friendly AIMD (DTAIMD) algorithm is proposed. Extensive simulation results are given to verify the derived necessary and sufficient condition, and to demonstrate the performance of the proposed DTAIMD algorithm.

Journal ArticleDOI
B. Ko1, Hyeran Byun1
TL;DR: This work proposes adaptive circular filters used for semantic image segmentation, which are based on both Bayes' theorem and texture distribution of image, and extracts optimal feature vectors from segmented regions and applies them to the stepwise Boolean AND matching scheme.
Abstract: We present our region-based image retrieval tool, finding region in the picture (FRIP), that is able to accommodate, to the extent possible, region scaling, rotation, and translation. Our goal is to develop an effective retrieval system to overcome a few limitations associated with existing systems. To do this, we propose adaptive circular filters used for semantic image segmentation, which are based on both Bayes' theorem and texture distribution of image. In addition, to decrease the computational complexity without losing the accuracy of the search results, we extract optimal feature vectors from segmented regions and apply them to our stepwise Boolean AND matching scheme. The experimental results using real world images show that our system can indeed improve retrieval performance compared to other global property-based or region-of-interest-based image retrieval methods.

Journal ArticleDOI
TL;DR: This paper proposes to generate the watermark from the original image and owner's logo with a one-way function to defend against a counterfeiting attack on an SVD-based ownership watermarking scheme.
Abstract: This paper proposes a counterfeiting attack on an SVD-based ownership watermarking scheme. In the proposed attack, the adversary can claim the rightful ownership of any image by fabricating a bogus "original" image and meaningful logo. To defend against this attack, this paper proposes to generate the watermark from the original image and owner's logo with a one-way function.

Journal ArticleDOI
TL;DR: Results show that the proposed algorithm significantly outperforms other techniques by several dBs in peak signal-to-noise ratio (PSNR), provides good visual quality, and has a rather low complexity, which makes it possible to perform real-time operation with reasonable computational resources.
Abstract: In low bit-rate packet-based video communications, video frames may have very small size, so that each frame fills the payload of a single network packet; thus, packet losses correspond to whole-frame losses, to which the existing error concealment algorithms are badly suited and generally not applicable. In this paper, we deal with the problem of concealment of whole frame-losses, and propose a novel technique which is capable of handling this very critical case. The proposed technique presents other two major innovations with respect to the state-of-the-art: i) it is based on optical flow estimation applied to error concealment and ii) it performs multiframe estimation, thus optimally exploiting the multiple reference frame buffer featured by the most modern video coders such as H.263+ and H.264. If data partitioning is employed, by e.g., sending headers, motion vectors, and coding modes in prioritized packets as can be done in the DiffServ network model, the algorithm is capable of exploiting the motion vectors to improve the error concealment results. The algorithm has been embedded in the H.264 test model software, and tested under both independent and correlated packet loss models with parameters typical of the wireless environment. Results show that the proposed algorithm significantly outperforms other techniques by several dBs in peak signal-to-noise ratio (PSNR), provides good visual quality, and has a rather low complexity, which makes it possible to perform real-time operation with reasonable computational resources.

Journal ArticleDOI
TL;DR: InsightVideo is introduced, a video analysis and retrieval system, which joins video content hierarchy, hierarchical browsing and retrieval for efficient video access and introduces a video similarity evaluation scheme at different levels.
Abstract: Hierarchical video browsing and feature-based video retrieval are two standard methods for accessing video content. Very little research, however, has addressed the benefits of integrating these two methods for more effective and efficient video content access. In this paper, we introduce InsightVideo, a video analysis and retrieval system, which joins video content hierarchy, hierarchical browsing and retrieval for efficient video access. We propose several video processing techniques to organize the content hierarchy of the video. We first apply a camera motion classification and key-frame extraction strategy that operates in the compressed domain to extract video features. Then, shot grouping, scene detection and pairwise scene clustering strategies are applied to construct the video content hierarchy. We introduce a video similarity evaluation scheme at different levels (key-frame, shot, group, scene, and video.) By integrating the video content hierarchy and the video similarity evaluation scheme, hierarchical video browsing and retrieval are seamlessly integrated for efficient content access. We construct a progressive video retrieval scheme to refine user queries through the interactions of browsing and retrieval. Experimental results and comparisons of camera motion classification, key-frame extraction, scene detection, and video retrieval are presented to validate the effectiveness and efficiency of the proposed algorithms and the performance of the system.

Journal ArticleDOI
TL;DR: The proposed quantitative method is found to fit closely to subjective ratings by human observers based on preliminary experimental results, and an experimental strategy for verifying and fitting a quantitative model that estimates 3-D perceptual quality is proposed.
Abstract: Many factors, such as the number of vertices and the resolution of texture, can affect the display quality of three-dimensional (3-D) objects. When the resources of a graphics system are not sufficient to render the ideal image, degradation is inevitable. It is, therefore, important to study how individual factors will affect the overall quality, and how the degradation can be controlled given limited resources. In this paper, the essential factors determining the display quality are reviewed. We then integrate two important ones, resolution of texture and resolution of wireframe, and use them in our model as a perceptual metric. We assess this metric using statistical data collected from a 3-D quality evaluation experiment. The statistical model and the methodology to assess the display quality metric are discussed. A preliminary study of the reliability of the estimates is also described. The contribution of this paper lies in: 1) determining the relative importance of wireframe versus texture resolution in perceptual quality evaluation and 2) proposing an experimental strategy for verifying and fitting a quantitative model that estimates 3-D perceptual quality. The proposed quantitative method is found to fit closely to subjective ratings by human observers based on preliminary experimental results.

Journal ArticleDOI
TL;DR: A new and generic rate-distortion-complexity model is proposed that can generate DIA descriptions for image and video decoding algorithms running on various hardware architectures and explicitly model the complexity involved in decoding a bitstream by a generic receiver.
Abstract: Existing research on Universal Multimedia Access has mainly focused on adapting multimedia to the network characteristics while overlooking the receiver capabilities. Alternatively, part 7 of the MPEG-21 standard entitled Digital Item Adaptation (DIA) defines description tools to guide the multimedia adaptation process based on both the network conditions and the available receiver resources. In this paper, we propose a new and generic rate-distortion-complexity model that can generate such DIA descriptions for image and video decoding algorithms running on various hardware architectures. The novelty of our approach is in virtualizing complexity, i.e., we explicitly model the complexity involved in decoding a bitstream by a generic receiver. This generic complexity is translated dynamically into "real" complexity, which is architecture-specific. The receivers can then negotiate with the media server/proxy the transmission of a bitstream having a desired complexity level based on their resource constraints. Hence, unlike in previous streaming systems, multimedia transmission can be optimized in an integrated rate-distortion-complexity setting by minimizing the incurred distortion under joint rate-complexity constraints.

Journal ArticleDOI
Xin Wang, T. DeMartini, B. Wragg, Muthukrishnan Paramasivam1, C. Barlas 
TL;DR: An overview of the MPEG-21 Rights Expression Language in terms of its data model, expressiveness, authorization model, structure for extensibility and profiling, and usages in digital media, trust management, and web services is provided.
Abstract: The MPEG-21 Rights Expression Language (REL) is an XML-based language for digital rights management (DRM), providing a universal method for specifying rights and conditions associated with the distribution and use of assets like content, resources and services. Evolved from the eXtensible rights Markup language (XrML), the REL facilitates the creation of an open DRM architecture for managing and protecting these assets. As a general-purpose rights expression language, the REL is agnostic to types of assets, platforms and media, and expressive enough to support applications that can be even beyond DRM, including protecting privacy. It also contains additional capabilities in the areas of extensibility, security, trust management, and life cycle management of rights. This article provides an overview of the REL in terms of its data model, expressiveness, authorization model, structure for extensibility and profiling, and usages in digital media, trust management, and web services. To support the REL and provide extensive semantics for the management of rights, MPEG-21 also defined a Rights Data Dictionary (RDD). Based on original work conducted by , the MPEG-21 RDD specifies a methodology and structure for the RDD dictionary. The specification defines a core set of terms and provides a mechanism for the introduction of further terms through a registration authority. The RDD also supports the mapping of terms from different namespaces.

Journal ArticleDOI
TL;DR: The proposed adaptive rule is more robust in the presence of unreliable modalities, and outperforms the hard-level max rule and soft-level weighted summation rule, provided that the employed reliability measure is effective in assessment of classifier decisions.
Abstract: We present a multimodal open-set speaker identification system that integrates information coming from audio, face and lip motion modalities. For fusion of multiple modalities, we propose a new adaptive cascade rule that favors reliable modality combinations through a cascade of classifiers. The order of the classifiers in the cascade is adaptively determined based on the reliability of each modality combination. A novel reliability measure, that genuinely fits to the open-set speaker identification problem, is also proposed to assess accept or reject decisions of a classifier. A formal framework is developed based on probability of correct decision for analytical comparison of the proposed adaptive rule with other classifier combination rules. The proposed adaptive rule is more robust in the presence of unreliable modalities, and outperforms the hard-level max rule and soft-level weighted summation rule, provided that the employed reliability measure is effective in assessment of classifier decisions. Experimental results that support this assertion are provided.