scispace - formally typeset
Search or ask a question
Author

Jincheng Huang

Other affiliations: Tsinghua University, Lexmark
Bio: Jincheng Huang is an academic researcher from New York University. The author has contributed to research in topics: Image segmentation & Hidden Markov model. The author has an hindex of 11, co-authored 13 publications receiving 1186 citations. Previous affiliations of Jincheng Huang include Tsinghua University & Lexmark.

Papers
More filters
Journal Article•DOI•
TL;DR: This work describes audio and visual features that can effectively characterize scene content, present selected algorithms for segmentation and classification, and review some testbed systems for video archiving and retrieval.
Abstract: Multimedia content analysis refers to the computerized understanding of the semantic meanings of a multimedia document, such as a video sequence with an accompanying audio track. With a multimedia document, its semantics are embedded in multiple forms that are usually complimentary of each other, Therefore, it is necessary to analyze all types of data: image frames, sound tracks, texts that can be extracted from image frames, and spoken words that can be deciphered from the audio track. This usually involves segmenting the document into semantically meaningful units, classifying each unit into a predefined scene type, and indexing and summarizing the document for efficient retrieval and browsing. We review advances in using audio and visual information jointly for accomplishing the above tasks. We describe audio and visual features that can effectively characterize scene content, present selected algorithms for segmentation and classification, and review some testbed systems for video archiving and retrieval. We also describe audio and visual descriptors and description schemes that are being considered by the MPEG-7 standard for multimedia content description.

552 citations

Proceedings Article•DOI•
07 Dec 1998
TL;DR: A technique for classifying TV broadcast video using a hidden Markov model (HMM) using the clip-based features as observation vectors for discriminating five types of TV programs, namely commercials, basketball games, football games, news reports, and weather forecasts.
Abstract: This paper describes a technique for classifying TV broadcast video using a hidden Markov model (HMM). Here we consider the problem of discriminating five types of TV programs, namely commercials, basketball games, football games, news reports, and weather forecasts. Eight frame-based audio features are used to characterize the low-level audio properties, and fourteen clip-based audio features are extracted based on these frame-based features to characterize the high-level audio properties. For each type of these five TV programs, we build an ergodic HMM using the clip-based features as observation vectors. The maximum likelihood method is then used for classifying testing data using the trained models.

130 citations

Proceedings Article•DOI•
23 Jun 1997
TL;DR: Several audio features that have been found effective in distinguishing audio characteristics of different scene classes are described and a neural net classifier was quite successful in separating audio clips from different TV programs.
Abstract: Analysis and classification of the scene content of a video sequence are very important for content-based indexing and retrieval of multimedia databases. We report our research on using the associated audio information for video scene classification. We describe several audio features that have been found effective in distinguishing audio characteristics of different scene classes. Based on these features, a neural net classifier was quite successful in separating audio clips from different TV programs.

119 citations

Proceedings Article•DOI•
01 Jan 1999
TL;DR: Four different methods for integrating audio and visual information for video classification based on a hidden Markov model (HMM) are presented: direct concatenation, product HMM, two-stage H MM, and integration by neural network.
Abstract: Along with the advances in multimedia and Internet technology, a huge amount of data, including digital video and audio, are generated daily. Tools for the efficient indexing and retrieval of such data are indispensable. With multi-modal information present in the data, effective integration is necessary and is still a challenging problem. In this paper, we present four different methods for integrating audio and visual information for video classification based on a hidden Markov model (HMM): direct concatenation, product HMM, two-stage HMM, and integration by neural network. Our results have shown significant improvements over using a single modality.

119 citations

Proceedings Article•DOI•
04 Oct 1998
TL;DR: It is proposed to use audio information along with image and motion information to accomplish segmentation at different levels with promising results with videos digitized from TV programs.
Abstract: A video sequence usually consists of separate scenes, and each scene includes many shots. For video understanding purposes, it is most important to detect scene breaks. To analyze the content of each scene, detection of shot breaks is also required. Usually, a scene break is associated with a simultaneous change of image, motion, and audio characteristics, while a shot break is only accompanied with changes in image or motion or both. We propose to use audio information along with image and motion information to accomplish segmentation at different levels. Promising results have been obtained with videos digitized from TV programs.

85 citations


Cited by
More filters
Patent•
01 Feb 1999
TL;DR: An adaptive interface for a programmable system, for predicting a desired user function, based on user history, as well as machine internal status and context, is presented for confirmation by the user, and the predictive mechanism is updated based on this feedback as mentioned in this paper.
Abstract: An adaptive interface for a programmable system, for predicting a desired user function, based on user history, as well as machine internal status and context. The apparatus receives an input from the user and other data. A predicted input is presented for confirmation by the user, and the predictive mechanism is updated based on this feedback. Also provided is a pattern recognition system for a multimedia device, wherein a user input is matched to a video stream on a conceptual basis, allowing inexact programming of a multimedia device. The system analyzes a data stream for correspondence with a data pattern for processing and storage. The data stream is subjected to adaptive pattern recognition to extract features of interest to provide a highly compressed representation that may be efficiently processed to determine correspondence. Applications of the interface and system include a video cassette recorder (VCR), medical device, vehicle control system, audio device, environmental control system, securities trading terminal, and smart house. The system optionally includes an actuator for effecting the environment of operation, allowing closed-loop feedback operation and automated learning.

1,182 citations

Journal Article•DOI•
TL;DR: This survey aims at providing multimedia researchers with a state-of-the-art overview of fusion strategies, which are used for combining multiple modalities in order to accomplish various multimedia analysis tasks.
Abstract: This survey aims at providing multimedia researchers with a state-of-the-art overview of fusion strategies, which are used for combining multiple modalities in order to accomplish various multimedia analysis tasks. The existing literature on multimodal fusion research is presented through several classifications based on the fusion methodology and the level of fusion (feature, decision, and hybrid). The fusion methods are described from the perspective of the basic concept, advantages, weaknesses, and their usage in various analysis tasks as reported in the literature. Moreover, several distinctive issues that influence a multimodal fusion process such as, the use of correlation and independence, confidence level, contextual information, synchronization between different modalities, and the optimal modality selection are also highlighted. Finally, we present the open issues for further research in the area of multimodal fusion.

1,019 citations

Journal Article•DOI•
08 Sep 2003
TL;DR: The main components of audiovisual automatic speech recognition (ASR) are reviewed and novel contributions in two main areas are presented: first, the visual front-end design, based on a cascade of linear image transforms of an appropriate video region of interest, and subsequently, audiovISual speech integration.
Abstract: Visual speech information from the speaker's mouth region has been successfully shown to improve noise robustness of automatic speech recognizers, thus promising to extend their usability in the human computer interface. In this paper, we review the main components of audiovisual automatic speech recognition (ASR) and present novel contributions in two main areas: first, the visual front-end design, based on a cascade of linear image transforms of an appropriate video region of interest, and subsequently, audiovisual speech integration. On the latter topic, we discuss new work on feature and decision fusion combination, the modeling of audiovisual speech asynchrony, and incorporating modality reliability estimates to the bimodal recognition process. We also briefly touch upon the issue of audiovisual adaptation. We apply our algorithms to three multisubject bimodal databases, ranging from small- to large-vocabulary recognition tasks, recorded in both visually controlled and challenging environments. Our experiments demonstrate that the visual modality improves ASR over all conditions and data considered, though less so for visually challenging environments and large vocabulary tasks.

790 citations

Patent•
17 Apr 2002
TL;DR: A multimedia production and distribution system collects or assembles a media production (such as, a news program, television programming, or radio broadcast) from a variety of sources, including television stations and other media hosting facilities.
Abstract: A multimedia production and distribution system collects or assembles a media production (such as, a news program, television programming, or radio broadcast) from a variety of sources, including television stations and other media hosting facilities. The media production is categorized and indexed for retrieval and distribution across a wired or wireless network, such as the Internet, to any client, such as a personal computer, television, or personal digital assistant. A user can operate the client to display and interact with the media production, or select various options to customize the transmission or request a standard program. Alternatively, the user can establish a template to generate the media production automatically based on personal preferences. The media production is displayed on the client with various media enhancements to add value to the media production. Such enhancements include graphics, extended play segments, opinion research, and URLs. The enhancements also include advertisements, such as commercials, active banners, and sponsorship buttons. An advertisement reporting system monitors the sale and distribution of advertisements within the network. The advertisements are priced according to factors that measure the likelihood of an advertisement actually being presented or viewed by users most likely to purchase the advertised item or service. The advertisement reporting system also collects metrics to invoice and apportion income derived from the advertisements among the network participants, including a portal host and/or producer of the content.

733 citations

Journal Article•DOI•
TL;DR: This work describes audio and visual features that can effectively characterize scene content, present selected algorithms for segmentation and classification, and review some testbed systems for video archiving and retrieval.
Abstract: Multimedia content analysis refers to the computerized understanding of the semantic meanings of a multimedia document, such as a video sequence with an accompanying audio track. With a multimedia document, its semantics are embedded in multiple forms that are usually complimentary of each other, Therefore, it is necessary to analyze all types of data: image frames, sound tracks, texts that can be extracted from image frames, and spoken words that can be deciphered from the audio track. This usually involves segmenting the document into semantically meaningful units, classifying each unit into a predefined scene type, and indexing and summarizing the document for efficient retrieval and browsing. We review advances in using audio and visual information jointly for accomplishing the above tasks. We describe audio and visual features that can effectively characterize scene content, present selected algorithms for segmentation and classification, and review some testbed systems for video archiving and retrieval. We also describe audio and visual descriptors and description schemes that are being considered by the MPEG-7 standard for multimedia content description.

552 citations