scispace - formally typeset
Search or ask a question

Showing papers by "Sheng Tang published in 2007"


01 Jan 2007
TL;DR: An EMD-based bag-of-feature method is proposed to exploit visual/spatial information, and WordNet is utilized to expand semantic meanings of text to boost up the generalization of detectors.
Abstract: We participated in the high-level feature extraction task in TRECVID 2007. This paper describes the details of our system for the task. For feature extraction, we propose an EMD-based bag-of-feature method to exploit visual/spatial information, and utilize WordNet to expand semantic meanings of text to boost up the generalization of detectors. We also explore audio features and extract the motion cues in compressed domain for detecting concepts highly associated with audio/motion. We use Ordered Weighted Average (OWA) fusion method to combine the SVM-based multi-modal concept detection results. Experiment results show that our methods are effective.

43 citations


Proceedings ArticleDOI
01 Feb 2007
TL;DR: A new method based on the observations that the appearance of goal-mouth points to a high likelihood of exciting action in soccer videos and that highlight is composed of certain types of scene views which exhibit certain transition rules is proposed.
Abstract: A new method is proposed for highlight extraction in soccer videos based on goal-mouth detection. This approach is based on the observations that the appearance of goal-mouth points to a high likelihood of exciting action in soccer videos and that highlight is composed of certain types of scene views which exhibit certain transition rules. To exploit those observations, first goal-mouth are detected and segmented in soccer videos by Top-Hat Transform and some domain rules, and then with scene transition rules highlight is extracted based on goal-mouth detection. The effectiveness and efficiency of this approach are demonstrated by the experimental results on shot detection.

19 citations


Proceedings ArticleDOI
17 Sep 2007
TL;DR: This paper proposed a lexicon-guided two-level LDA retrieval framework, which uses the HowNet to guide the first level LDA model's parameter estimation, and further constructs the second layer LDA models based on the first-level's inference results.
Abstract: Topic-based language model has attracted much attention as the propounding of semantic retrieval in recent years. Especially for the ASR text with errors, the topic representation is more reasonable than the exact term representation. Among these models, Latent Dirichlet Allocation(LDA) has been noted for its ability to discover the latent topic structure, and is broadly applied in many text-related tasks. But up to now its application in information retrieval(IR) is still limited to be a supplement to the standard document models, and furthermore, it has been pointed out that directly employing the basic LDA model will hurt retrieval performance. In this paper, we propose a lexicon-guided two-level LDA retrieval framework. It uses the HowNet to guide the first-level LDA model's parameter estimation, and further construct the second-level LDA models based on the first-level's inference results. We use TRECID 2005 ASR collection to evaluate it, and compare it with the vector space model(VSM) and latent semantic Indexing(LSI). Our experiments show the proposed method is very competitive.

18 citations


Journal ArticleDOI
TL;DR: A secure and incidental distortion tolerant signature method for image authentication based on Hotelling’s T-square Statistic via Principal Component Analysis (PCA) of block DCT coefficients and Structural and Statistical Signature (SSS).
Abstract: In this paper, a secure and incidental distortion tolerant signature method for image authentication is proposed. The generation of authentication signature is based on Hotelling's T-square Statistic (HTS) via Principal Component Analysis (PCA) of block DCT coefficients. HTS values of all blocks construct a unique and stable "block-edge image", i.e., Structural and Statistical Signature (SSS). The characteristic of SSS is that it is short, and can tolerate content-preserving manipulations while keeping sensitive to content-changing attacks, and locate tampering easily. During signature matching, the Fisher criterion is used to obtain optimal threshold for automatically and universally distinguishing incidental manipulations from malicious attacks. Moreover, the security of SSS is achieved by encryption of the DCT coefficients with chaotic sequences before PCA. Experiments show that the novel method is effective for authentication.

12 citations


Book ChapterDOI
18 Nov 2007
TL;DR: A novel statistical framework is proposed for shot segmentation and classification that segments shot considering the character of intra- shot to classify shot, while classifies shot considering character of inter-shot to segment shot, which obtain more accurate results.
Abstract: In this paper, a novel statistical framework is proposed for shot segmentation and classification. The proposed framework segments and classifies shots simultaneously using same difference features based on statistical inference. The task of shot segmentation and classification is taken as finding the most possible shot sequence given feature sequences, and it can be formulated by a conditional probability which can be divided into a shot sequence probability and a feature sequence probability. Shot sequence probability is derived from relations between adjacent shots by Bi-gram, and feature sequence probability is dependent on inherent character of shot modeled by HMM. Thus, the proposed framework segments shot considering the character of intra-shot to classify shot, while classifies shot considering character of inter-shot to segment shot, which obtain more accurate results. Experimental results on soccer and badminton videos are promising, and demonstrate the effectiveness of the proposed framework.

10 citations


01 Jan 2007
TL;DR: The overall framework of the video search and retrieval for both automated and interactive system is shown, which focuses on designing a high performance feedback system, from which users can make use of several autofeedback and active learning functions to improve the retrieval performance.
Abstract: This paper describes the details of our systems for our automated and interactive search in TRECVID 2007. The shift from news video to documentary video this year has prompted a series of changes in processing techniques from that developed over the past few years. For the automated search task, we employ our previous querydependent retrieval which automatically discovers query class and query-high-level-features (query-HLF) to fuse available multimodal features. Different from previous works, our system this year gives more emphasis to visual features such as color, texture and motion in the video source. The reasons are: (a) given the low quality of ASR text and the more visual and motion oriented queries, we expect the visual features to be as discriminating as text feature; and (b) the appropriate use of motion features is highly effective for queries as they are able to model intra-frame changes. For the interactive task, we first utilize the results from the automated search results for user feedback. The user is able to make use of our intuitive retrieval interface with a variety of relevance feedback techniques to refine the search results. In addition, we introduce the motion-icons, which allow users to see a dynamic series of keyframes instead of a single keyframe during assessment. Results show that the approach can help in providing better discrimination. 1. INTRODUCTION The overall framework of our video search and retrieval for both automated and interactive system is shown in Figure 1. There are two main stages: the auto search stage and the interactive search stage. The retrieval starts with the user query, which can simply be a free text query; or coupled with image and video (multimedia query). The auto search first processes the multimedia query and performs the retrieval. The emphasis is on understanding the query to infer the roles of HLF, motion and visual features in query processing. For the interactive search, the user will make use of the automated search results to indicate whether the results are indeed relevant or otherwise. The emphasis is on designing a high performance feedback system, from which users can make use of several autofeedback and active learning functions to improve the retrieval performance. The domain of corpus for this year is the Dutch documentary video. The videos are preprocessed, segmented into shots with the speech track automatically recognized using a commercial automated speech recognition (ASR) engine and translated to English text. As a result of ASR and translation, the quality of ASR text is quite low. This, coupled with a large number of visual and motion oriented queries, suggests that ASR text may not play a critical role in the retrieval process. In fact, visual and motion information will be as important as text, as we move from news video to Dutch documentary video retrievals.

9 citations


Proceedings ArticleDOI
Na Cai1, Ming Li1, Shouxun Lin1, Yongdong Zhang1, Sheng Tang1 
26 Jul 2007
TL;DR: The weighting scheme along with the more adaptive formulae modified in the method makes it outperform the standard Adaboost algorithm as well as many other fusion methods.
Abstract: We propose an improved fusion method used in high level feature extraction at TRECVID - average precision based Adaboost (AP-based Adaboost). The AP-based weighting scheme makes use of both the weight and the rank of each sample that all have contribution to the final average precision. The weighting scheme along with the more adaptive formulae modified in our method makes it outperform the standard Adaboost algorithm as well as many other fusion methods. Experimental results on TRECVID-2005 development set show that our method is an effective and relatively robust fusion method.

8 citations


Proceedings ArticleDOI
26 Jul 2007
TL;DR: Based on video structuralization and human attention analysis, action concept annotation for interest-oriented navigation of viewers is realized and the effectiveness and generality of human attention model for action movie analysis is demonstrated.
Abstract: Nowadays, automatic detection of highlights for movies is indispensable for video management and browsing. In this paper, we specifically present the formulation of human attention model and its application for action movie analysis. Depending on the relationship between stimuli in both visual and audio modalities and the change of human attention, we construct the visual and audio attention sub-models respectively. By integrating both sub-models, human attention model is formulated and attention flow plot is derived to simulate the change of human attention in the time domain. Based on video structuralization and human attention analysis, we realize action concept annotation for interest-oriented navigation of viewers. The large-scale experiments demonstrate the effectiveness and generality of human attention model for action movie analysis.

7 citations


Proceedings ArticleDOI
10 Sep 2007
TL;DR: This paper presents a novel anchorperson detection algorithm based on spatio-temporal slice (STS), which with STSpattern analysis, clustering and decision fusion, anchorperson shots can be detected for browsing news video.
Abstract: For conveniently navigating and editing the news programs, it is very important to segment the video into meaningful units. The effective indexing of news videos can be fulfilled by the anchorperson shot because it is an indicator which denotes the occurrence of upcoming news stories. The paper presents a novel anchorperson detection algorithm based on spatio-temporal slice (STS). With STSpattern analysis, clustering and decision fusion, anchorperson shots can be detected for browsing news video. The large-scale experimental results demonstrate that the algorithm is accurate, robust and effective.

6 citations


Proceedings ArticleDOI
02 Jul 2007
TL;DR: The use of a spatio-temporal visual map (STVM) model to supplement Web video retrieval is described by employing the spatiospecific visual similarity to rerank the text-retrieval results and find new results.
Abstract: The massive amount of multimedia information especially video available on the Web requires a more precise and interactive retrieval. Current operational video retrieval systems do not make use of the implicit visual features but rely only on textual metadata supplied by the user during uploading. This greatly affects the retrieval performance as the metadata may not be comprehensive or consistent. In this paper, we describe the use of a spatio-temporal visual map (STVM) model to supplement Web video retrieval. This is done by employing the spatio-temporal visual similarity to rerank the text-retrieval results and find new results. Experimental results on a dynamic Web video corpus show significant improvement based on STVM model, with good usability scores based on human users.

5 citations


Proceedings ArticleDOI
26 Jul 2007
TL;DR: A novel approach to SVM (support vector machine) named CGSVM (clustering guided SVM) is presented, which utilizes clustering result to select the most informative image samples to be labeled, and optimize the penalty coefficient.
Abstract: SVM (support vector machine) enables effective image classification for semantic image retrieval. However, how to train accurate image classifiers in high-dimensional feature space suffers from the problem of choosing proper training samples. To solve this problem, a novel approach named CGSVM (clustering guided SVM) is presented, which utilizes clustering result to select the most informative image samples to be labeled, and optimize the penalty coefficient. Experimental results show that our algorithm achieves higher search accuracy than regular SVM for semantic image retrieval.

Book ChapterDOI
Xuefeng Pan1, Jintao Li1, Shan Ba, Yongdong Zhang1, Sheng Tang1 
09 Jan 2007
TL;DR: The experiment results show that the proposed feature extracting method based on spatiotemporal slice analyzing is effective and robust for variant video content and format.
Abstract: In this paper we propose a novel feature extracting method based on spatiotemporal slice analyzing To date, video features are focused on the character of every single video frame With our method, the video content is no longer represented with every single frame The temporal variation of visual information is taken as an important feature of video in our method We examined this kind of feature with experiments in this paper The experiment results show that the proposed feature is effective and robust for variant video content and format

Proceedings ArticleDOI
02 Jul 2007
TL;DR: This paper describes an automated retrieval framework which fuses the multimodal features and event structures present in news video to support precise news video retrieval and employs temporal event clusters to provide additional information during story level retrieval.
Abstract: Current state-of-the-art news video retrieval systems mainly focus on automated speech recognition (ASR) text to perform retrieval. This paradigm greatly affects retrieval performance as ASR text alone is not sufficient to provide an accurate representation of the entire news video. In this paper, we describe our automated retrieval framework which fuses the multimodal features and event structures present in news video to support precise news video retrieval. The contributions of this paper are: (a) we uncover and employ temporal event clusters to provide additional information during story level retrieval; and (b) we integrate other modality features with text features and incorporate event clusters for pseudo relevance feedback (PRF) in shot level re-ranking. Experiments performed on video search task using the TRECVID 2005/06 dataset show that the proposed approach is effective.

Proceedings ArticleDOI
09 Jul 2007
TL;DR: A spatio-temporal visual map (STVM) retrieval system coupled with active learning to support user-centered interactive retrieval of news video segments is proposed.
Abstract: Interactive news video retrieval requires the effective communication between the human searchers and the search engine to locate relevant video segments. We propose a spatio-temporal visual map (STVM) retrieval [1] system coupled with active learning to support user-centered interactive retrieval.

Book ChapterDOI
Xuefeng Pan1, Jintao Li1, Yongdong Zhang1, Sheng Tang1, Juan Cao1 
02 Apr 2007
TL;DR: A robust video content retrieval method based on spatiotemporal features is proposed and the experiment results show that the proposed feature is robust for variant video format.
Abstract: In this paper a robust video content retrieval method based on spatiotemporal features is proposed To date, most video retrieval methods are using the character of video key frames This kind of frame based methods is not robust enough for different video format With our method, the temporal variation of visual information is presented using spatiotemporal slice Then the DCT is used to extract feature of slice With this kind of feature, a robust video content retrieval algorithm is developed The experiment results show that the proposed feature is robust for variant video format

Book ChapterDOI
Juan Cao1, Sheng Tang1, Jintao Li1, Yongdong Zhang1, Xuefeng Pan1 
11 Dec 2007
TL;DR: This paper uses the lexiconguided semantic clustering to effectively remove the noise introduced by news video's additional contents, and uses the cluster-based LSI to automatically mine the semantic structure underlying the terms expression.
Abstract: Many researchers try to utilize the semantic information extracted from visual feature to directly realize the semantic video retrieval or to supplement the automated speech recognition (ASR) text retrieval. But bridging the gap between the low-level visual feature and semantic content is still a challenging task. In this paper, we study how to effectively use Latent Semantic Indexing (LSI) to improve the semantic video retrieval through the ASR texts. The basic LSI method has been shown effective in the traditional text retrieval and the noisy ASR text retrieval. In this paper, we further use the lexiconguided semantic clustering to effectively remove the noise introduced by news video's additional contents, and use the cluster-based LSI to automatically mine the semantic structure underlying the terms expression. Tests on the TRECVID 2005 dataset show that the above two enhancements achieve 21.3% and 6.9% improvements in performance over the traditional vector-space model(VSM) and the basic LSI separately.

DOI
An-An Liu1, Sheng Tang1, Yongdong Zhang1, Jintao Li1, Zhaoxuan Yang1 
30 May 2007
TL;DR: Face detection and audio classification are implemented to detect "face" and "speech" concepts for each shot and by integrating audiovisual information, "interview" concept is finally detected.
Abstract: According to the concepts of Large-Scale Concept Ontology for Multimedia (LSCOM) and requirement of the 4th task in the 2006 TRECVID, i.e., rushes exploitation, the "interview" concept is an important semantic concept for rushes content analysis. The paper presents the shot-level "interview" concept detection method. Face detection and audio classification are implemented to detect "face" and "speech" concepts for each shot. By integrating audiovisual information, "interview" concept is finally detected. The utilization of the method will definitely benefit the video edit. Large-scale experimental results strongly demonstrate the accuracy and effectiveness of the proposed method.

01 Jan 2007
TL;DR: Aautomated retrieval framework whichuses themultimodal features and event structures present innewsvideo to support precise newsvideo retrieval and integrates other modality features withtext features and incorporates event clusters for pseudorelevance feedback (PRF)inshot level re-ranking.
Abstract: Current state-of-the-art newsvideo retrieval systems mainly focus onautomated speech recognition (ASR)text toperform retrieval. Thisparadigm greatly affects retrieval performance asASRtext alone isnotsufficient toprovide anaccurate representation ofthe entire newsvideo. Inthispaper, we describe ourautomated retrieval framework whichfuses themultimodal features andevent structures present innewsvideo tosupport precise newsvideo retrieval. Thecontributions ofthis paper are: (a)weuncover and employ temporal event clusters toprovide additional information during story level retrieval; and(b)weintegrate other modality features withtextfeatures andincorporate eventclusters for pseudorelevance feedback (PRF)inshotlevel re-ranking. Experiments performed onvideo search task using theTRECVID 2005/06 dataset showthat theproposed approach iseffective.