scispace - formally typeset
Search or ask a question

Showing papers by "Thomas Sikora published in 2004"


Journal ArticleDOI
TL;DR: An MPEG-7-based audio classification and retrieval technique targeted for analysis of film material that achieves better results than normalized audio spectrum envelope and principal component analysis in a speaker recognition system is presented.
Abstract: In this paper, we present an MPEG-7-based audio classification and retrieval technique targeted for analysis of film material. The technique consists of low-level descriptors and high-level description schemes. For low-level descriptors, low-dimensional features such as audio spectrum projection based on audio spectrum basis descriptors is produced in order to find a balanced tradeoff between reducing dimensionality and retaining maximum information content. High-level description schemes are used to describe the modeling of reduced-dimension features, the procedure of audio classification, and retrieval. A classifier based on continuous hidden Markov models is applied. The sound model state path, which is selected according to the maximum-likelihood model, is stored in an MPEG-7 sound database and used as an index for query applications. Various experiments are presented where the speaker- and sound-recognition rates are compared for different feature extraction methods. Using independent component analysis, we achieved better results than normalized audio spectrum envelope and principal component analysis in a speaker recognition system. In audio classification experiments, audio sounds are classified into selected sound classes in real time with an accuracy of 96%.

97 citations


Proceedings ArticleDOI
17 May 2004
TL;DR: The MPEG-7 ASP features and MFCCs are used to train hidden Markov models (HMM) for individual speakers and sounds for audio segmentation, and the MFCC approach yields a sound/speaker recognition rate superior to MPEG- 7 implementations.
Abstract: We evaluate the MPEG-7 audio spectrum projection (ASP) features for general sound recognition performance against the well established MFCC. The recognition tasks of interest are speaker recognition, sound classification, and segmentation of audio using sound/speaker identification. For sound classification we use three approaches: direct approach; hierarchical approach without hints; hierarchical approach with hints. For audio segmentation, the MPEG-7 ASP features and MFCCs are used to train hidden Markov models (HMM) for individual speakers and sounds. The trained sound/speaker models are then used to segment conversational speech involving a given subset of people in panel discussion television programs. Results show that the MFCC approach yields a sound/speaker recognition rate superior to MPEG-7 implementations.

36 citations


01 Jun 2004
TL;DR: This work compares the performance of MPEG-7 Audio Spectrum Projection features based on several basis decomposition algorithms vs. Mel-scale Frequency Cepstrum Coefficients (MFCC) and concludes that established MFCC features yield better performance compared to MPEG- 7 ASP in the general sound recognition under practical constraints.
Abstract: Our challenge is to analyze/classify video sound track content for indexing purposes. To this end we compare the performance of MPEG-7 Audio Spectrum Projection (ASP) features based on several basis decomposition algorithms vs. Mel-scale Frequency Cepstrum Coefficients (MFCC). For basis decomposition in the feature extraction we evaluate three approaches: Principal Component Analysis (PCA), Independent Component Analysis (ICA), and Non-negative Matrix Factorization (NMF). Audio features are computed from these reduced vectors and are fed into a continuous hidden Markov model (CHMM) classifier. Our conclusion is that established MFCC features yield better performance compared to MPEG-7 ASP in the general sound recognition under practical constraints.

25 citations


01 Jan 2004
TL;DR: A new European consortium is formed as a Network of Excellence to integrate the research works of 19 institutions in the field of 3DTV with all its aspects except audio.
Abstract: A new European consortium is formed as a Network of Excellence to integrate the research works of 19 institutions in the field of 3DTV. The consortium is funded by EC under the FP6 thematic area Information Society Technologies within the strategic objective Cross-media Content for Leisure and Entertainment. The project will last 48 months, but the collaboration among the partners is expected to be longer. The technical focus of the consortium is 3DTV with all its aspects except audio. Various techniques of 3D scene capture will be investigated and compared. Representation of captured 3D content in abstract form using mainly computer graphics approaches is the key feature which decouples user from the input.. Compression of 3D scene information, and forming the bitstream structure for effective streaming are parts of the project. The user may interact with the captured scene and get a visual display based on the choice of display technology. A rich variety of different display techniques, including stereoscopy and holographic displays are among the main focus of the consortium. The plan covers various integration and dissemination activities.

25 citations


Journal Article
TL;DR: This paper presents a modular QBH system using MPEG-7 descriptors in all processing stages due to the modular design all components can easily be substituted and the system is evaluated by changing parameters defined by the MPEG- 7 descriptors.
Abstract: Query by Humming (QBH) is a method for searching in a multimedia database system containing meta data descriptions of songs. The database can be searched by hummed queries, this means that a user can hum a melody into a microphone which is connected to the computer hosting the system. The QBH system searches the database for songs which are similar to the input query and presents the result to the user as a list of matching songs. This paper presents a modular QBH system using MPEG-7 descriptors in all processing stages. Due to the modular design all components can easily be substituted. The system is evaluated by changing parameters defined by the MPEG-7 descriptors.

24 citations


Proceedings ArticleDOI
18 Jan 2004
TL;DR: This paper presents a novel approach to human body posture recognition based on the MPEG-7 contour-based shape descriptor and the widely used projection histogram and an optimal system design with recognition rates of 95.59% for the main posture, 77.84%" for the view and 79.77% in combination is achieved.
Abstract: This paper presents a novel approach to human body posture recognition based on the MPEG-7 contour-based shape descriptor and the widely used projection histogram. A combination of them was used to recognize the main posture and the view of a human based on the binary object mask obtained by the segmentation process. The recognition is treated as a typical pattern recognition task and is carried out through a hierarchy of classifiers. Therefore various structures both hierachical and non-hierarchical, in combination with different classifiers, are compared to each other with respect to recognition performance and computational complexity. Based on this an optimal system design with recognition rates of 95.59% for the main posture, 77.84% for the view and 79.77% in combination is achieved.

22 citations


Proceedings ArticleDOI
04 Oct 2004
TL;DR: An indexing and retrieval system that uses phonetic information only, based on the vector space IR model, using phone N-grams as indexing terms to improve the retrieval performance.
Abstract: This paper presents a phone-based approach of spoken document retrieval (SDR), developed in the framework of the emerging MPEG-7 standard. We describe an indexing and retrieval system that uses phonetic information only. The retrieval method is based on the vector space IR model, using phone N-grams as indexing terms. We propose a technique to expand the representation of documents by means of phone confusion probabilities in order to improve the retrieval performance. This method is tested on a collection of short German spoken documents, using 10 city names as queries.

22 citations


Proceedings ArticleDOI
06 Sep 2004
TL;DR: This work proposes an algorithm that makes use of the redundancies between two views of a stereo image pair that helps to conceal block losses in monocular images.
Abstract: With the increasing number of image communication applications especially in the low complexity domain, error concealment has become a very important field of research. Since many compression standards for images and videos are block-based a lot of methods were applied to conceal block losses in monocular images. The fast progress of capture, representation and display technologies for 3D image data advances the efforts on 3D concealment strategies. Because of their psycho-visual characteristics, stereoscopic images have to fulfill a very high quality demand. We propose an algorithm that makes use of the redundancies between two views of a stereo image pair. In many cases erroneous block bursts occur and can be highly disturbing, thus we mainly concentrate on these errors. In addition, we focused on the quality assessment of several error concealment strategies. Beside the objective evaluation measures, we carried out a subjective quality test following the DSCQS methodology as proposed by MPEG. The results of this test demonstrate the efficiency of the approach.

21 citations


Journal Article
TL;DR: A Query by Tapping System is presented which allows to formulate queries by tapping the melody line’s rhythm of a song requested on a MIDI keyboard or an e-drum and computes and presents a new search result list after every tap made by the user.
Abstract: A Query by Tapping System is a multi-media database containing rhythmic metadata descriptions of songs. This paper presents a Query by Tapping system called BeatBank. The system allows to formulate queries by tapping the melody line’s rhythm of a song requested on a MIDI keyboard or an e-drum. The query entered is converted into an MPEG-7 compliant representation. The actual search process takes only rhythmic aspects of the melodies into account by comparing the values of the MPEG-7 Beat Description Scheme. An efficiently computable similarity measure is presented which enables the comparison of two database entries. This system works in real-time and computes the search process online. It computes and presents a new search result list after every tap made by the user.

21 citations


Proceedings ArticleDOI
24 Oct 2004
TL;DR: A recursive block based decoder distortion estimation technique for video and results showing the accordance of the estimation results with simulation results are presented.
Abstract: We introduce a recursive block based decoder distortion estimation technique for video and present results showing the accordance of the estimation results with simulation results. Each block in a frame, with all its corresponding blocks along the video sequence, are modeled as an AR(1) source where the correlation coefficient of the source depends on the loop filtering effects and the additional noise term on the motion compensated block difference as well as on the quantization distortion of the block. The distortion term for each block of each frame in the video sequence is calculated recursively depending on the packet loss rate of the channel.

15 citations


01 Jan 2004
TL;DR: A hybrid approach is presented, combining a 2D projective transformation and a monoscopic error concealment technique for block loss in stereoscopic image pairs, utilizing the information of the associated image to fulfill the higher quality demand.
Abstract: Error concealment for stereoscopic images receives little attention in research of image processing. While many methods have been proposed for monocular images, this paper considers a concealment strategy for block loss in stereoscopic image pairs, utilizing the information of the associated image to fulfill the higher quality demand. We present a hybrid approach, combining a 2D projective transformation and a monoscopic error concealment technique. Pixel values from the associated stereo image are warped to their corresponding positions in the lost block. To reduce discontinuities at the block borders, a monoscopic error concealment algorithm with low-pass properties is integrated. The stereoscopic depth perception is much less affected in our approach than using only monoscopic error concealment techniques.

Proceedings Article
01 Sep 2004
TL;DR: Experimental results show that the MFCC features yield better performance compared to MPEG-7 ASP in the sound recognition, and audio segmentation.
Abstract: Our challenge is to analyze/classify video sound track content for indexing purposes To this end we compare the performance of MPEG-7 Audio Spectrum Projection (ASP) features based on basis decomposition vs Mel-scale Frequency Cepstrum Coefficients (MFCC) For basis decomposition in the feature extraction we have three choices: Principal Component Analysis (PCA), Independent Component Analysis (ICA), and Non-negative Matrix Factorization (NMF) Audio features are computed from these reduced vectors and are fed into hidden Markov model (HMM) classifier Experimental results show that the MFCC features yield better performance compared to MPEG-7 ASP in the sound recognition, and audio segmentation

Proceedings ArticleDOI
06 Sep 2004
TL;DR: A phone-based approach of spoken document retrieval (SDR), developed in the framework of the emerging MPEG-7 standard, is presented that uses phonetic information only and a vector space IR model.
Abstract: In this paper, we present a phone-based approach of spoken document retrieval (SDR), developed in the framework of the emerging MPEG-7 standard The audio part of MPEG-7 aims at standardizing the indexing of audio documents It encloses a SpokenContent tool that provides a description framework of the semantic content of speech signals In the context of MPEG-7, we propose an indexing and retrieval method that uses phonetic information only and a vector space IR model Different strategies based on the use of phone N-gram indexing terms are experimented

Proceedings ArticleDOI
01 Jan 2004
TL;DR: Evaluating different distance measures for the MFEG-7 MelodyContour DS evaluates the use of each measure for melody comparison in a QBH system.
Abstract: In query by humming (QBH) systems, the melody contour is often used as a symbolic description of music. The MelodyContour description scheme (DS) defined by MPEG-7 is a standardized representation of melody contours. For melody comparison in a QBH system, a distance measure is required. This paper evaluates different distance measures for the MFEG-7 MelodyContour DS. The use of each measure is discussed.

Proceedings ArticleDOI
24 Oct 2004
TL;DR: This work presents a concealment strategy for block loss in stereoscopic image pairs by means of a projective transformation model, where pixel values from the associated stereo image are warped to their corresponding positions in the lost block.
Abstract: Error concealment is an important field of research in image processing. Many methods have been applied to conceal block losses in monocular images. We present a concealment strategy for block loss in stereoscopic image pairs. Unlike the error concealment techniques used for monocular images, the information of the associated image is utilized, i.e., by means of a projective transformation model, pixel values from the associated stereo image are warped to their corresponding positions in the lost block. The stereoscopic depth perception is much less affected in our approach than using monoscopic error concealment techniques.

01 Jan 2004
TL;DR: Evaluated MPEG-7 basis projection features vs. Mel-scale Frequency Cepstrum Coefficients for speaker recognition in noisy environments show that the MFCC features yield better performance compared to MPEG- 7 features.
Abstract: Our purpose is to evaluate the efficiency of MPEG-7 basis projection (BP) features vs. Mel-scale Frequency Cepstrum Coefficients (MFCC) for speaker recognition in noisy environments. The MPEG-7 feature extraction mainly consists of a Normalized Audio Spectrum Envelope (NASE), a basis decomposition algorithm and a spectrum basis projection. Prior to the feature extraction the noise reduction algorithm is performed by using a modified log spectral amplitude speech estimator (LSA) and a minima controlled noise estimation (MC). The noise-reduced features can be effectively used in a HMM-based recognition system. The performance is measured by the segmental signalto-noise ratio, and the recognition results of the MPEG-7 standardized features vs. Mel-scale Frequency Cepstrum Coefficients (MFCC) in comparison to other noise reduction methods. Results show that the MFCC features yield better performance compared to MPEG-7 features.

Proceedings ArticleDOI
04 Oct 2004
TL;DR: Robust speech enhancement using noise estimation based on smoothing of spectral noise floor (SNF) for nonstationary noise environments and its enhanced speech is free of musical tones and reverberation artifacts and sounds very natural compared to methods using other short-time spectrum attenuation techniques.
Abstract: This paper presents robust speech enhancement using noise estimation based on smoothing of spectral noise floor (SNF) for nonstationary noise environments. The spectral gain function is obtained by well-known log-spectral amplitude (LSA) estimation criterion associated with the speech presence uncertainty. The noise estimate is given by averaging actual spectral power values, using a smoothing parameter that depends on smoothing of spectral noise floor. The noise estimator is very simple but achieves a good tracking capability for a nonstationary noise. Its enhanced speech is free of musical tones and reverberation artifacts and sounds very natural compared to methods using other short-time spectrum attenuation techniques. The performance is measured by the segmental signal-to-noise ratio (SNR), the speech/ speaker recognition accuracy and the speaker change detection rate for the audio segmentation using MFCC-features (Melscale Frequency Cepstral Coefficients) in comparison to other single microphone noise reduction methods.

Proceedings ArticleDOI
17 May 2004
TL;DR: A novel wavelet dispersion measure is developed that obtains a success rate of approximately 78% in identifying unknown complex classical music movements and is used in conjunction with a probabilistic radial basis neural network trained by only three independent example sets.
Abstract: Precision audio content description is one of the key components of next generation Internet multimedia search machines. We examine the usability of a combination of 39 different wavelets and three different types of neural nets for precision audio content description. More specifically, we develop a novel wavelet dispersion measure that measures obtained ranks of wavelet coefficients. Our dispersion measure in conjunction with a probabilistic radial basis neural network trained by only three independent example sets obtains a success rate of approximately 78% in identifying unknown complex classical music movements.

Proceedings ArticleDOI
18 Jan 2004
TL;DR: This work compared MSVC with TLC as an extension of SC based on transmission simulations over lossy channels under the assumption that the motion vectors are always available and results show that when motion vector are received TLC performs better than MSVC for every coding option tested.
Abstract: Multiple Description Video Coding (MDC) and Layered Coding (LC) are both error-resilient source coding techniques used for transmission over error-prone channels. Both techniques generate multiple streams. The streams generated by MDC correspond to different descriptions of the same source whereas the streams produced by LC are differentiated as base and enhancement layer streams. Moreover whereas the MDC streams are independently decodable the decoding of the enhancement layer stream is dependent on the decoding of the base layer stream. In this work we concentrate on specific MDC and LC schemes, i.e. Multi-State Video Coding (MSVC) and Temporal Layered Coding (TLC). MSVC was introduced by John Apostolopoulos and it was showed that if each frame is transmitted in a separate packet and if motion information for each lost frame is also lost, MSVC outperforms Single Layer Coding (SC) in recovering from single as well as burst losses. Here we compared MSVC with TLC as an extension of SC based on transmission simulations over lossy channels under the assumption that the motion vectors are always available. Using different coding modes and specific reconstruction methods average reconstructed frame PSNR (peak signal to noise ratio) is measured and compared. Results show that when motion vectors are received TLC performs better than MSVC for every coding option tested. The performance difference is bigger for low motion sequences.