scispace - formally typeset
Search or ask a question

Showing papers on "Audio signal processing published in 2006"


Journal ArticleDOI
TL;DR: This paper considers four different sets of allowed distortions in blind audio source separation algorithms, from time-invariant gains to time-varying filters, and derives a global performance measure using an energy ratio, plus a separate performance measure for each error term.
Abstract: In this paper, we discuss the evaluation of blind audio source separation (BASS) algorithms. Depending on the exact application, different distortions can be allowed between an estimated source and the wanted true source. We consider four different sets of such allowed distortions, from time-invariant gains to time-varying filters. In each case, we decompose the estimated source into a true source part plus error terms corresponding to interferences, additive noise, and algorithmic artifacts. Then, we derive a global performance measure using an energy ratio, plus a separate performance measure for each error term. These measures are computed and discussed on the results of several BASS problems with various difficulty levels

2,855 citations


Journal ArticleDOI
TL;DR: An overview of the approaches currently used in a key area of audio diarization, namely speaker diarizations, are provided and their relative merits and limitations are discussed.
Abstract: Audio diarization is the process of annotating an input audio channel with information that attributes (possibly overlapping) temporal regions of signal energy to their specific sources. These sources can include particular speakers, music, background noise sources, and other signal source/channel characteristics. Diarization can be used for helping speech recognition, facilitating the searching and indexing of audio archives, and increasing the richness of automatic transcriptions, making them more readable. In this paper, we provide an overview of the approaches currently used in a key area of audio diarization, namely speaker diarization, and discuss their relative merits and limitations. Performances using the different techniques are compared within the framework of the speaker diarization task in the DARPA EARS Rich Transcription evaluations. We also look at how the techniques are being introduced into real broadcast news systems and their portability to other domains and tasks such as meetings and speaker verification

634 citations


Patent
05 Jun 2006
TL;DR: In this article, a real-time audio-on-demand communication system is proposed, which provides realtime playback of audio data transferred via telephone lines or other communication links. But, the system is not suitable for multimedia applications.
Abstract: An audio-on-demand communication system provides real-time playback of audio data transferred via telephone lines or other communication links. One or more audio servers include memory banks which store compressed audio data. At the request of a user at a subscriber PC, an audio server transmits the compressed audio data over the communication link to the subscriber PC. The subscriber PC receives and decompresses the transmitted audio data in less than real-time using only the processing power of the CPU within the subscriber PC. According to one aspect of the present invention, high quality audio data compressed according to lossless compression techniques is transmitted together with normal quality audio data. According to another aspect of the present invention, metadata, or extra data, such as text, captions, still images, etc., is transmitted with audio data and is simultaneously displayed with corresponding audio data. The audio-on-demand system also provides a table of contents indicating significant divisions in the audio clip to be played and allows the user immediate access to audio data at the listed divisions. According to a further aspect of the present invention, servers and subscriber PCs are dynamically allocated based upon geographic location to provide the highest possible quality in the communication link.

470 citations


Journal ArticleDOI
TL;DR: This paper investigates the feasibility of an audio-based context recognition system developed and compared to the accuracy of human listeners in the same task, with particular emphasis on the computational complexity of the methods.
Abstract: The aim of this paper is to investigate the feasibility of an audio-based context recognition system. Here, context recognition refers to the automatic classification of the context or an environment around a device. A system is developed and compared to the accuracy of human listeners in the same task. Particular emphasis is placed on the computational complexity of the methods, since the application is of particular interest in resource-constrained portable devices. Simplistic low-dimensional feature vectors are evaluated against more standard spectral features. Using discriminative training, competitive recognition accuracies are achieved with very low-order hidden Markov models (1-3 Gaussian components). Slight improvement in recognition accuracy is observed when linear data-driven feature transformations are applied to mel-cepstral features. The recognition rate of the system as a function of the test sequence length appears to converge only after about 30 to 60 s. Some degree of accuracy can be achieved even with less than 1-s test sequence lengths. The average reaction time of the human listeners was 14 s, i.e., somewhat smaller, but of the same order as that of the system. The average recognition accuracy of the system was 58% against 69%, obtained in the listening tests in recognizing between 24 everyday contexts. The accuracies in recognizing six high-level classes were 82% for the system and 88% for the subjects.

436 citations


Patent
04 Dec 2006
TL;DR: In this article, the authors present a method, system and apparatus for playing an audio signal synchronously on a first mobile audio player and at least a second mobile audio players, where a delay enables the audio signal to be played synchronously with a second audio player.
Abstract: The present invention discloses a method, system and apparatus for playing an audio signal synchronously on a first mobile audio player and at least a second mobile audio player. More particularly, the invention pertains to an audio player device enabled for wireless transmission and reception of an audio signal. In one aspect, a delay enables the audio signal to be played synchronously on the audio player with a second audio player. In another aspect synchronization signals are used to play the audio signal synchronously on the first audio player and the second audio player.

408 citations


Patent
04 May 2006
TL;DR: In this paper, a sound capture unit is configured to identify one or more sound sources and generate data capable of being analyzed to determine a listening zone at which to process sound to the substantial exclusion of sounds outside the listening zone.
Abstract: Sound processing methods and apparatus are provided. A sound capture unit is configured to identify one or more sound sources. The sound capture unit generates data capable of being analyzed to determine a listening zone at which to process sound to the substantial exclusion of sounds outside the listening zone. Sound captured and processed for the listening zone may be used for interactivity with the computer program. The listening zone may be adjusted based on the location of a sound source. One or more listening zones may be pre-calibrated. The apparatus may optionally include an image capture unit configured to capture one or more image frames. The listening zone may be adjusted based on the image. A video game unit may be controlled by generating inertial, optical and/or acoustic signals with a controller and tracking a position and/or orientation of the controller using the inertial, acoustic and/or optical signal.

352 citations


Journal ArticleDOI
TL;DR: Experimental results shows that the proposed watermarking scheme is inaudible and robust against various signal processing such as noise adding, resampling, requantization, random cropping, and MPEG-1 Layer III (MP3) compression.
Abstract: Synchronization attack is one of the key issues of digital audio watermarking. In this correspondence, a blind digital audio watermarking scheme against synchronization attack using adaptive quantization is proposed. The features of the proposed scheme are as follows: 1) a kind of more steady synchronization code and a new embedded strategy are adopted to resist the synchronization attack more effectively; 2) he multiresolution characteristics of discrete wavelet transform (DWT) and the energy-compression characteristics of discrete cosine transform (DCT) are combined to improve the transparency of digital watermark; 3) the watermark is embedded into the low frequency components by adaptive quantization according to human auditory masking; and 4) the scheme can extract the watermark without the help of the original digital audio signal. Experiment results shows that the proposed watermarking scheme is inaudible and robust against various signal processing such as noise adding, resampling, requantization, random cropping, and MPEG-1 Layer III (MP3) compression

275 citations


Patent
11 Sep 2006
TL;DR: In this article, a method for determining an intensity value of an interaction with a computer program is described, which includes capturing an image of a capture zone, identifying an input object in the image, identifying the initial value of a parameter of the input object, capturing a second image of the capture zone and identifying a second value of the parameter.
Abstract: A method for determining an intensity value of an interaction with a computer program is described. The method and device includes capturing an image of a capture zone, identifying an input object in the image, identifying an initial value of a parameter of the input object, capturing a second image of the capture zone, and identifying a second value of the parameter of the input object. The parameter identifies one or more of a shape, color, or brightness of the input object and is affected by human manipulation of the input object. The extent of change in the parameter is calculated, which is the difference between the second value and the first value. An activity input is provided to the computer program, the activity input including an intensity value representing the extent of change of the parameter. A method for detecting an intensity value from sound generating input objects, and a computer video game are also described. A game controller having LEDs, sound capture and generation, or an accelerometer is also described.

268 citations


Journal ArticleDOI
TL;DR: A content-based audio classification algorithm based on novel multiscale spectro-temporal modulation features inspired by a model of auditory cortical processing to discriminate speech from nonspeech consisting of animal vocalizations, music, and environmental sounds is described.
Abstract: We describe a content-based audio classification algorithm based on novel multiscale spectro-temporal modulation features inspired by a model of auditory cortical processing. The task explored is to discriminate speech from nonspeech consisting of animal vocalizations, music, and environmental sounds. Although this is a relatively easy task for humans, it is still difficult to automate well, especially in noisy and reverberant environments. The auditory model captures basic processes occurring from the early cochlear stages to the central cortical areas. The model generates a multidimensional spectro-temporal representation of the sound, which is then analyzed by a multilinear dimensionality reduction technique and classified by a support vector machine (SVM). Generalization of the system to signals in high level of additive noise and reverberation is evaluated and compared to two existing approaches (Scheirer and Slaney, 2002 and Kingsbury et al., 2002). The results demonstrate the advantages of the auditory model over the other two systems, especially at low signal-to-noise ratios (SNRs) and high reverberation.

251 citations


Patent
05 Jan 2006
TL;DR: In this paper, a digital audio file search method and apparatus for digital audio files is provided that allows a user to navigate the audio files by generating speech sounds related to the information of the audio file to facilitate searching and playback.
Abstract: A digital audio file search method and apparatus for digital audio files is provided that allows a user to navigate the audio files by generating speech sounds related to the information of the audio files to facilitate searching and playback. The digital audio file search method and apparatus searches for audio files in a portable digital audio player in combination with an automobile audio system through speech sounds by utilizing text-to-speech processing and by prompting response from a user in response to the generated speech sounds. The text-to-speech technology is utilized to generate the speech sound based on tag-data of the audio files. When hearing the speech sounds, the user gives instruction for searching the files without being distracted from driving the automobile.

226 citations


Proceedings ArticleDOI
14 May 2006
TL;DR: The results show that the proposed top-down event detection approach works significantly better than the single level approach.
Abstract: With the increasing use of audio sensors in surveillance and monitoring applications, event detection using audio streams has emerged as an important research problem. This paper presents a hierarchical approach for audio based event detection for surveillance. The proposed approach first classifies a given audio frame into vocal and nonvocal events, and then performs further classification into normal and excited events. We model the events using a Gaussian Mixture Model and optimize the parameters for four different audio features ZCR, LPC, LPCC and LFCC. Experiments have been performed to evaluate the effectiveness of the features for detecting various normal and the excited state human activities. The results show that the proposed top-down event detection approach works significantly better than the single level approach.

Journal ArticleDOI
TL;DR: The International Conference on Music Information Retrieval (ISMIR) 2004 was the first large scale cross-validation of audio tempo induction algorithms as mentioned in this paper, which was held at the University Pompeu Fabra in Barcelona, Spain.
Abstract: We report on the tempo induction contest organized during the International Conference on Music Information Retrieval (ISMIR 2004) held at the University Pompeu Fabra in Barcelona, Spain, in October 2004. The goal of this contest was to evaluate some state-of-the-art algorithms in the task of inducing the basic tempo (as a scalar, in beats per minute) from musical audio signals. To our knowledge, this is the first published large scale cross-validation of audio tempo induction algorithms. Participants were invited to submit algorithms to the contest organizer, in one of several allowed formats. No training data was provided. A total of 12 entries (representing the work of seven research teams) were evaluated, 11 of which are reported in this document. Results on the test set of 3199 instances were returned to the participants before they were made public. Anssi Klapuri's algorithm won the contest. This evaluation shows that tempo induction algorithms can reach over 80% accuracy for music with a constant tempo, if we do not insist on finding a specific metrical level. After the competition, the algorithms and results were analyzed in order to discover general lessons for the future development of tempo induction systems. One conclusion is that robust tempo induction entails the processing of frame features rather than that of onset lists. Further, we propose a new "redundant" approach to tempo induction, inspired by knowledge of human perceptual mechanisms, which combines multiple simpler methods using a voting mechanism. Machine emulation of human tempo induction is still an open issue. Many avenues for future work in audio tempo tracking are highlighted, as for instance the definition of the best rhythmic features and the most appropriate periodicity detection method. In order to stimulate further research, the contest results, annotations, evaluation software and part of the data are available at http://ismir2004.ismir.net/ISMIR_Contest.html

Journal ArticleDOI
TL;DR: This paper addresses the problem of audio source separation with one single sensor, using a statistical model of the sources, based on a learning step from samples of each source separately, during which Gaussian scaled mixture models (GSMM) are trained.
Abstract: In this paper, we address the problem of audio source separation with one single sensor, using a statistical model of the sources. The approach is based on a learning step from samples of each source separately, during which we train Gaussian scaled mixture models (GSMM). During the separation step, we derive maximum a posteriori (MAP) and/or posterior mean (PM) estimates of the sources, given the observed audio mixture (Bayesian framework). From the experimental point of view, we test and evaluate the method on real audio examples.

Proceedings ArticleDOI
09 Jul 2006
TL;DR: This paper utilizes low-level audio features from a mobile robot and investigates using high-level features based on spectral analysis for scene characterization, and a recognition system was built to discriminate between different environments based on these audio features.
Abstract: Automatic recognition of unstructured environments is an important problem for mobile robots. We focus on using audio features to recognize different auditory environments, where they are characterized by different types of sounds. The use of audio information provides a complementary means of scene recognition that can effectively augment visual information. In particular, audio can be used toward both the analysis and characterization of the environment at a higher level of abstraction. We begin our investigation of recognizing different auditory environments with the audio information. In this paper, we utilize low-level audio features from a mobile robot and investigate using high-level features based on spectral analysis for scene characterization, and a recognition system was built to discriminate between different environments based on these audio features found.

Journal ArticleDOI
TL;DR: A novel content-dependent localized robust audio watermarking scheme that shows strong robustness against common audio signal processing, time-domain synchronization attacks, and most distortions introduced in Stirmark for Audio.
Abstract: Synchronization attacks like random cropping and time-scale modification are very challenging problems to audio watermarking techniques. To combat these attacks, a novel content-dependent localized robust audio watermarking scheme is proposed. The basic idea is to first select steady high-energy local regions that represent music edges like note attacks, transitions or drum sounds by using different methods, then embed the watermark in these regions. Such regions are of great importance to the understanding of music and will not be changed much for maintaining high auditory quality. In this way, the embedded watermark has the potential to escape all kinds of distortions. Experimental results show strong robustness against common audio signal processing, time-domain synchronization attacks, and most distortions introduced in Stirmark for Audio.

Journal ArticleDOI
TL;DR: A flexible framework is proposed for key audio effect detection in a continuous audio stream, as well as for the semantic inference of an auditory context, and a Bayesian network-based approach is proposed to further discover the high-level semantics of a auditory context by integrating prior knowledge and statistical learning.
Abstract: Key audio effects are those special effects that play critical roles in human's perception of an auditory context in audiovisual materials. Based on key audio effects, high-level semantic inference can be carried out to facilitate various content-based analysis applications, such as highlight extraction and video summarization. In this paper, a flexible framework is proposed for key audio effect detection in a continuous audio stream, as well as for the semantic inference of an auditory context. In the proposed framework, key audio effects and the background sounds are comprehensively modeled with hidden Markov models, and a Grammar Network is proposed to connect various models to fully explore the transitions among them. Moreover, a set of new spectral features are employed to improve the representation of each audio effect and the discrimination among various effects. The framework is convenient to add or remove target audio effects in various applications. Based on the obtained key effect sequence, a Bayesian network-based approach is proposed to further discover the high-level semantics of an auditory context by integrating prior knowledge and statistical learning. Evaluations on 12 h of audio data indicate that the proposed framework can achieve satisfying results, both on key audio effect detection and auditory context inference.

Patent
11 Sep 2006
TL;DR: In this article, an overall audio output signal for an electronic device may be generated such that, for at least one of audio channels corresponding to predictive manner processing, the generated audio output for that channel included into the overall audiooutput signal is based at least in part on configuration information associated with a processed audio output signals.
Abstract: In operation of an electronics device, audio based on asynchronous events, such as game playing, is intelligently combined with audio output nominally generated in a predictive manner, such as resulting from media playback. For example, an overall audio output signal for the electronic device may be generated such that, for at least one of audio channels corresponding to predictive manner processing, the generated audio output for that channel included into the overall audio output signal is based at least in part on configuration information associated with a processed audio output signal for at least one of the audio channels corresponding to asynchronous events based processing. Thus, for example, the game audio processing may control how audio effects from the game are combined with audio effects from media playback.

Journal ArticleDOI
TL;DR: This study focuses on a single music genre but combines a variety of instruments among which are percussion and singing voice, and obtains a taxonomy of musical ensembles which is used to efficiently classify possible combinations of instruments played simultaneously.
Abstract: We propose a new approach to instrument recognition in the context of real music orchestrations ranging from solos to quartets. The strength of our approach is that it does not require prior musical source separation. Thanks to a hierarchical clustering algorithm exploiting robust probabilistic distances, we obtain a taxonomy of musical ensembles which is used to efficiently classify possible combinations of instruments played simultaneously. Moreover, a wide set of acoustic features is studied including some new proposals. In particular, signal to mask ratios are found to be useful features for audio classification. This study focuses on a single music genre (i.e., jazz) but combines a variety of instruments among which are percussion and singing voice. Using a varied database of sound excerpts from commercial recordings, we show that the segmentation of music with respect to the instruments played can be achieved with an average accuracy of 53%.

Proceedings ArticleDOI
01 Oct 2006
TL;DR: A novel method to extract the notches that makes it possible to accurately estimate the location of a sound source in both the horizontal and vertical plane using only two microphones and human-like ears is presented.
Abstract: Being able to locate the origin of a sound is important for our capability to interact with the environment. Humans can locate a sound source in both the horizontal and vertical plane with only two ears, using the head related transfer function HRTF, or more specifically features like interaural time difference ITD, interaural level difference ILD, and notches in the frequency spectra. In robotics notches have been left out since they are considered complex and difficult to use. As they are the main cue for humans' ability to estimate the elevation of the sound source this have to be compensated by adding more microphones or very large and asymmetric ears. In this paper, we present a novel method to extract the notches that makes it possible to accurately estimate the location of a sound source in both the horizontal and vertical plane using only two microphones and human-like ears. We suggest the use of simple spiral-shaped ears that has similar properties to the human ears and make it easy to calculate the position of the notches. Finally we show how the robot can learn its HRTF and build audiomotor maps using supervised learning and how it automatically can update its map using vision and compensate for changes in the HRTF due to changes to the ears or the environment.

Patent
02 Sep 2006
TL;DR: In this article, a method for suppressing receiver audio regeneration was proposed, which includes the steps of receiving a communication signal (502), at a Radio Frequency (RF) unit (102), demodulating the communication signal to an audio signal (504), monitoring a volume level of the audio signal, and shifting the pitch of audio signal when the volume level reaches a predetermined threshold.
Abstract: The invention concerns a method (500) and system (100) for suppressing receiver audio regeneration. The method (500) includes the steps of receiving a communication signal (502), at a Radio Frequency (RF) unit (102), demodulating the communication signal to an audio signal (504), monitoring a volume level of the audio signal (506), and shifting the pitch of the audio signal when the volume level reaches a predetermined threshold (508), and playing the pitch-shifted audio signal out of a speaker to produce a pitch-shifted acoustic signal (510). The method can shift the pitch of the audio signal to produce a pitch-shifted acoustic signal with signal properties suppressing regeneration of the acoustic signal onto the audio signal at the RF unit. The amount of pitch-shifting can be a function of the volume level.

PatentDOI
TL;DR: A portable audio device suitable for reproducing MPEG encoded data includes a plurality of inputs, a data storage, a display, an audio output, at least one processor, and a battery as discussed by the authors.
Abstract: A portable audio device suitable for reproducing MPEG encoded data includes a plurality of inputs, a data storage, a display, an audio output, at least one processor, and a battery. The plurality of inputs includes a forward input, a play control input, and a random input. The data storage stores compressed digitized audio data. The at least one processor is responsive to selection of at least one of the plurality of inputs to convert selected compressed digitized audio data stored in the data storage for reproduction by the audio output and to provide information to the display.

Patent
17 May 2006
TL;DR: In this paper, a measured level of an acoustic signal within an earpiece of a headset is used to determine compression characteristics without requiring separation of an interfering signal present in the monitored acoustic signal from a component related to the input audio signal.
Abstract: Adapting an audio response addresses perceptual effects of an interfering signal, such as of a residual ambient noise or other interference in an earpiece of a headphone. In one aspect, an input audio signal is presented substantially unmodified when it is at levels substantially above the interfering signal and is compressed when at or below the level of the interfering signal. The approach can make use of a measured level of an acoustic signal, for example, within an earpiece of a headset, and use the measured level in conjunction with the level of an input audio signal to determine compression characteristics without requiring separation of an interfering signal present in the monitored acoustic signal from a component related to the input audio signal. In another aspect, presentation characteristics of an input audio signal are determined to reduce distraction from an interfering signal, such as from a background conversation.

Patent
31 Jul 2006
TL;DR: In this article, an audio system installed in a listening space may include a signal processor and a plurality of loudspeakers, and the audio system may be tuned with an automated audio tuning system to optimize the sound output of the loudspeakers within the listening space.
Abstract: An audio system installed in a listening space may include a signal processor and a plurality of loudspeakers. The audio system may be tuned with an automated audio tuning system to optimize the sound output of the loudspeakers within the listening space. The automated audio tuning system may provide automated processing to determine at least one of a plurality of settings, such as channel equalization settings, delay settings, gain settings, crossover settings, bass optimization settings and group equalization settings. The settings may be generated by the automated audio tuning system based on an audio response produced by the loudspeakers in the audio system. The automated tuning system may generate simulations of the application of settings to the audio response to optimize tuning.

Proceedings ArticleDOI
Shumeet Baluja1, Michele Covell1
01 Jan 2006
TL;DR: Waveprint uses a combination of computer-vision techniques and large-scale-data-stream processing algorithms to create compact fingerprints of audio data that can be efficiently matched, and explicitly measures the tradeoffs between performance, memory usage, and computation.
Abstract: In this paper, we introduce Waveprint, a novel method for audio identification. Waveprint uses a combination of computer-vision techniques and large-scale-data-stream processing algorithms to create compact fingerprints of audio data that can be efficiently matched. The resulting system has excellent identification capabilities for small snippets of audio that have been degraded in a variety of manners, including competing noise, poor recording quality, and cell-phone playback. We explicitly measure the tradeoffs between performance, memory usage, and computation through extensive experimentation.

Proceedings ArticleDOI
01 Sep 2006
TL;DR: This work presents an extension to NMF that is convolutive and includes a sparseness constraint, and in combination with a spectral magnitude transform, this method discovers auditory objects and their associated sparse activation patterns.
Abstract: Discovering a representation which allows auditory data to be parsimoniously represented is useful for many machine learning and signal processing tasks Such a representation can be constructed by non-negative matrix factorisation (NMF), a method for finding parts-based representations of non-negative data We present an extension to NMF that is convolutive and includes a sparseness constraint In combination with a spectral magnitude transform, this method discovers auditory objects and their associated sparse activation patterns

PatentDOI
TL;DR: In this paper, the authors proposed a method to improve the quality of output audio by detecting an output acoustic signal and generating a receive audio signal based at least in part on the detected acoustic signal.
Abstract: A method ( 200 ) for improving quality of output audio ( 126 ). The method can include detecting an output acoustic signal ( 128 ) and generating a receive audio signal ( 134 ) based, at least in part, on the detected output acoustic signal. A frequency domain representation ( 140 ) of the receive audio signal can be compared to a frequency domain representation ( 138 ) of a source audio signal ( 124 ) from which the output acoustic signal is generated. At least one distortion signal ( 142 ) in the receive audio signal can be identified, and the source audio signal can be selectively equalized to reduce an amplitude of the source audio signal at a frequency that correlates to the distortion signal.

Journal ArticleDOI
TL;DR: Results show that this MMP algorithm is very promising for high-quality adaptive coding of audio signals, and at the cost of a slight sub-optimality in terms of the rate of convergence for the approximation error.
Abstract: This paper describes the Molecular Matching Pursuit (MMP), an extension of the popular Matching Pursuit (MP) algorithm for the decomposition of signals. The MMP is a practical solution which introduces the notion of structures within the framework of sparse overcomplete representations; these structures are based on the local dependency of significant time-frequency or time-scale atoms. We show that this algorithm is well adapted to the representation of real signals such as percussive audio signals. This is at the cost of a slight sub-optimality in terms of the rate of convergence for the approximation error, but the benefits are numerous, most notably a significant reduction in the computational cost, which facilitates the processing of long signals. Results show that this algorithm is very promising for high-quality adaptive coding of audio signals

Journal ArticleDOI
01 Oct 2006
TL;DR: A biologically inspired and technically implemented sound localization system to robustly estimate the position of a sound source in the frontal azimuthal half-plane that is able to localize audible signals, for example human speech signals, even in reverberating environments.
Abstract: This paper proposes a biologically inspired and technically implemented sound localization system to robustly estimate the position of a sound source in the frontal azimuthal half-plane. For localization, binaural cues are extracted using cochleagrams generated by a cochlear model that serve as input to the system. The basic idea of the model is to separately measure interaural time differences and interaural level differences for a number of frequencies and process these measurements as a whole. This leads to two-dimensional frequency versus time-delay representations of binaural cues, so-called activity maps. A probabilistic evaluation is presented to estimate the position of a sound source over time based on these activity maps. Learned reference maps for different azimuthal positions are integrated into the computation to gain time-dependent discrete conditional probabilities. At every timestep these probabilities are combined over frequencies and binaural cues to estimate the sound source position. In addition, they are propagated over time to improve position estimation. This leads to a system that is able to localize audible signals, for example human speech signals, even in reverberating environments

Journal ArticleDOI
TL;DR: A new algorithm is proposed for audio classification, which is based on weighted GMM Networks (WGN), and a new false alarm compensation procedure is implemented, which can compensate the false alarm rate significantly with little cost to the miss rate.
Abstract: The problem of unsupervised audio classification and segmentation continues to be a challenging research problem which significantly impacts automatic speech recognition (ASR) and spoken document retrieval (SDR) performance. This paper addresses novel advances in 1) audio classification for speech recognition and 2) audio segmentation for unsupervised multispeaker change detection. A new algorithm is proposed for audio classification, which is based on weighted GMM Networks (WGN). Two new extended-time features: variance of the spectrum flux (VSF) and variance of the zero-crossing rate (VZCR) are used to preclassify the audio and supply weights to the output probabilities of the GMM networks. The classification is then implemented using weighted GMM networks. Since historically there have been no features specifically designed for audio segmentation, we evaluate 16 potential features including three new proposed features: perceptual minimum variance distortionless response (PMVDR), smoothed zero-crossing rate (SZCR), and filterbank log energy coefficients (FBLC) in 14 noisy environments to determine the best robust features on the average across these conditions. Next, a new distance metric, T/sup 2/-mean, is proposed which is intended to improve segmentation for short segment turns (i.e., 1-5 s). A new false alarm compensation procedure is implemented, which can compensate the false alarm rate significantly with little cost to the miss rate. Evaluations on a standard data set-Defense Advanced Research Projects Agency (DARPA) Hub4 Broadcast News 1997 evaluation data-show that the WGN classification algorithm achieves over a 50% improvement versus the GMM network baseline algorithm, and the proposed compound segmentation algorithm achieves 23%-10% improvement in all metrics versus the baseline Mel-frequency cepstral coefficients (MFCC) and traditional Bayesian information criterion (BIC) algorithm. The new classification and segmentation algorithms also obtain very satisfactory results on the more diverse and challenging National Gallery of the Spoken Word (NGSW) corpus.

Proceedings ArticleDOI
09 Oct 2006
TL;DR: Two general frameworks based on Gaussian model mixture (GMM) and support vector machine (SVM) to achieve shout detection in railway embedded environment are presented.
Abstract: This paper addresses the problem of automatic audio analysis for aided surveillance application in public transport. The aim of such application is to detect critical situations and to warn the control room. We propose a comparative study of two methods of modelisation/classification of acoustical segments. The problem is quite similar to the 'audio indexing' framework, nevertheless the environment here is very noisy. We present two general frameworks based on Gaussian model mixture (GMM) and support vector machine (SVM) to achieve shout detection in railway embedded environment