scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Speech and Audio Processing in 2002"


Journal ArticleDOI
TL;DR: The automatic classification of audio signals into an hierarchy of musical genres is explored and three feature sets for representing timbral texture, rhythmic content and pitch content are proposed.
Abstract: Musical genres are categorical labels created by humans to characterize pieces of music. A musical genre is characterized by the common characteristics shared by its members. These characteristics typically are related to the instrumentation, rhythmic structure, and harmonic content of the music. Genre hierarchies are commonly used to structure the large collections of music available on the Web. Currently musical genre annotation is performed manually. Automatic musical genre classification can assist or replace the human user in this process and would be a valuable addition to music information retrieval systems. In addition, automatic musical genre classification provides a framework for developing and evaluating features for any type of content-based analysis of musical signals. In this paper, the automatic classification of audio signals into an hierarchy of musical genres is explored. More specifically, three feature sets for representing timbral texture, rhythmic content and pitch content are proposed. The performance and relative importance of the proposed features is investigated by training statistical pattern recognition classifiers using real-world audio collections. Both whole file and real-time frame-based classification schemes are described. Using the proposed feature sets, classification of 61% for ten musical genres is achieved. This result is comparable to results reported for human musical genre classification.

2,668 citations


Journal ArticleDOI
Lie Lu1, Hong-Jiang Zhang1, Hao Jiang1
TL;DR: A robust approach that is capable of classifying and segmenting an audio stream into speech, music, environment sound, and silence is proposed, and an unsupervised speaker segmentation algorithm using a novel scheme based on quasi-GMM and LSP correlation analysis is developed.
Abstract: We present our study of audio content analysis for classification and segmentation, in which an audio stream is segmented according to audio type or speaker identity. We propose a robust approach that is capable of classifying and segmenting an audio stream into speech, music, environment sound, and silence. Audio classification is processed in two steps, which makes it suitable for different applications. The first step of the classification is speech and nonspeech discrimination. In this step, a novel algorithm based on K-nearest-neighbor (KNN) and linear spectral pairs-vector quantization (LSP-VQ) is developed. The second step further divides nonspeech class into music, environment sounds, and silence with a rule-based classification scheme. A set of new features such as the noise frame ratio and band periodicity are introduced and discussed in detail. We also develop an unsupervised speaker segmentation algorithm using a novel scheme based on quasi-GMM and LSP correlation analysis. Without a priori knowledge, this algorithm can support the open-set speaker, online speaker modeling and real time segmentation. Experimental results indicate that the proposed algorithms can produce very satisfactory results.

559 citations


Journal ArticleDOI
TL;DR: This work combines cross-power minimization of second-order source separation with geometric linear constraints used in adaptive beamforming to resolve some of the ambiguities inherent in the independence criterion such as frequency permutations and degrees of freedom provided by additional sensors.
Abstract: Convolutive blind source separation and adaptive beamforming have a similar goal-extracting a source of interest (or multiple sources) while reducing undesired interferences. A benefit of source separation is that it overcomes the conventional cross-talk or leakage problem of adaptive beamforming. Beamforming on the other hand exploits geometric information which is often readily available but not utilized in blind algorithms. We propose to join these benefits by combining cross-power minimization of second-order source separation with geometric linear constraints used in adaptive beamforming. We find that the geometric constraints resolve some of the ambiguities inherent in the independence criterion such as frequency permutations and degrees of freedom provided by additional sensors. We demonstrate the new method in performance comparisons for actual room recordings of two and three simultaneous acoustic sources.

341 citations


Journal ArticleDOI
TL;DR: In this paper, the adaptive multirate wideband (AMR-WB) speech codec was selected by the Third Generation Partnership Project (3GPP) for GSM and the third generation mobile communication WCDMA system for providing wideband speech services.
Abstract: This paper describes the adaptive multirate wideband (AMR-WB) speech codec selected by the Third Generation Partnership Project (3GPP) for GSM and the third generation mobile communication WCDMA system for providing wideband speech services. The AMR-WB speech codec algorithm was selected in December 2000 and the corresponding specifications were approved in March 2001. The AMR-WB codec was also selected by the International Telecommunication Union-Telecommunication Sector (ITU-T) in July 2001 in the standardization activity for wideband speech coding around 16 kb/s and was approved in January 2002 as Recommendation G.722.2. The adoption of AMR-WB by ITU-T is of significant importance since for the first time the same codec is adopted for wireless as well as wireline services. AMR-WB uses an extended audio bandwidth from 50 Hz to 7 kHz and gives superior speech quality and voice naturalness compared to existing second- and third-generation mobile communication systems. The wideband speech service provided by the AMR-WB codec will give mobile communication speech quality that also substantially exceeds (narrowband) wireline quality. The paper details AMR-WB standardization history, algorithmic description including novel techniques for efficient ACELP wideband speech coding and subjective quality performance of the codec.

312 citations


Journal ArticleDOI
Qi Li1, Jinsong Zheng1, A. Tsai1, Qiru Zhou1
TL;DR: The experiments show that the batch-mode algorithm can detect endpoints as accurately as using HMM forced alignment while the proposed one has much less computational complexity.
Abstract: When automatic speech recognition (ASR) and speaker verification (SV) are applied in adverse acoustic environments, endpoint detection and energy normalization can be crucial to the functioning of both systems. In low signal-to-noise ratio (SNR) and nonstationary environments, conventional approaches to endpoint detection and energy normalization often fail and ASR performances usually degrade dramatically. The purpose of this paper is to address the endpoint problem. For ASR, we propose a real-time approach. It uses an optimal filter plus a three-state transition diagram for endpoint detection. The filter is designed utilizing several criteria to ensure accuracy and robustness. It has almost invariant response at various background noise levels. The detected endpoints are then applied to energy normalization sequentially. Evaluation results show that the proposed algorithm significantly reduces the string error rates in low SNR situations. The reduction rates even exceed 50% in several evaluated databases. For SV, we propose a batch-mode approach. It uses the optimal filter plus a two-mixture energy model for endpoint detection. The experiments show that the batch-mode algorithm can detect endpoints as accurately as using HMM forced alignment while the proposed one has much less computational complexity.

225 citations


Journal ArticleDOI
TL;DR: An algorithm is proposed which detects speech pauses by adaptively tracking minima in a noisy signal's power envelope both for the broadband signal and for the high-pass and low-pass filtered signal in poor signal-to-noise ratios (SNRs).
Abstract: A speech pause detection algorithm is an important and sensitive part of most single-microphone noise reduction schemes for enhancement of speech signals corrupted by additive noise as an estimate of the background noise is usually determined when speech is absent. An algorithm is proposed which detects speech pauses by adaptively tracking minima in a noisy signal's power envelope both for the broadband signal and for the high-pass and low-pass filtered signal. In poor signal-to-noise ratios (SNRs), the proposed algorithm maintains a low false-alarm rate in the detection of speech pauses while the standardized algorithm of ITU G.729 shows an increasing false-alarm rate in unfavorable situations. These characteristics are found with different types of noise and indicate that the proposed algorithm is better suited to be used for noise estimation in noise reduction algorithms, as speech deterioration may thus be kept at a low level. It is shown that in connection with the Ephraim-Malah (1984) noise reduction scheme, the speech pause detection performance can even be further increased by using the noise-reduced signal instead of the noisy signal as input for the speech pause decision unit.

219 citations


Journal ArticleDOI
TL;DR: On-line estimation of the clean speech and model parameters and the adequacy of the chosen statistical models are performed and it is shown how model adequacy may be determined by combining the particle filter with frequentist methods.
Abstract: This paper applies time-varying autoregressive (TVAR) models with stochastically evolving parameters to the problem of speech modeling and enhancement. The stochastic evolution models for the TVAR parameters are Markovian diffusion processes. The main aim of the paper is to perform on-line estimation of the clean speech and model parameters and to determine the adequacy of the chosen statistical models. Efficient particle methods are developed to solve the optimal filtering and fixed-lag smoothing problems. The algorithms combine sequential importance sampling (SIS), a selection step and Markov chain Monte Carlo (MCMC) methods. They employ several variance reduction strategies to make the best use of the statistical structure of the model. It is also shown how model adequacy may be determined by combining the particle filter with frequentist methods. The modeling and enhancement performance of the models and estimation algorithms are evaluated in simulation studies on both synthetic and real speech data sets.

169 citations


Journal ArticleDOI
TL;DR: Issues in ASR of children's speech are introduced by an analysis of developmental changes in the spectral and temporal characteristics of the speech signal using data obtained from 456 children, ages five to 18 years.
Abstract: Creating conversational interfaces for children is challenging in several respects. These include acoustic modeling for automatic speech recognition (ASR), language and dialog modeling, and multimodal-multimedia user interface design. First, issues in ASR of children's speech are introduced by an analysis of developmental changes in the spectral and temporal characteristics of the speech signal using data obtained from 456 children, ages five to 18 years. Acoustic modeling adaptation and vocal tract normalization algorithms that yielded state-of-the-art ASR performance on children's speech are described. Second, an experiment designed to better understand how children interact with machines using spoken language is described. Realistic conversational multimedia interaction data were obtained from 160 children who played a voice-activated computer game in a Wizard of Oz (WoZ) scenario. Results of using these data in developing novel language and dialog models as well as in a unified maximum likelihood framework for acoustic decoding in ASR and semantic classification for spoken language understanding are described. Leveraging the lessons learned from the WoZ study and a concurrent user experience evaluation, a multimedia personal agent prototype for children was designed. Details of the architecture and application details are described. Informal evaluation by children was found positive especially for the animated agent and the speech interface.

163 citations


Journal ArticleDOI
TL;DR: This work proposes the use of a polynomial-based classifier which is highly computationally scalable with the number of speakers, and a new training algorithm which is discriminative, handles large data sets, and has low memory usage.
Abstract: Modern speaker recognition applications require high accuracy at low complexity. We propose the use of a polynomial-based classifier to achieve these objectives. This approach has several advantages. First, polynomial classifier scoring yields a system which is highly computationally scalable with the number of speakers. Second, a new training algorithm is proposed which is discriminative, handles large data sets, and has low memory usage. Third, the output of the polynomial classifier is easily incorporated into a statistical framework allowing it to be combined with other techniques such as hidden Markov models. Results are given for the application of the new methods to the YOHO speaker recognition database.

147 citations


Journal ArticleDOI
TL;DR: Algorithms for combined acoustic echo cancellation and noise reduction for hands-free telephones are presented and compared and a psychoacoustically motivated weighting rule is mostly preferred since it leads to more natural near end speech and to less annoying residual noise.
Abstract: This paper presents and compares algorithms for combined acoustic echo cancellation and noise reduction for hands-free telephones. A structure is proposed, consisting of a conventional acoustic echo canceler and a frequency domain postfilter in the sending path of the hands-free system. The postfilter applies the spectral weighting technique and attenuates both the background noise and the residual echo which remains after imperfect echo cancellation. Two weighting rules for the postfilter are discussed. The first is a conventional one, known from noise reduction, which is extended to attenuate residual echo as well as noise. The second is a psychoacoustically motivated weighting rule. Both rules are evaluated and compared by instrumental and auditive tests. They succeed about equally well in attenuating the noise and the residual echo. In listening tests, however, the psychoacoustically motivated weighting rule is mostly preferred since it leads to more natural near end speech and to less annoying residual noise.

146 citations


Journal ArticleDOI
TL;DR: A subjective listening test of the combined pre-filter/lossless coder and a state-of-the-art perceptual audio coder (PAC) shows that the new method achieves a comparable compression ratio and audio quality with a lower delay.
Abstract: This paper proposes a versatile perceptual audio coding method that achieves high compression ratios and is capable of low encoding/decoding delay. It accommodates a variety of source signals (including both music and speech) with different sampling rates. It is based on separating irrelevance and redundancy reductions into independent functional units. This contrasts traditional audio coding where both are integrated within the same subband decomposition. The separation allows for the independent optimization of the irrelevance and redundancy reduction units. For both reductions, we rely on adaptive filtering and predictive coding as much as possible to minimize the delay. A psycho-acoustically controlled adaptive linear filter is used for the irrelevance reduction, and the redundancy reduction is carried out by a predictive lossless coding scheme, which is termed weighted cascaded least mean squared (WCLMS) method. Experiments are carried out on a database of moderate size which contains mono-signals of different sampling rates and varying nature (music, speech, or mixed). They show that the proposed WCLMS lossless coder outperforms other competing lossless coders in terms of compression ratios and delay, as applied to the pre-filtered signal. Moreover, a subjective listening test of the combined pre-filter/lossless coder and a state-of-the-art perceptual audio coder (PAC) shows that the new method achieves a comparable compression ratio and audio quality with a lower delay.

Journal ArticleDOI
TL;DR: The optimal estimate of the Filtered-X LMS with an individually adjusted fixed filter shows the best agreement with the desired estimate.
Abstract: Feedback cancellation in hearing aids based on Filtered-X LMS is analyzed. The data used for identification of the feedback path are the output and input signals of the hearing aid. The identification is thus done in a closed loop. Tracking characteristics and bias of the optimal estimate are discussed. The optimal estimate can be biased when the identification is performed in closed loop and the input signal to the hearing aid is not white. It is shown that the bias could be avoided if the spectrum of the input signal was known and the data used to update the internal feedback is prefiltered. The effects of different choices of the design variables of the Filtered-X LMS are discussed. Three alternatives of the fixed filter are evaluated on feedback paths of hearing aids on human subjects and with alternative spectra of the input signal. The optimal estimate of the Filtered-X LMS with an individually adjusted fixed filter shows the best agreement with the desired estimate.

Journal ArticleDOI
TL;DR: A spectral domain, speech enhancement algorithm based on a mixture model for the short time spectrum of the clean speech signal, and on a maximum assumption in the production of the noisy speech spectrum that shows improved performance compared to alternative speech enhancement algorithms.
Abstract: We present a spectral domain, speech enhancement algorithm. The new algorithm is based on a mixture model for the short time spectrum of the clean speech signal, and on a maximum assumption in the production of the noisy speech spectrum. In the past this model was used in the context of noise robust speech recognition. In this paper we show that this model is also effective for improving the quality of speech signals corrupted by additive noise. The computational requirements of the algorithm can be significantly reduced, essentially without paying performance penalties, by incorporating a dual codebook scheme with tied variances. Experiments, using recorded speech signals and actual noise sources, show that in spite of its low computational requirements, the algorithm shows improved performance compared to alternative speech enhancement algorithms.

Journal ArticleDOI
TL;DR: On the German spontaneous speech task Verbmobil, the WSJ task and the German telephone digit string corpus SieTill, the proposed methods for VTN reduce the error rates significantly.
Abstract: This paper presents methods for speaker adaptive modeling using vocal tract normalization (VTN) along with experimental tests on three databases. We propose a new training method for VTN: By using single-density acoustic models per HMM state for selecting the scale factor of the frequency axis, we avoid the problem that a mixture-density tends to learn the scale factors of the training speakers and thus cannot be used for selecting the scale factor. We show that using single Gaussian densities for selecting the scale factor in training results in lower error rates than using mixture densities. For the recognition phase, we propose an improvement of the well-known two-pass strategy: by using a non-normalized acoustic model for the first recognition pass instead of a normalized model, lower error rates are obtained. In recognition tests, this method is compared with a fast variant of VTN. The two-pass strategy is an efficient method, but it is suboptimal because the scale factor and the word sequence are determined sequentially. We found that for telephone digit string recognition this suboptimality reduces the VTN gain in recognition performance by 30% relative. In summary, on the German spontaneous speech task Verbmobil, the WSJ task and the German telephone digit string corpus SieTill, the proposed methods for VTN reduce the error rates significantly.

Journal ArticleDOI
TL;DR: Low-power, portable devices could achieve very high levels of speech-content protection at only 30-45% of the computational load of current techniques, freeing resources for other tasks and enabling longer battery life.
Abstract: Mobile multimedia applications, the focus of many forthcoming wireless services, increasingly demand low-power techniques implementing content protection and customer privacy. In this paper low complexity perception-based partial encryption schemes for speech are presented. Speech compressed by a widely-used speech coding algorithm, the ITU-T G.729 standard at 8 kb/s, is partitioned in two classes, one, the most perceptually relevant, to be encrypted, the other, to be left unprotected. Two partial-encryption techniques are developed, a low-protection scheme, aimed at preventing most kinds of eavesdropping and a high-protection scheme, based on the encryption of a larger share of perceptually important bits and meant to perform as well as full encryption of the compressed bitstream. The high-protection scheme, based on the encryption of about 45% of the bitstream, achieves content protection comparable to that obtained by full encryption, as verified by both objective measures and formal listening tests. For the low-protection scheme, encryption of as little as 30% of the bitstream virtually eliminates intelligibility as well as most of the remaining perceptual information. Low-power, portable devices could therefore achieve very high levels of speech-content protection at only 30-45% of the computational load of current techniques, freeing resources for other tasks and enabling longer battery life.

Journal ArticleDOI
TL;DR: This paper presents a complete probabilistic formulation for the automatic design of subword units and dictionary, given only the acoustic data and their transcriptions, and permits easy incorporation of external sources of information, such as the spellings of words in terms of a nonideographic script.
Abstract: Large vocabulary continuous speech recognition (LVCSR) systems traditionally represent words in terms of smaller subword units. Both during training and during recognition, they require a mapping table, called the dictionary, which maps words into sequences of these subword units. The performance of the LVCSR system depends critically on the definition of the subword units and the accuracy of the dictionary. In current LVCSR systems, both these components are manually designed. While manually designed subword units generalize well, they may not be the optimal units of classification for the specific task or environment for which an LVCSR system is trained. Moreover, when human expertise is not available, it may not be possible to design good subword units manually. There is clearly a need for data-driven design of these LVCSR components. In this paper, we present a complete probabilistic formulation for the automatic design of subword units and dictionary, given only the acoustic data and their transcriptions. The proposed framework permits easy incorporation of external sources of information, such as the spellings of words in terms of a nonideographic script.

Journal ArticleDOI
TL;DR: This paper presents a system that allows the user to search for information on mobile devices using spoken natural-language queries that combines state-of-the-art speech-recognition and information-retrieval technologies and shows that for mobile devices with high-quality microphones, spoken-query retrieval based on existing technologies yields retrieval precisions that come close to that for perfect text input.
Abstract: With the proliferation of handheld devices, information access on mobile devices is a topic of growing relevance. This paper presents a system that allows the user to search for information on mobile devices using spoken natural-language queries. We explore several issues related to the creation of this system, which combines state-of-the-art speech-recognition and information-retrieval technologies. This is the first work that we are aware of which evaluates spoken query based information retrieval on a commonly available and well researched text database, the Chinese news corpus used in the National Institute of Standards and Technology (NIST)s TREC-5 and TREC-6 benchmarks. To compare spoken-query retrieval performance for different relevant scenarios and recognition accuracies, the benchmark queries-read verbatim by 20 speakers-were recorded simultaneously through three channels: headset microphone, PDA microphone, and cellular phone. Our results show that for mobile devices with high-quality microphones, spoken-query retrieval based on existing technologies yields retrieval precisions that come close to that for perfect text input (mean average precision 0.459 and 0.489, respectively, on TREC-6).

Journal ArticleDOI
TL;DR: MiPad is a wireless mobile PDA prototype that enables users to accomplish many common tasks using a multimodal spoken language interface and wireless-data technologies, and provides a novel solution for data entry in PDAs or smart phones.
Abstract: This paper describes the main components of MiPad (multimodal interactive PAD) and especially its distributed speech processing aspects. MiPad is a wireless mobile PDA prototype that enables users to accomplish many common tasks using a multimodal spoken language interface and wireless-data technologies. It fully integrates continuous speech recognition and spoken language understanding, and provides a novel solution for data entry in PDAs or smart phones, often done by pecking with tiny styluses or typing on minuscule keyboards. Our user study indicates that the throughput of MiPad is significantly superior to that of the existing pen-based PDA interface. Acoustic modeling and noise robustness in distributed speech recognition are key components in MiPad's design and implementation. In a typical scenario, the user speaks to the device at a distance so that he or she can see the screen. The built-in microphone thus picks up a lot of background noise, which requires MiPad be noise robust. For complex tasks, such as dictating e-mails, resource limitations demand the use of a client-server (peer-to-peer) architecture, where the PDA performs primitive feature extraction, feature quantization, and error protection, while the transmitted features to the server are subject to further speech feature enhancement, speech decoding and understanding before a dialog is carried out and actions rendered. Noise robustness can be achieved at the client, at the server or both. Various speech processing aspects of this type of distributed computation as related to MiPad's potential deployment are presented. Previous user interface study results are also described. Finally, we point out future research directions as related to several key MiPad functionalities.

Journal ArticleDOI
TL;DR: Considering the monosyllabic structure of the Chinese language, a whole class of syllable-based indexing features, including overlapping segments of syllables and syllable pairs separated by a few syllables, is extensively investigated based on a Mandarin broadcast news database.
Abstract: With the rapidly growing use of the audio and multimedia information over the Internet, the technology for retrieving speech information using voice queries is becoming more and more important. In this paper, considering the monosyllabic structure of the Chinese language, a whole class of syllable-based indexing features, including overlapping segments of syllables and syllable pairs separated by a few syllables, is extensively investigated based on a Mandarin broadcast news database. The strong discriminating capabilities of such syllable-based features were verified by comparing with the word- or character-based features. Good approaches for better utilizing such capabilities, including fusion with the word- and character-level information and improved approaches to obtain better syllable-based features and query expressions, were extensively investigated. Very encouraging experimental results were obtained.

Journal ArticleDOI
TL;DR: Together, source coding, channel coding and the modified recognition engine are shown to provide good recognition accuracy over a wide range of communication channels with bit rates of 1.2 kbps or less.
Abstract: We present a framework for developing source coding, channel coding and decoding as well as erasure concealment techniques adapted for distributed (wireless or packet-based) speech recognition. It is shown that speech recognition as opposed to speech coding, is more sensitive to channel errors than channel erasures, and appropriate channel coding design criteria are determined. For channel decoding, we introduce a novel technique for combining at the receiver soft decision decoding with error detection. Frame erasure concealment techniques are used at the decoder to deal with unreliable frames. At the recognition stage, we present a technique to modify the recognition engine itself to take into account the time-varying reliability of the decoded feature after channel transmission. The resulting engine, referred to as weighted Viterbi recognition, further improves the recognition accuracy. Together, source coding, channel coding and the modified recognition engine are shown to provide good recognition accuracy over a wide range of communication channels with bit rates of 1.2 kbps or less.

Journal ArticleDOI
TL;DR: In this paper, a set of techniques for estimating confidence measures using the phone recognizer output in conjunction with the word recognizer outputs is described, based on the construction of "metamodels", which generate alternative word hypotheses for an utterance.
Abstract: We describe some high-level approaches to estimating confidence scores for the words output by a speech recognizer. By "high-level" we mean that the proposed measures do not rely on decoder specific "side information" and so should find more general applicability than measures that have been developed for specific recognizers. Our main approach is to attempt to decouple the language modeling and acoustic modeling in the recognizer in order to generate independent information from these two sources that can then be used for estimation of confidence. We isolate these two information sources by using a phone recognizer working in parallel with the word recognizer. A set of techniques for estimating confidence measures using the phone recognizer output in conjunction with the word recognizer output is described. The most effective of these techniques is based on the construction of "metamodels," which generate alternative word hypotheses for an utterance. An alternative approach requires no other recognizers or extra information for confidence estimation and is based on the notion that a word that is semantically "distant" from the other decoded words in the utterance is likely to be incorrect. We describe a method for constructing "semantic similarities" between words and hence estimating a confidence. Results using the UK version of the Wall Street Journal are given for each technique.

Journal ArticleDOI
TL;DR: This paper explores packet loss recovery for automatic speech recognition (ASR) in spoken dialog systems, assuming an architecture in which a lightweight client communicates with a remote ASR server, and shows that the approach provides robust ASR performance which degrades gracefully as packet loss rates increase.
Abstract: This paper explores packet loss recovery for automatic speech recognition (ASR) in spoken dialog systems, assuming an architecture in which a lightweight client communicates with a remote ASR server. Speech is transmitted with source and channel codes optimized for the ASR application, i.e., to minimize word error rate. Unequal amounts of forward error correction, depending on the data's effect on ASR performance, are assigned to protect against packet loss. Experiments with simulated packet loss in a range of loss conditions are conducted on the DARPA Communicator (air travel information) task. Results show that the approach provides robust ASR performance which degrades gracefully as packet loss rates increase. Transmitting at 5.2 Kbps with up to 200 ms added delay, leads to only a 7% relative degradation in word error rate even under extremely adverse network conditions.

Journal ArticleDOI
TL;DR: The replacement of the ordinary output probability with its expected value if the addition of noise is modeled as a stochastic process, which in turn is merged with the hidden Markov model (HMM) in the Viterbi algorithm.
Abstract: This paper proposes the replacement of the ordinary output probability with its expected value if the addition of noise is modeled as a stochastic process, which in turn is merged with the hidden Markov model (HMM) in the Viterbi algorithm. This new output probability is analytically derived for the generic case of a mixture of Gaussians and can be seen as the definition of a stochastic version of the weighted Viterbi algorithm. Moreover, an analytical expression to estimate the uncertainty in noise canceling is also presented. The method is applied in combination with spectral subtraction to improve the robustness to additive noise of a text-dependent speaker verification system. Reductions as high as 30% or 40% in the error rates and improvements of 50% in the stability of the decision thresholds are reported.

Journal ArticleDOI
M.J.F. Gales1
TL;DR: This paper describes two new forms of multiple subspace schemes and the problem of handling likelihood consistency between the various subspaces is dealt with by viewing the projection schemes within a maximum likelihood framework.
Abstract: The first stage in many pattern recognition tasks is to generate a good set of features from the observed data. Usually, only a single feature space is used. However, in some complex pattern recognition tasks the choice of a good feature space may vary depending on the signal content. An example is in speech recognition where phone dependent feature subspaces may be useful. Handling multiple subspaces while still maintaining meaningful likelihood comparisons between classes is a key issue. This paper describes two new forms of multiple subspace schemes. For both schemes, the problem of handling likelihood consistency between the various subspaces is dealt with by viewing the projection schemes within a maximum likelihood framework. Efficient estimation formulae for the model parameters for both schemes are derived. In addition, the computational cost for their use during recognition are given. These new projection schemes are evaluated on a large vocabulary speech recognition task in terms of performance, speed of likelihood calculation and number parameters.

Journal ArticleDOI
TL;DR: An adaptive nonlinear device is presented that incorporates new knowledge about two-channel echo cancellation in such a way that a pre-specified maximum misalignment is maintained while improving the perceived quality by minimizing the introduced distortion.
Abstract: We expand the knowledge regarding the problems of two-channel (or stereophonic) echo cancellation. The major difference between two-channel and the single-channel echo cancellation is the problem of nonunique solutions in the two-channel case. In previous work, this nonuniqueness problem has been linked to the coherence between the two incoming audio channels. One proven solution to this problem is to distort the signals with a nonlinear device. In this work, we present new theory that gives insight to the existing links between: (i) coherence and level of distortion, and (ii) coherence and achievable misalignment of the stereophonic echo canceler. Furthermore, we present an adaptive nonlinear device that incorporates this new knowledge in such a way that a pre-specified maximum misalignment is maintained while improving the perceived quality by minimizing the introduced distortion. Moreover, all the ideas presented can be generalized to the multichannel (>2) case.

Journal ArticleDOI
TL;DR: The theory and implementation of the probabilistic union model, and a combination of the union model with conventional noise reduction techniques to deal with a mixture of stationary noise and unknown, nonstationary noise, are introduced.
Abstract: This paper introduces a new statistical approach, namely the probabilistic union model, for speech recognition involving partial, unknown frequency-band corruption. Partial frequency-band corruption accounts for the effect of a family of real-world noises. Previous methods based on the missing feature theory usually require the identity of the noisy bands. This identification can be difficult for unexpected noise with unknown, time-varying band characteristics. The new model combines the local frequency-band information based on the union of random events, to reduce the dependence of the model on information about the noise. This model partially accomplishes the target: offering robustness to partial frequency-band corruption, while requiring no information about the noise. This paper introduces the theory and implementation of the union model, and is focused on several important advances. These new developments include a new algorithm for automatic order selection, a generalization of the modeling principle to accommodate partial feature stream corruption, and a combination of the union model with conventional noise reduction techniques to deal with a mixture of stationary noise and unknown, nonstationary noise. For the evaluation, we used the TIDIGITS database for speaker-independent connected digit recognition. The utterances were corrupted by various types of additive noise, stationary or time-varying, assuming no knowledge about the noise characteristics. The results indicate that the new model offers significantly improved robustness in comparison to other models.

Journal ArticleDOI
TL;DR: A fast, vocabulary independent, algorithm for spotting words in speech that has wide-ranging use in distributed and pervasive speech recognition applications such as audio-indexing, spoken message retrieval and video-browsing is presented.
Abstract: In this paper, we present a fast, vocabulary independent, algorithm for spotting words in speech The algorithm consists of a phone-ngram representation (indexing) stage and a coarse-to-detailed search stage for spotting a word/phone sequence in speech The phone-ngram representation stage provides a phoneme-level representation of the speech that can be searched efficiently We present a novel method for phoneme-recognition using a vocabulary prefix tree to guide the creation of the phone-ngram index The coarse search, consisting of phone-ngram matching, identifies regions of speech as putative word hits The detailed acoustic match is then conducted only at the putative hits identified in the coarse match This gives us vocabulary independence and the desired accuracy and speed in wordspotting Current lattice-based phoneme-matching algorithms are similar to the coarse-match step of our algorithm We show that our combined algorithm gives a factor of two improvement over the coarse match The algorithm has wide-ranging use in distributed and pervasive speech recognition applications such as audio-indexing, spoken message retrieval and video-browsing

Journal ArticleDOI
TL;DR: A new auditory-based speech processing system based on the biologically rooted property of the average localized synchrony detection (ALSD), which detects periodicity in the speech signal at Bark-scaled frequencies while reducing the response's spurious peaks and sensitivity to implementation mismatches, presents a consistent and robust representation of the formants.
Abstract: A new auditory-based speech processing system based on the biologically rooted property of the average localized synchrony detection (ALSD) is proposed. The system detects periodicity in the speech signal at Bark-scaled frequencies while reducing the response's spurious peaks and sensitivity to implementation mismatches, and hence presents a consistent and robust representation of the formants. The system is evaluated for its formant extraction ability while reducing spurious peaks. It is compared with other auditory-based and traditional systems in the tasks of vowel and consonant recognition on clean speech from the TIMIT database and in the presence of noise. The results illustrate the advantage of the ALSD system in extracting the formants and reducing the spurious peaks. They also indicate the superiority of the synchrony measures over the mean-rate in the presence of noise.

Journal ArticleDOI
TL;DR: The issues and techniques explored are improving robustness and efficiency of the front-end, using multiple microphones for removing extraneous signals from speech via a new multichannel CDCN technique, reducing computation via silence detection, and applying the Bayesian information criterion (BIC) to build smaller and better acoustic models.
Abstract: This paper describes a robust, accurate, efficient, low-resource, medium-vocabulary, grammar-based speech recognition system using hidden Markov models for mobile applications. Among the issues and techniques we explore are improving robustness and efficiency of the front-end, using multiple microphones for removing extraneous signals from speech via a new multichannel CDCN technique, reducing computation via silence detection, applying the Bayesian information criterion (BIC) to build smaller and better acoustic models, minimizing finite state grammars, using hybrid maximum likelihood and discriminative models, and automatically generating baseforms from single new-word utterances.

Journal ArticleDOI
TL;DR: It is demonstrated that a genetic algorithm can efficiently generate accurate low-order pole-zero approximations of head-related transfer functions (HRTFs) from measured impulse responses by minimizing a logarithmic error criterion.
Abstract: We demonstrate that a genetic algorithm (GA) can efficiently generate accurate low-order pole-zero approximations of head-related transfer functions (HRTFs) from measured impulse responses by minimizing a logarithmic error criterion. This approach is much simpler and comparable or superior in efficiency to competing search algorithms. We build on previous work in low-order HRTF approximation. By applying the GA, we converge to solutions of equal quality in about 30 s compared to over 20 min. This work develops a basic steady-state GA using a pole-zero filter design problem as an illustrative example. We propose a domain-appropriate error measure. We then apply the algorithm to designing filters to approximate measured HRTFs. Detailed performance measurements are presented. In the appendix, we propose a widely applicable population variation metric. A lower bound is developed for this metric and is used to detect convergence.