scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Speech and Audio Processing in 2005"


Journal ArticleDOI
TL;DR: This paper explores the detection of domain-specific emotions using language and discourse information in conjunction with acoustic correlates of emotion in speech signals on a case study of detecting negative and non-negative emotions using spoken language data obtained from a call center application.
Abstract: The importance of automatically recognizing emotions from human speech has grown with the increasing role of spoken language interfaces in human-computer interaction applications. This paper explores the detection of domain-specific emotions using language and discourse information in conjunction with acoustic correlates of emotion in speech signals. The specific focus is on a case study of detecting negative and non-negative emotions using spoken language data obtained from a call center application. Most previous studies in emotion recognition have used only the acoustic information contained in speech. In this paper, a combination of three sources of information-acoustic, lexical, and discourse-is used for emotion recognition. To capture emotion information at the language level, an information-theoretic notion of emotional salience is introduced. Optimization of the acoustic correlates of emotion with respect to classification error was accomplished by investigating different feature sets obtained from feature selection, followed by principal component analysis. Experimental results on our call center data show that the best results are obtained when acoustic and language information are combined. Results show that combining all the information, rather than using only acoustic information, improves emotion classification by 40.7% for males and 36.4% for females (linear discriminant classifier used for acoustic information).

959 citations


Journal ArticleDOI
TL;DR: Methods based on the use of explicitly predefined signal features: the signal's amplitude envelope, spectral magnitudes and phases, time-frequency representations, and methods based on probabilistic signal models are discussed.
Abstract: Note onset detection and localization is useful in a number of analysis and indexing techniques for musical signals. The usual way to detect onsets is to look for "transient" regions in the signal, a notion that leads to many definitions: a sudden burst of energy, a change in the short-time spectrum of the signal or in the statistical properties, etc. The goal of this paper is to review, categorize, and compare some of the most commonly used techniques for onset detection, and to present possible enhancements. We discuss methods based on the use of explicitly predefined signal features: the signal's amplitude envelope, spectral magnitudes and phases, time-frequency representations; and methods based on probabilistic signal models: model-based change point detection, surprise signals, etc. Using a choice of test cases, we provide some guidelines for choosing the appropriate method for a given application.

802 citations


Journal ArticleDOI
TL;DR: This work derives an exact solution to the problem of maximum likelihood estimation of the supervector covariance matrix used in extended MAP (or EMAP) speaker adaptation and shows how it can be regarded as a new method of eigenvoice estimation.
Abstract: We derive an exact solution to the problem of maximum likelihood estimation of the supervector covariance matrix used in extended MAP (or EMAP) speaker adaptation and show how it can be regarded as a new method of eigenvoice estimation. Unlike other approaches to the problem of estimating eigenvoices in situations where speaker-dependent training is not feasible, our method enables us to estimate as many eigenvoices from a given training set as there are training speakers. In the limit as the amount of training data for each speaker tends to infinity, it is equivalent to cluster adaptive training.

523 citations


Journal ArticleDOI
TL;DR: Alternative spatial sampling schemes for the positioning of microphones on a sphere are presented, and the errors introduced by finite number of microphones, spatial aliasing, inaccuracies in microphone positioning, and measurement noise are investigated both theoretically and by using simulations.
Abstract: Spherical microphone arrays have been recently studied for sound-field recordings, beamforming, and sound-field analysis which use spherical harmonics in the design. Although the microphone arrays and the associated algorithms were presented, no comprehensive theoretical analysis of performance was provided. This work presents a spherical-harmonics-based design and analysis framework for spherical microphone arrays. In particular, alternative spatial sampling schemes for the positioning of microphones on a sphere are presented, and the errors introduced by finite number of microphones, spatial aliasing, inaccuracies in microphone positioning, and measurement noise are investigated both theoretically and by using simulations. The analysis framework can also provide a useful guide for the design and analysis of more general spherical microphone arrays which do not use spherical harmonics explicitly.

522 citations


Journal ArticleDOI
TL;DR: Compared to algorithms based on the Gaussian assumption, such as the Wiener filter or the Ephraim and Malah (1984) MMSE short-time spectral amplitude estimator, the estimators based on these supergaussian densities deliver an improved signal-to-noise ratio.
Abstract: This paper presents a class of minimum mean-square error (MMSE) estimators for enhancing short-time spectral coefficients of a noisy speech signal. In contrast to most of the presently used methods, we do not assume that the spectral coefficients of the noise or of the clean speech signal obey a (complex) Gaussian probability density. We derive analytical solutions to the problem of estimating discrete Fourier transform (DFT) coefficients in the MMSE sense when the prior probability density function of the clean speech DFT coefficients can be modeled by a complex Laplace or by a complex bilateral Gamma density. The probability density function of the noise DFT coefficients may be modeled either by a complex Gaussian or by a complex Laplacian density. Compared to algorithms based on the Gaussian assumption, such as the Wiener filter or the Ephraim and Malah (1984) MMSE short-time spectral amplitude estimator, the estimators based on these supergaussian densities deliver an improved signal-to-noise ratio.

352 citations


Journal ArticleDOI
TL;DR: The paper describes how the proposed method of compensating for nonlinear distortions in speech representation caused by noise can be applied to robust speech recognition and it is compared with other compensation techniques.
Abstract: This paper describes a method of compensating for nonlinear distortions in speech representation caused by noise. The method described here is based on the histogram equalization method often used in digital image processing. Histogram equalization is applied to each component of the feature vector in order to improve the robustness of speech recognition systems. The paper describes how the proposed method can be applied to robust speech recognition and it is compared with other compensation techniques. The recognition experiments, including results in the AURORA II framework, demonstrate the effectiveness of histogram equalization when it is applied either alone or in combination with other compensation techniques.

332 citations


Journal ArticleDOI
TL;DR: A general broadband approach to blind source separation (BSS) for convolutive mixtures based on second-order statistics is presented and constraints are obtained which provide a deeper understanding of the internal permutation problem in traditional narrowband frequency-domain BSS.
Abstract: We present a general broadband approach to blind source separation (BSS) for convolutive mixtures based on second-order statistics. This avoids several known limitations of the conventional narrowband approximation, such as the internal permutation problem. In contrast to traditional narrowband approaches, the new framework simultaneously exploits the nonwhiteness property and nonstationarity property of the source signals. Using a novel matrix formulation, we rigorously derive the corresponding time-domain and frequency-domain broadband algorithms by generalizing a known cost-function which inherently allows joint optimization for several time-lags of the correlations. Based on the broadband approach time-domain, constraints are obtained which provide a deeper understanding of the internal permutation problem in traditional narrowband frequency-domain BSS. For both the time-domain and the frequency-domain versions, we discuss links to well-known, and also, to novel algorithms that constitute special cases. Moreover, using the so-called generalized coherence, links between the time-domain and the frequency-domain algorithms can be established, showing that our cost function leads to an update equation with an inherent normalization ensuring a robust adaptation behavior. The concept is applicable to offline, online, and block-online algorithms by introducing a general weighting function allowing for tracking of time-varying real acoustic environments.

285 citations


Journal ArticleDOI
TL;DR: Bayesian estimators of the short-time spectral magnitude of speech based on perceptually motivated cost functions are proposed using variants of speech distortion measures, such as the Itakura-Saito and weighted likelihood-ratio distort measures, which have been used successfully in speech recognition.
Abstract: The traditional minimum mean-square error (MMSE) estimator of the short-time spectral amplitude is based on the minimization of the Bayesian squared-error cost function. The squared-error cost function, however, is not subjectively meaningful in that it does not necessarily produce estimators that emphasize spectral peak (formants) information or estimators which take into account auditory masking effects. To overcome the shortcomings of the MMSE estimator, we propose in this paper Bayesian estimators of the short-time spectral magnitude of speech based on perceptually motivated cost functions. In particular, we use variants of speech distortion measures, such as the Itakura-Saito and weighted likelihood-ratio distortion measures, which have been used successfully in speech recognition. Three classes of Bayesian estimators of the speech magnitude spectrum are derived. The first class of estimators emphasizes spectral peak information, the second class uses a weighted-Euclidean cost function that implicitly takes into account auditory masking effects, and the third class of estimators is designed to penalize spectral attenuation. Of the three classes of Bayesian estimators, the estimators that implicitly take into account auditory masking effect performed the best in terms of having less residual noise and better speech quality.

278 citations


Journal ArticleDOI
TL;DR: In this paper, the Jacobian determinant of the transformation matrix is computed analytically for three typical warping functions and it is shown that the matrices are diagonal dominant and thus can be approximated by quindiagonal matrices.
Abstract: Vocal tract normalization (VTN) is a widely used speaker normalization technique which reduces the effect of different lengths of the human vocal tract and results in an improved recognition accuracy of automatic speech recognition systems. We show that VTN results in a linear transformation in the cepstral domain, which so far have been considered as independent approaches of speaker normalization. We are now able to compute the Jacobian determinant of the transformation matrix, which allows the normalization of the probability distributions used in speaker-normalization for automatic speech recognition. We show that VTN can be viewed as a special case of Maximum Likelihood Linear Regression (MLLR). Consequently, we can explain previous experimental results that improvements obtained by VTN and subsequent MLLR are not additive in some cases. For three typical warping functions the transformation matrix is calculated analytically and we show that the matrices are diagonal dominant and thus can be approximated by quindiagonal matrices.

217 citations


Journal ArticleDOI
TL;DR: This paper presents a text-independent speaker verification system using support vector machines (SVMs) with score-space kernels and introduces a technique called spherical normalization that preconditions the Hessian matrix.
Abstract: This paper presents a text-independent speaker verification system using support vector machines (SVMs) with score-space kernels. Score-space kernels generalize Fisher kernels and are based on underlying generative models such as Gaussian mixture models (GMMs). This approach provides direct discrimination between whole sequences, in contrast with the frame-level approaches at the heart of most current systems. The resultant SVMs have a very high dimensionality since it is related to the number of parameters in the underlying generative model. To address problems that arise in the resultant optimization we introduce a technique called spherical normalization that preconditions the Hessian matrix. We have performed speaker verification experiments using the PolyVar database. The SVM system presented here reduces the relative error rates by 34% compared to a GMM likelihood ratio system.

214 citations


Journal ArticleDOI
TL;DR: The unsupervised training of a speech recognizer on recognized transcriptions is studied in this paper and the effect of the confidence measure which is used to detect possible recognition errors is studied systematically.
Abstract: For large vocabulary continuous speech recognition systems, the amount of acoustic training data is of crucial importance. In the past, large amounts of speech were thus recorded from various sources and had to be transcribed manually. It is thus desirable to train a recognizer with as little manually transcribed acoustic data as possible. Since untranscribed speech is available in various forms nowadays, the unsupervised training of a speech recognizer on recognized transcriptions is studied in this paper. A low-cost recognizer trained with between one and six h of manually transcribed speech is used to recognize 72 h of untranscribed acoustic data. These transcriptions are then used in combination with a confidence measure to train an improved recognizer. The effect of the confidence measure which is used to detect possible recognition errors is studied systematically. Finally, the unsupervised training is applied iteratively. Starting with only one h of transcribed acoustic data, a recognition system is trained fully automatically. With this iterative training procedure, the word error rates are reduced from 71.3% to 38.3% on the Broadcast News'96 evaluation test set and from 65.6% to 29.3% on the Broadcast News'98 evaluation test set. In comparison with an optimized system trained with the manually generated transcriptions of the complete 72 h training corpus, the word error rates increase by 14.3% relative and 18.6% relative, respectively.

Journal ArticleDOI
TL;DR: A novel algorithm to automatically determine the relative three-dimensional (3-D) positions of audio sensors and actuators in an ad-hoc distributed network of heterogeneous general purpose computing platforms such as laptops, PDAs, and tablets is presented.
Abstract: We present a novel algorithm to automatically determine the relative three-dimensional (3-D) positions of audio sensors and actuators in an ad-hoc distributed network of heterogeneous general purpose computing platforms such as laptops, PDAs, and tablets. A closed form approximate solution is derived, which is further refined by minimizing a nonlinear error function. Our formulation and solution accounts for the lack of temporal synchronization among different platforms. We compare two different estimators, one based on the time of flight and the other based on time difference of flight. We also derive an approximate expression for the mean and covariance of the implicitly defined estimator using the implicit function theorem and approximate Taylors' series expansion. The theoretical performance limits for estimating the sensor 3-D positions are derived via the Crame/spl acute/r-Rao bound (CRB) and analyzed, with respect to the number of sensors and actuators, as well as their geometry. We report extensive simulation results and discuss the practical details of implementing our algorithms in a real-life system.

Journal ArticleDOI
TL;DR: This paper describes how to estimate the confidence score for each utterance through an on-line algorithm using the lattice output of a speech recognizer and shows that the amount of labeled data needed for a given word accuracy can be reduced by more than 60% with respect to random sampling.
Abstract: We are interested in the problem of adaptive learning in the context of automatic speech recognition (ASR). In this paper, we propose an active learning algorithm for ASR. Automatic speech recognition systems are trained using human supervision to provide transcriptions of speech utterances. The goal of Active Learning is to minimize the human supervision for training acoustic and language models and to maximize the performance given the transcribed and untranscribed data. Active learning aims at reducing the number of training examples to be labeled by automatically processing the unlabeled examples, and then selecting the most informative ones with respect to a given cost function for a human to label. In this paper we describe how to estimate the confidence score for each utterance through an on-line algorithm using the lattice output of a speech recognizer. The utterance scores are filtered through the informativeness function and an optimal subset of training samples is selected. The active learning algorithm has been applied to both batch and on-line learning scheme and we have experimented with different selective sampling algorithms. Our experiments show that by using active learning the amount of labeled data needed for a given word accuracy can be reduced by more than 60% with respect to random sampling.

Journal ArticleDOI
TL;DR: A new technique for dynamic, frame-by-frame compensation of the Gaussian variances in the hidden Markov model (HMM), exploiting the feature variance or uncertainty estimated during the speech feature enhancement process, to improve noise-robust speech recognition.
Abstract: This paper presents a new technique for dynamic, frame-by-frame compensation of the Gaussian variances in the hidden Markov model (HMM), exploiting the feature variance or uncertainty estimated during the speech feature enhancement process, to improve noise-robust speech recognition. The new technique provides an alternative to the Bayesian predictive classification decision rule by carrying out an integration over the feature space instead of over the model-parameter space, offering a much simpler system implementation, lower computational cost, and dynamic compensation capabilities at the frame level. The computation of the feature enhancement variances is carried out using a probabilistic and parametric model of speech distortion, free from the use of any stereo training data. Dynamic compensation of the Gaussian variances in the HMM recognizer is derived, which is simply enlarging the HMM Gaussian variances by the feature enhancement variances. Experimental evaluation using the full Aurora2 test data sets demonstrates a significant digit error rate reduction, averaged over all noisy and signal-to-noise-ratio conditions, compared with the baseline that did not exploit the enhancement variance information. When the true enhancement variances are used, further dramatic error rate reduction is observed, indicating the strong potential for the new technique and the strong need for high accuracy in estimating the variances associated with feature enhancement. All the results, using either the true variances of the enhanced features or the estimated ones, show that the greatest contribution to recognizer's performance improvement is due to the use of the uncertainty for the static features, next due to the delta features, and the least due to the delta-delta features.

Journal ArticleDOI
TL;DR: A statistical model for speech enhancement that takes into account the time-correlation between successive speech spectral components is proposed and it is shown that a special case of the causal estimator degenerates to a "decision-directed" estimator with a time-varying frequency-dependent weighting factor.
Abstract: In this paper, we propose a statistical model for speech enhancement that takes into account the time-correlation between successive speech spectral components. It retains the simplicity associated with the Gaussian statistical model, and enables the extension of existing algorithms to noncausal estimation. The sequence of speech spectral variances is a random process, which is generally correlated with the sequence of speech spectral magnitudes. Causal and noncausal estimators for the a priori SNR are derived in agreement with the model assumptions and the estimation of the speech spectral components. We show that a special case of the causal estimator degenerates to a "decision-directed" estimator with a time-varying frequency-dependent weighting factor. Experimental results demonstrate the improved performance of the proposed algorithms.

Journal ArticleDOI
TL;DR: A new frequency domain approach to blind source separation (BSS) of audio signals mixed in a reverberant environment using a joint diagonalization procedure on the cross power spectral density matrices to identify the mixing system at each frequency bin up to a scale and permutation ambiguity.
Abstract: In this paper, we propose a new frequency domain approach to blind source separation (BSS) of audio signals mixed in a reverberant environment. We propose a joint diagonalization procedure on the cross power spectral density matrices of the signals at the output of the mixing system to identify the mixing system at each frequency bin up to a scale and permutation ambiguity. The frequency domain joint diagonalization is performed using a new and quickly converging algorithm which uses an alternating least-squares (ALS) optimization method. The inverse of the mixing system is then used to separate the sources. An efficient dyadic algorithm to resolve the frequency dependent permutation ambiguities that exploits the inherent nonstationarity of the sources is presented. The effect of the unknown scaling ambiguities is partially resolved using an initialization procedure for the ALS algorithm. The performance of the proposed algorithm is demonstrated by experiments conducted in real reverberant rooms. Performance comparisons are made with previous methods.

Journal ArticleDOI
TL;DR: This paper proposes effective algorithms to automatically classify and summarize music content and shows a better performance in music classification than traditional Euclidean distance methods and hidden Markov model methods.
Abstract: Automatic music classification and summarization are very useful to music indexing, content-based music retrieval and on-line music distribution, but it is a challenge to extract the most common and salient themes from unstructured raw music data. In this paper, we propose effective algorithms to automatically classify and summarize music content. Support vector machines are applied to classify music into pure music and vocal music by learning from training data. For pure music and vocal music, a number of features are extracted to characterize the music content, respectively. Based on calculated features, a clustering algorithm is applied to structure the music content. Finally, a music summary is created based on the clustering results and domain knowledge related to pure and vocal music. Support vector machine learning shows a better performance in music classification than traditional Euclidean distance methods and hidden Markov model methods. Listening tests are conducted to evaluate the quality of summarization. The experiments on different genres of pure and vocal music illustrate the results of summarization are significant and effective.

Journal ArticleDOI
TL;DR: An improved audio classification and categorization technique that makes use of wavelets and support vector machines (SVMs) to accurately classify and categorize audio data.
Abstract: In this paper, an improved audio classification and categorization technique is presented. This technique makes use of wavelets and support vector machines (SVMs) to accurately classify and categorize audio data. When a query audio is given, wavelets are first applied to extract acoustical features such as subband power and pitch information. Then, the proposed method uses a bottom-up SVM over these acoustical features and additional parameters, such as frequency cepstral coefficients, to accomplish audio classification and categorization. A public audio database (Muscle Fish), which consists of 410 sounds in 16 classes, is used to evaluate the performances of the proposed method against other similar schemes. Experimental results show that the classification errors are reduced from 16 (8.1%) to six (3.0%), and the categorization accuracy of a given audio sound can achieve 100% in the Top 2 matches.

Journal ArticleDOI
TL;DR: E evaluation results show that the proposed /spl beta/-order minimum mean-square error speech enhancement approach can achieve a more significant noise reduction and a better spectral estimation of weak speech spectral components from a noisy signal as compared to many existing speech enhancement algorithms.
Abstract: This paper proposes /spl beta/-order minimum mean-square error (MMSE) speech enhancement approach for estimating the short time spectral amplitude (STSA) of a speech signal. We analyze the characteristics of the /spl beta/-order STSA MMSE estimator and the relation between the value of /spl beta/ and the spectral amplitude gain function of the MMSE method. We further investigate the effectiveness of a range of fixed-/spl beta/ values in estimating STSA based on the MMSE criterion, and discuss how the /spl beta/ value could be adapted using the frame signal-to-noise ratio (SNR). The performance of the proposed speech enhancement approach is then evaluated through spectrogram inspection, objective speech distortion measures and subjective listening tests using several types of noise sources from the NOISEX-92 database. Evaluation results show that our approach can achieve a more significant noise reduction and a better spectral estimation of weak speech spectral components from a noisy signal as compared to many existing speech enhancement algorithms.

Journal ArticleDOI
TL;DR: Tests performed on a large corpus of recorded meetings show classification accuracies of up to 96%, and automatic speech recognition performance close to that obtained using ground truth segmentation.
Abstract: The analysis of scenarios in which a number of microphones record the activity of speakers, such as in a round-table meeting, presents a number of computational challenges. For example, if each participant wears a microphone, speech from both the microphone's wearer (local speech) and from other participants (crosstalk) is received. The recorded audio can be broadly classified in four ways: local speech, crosstalk plus local speech, crosstalk alone and silence. We describe two experiments related to the automatic classification of audio into these four classes. The first experiment attempted to optimize a set of acoustic features for use with a Gaussian mixture model (GMM) classifier. A large set of potential acoustic features were considered, some of which have been employed in previous studies. The best-performing features were found to be kurtosis, "fundamentalness," and cross-correlation metrics. The second experiment used these features to train an ergodic hidden Markov model classifier. Tests performed on a large corpus of recorded meetings show classification accuracies of up to 96%, and automatic speech recognition performance close to that obtained using ground truth segmentation.

Journal ArticleDOI
TL;DR: A novel robust feature parameter, adaptive band-partitioning spectral entropy (ABSE), is presented to successfully detect endpoints in adverse environments and is reliable in a real car.
Abstract: In speech processing, endpoint detection in noisy environments is difficult, especially in the presence of nonstationary noise. Robust endpoint detection is one of the most important areas of speech processing. Generally, the feature parameters used for endpoint detection are highly sensitive to the environment. Endpoint detection is severely degraded at low signal-to-noise ratios (SNRs) since those feature parameters cannot adequately describe the characteristics of a speech signal. As a result, this study seeks the banded structure on speech spectrogram to distinguish a speech from a nonspeech, especially in adverse environments. First, this study proposes a feature parameter, called band-partitioning spectral entropy (BSE), which exploits the use of the banded structure on speech spectrogram. A refined adaptive band selection (RABS) method is extended from the adaptive band selection method proposed by Wu et al., which adaptively selects useful bands not corrupted by noise. The successful RABS method is strongly depended on an on-line detection with minimal processing delay. In this paper, the RABS method is combined with the BSE parameter. Finally, a novel robust feature parameter, adaptive band-partitioning spectral entropy (ABSE), is presented to successfully detect endpoints in adverse environments. Experimental results indicate that the ABSE parameter is very effective under various noise conditions with several SNRs. Furthermore, the proposed algorithm outperforms other approaches and is reliable in a real car.

Journal ArticleDOI
TL;DR: A method based on the vowel onset point (VOP) is proposed for locating the end-points of an utterance and combining the evidence from these features seem to improve the performance of the system significantly.
Abstract: This paper proposes a text-dependent (fixed-text) speaker verification system which uses different types of information for making a decision regarding the identity claim of a speaker. The baseline system uses the dynamic time warping (DTW) technique for matching. Detection of the end-points of an utterance is crucial for the performance of the DTW-based template matching. A method based on the vowel onset point (VOP) is proposed for locating the end-points of an utterance. The proposed method for speaker verification uses the suprasegmental and source features, besides spectral features. The suprasegmental features such as pitch and duration are extracted using the warping path information in the DTW algorithm. Features of the excitation source, extracted using the neural network models, are also used in the text-dependent speaker verification system. Although the suprasegmental and source features individually may not yield good performance, combining the evidence from these features seem to improve the performance of the system significantly. Neural network models are used to combine the evidence from multiple sources of information.

Journal ArticleDOI
TL;DR: It is deduced that spatial interference and temporal echoes can be separated and an M/spl times/N MIMO system will be converted into M SIMO systems that are free of spatial interference.
Abstract: Blind separation of independent speech sources from their convolutive mixtures in a reverberant acoustic environment is a difficult problem and the state-of-the-art blind source separation techniques are still unsatisfactory. The challenge lies in the coexistence of spatial interference from competing sources and temporal echoes due to room reverberation in the observed mixtures. Focusing only on optimizing the signal-to-interference ratio is inadequate for most if not all speech processing systems. In this paper, we deduce that spatial interference and temporal echoes can be separated and an M/spl times/N MIMO system will be converted into M SIMO systems that are free of spatial interference. Furthermore we show that the channel matrices of these SIMO systems are irreducible if the channels from the same source in the MIMO system do not share common zeros. Thereafter we can apply the Bezout theorem to remove reverberation in those SIMO systems. Such a two-stage procedure leads to a novel sequential source separation and speech dereverberation algorithm based on blind multichannel identification. Simulations with measurements obtained in the varechoic chamber at Bell Labs demonstrate the success and robustness of the proposed algorithm in highly reverberant acoustic environments.

Journal ArticleDOI
Doh-Suk Kim1
TL;DR: The proposed auditory non-intrusive quality estimation (ANIQUE) model is based on the functional roles of human auditory systems and the characteristics of human articulation systems and demonstrates the effectiveness of the proposed model.
Abstract: In predicting subjective quality of speech signal degraded by telecommunication networks, conventional objective models require a reference source speech signal, which is applied as an input to the network, as well as the degraded speech. Non-intrusive estimation of speech quality is a challenging problem in that only the degraded speech signal is available. Non-intrusive estimation can be used in many real applications when source speech signal is not available. In this paper, we propose a new approach for non-intrusive speech quality estimation utilizing the temporal envelope representation of speech. The proposed auditory non-intrusive quality estimation (ANIQUE) model is based on the functional roles of human auditory systems and the characteristics of human articulation systems. Experimental evaluations on 35 different tests demonstrated the effectiveness of the proposed model.

Journal ArticleDOI
TL;DR: A time domain aperiodicity, periodicity, and pitch (APP) detector that estimates the proportion of periodic and aperiodic energy in a speech signal and the pitch period of the periodic component shows excellent agreement between the periodic/aperiodic decisions made by the APP system and the estimates obtained from the EGG data.
Abstract: In this paper, we present a time domain aperiodicity, periodicity, and pitch (APP) detector that estimates 1) the proportion of periodic and aperiodic energy in a speech signal and 2) the pitch period of the periodic component. The APP system is particularly useful in situations where the speech signal contains simultaneous periodic and aperiodic energy, as in the case of breathy vowels and some voiced obstruents. The performance of the APP system was evaluated on synthetic speech-like signals corrupted with noise at various levels of signal-to-noise ratio (SNR) and on three different natural speech databases that consist of simultaneously recorded electroglottograph (EGG) and acoustic data. When compared on a frame basis (at a frame rate of 2.5 ms) the results show excellent agreement between the periodic/aperiodic decisions made by the APP system and the estimates obtained from the EGG data (94.43% for periodicity and 96.32% for aperiodicity). The results also support previous studies that show that voiced obstruents are frequently manifested with either little or no aperiodic energy, or with strong periodic and aperiodic components. The EGG data were used as a reference for evaluating the pitch detection algorithm. The ground truth was not manually checked to rectify or exclude incorrect estimates. The overall gross error rate in pitch prediction across the three speech databases was 5.67%. In the case of synthetic speech-like data, the estimated SNR was found to be in close proportion to the actual SNR, and the pitch was always accurately found regardless of the presence of any shimmer or jitter.

Journal ArticleDOI
TL;DR: This study discusses a number of issues for audio stream phrase recognition for information retrieval for a new National Gallery of the Spoken Word (NGSW), and proposes a system diagram and discusses critical tasks associated with effective audio information retrieval.
Abstract: Advances in formulating spoken document retrieval for a new National Gallery of the Spoken Word (NGSW) are addressed. NGSW is the first large-scale repository of its kind, consisting of speeches, news broadcasts, and recordings from the 20th century. After presenting an overview of the audio stream content of the NGSW, with sample audio files from U.S. Presidents from 1893 to the present, an overall system diagram is proposed with a discussion of critical tasks associated with effective audio information retrieval. These include advanced audio segmentation, speech recognition model adaptation for acoustic background noise and speaker variability, and information retrieval using natural language processing for text query requests that include document and query expansion. For segmentation, a new evaluation criterion entitled fused error score (FES) is proposed, followed by application of the CompSeg segmentation scheme on DARPA Hub4 Broadcast News (30.5% relative improvement in FES) and NGSW data. Transcript generation is demonstrated for a six-decade portion of the NGSW corpus. Novel model adaptation using structure maximum likelihood eigenspace mapping shows a relative 21.7% improvement. Issues regarding copyright assessment and metadata construction are also addressed for the purposes of a sustainable audio collection of this magnitude. Advanced parameter-embedded watermarking is proposed with evaluations showing robustness to correlated noise attacks. Our experimental online system entitled "SpeechFind" is presented, which allows for audio retrieval from a portion of the NGSW corpus. Finally, a number of research challenges such as language modeling and lexicon for changing time periods, speaker trait and identification tracking, as well as new directions, are discussed in order to address the overall task of robust phrase searching in unrestricted audio corpora.

Journal ArticleDOI
TL;DR: This paper proposes an echo suppression algorithm, which estimates the spectral envelope of the echo signal by spectral modification-a technique originally proposed for noise reduction, and shows that this new approach has several advantages over the traditional AEC.
Abstract: Full-duplex hands-free telecommunication systems employ an acoustic echo canceler (AEC) to remove the undesired echoes that result from the coupling between a loudspeaker and a microphone. Traditionally, the removal is achieved by modeling the echo path impulse response with an adaptive finite impulse response (FIR) filter and subtracting an echo estimate from the microphone signal. It is not uncommon that an adaptive filter with a length of 50-300 ms needs to be considered, which makes an AEC highly computationally expensive. In this paper, we propose an echo suppression algorithm to eliminate the echo effect. Instead of identifying the echo path impulse response, the proposed method estimates the spectral envelope of the echo signal. The suppression is done by spectral modification-a technique originally proposed for noise reduction. It is shown that this new approach has several advantages over the traditional AEC. Properties of human auditory perception are considered, by estimating spectral envelopes according to the frequency selectivity of the auditory system, resulting in improved perceptual quality. A conventional AEC is often combined with a post-processor to reduce the residual echoes due to minor echo path changes. It is shown that the proposed algorithm is insensitive to such changes. Therefore, no post-processor is necessary. Furthermore, the new scheme is computationally much more efficient than a conventional AEC.

Journal ArticleDOI
TL;DR: This work investigates interesting connections between BSS and ideal beamforming, which leads to a permutation alignment scheme based on microphone array directivity patterns and proposes a multistage algorithm, which aligns the unmixing filter permutations without sacrificing the spectral resolution.
Abstract: Acoustic reverberation severely limits the performance of multiple microphone blind speech separation (BSS) methods. We show that the limited performance is due to random permutations of the unmixing filters over frequency. This problem, which we refer to as permutation inconsistency, becomes worse as the length of the room impulse response increases. We explore interesting connections between BSS and ideal beamforming, which leads us to propose a permutation alignment scheme based on microphone array directivity patterns. Given that the permutations are properly aligned, we show that the blind speech separation method outperforms the nonblind beamformer in a highly reverberant environment. Furthermore, we discover the tradeoff where permutations can be aligned by affording a loss in spectral resolution of the unmixing filters. We then propose a multistage algorithm, which aligns the unmixing filter permutations without sacrificing the spectral resolution. For our study, we perform experiments in both real and simulated environments and compare the results to the ideal performance benchmarks that we derive using prior knowledge of the mixing filters.

Journal ArticleDOI
TL;DR: Clear improvements in speech/nonspeech discrimination accuracy demonstrate the effectiveness of the proposed VAD and an increase of the OSF order leads to a better separation of the speech and noise distributions, thus allowing a more effective discrimination and a tradeoff between complexity and performance.
Abstract: An effective voice activity detection (VAD) algorithm is proposed for improving speech recognition performance in noisy environments. The approach is based on the determination of the speech/nonspeech divergence by means of specialized order statistics filters (OSFs) working on the subband log-energies. This algorithm differs from many others in the way the decision rule is formulated. Instead of making the decision based on the current frame, it uses OSFs on the subband log-energies which significantly reduces the error probability when discriminating speech from nonspeech in a noisy signal. Clear improvements in speech/nonspeech discrimination accuracy demonstrate the effectiveness of the proposed VAD. It is shown that an increase of the OSF order leads to a better separation of the speech and noise distributions, thus allowing a more effective discrimination and a tradeoff between complexity and performance. The algorithm also incorporates a noise reduction block working in tandem with the VAD and showed to further improve its accuracy. A previous noise reduction block also improves the accuracy in detecting speech and nonspeech. The experimental analysis carried out on the AURORA databases and tasks provides an extensive performance evaluation together with an exhaustive comparison to the standard VADs such as ITU G.729, GSM AMR, and ETSI AFE for distributed speech recognition (DSR), and other recently reported VADs.

Journal ArticleDOI
TL;DR: An efficient approach for unsupervised audio stream segmentation and clustering via the Bayesian Information Criterion via the BIC is proposed, which is particularly successful for short segment turns of less than 2 s in duration.
Abstract: In many speech and audio applications, it is first necessary to partition and classify acoustic events prior to voice coding for communication or speech recognition for spoken document retrieval. In this paper, we propose an efficient approach for unsupervised audio stream segmentation and clustering via the Bayesian Information Criterion (BIC). The proposed method extends an earlier formulation by Chen and Gopalakrishnan. In our formulation, Hotelling's T/sup 2/-Statistic is used to pre-select candidate segmentation boundaries followed by BIC to perform the segmentation decision. The proposed algorithm also incorporates a variable-size increasing window scheme and a skip-frame test. Our experiments show that we can improve the final algorithm speed by a factor of 100 compared to that in Chen and Gopalakrishnan's while achieving a 6.7% reduction in the acoustic boundary miss rate at the expense of a 5.7% increase in false alarm rate using DARPA Hub4 1997 evaluation data. The approach is particularly successful for short segment turns of less than 2 s in duration. The results suggest that the proposed algorithm is sufficiently effective and efficient for audio stream segmentation applications.