scispace - formally typeset
Search or ask a question

Showing papers on "Speech enhancement published in 2007"


Book
07 Jun 2007
TL;DR: Clear and concise, this book explores how human listeners compensate for acoustic noise in noisy environments and suggests steps that can be taken to realize the full potential of these algorithms under realistic conditions.
Abstract: With the proliferation of mobile devices and hearing devices, including hearing aids and cochlear implants, there is a growing and pressing need to design algorithms that can improve speech intelligibility without sacrificing quality. Responding to this need, Speech Enhancement: Theory and Practice, Second Edition introduces readers to the basic problems of speech enhancement and the various algorithms proposed to solve these problems. Updated and expanded, this second edition of the bestselling textbook broadens its scope to include evaluation measures and enhancement algorithms aimed at improving speech intelligibility. Fundamentals, Algorithms, Evaluation, and Future Steps Organized into four parts, the book begins with a review of the fundamentals needed to understand and design better speech enhancement algorithms. The second part describes all the major enhancement algorithms and, because these require an estimate of the noise spectrum, also covers noise estimation algorithms. The third part of the book looks at the measures used to assess the performance, in terms of speech quality and intelligibility, of speech enhancement methods. It also evaluates and compares several of the algorithms. The fourth part presents binary mask algorithms for improving speech intelligibility under ideal conditions. In addition, it suggests steps that can be taken to realize the full potential of these algorithms under realistic conditions. Whats New in This Edition Updates in every chapter A new chapter on objective speech intelligibility measures A new chapter on algorithms for improving speech intelligibility Real-world noise recordings (on accompanying CD) MATLAB code for the implementation of intelligibility measures (on accompanying CD) MATLAB and C/C++ code for the implementation of algorithms to improve speech intelligibility (on accompanying CD) Valuable Insights from a Pioneer in Speech Enhancement Clear and concise, this book explores how human listeners compensate for acoustic noise in noisy environments. Written by a pioneer in speech enhancement and noise reduction in cochlear implants, it is an essential resource for anyone who wants to implement or incorporate the latest speech enhancement algorithms to improve the quality and intelligibility of speech degraded by noise. Includes a CD with Code and Recordings The accompanying CD provides MATLAB implementations of representative speech enhancement algorithms as well as speech and noise databases for the evaluation of enhancement algorithms.

2,269 citations


Journal ArticleDOI
TL;DR: A noisy speech corpus is developed suitable for evaluation of speech enhancement algorithms encompassing four classes of algorithms: spectral subtractive, subspace, statistical-model based and Wiener-type algorithms.

634 citations


Journal ArticleDOI
TL;DR: In this paper, the authors derived minimum mean-square error estimators of speech DFT coefficient magnitudes as well as of complex-valued DFT coefficients based on two classes of generalized gamma distributions, under an additive Gaussian noise assumption.
Abstract: This paper considers techniques for single-channel speech enhancement based on the discrete Fourier transform (DFT). Specifically, we derive minimum mean-square error (MMSE) estimators of speech DFT coefficient magnitudes as well as of complex-valued DFT coefficients based on two classes of generalized gamma distributions, under an additive Gaussian noise assumption. The resulting generalized DFT magnitude estimator has as a special case the existing scheme based on a Rayleigh speech prior, while the complex DFT estimators generalize existing schemes based on Gaussian, Laplacian, and Gamma speech priors. Extensive simulation experiments with speech signals degraded by various additive noise sources verify that significant improvements are possible with the more recent estimators based on super-Gaussian priors. The increase in perceptual evaluation of speech quality (PESQ) over the noisy signals is about 0.5 points for street noise and about 1 point for white noise, nearly independent of input signal-to-noise ratio (SNR). The assumptions made for deriving the complex DFT estimators are less accurate than those for the magnitude estimators, leading to a higher maximum achievable speech quality with the magnitude estimators.

293 citations


Book ChapterDOI
01 Jun 2007
TL;DR: This chapter shows a comprehensive approximation to the main challenges in voice activity detection, the different solutions that have been reported in a complete review of the state of the art and the evaluation frameworks that are normally used.
Abstract: An important drawback affecting most of the speech processing systems is the environmental noise and its harmful effect on the system performance. Examples of such systems are the new wireless communications voice services or digital hearing aid devices. In speech recognition, there are still technical barriers inhibiting such systems from meeting the demands of modern applications. Numerous noise reduction techniques have been developed to palliate the effect of the noise on the system performance and often require an estimate of the noise statistics obtained by means of a precise voice activity detector (VAD). Speech/non-speech detection is an unsolved problem in speech processing and affects numerous applications including robust speech recognition (Karray and Marting, 2003; Ramirez et al. 2003), discontinuous transmission (ITU, 1996; ETSI, 1999), real-time speech transmission on the Internet (Sangwan et al., 2002) or combined noise reduction and echo cancellation schemes in the context of telephony (Basbug et al., 2004; Gustafsson et al., 2002). The speech/non-speech classification task is not as trivial as it appears, and most of the VAD algorithms fail when the level of background noise increases. During the last decade, numerous researchers have developed different strategies for detecting speech on a noisy signal (Sohn et al., 1999; Cho and Kondoz, 2001; Gazor and Zhang, 2003, Armani et al., 2003) and have evaluated the influence of the VAD effectiveness on the performance of speech processing systems (Bouquin-Jeannes and Faucon, 1995). Most of the approaches have focussed on the development of robust algorithms with special attention being paid to the derivation and study of noise robust features and decision rules (Woo et al., 2000; Li et al., 2002; Marzinzik and Kollmeier, 2002). The different VAD methods include those based on energy thresholds (Woo et al., 2000), pitch detection (Chengalvarayan, 1999), spectrum analysis (Marzinzik and Kollmeier, 2002), zero-crossing rate (ITU, 1996), periodicity measure (Tucker, 1992), higher order statistics in the LPC residual domain (Nemer et al., 2001) or combinations of different features (ITU, 1993; ETSI, 1999; Tanyer and Ozer, 2000). This chapter shows a comprehensive approximation to the main challenges in voice activity detection, the different solutions that have been reported in a complete review of the state of the art and the evaluation frameworks that are normally used. The application of VADs for speech coding, speech enhancement and robust speech recognition systems is shown and discussed. Three different VAD methods are described and compared to standardized and

256 citations


Journal ArticleDOI
TL;DR: Overall, the analysis of consonant confusion matrices suggests that in order for noise reduction algorithms to improve speech intelligibility, they need to improve the place and manner feature scores.
Abstract: The evaluation of intelligibility of noise reduction algorithms is reported. IEEE sentences and consonants were corrupted by four types of noise including babble, car, street and train at two signal-to-noise ratio levels (0 and 5 dB), and then processed by eight speech enhancement methods encompassing four classes of algorithms: spectral subtractive, sub-space, statistical model based and Wiener-type algorithms. The enhanced speech was presented to normal-hearing listeners for identification. With the exception of a single noise condition, no algorithm produced significant improvements in speech intelligibility. Information transmission analysis of the consonant confusion matrices indicated that no algorithm improved significantly the place feature score, significantly, which is critically important for speech recognition. The algorithms which were found in previous studies to perform the best in terms of overall quality, were not the same algorithms that performed the best in terms of speech intelligibility. The subspace algorithm, for instance, was previously found to perform the worst in terms of overall quality, but performed well in the present study in terms of preserving speech intelligibility. Overall, the analysis of consonant confusion matrices suggests that in order for noise reduction algorithms to improve speech intelligibility, they need to improve the place and manner feature scores.

251 citations


Journal ArticleDOI
TL;DR: Results suggest that native and non-native listeners apply similar strategies for speech-in-noise perception: the crucial difference is in the signal clarity required for contextual information to be effective, rather than in an inability of non- native listeners to take advantage of this contextual information per se.
Abstract: Previous research has shown that speech recognition differences between native and proficient non-native listeners emerge under suboptimal conditions. Current evidence has suggested that the key deficit that underlies this disproportionate effect of unfavorable listening conditions for non-native listeners is their less effective use of compensatory information at higher levels of processing to recover from information loss at the phoneme identification level. The present study investigated whether this non-native disadvantage could be overcome if enhancements at various levels of processing were presented in combination. Native and non-native listeners were presented with English sentences in which the final word varied in predictability and which were produced in either plain or clear speech. Results showed that, relative to the low-predictability-plain-speech baseline condition, non-native listener final word recognition improved only when both semantic and acoustic enhancements were available (high-predictability-clear-speech). In contrast, the native listeners benefited from each source of enhancement separately and in combination. These results suggests that native and non-native listeners apply similar strategies for speech-in-noise perception: The crucial difference is in the signal clarity required for contextual information to be effective, rather than in an inability of non-native listeners to take advantage of this contextual information per se.

249 citations


Patent
Carlos Avendano1
29 Jan 2007
TL;DR: In this article, an inter-microphone level difference (ILD) was used to attenuate noise and enhance speech. But the ILD was not used to enhance the speech of the primary acoustic signal.
Abstract: Systems and methods for utilizing inter-microphone level differences (ILD) to attenuate noise and enhance speech are provided. In exemplary embodiments, primary and secondary acoustic signals are received by omni-directional microphones, and converted into primary and secondary electric signals. A differential microphone array module processes the electric signals to determine a cardioid primary signal and a cardioid secondary signal. The cardioid signals are filtered through a frequency analysis module which takes the signals and mimics a cochlea implementation (i.e., cochlear domain). Energy levels of the signals are then computed, and the results are processed by an ILD module using a non-linear combination to obtain the ILD. In exemplary embodiments, the non-linear combination comprises dividing the energy level associated with the primary microphone by the energy level associated with the secondary microphone. The ILD is utilized by a noise reduction system to enhance the speech of the primary acoustic signal.

144 citations


Journal ArticleDOI
TL;DR: Experiments indicate that the scheme proposed in this paper performs significantly better than competing methods, in particular one that leads to the minimum mean squared error estimate of the clean speech signal.
Abstract: In this paper, we propose a Bayesian minimum mean squared error approach for the joint estimation of the short-term predictor parameters of speech and noise, from the noisy observation. We use trained codebooks of speech and noise linear predictive coefficients to model the a priori information required by the Bayesian scheme. In contrast to current Bayesian estimation approaches that consider the excitation variances as part of the a priori information, in the proposed method they are computed online for each short-time segment, based on the observation at hand. Consequently, the method performs well in nonstationary noise conditions. The resulting estimates of the speech and noise spectra can be used in a Wiener filter or any state-of-the-art speech enhancement system. We develop both memoryless (using information from the current frame alone) and memory-based (using information from the current and previous frames) estimators. Estimation of functions of the short-term predictor parameters is also addressed, in particular one that leads to the minimum mean squared error estimate of the clean speech signal. Experiments indicate that the scheme proposed in this paper performs significantly better than competing methods

143 citations


Journal ArticleDOI
TL;DR: An extensive overview of the available estimators is presented, and a theoretical estimator is derived to experimentally assess an upper bound to the performance that can be achieved by any subspace-based method.
Abstract: The objective of this paper is threefold: (1) to provide an extensive review of signal subspace speech enhancement, (2) to derive an upper bound for the performance of these techniques, and (3) to present a comprehensive study of the potential of subspace filtering to increase the robustness of automatic speech recognisers against stationary additive noise distortions. Subspace filtering methods are based on the orthogonal decomposition of the noisy speech observation space into a signal subspace and a noise subspace. This decomposition is possible under the assumption of a low-rank model for speech, and on the availability of an estimate of the noise correlation matrix. We present an extensive overview of the available estimators, and derive a theoretical estimator to experimentally assess an upper bound to the performance that can be achieved by any subspace-based method. Automatic speech recognition (ASR) experiments with noisy data demonstrate that subspace-based speech enhancement can significantly increase the robustness of these systems in additive coloured noise environments. Optimal performance is obtained only if no explicit rank reduction of the noisy Hankel matrix is performed. Although this strategy might increase the level of the residual noise, it reduces the risk of removing essential signal information for the recogniser's back end. Finally, it is also shown that subspace filtering compares favourably to the well-known spectral subtraction technique.

141 citations


Proceedings ArticleDOI
15 Apr 2007
TL;DR: The major aspects of emotion recognition are addressed in view of potential applications in the field, to benchmark today's emotion recognition systems and bridge the gap between commercial interest and current performances: acted vs. spontaneous speech, realistic emotions, noise and microphone conditions, and speaker independence.
Abstract: As automatic emotion recognition based on speech matures, new challenges can be faced. We therefore address the major aspects in view of potential applications in the field, to benchmark today's emotion recognition systems and bridge the gap between commercial interest and current performances: acted vs. spontaneous speech, realistic emotions, noise and microphone conditions, and speaker independence. Three different data-sets are used: the Berlin Emotional Speech Database, the Danish Emotional Speech Database, and the spontaneous AIBO Emotion Corpus. By using different feature types such as word- or turn-based statistics, manual versus forced alignment, and optimization techniques we show how to best cope with this demanding task and how noise addition or different microphone positions affect emotion recognition.

141 citations


Journal ArticleDOI
TL;DR: Two techniques to separate out the speech signal of the speaker of interest from a mixture of speech signals are presented and can result in significant enhancement of individual speakers in mixed recordings, consistently achieving better performance than that obtained with hard binary masks.
Abstract: The problem of single-channel speaker separation attempts to extract a speech signal uttered by the speaker of interest from a signal containing a mixture of acoustic signals. Most algorithms that deal with this problem are based on masking, wherein unreliable frequency components from the mixed signal spectrogram are suppressed, and the reliable components are inverted to obtain the speech signal from speaker of interest. Most current techniques estimate this mask in a binary fashion, resulting in a hard mask. In this paper, we present two techniques to separate out the speech signal of the speaker of interest from a mixture of speech signals. One technique estimates all the spectral components of the desired speaker. The second technique estimates a soft mask that weights the frequency subbands of the mixed signal. In both cases, the speech signal of the speaker of interest is reconstructed from the complete spectral descriptions obtained. In their native form, these algorithms are computationally expensive. We also present fast factored approximations to the algorithms. Experiments reveal that the proposed algorithms can result in significant enhancement of individual speakers in mixed recordings, consistently achieving better performance than that obtained with hard binary masks.

Journal ArticleDOI
TL;DR: The experimental results in terms of signal-to-noise ratio (SNR) and segmental SNR show that soft mask filtering outperforms binary mask and Wiener filtering.
Abstract: We present an approach for separating two speech signals when only one single recording of their linear mixture is available. For this purpose, we derive a filter, which we call the soft mask filter, using minimum mean square error (MMSE) estimation of the log spectral vectors of sources given the mixture's log spectral vectors. The soft mask filter's parameters are estimated using the mean and variance of the underlying sources which are modeled using the Gaussian composite source modeling (CSM) approach. It is also shown that the binary mask filter which has been empirically and extensively used in single-channel speech separation techniques is, in fact, a simplified form of the soft mask filter. The soft mask filtering technique is compared with the binary mask and Wiener filtering approaches when the input consists of male+male, female+female, and male+female mixtures. The experimental results in terms of signal-to-noise ratio (SNR) and segmental SNR show that soft mask filtering outperforms binary mask and Wiener filtering.

Journal ArticleDOI
TL;DR: A hidden Markov model (HMM)-based speech enhancement method using explicit gain modeling through the introduction of stochastic gain variables, energy variation in both speech and noise is explicitly modeled in a unified framework.
Abstract: Accurate modeling and estimation of speech and noise gains facilitate good performance of speech enhancement methods using data-driven prior models. In this paper, we propose a hidden Markov model (HMM)-based speech enhancement method using explicit gain modeling. Through the introduction of stochastic gain variables, energy variation in both speech and noise is explicitly modeled in a unified framework. The speech gain models the energy variations of the speech phones, typically due to differences in pronunciation and/or different vocalizations of individual speakers. The noise gain helps to improve the tracking of the time-varying energy of nonstationary noise. The expectation-maximization (EM) algorithm is used to perform offline estimation of the time-invariant model parameters. The time-varying model parameters are estimated online using the recursive EM algorithm. The proposed gain modeling techniques are applied to a novel Bayesian speech estimator, and the performance of the proposed enhancement method is evaluated through objective and subjective tests. The experimental results confirm the advantage of explicit gain modeling, particularly for nonstationary noise sources

Journal ArticleDOI
TL;DR: Overall results indicate that SNR and SSNR improvements for the proposed approach are comparable to those of the Ephraim Malah filter, with BWT enhancement giving the best results of all methods for the noisiest (-10db and -5db input SNR) conditions.

Journal ArticleDOI
TL;DR: A novel algorithm plugging audiovisual coherence of speech signals, estimated by statistical tools, on audio blind source separation (BSS) techniques is presented, applied to the difficult and realistic case of convolutive mixtures.
Abstract: Looking at the speaker's face can be useful to better hear a speech signal in noisy environment and extract it from competing sources before identification. This suggests that the visual signals of speech (movements of visible articulators) could be used in speech enhancement or extraction systems. In this paper, we present a novel algorithm plugging audiovisual coherence of speech signals, estimated by statistical tools, on audio blind source separation (BSS) techniques. This algorithm is applied to the difficult and realistic case of convolutive mixtures. The algorithm mainly works in the frequency (transform) domain, where the convolutive mixture becomes an additive mixture for each frequency channel. Frequency by frequency separation is made by an audio BSS algorithm. The audio and visual informations are modeled by a newly proposed statistical model. This model is then used to solve the standard source permutation and scale factor ambiguities encountered for each frequency after the audio blind separation stage. The proposed method is shown to be efficient in the case of 2 times 2 convolutive mixtures and offers promising perspectives for extracting a particular speech source of interest from complex mixtures

Journal ArticleDOI
TL;DR: It is shown that cepstral smoothing can effectively prevent spectral peaks of short duration that may be perceived as musical noise, and preserves speech onsets, plosives, and quasi-stationary narrowband structures like voiced speech.
Abstract: Many speech enhancement algorithms that modify short-term spectral magnitudes of the noisy signal by means of adaptive spectral gain functions are plagued by annoying spectral outliers. In this letter, we propose cepstral smoothing as a solution to this problem. We show that cepstral smoothing can effectively prevent spectral peaks of short duration that may be perceived as musical noise. At the same time, cepstral smoothing preserves speech onsets, plosives, and quasi-stationary narrowband structures like voiced speech. The proposed recursive temporal smoothing is applied to higher cepstral coefficients only, excluding those representing the pitch information. As the higher cepstral coefficients describe the finer spectral structure of the Fourier spectrum, smoothing them along time prevents single coefficients of the filter function from changing excessively and independently of their neighboring bins, thus suppressing musical noise. The proposed cepstral smoothing technique is very effective in nonstationary noise.

Journal ArticleDOI
TL;DR: The present study demonstrates that the assumed distribution of the DFT coefficients can have a significant effect on the quality of the enhanced speech and derive the MMSE estimator under speech presence uncertainty and a Laplacian statistical model.

Proceedings ArticleDOI
15 Apr 2007
TL;DR: A supervised learning approach to monaural segregation of reverberant voiced speech is proposed, which learns to map from a set of pitch-based auditory features to a grouping cue encoding the posterior probability of a time-frequency (T-F) unit being target dominant given observed features.
Abstract: Room reverberation degrades speech signals and poses a major challenge to current monaural speech segregation systems. Previous research relies on inverse filtering as a front-end for partially restoring the harmonicity of the reverberant signal. We show that the inverse filtering approach is sensitive to different room configurations, hence undesirable in general reverberation conditions. We propose a supervised learning approach to map a set of harmonic features into a pitch based grouping cue for each time-frequency (T-F) unit. We use a speech segregation method to estimate an ideal binary T-F mask which retains the reverberant mixture in a local T-F unit if and only if the energy of target is stronger than interference energy. Results show that our approach improves the segregation performance considerably.

Journal ArticleDOI
TL;DR: A robust dereverberation method is presented for speech enhancement in a situation requiring adaptation where a speaker shifts his/her head under reverberant conditions causing the impulse responses to change frequently.
Abstract: A robust dereverberation method is presented for speech enhancement in a situation requiring adaptation where a speaker shifts his/her head under reverberant conditions causing the impulse responses to change frequently. We combine correlation-based blind deconvolution with modified spectral subtraction to improve the quality of inverse-filtered speech degraded by the estimation error of inverse filters obtained in practice. Our method computes inverse filters by using the correlation matrix between input signals that can be observed without measuring room impulse responses. Inverse filtering reduces early reflection, which has most of the power of the reverberation, and then, spectral subtraction suppresses the tail of the inverse-filtered reverberation. The performance of our method in adaptation is demonstrated by experiments using measured room impulse responses. The subjective results indicated that this method provides superior speech quality to each of the individual methods: blind deconvolution and spectral subtraction.

Journal ArticleDOI
15 Apr 2007
TL;DR: In this paper, a visually derived Wiener filter was proposed to extract clean speech and noise power spectrum statistics from visual speech features, which is used to enhance audio speech that has been contaminated by noise.
Abstract: The aim of this work is to examine whether visual speech information can be used to enhance audio speech that has been contaminated by noise. First, an analysis of audio and visual speech features is made, which identifies the pair with highest audio-visual correlation. The study also reveals that higher audio-visual correlation exists within individual phoneme sounds rather than globally across all speech. This correlation is exploited in the proposal of a visually derived Wiener filter that obtains clean speech and noise power spectrum statistics from visual speech features. Clean speech statistics are estimated from visual features using a maximum a posteriori framework that is integrated within the states of a network of hidden Markov models to provide phoneme localization. Noise statistics are obtained through a novel audio-visual voice activity detector which utilizes visual speech features to make robust speech/nonspeech classifications. The effectiveness of the visually derived Wiener filter is evaluated subjectively and objectively and is compared with three different audio-only enhancement methods over a range of signal-to-noise ratios.

Journal ArticleDOI
TL;DR: This work proposes a supervised approach using regression trees to learn the nonlinear transformation of the uncertainty from the linear spectral domain to the cepstral domain, which is used by a decoder that exploits the variance associated with the enhanced cEPstral features to improve robust speech recognition.
Abstract: Recently, several algorithms have been proposed to enhance noisy speech by estimating a binary mask that can be used to select those time-frequency regions of a noisy speech signal that contain more speech energy than noise energy. This binary mask encodes the uncertainty associated with enhanced speech in the linear spectral domain. The use of the cepstral transformation smears the information from the noise dominant time-frequency regions across all the cepstral features. We propose a supervised approach using regression trees to learn the nonlinear transformation of the uncertainty from the linear spectral domain to the cepstral domain. This uncertainty is used by a decoder that exploits the variance associated with the enhanced cepstral features to improve robust speech recognition. Systematic evaluations on a subset of the Aurora4 task using the estimated uncertainty show substantial improvement over the baseline performance across various noise conditions.

Journal ArticleDOI
TL;DR: The advent of micro-technology and human-machine integration promisingly improves EL speech quality and more efficient algorithms enhance EL sound quality, apparently improving the intelligibility of EL speech, and thus better quality of life of the EL speakers.

Journal ArticleDOI
TL;DR: It is shown that the ''decision-directed'' approach for speech spectral variance estimation can have an important bias at low SNRs, which generally leads to too much speech suppression.

Journal ArticleDOI
TL;DR: Experimental results demonstrate the advantage of using the proposed simultaneous detection and estimation approach with the proposed a priori SNR estimator, which facilitate suppression of transient noise with a controlled level of speech distortion.
Abstract: In this paper, we present a simultaneous detection and estimation approach for speech enhancement. A detector for speech presence in the short-time Fourier transform domain is combined with an estimator, which jointly minimizes a cost function that takes into account both detection and estimation errors. Cost parameters control the tradeoff between speech distortion, caused by missed detection of speech components and residual musical noise resulting from false-detection. Furthermore, a modified decision-directed a priori signal-to-noise ratio (SNR) estimation is proposed for transient-noise environments. Experimental results demonstrate the advantage of using the proposed simultaneous detection and estimation approach with the proposed a priori SNR estimator, which facilitate suppression of transient noise with a controlled level of speech distortion.

Journal ArticleDOI
TL;DR: The accurate speaker tracking provided by the audio-visual sensor array proved beneficial to improve the recognition performance in a microphone array-based speech recognition system, both in terms of enhancement and recognition.
Abstract: This paper addresses the problem of distant speech acquisition in multiparty meetings, using multiple microphones and cameras. Microphone array beamforming techniques present a potential alternative to close-talking microphones by providing speech enhancement through spatial filtering. Beamforming techniques, however, rely on knowledge of the speaker location. In this paper, we present an integrated approach, in which an audio-visual multiperson tracker is used to track active speakers with high accuracy. Speech enhancement is then achieved using microphone array beamforming followed by a novel postfiltering stage. Finally, speech recognition is performed to evaluate the quality of the enhanced speech signal. The approach is evaluated on data recorded in a real meeting room for stationary speaker, moving speaker, and overlapping speech scenarios. The results show that the speech enhancement and recognition performance achieved using our approach are significantly better than a single table-top microphone and are comparable to a lapel microphone for some of the scenarios. The results also indicate that the audio-visual-based system performs significantly better than audio-only system, both in terms of enhancement and recognition. This reveals that the accurate speaker tracking provided by the audio-visual sensor array proved beneficial to improve the recognition performance in a microphone array-based speech recognition system.

Journal ArticleDOI
TL;DR: Neuroevolution Artificial Bandwidth Expansion (NEABE) is proposed, a new method that uses spectral folding to create the initial spectral components above the telephone band and was found to be preferred over narrowband speech in about 80% of the test cases.
Abstract: The limited bandwidth of 0.3-3.4 kHz in current telephone systems reduces both the quality and the intelligibility of speech. Artificial bandwidth expansion is a method that expands the bandwidth of the narrowband speech signal in the receiving end of the transmission link by adding new frequency components to the higher frequencies, i.e., up to 8 kHz. In this paper, a new method for artificial bandwidth expansion, termed Neuroevolution Artificial Bandwidth Expansion (NEABE) is proposed. The method uses spectral folding to create the initial spectral components above the telephone band. The spectral envelope is then shaped in the frequency domain, based on a set of parameters given by a neural network. Subjective listening tests were used to evaluate the performance of the proposed algorithm, and the results showed that NEABE speech was preferred over narrowband speech in about 80% of the test cases

Journal ArticleDOI
TL;DR: A new framework entitled Environmental Sniffing to detect, classify, and track acoustic environmental conditions and defines a new information criterion that incorporates the impact of noise into Environmental Sniffsing performance.
Abstract: Automatic speech recognition systems work reasonably well under clean conditions but become fragile in practical applications involving real-world environments. To date, most approaches dealing with environmental noise in speech systems are based on assumptions concerning the noise, or differences in collecting and training on a specific noise condition, rather than exploring the nature of the noise. As such, speech recognition, speaker ID, or coding systems are typically retrained when new acoustic conditions are to be encountered. In this paper, we propose a new framework entitled Environmental Sniffing to detect, classify, and track acoustic environmental conditions. The first goal of the framework is to seek out detailed information about the environmental characteristics instead of just detecting environmental changes. The second goal is to organize this knowledge in an effective manner to allow smart decisions to direct subsequent speech processing systems. Our current framework uses a number of speech processing modules including a hybrid algorithm with T2-BIC segmentation, Gaussian mixture model/hidden Markov model (GMM/HMM)-based classification and noise language modeling to achieve effective noise knowledge estimation. We define a new information criterion that incorporates the impact of noise into Environmental Sniffing performance. We use an in-vehicle speech and noise environment as a test platform for our evaluations and investigate the integration of Environmental Sniffing for automatic speech recognition (ASR) in this environment. Noise sniffing experiments show that our proposed hybrid algorithm achieves a classification error rate of 25.51%, outperforming our baseline system by 7.08%. The sniffing framework is compared to a ROVER solution for automatic speech recognition (ASR) using different noise conditioned recognizers in terms of word error rate (WER) and CPU usage. Results show that the model matching scheme using the knowledge extracted from the audio stream by Environmental Sniffing achieves better performance than a ROVER solution both in accuracy and computation. A relative 11.1% WER improvement is achieved with a relative 75% reduction in CPU resources

Journal ArticleDOI
TL;DR: Evaluations of the performance gain obtained from the proposed post-processing speech restoration module are presented and compared to standard speech enhancement systems which show substantial improvement gains in perceptual quality.
Abstract: This paper presents a post-processing speech restoration module for enhancing the performance of conventional speech enhancement methods. The restoration module aims to retrieve parts of speech spectrum that may be lost to noise or suppressed when using conventional speech enhancement methods. The proposed restoration method utilizes a harmonic plus noise model (HNM) of speech to retrieve damaged speech structure. A modified HNM of speech is proposed where, instead of the conventional binary labeling of the signal in each subband as voiced or unvoiced, the concept of harmonicity is introduced which is more adaptable to the codebook mapping method used in the later stage of enhancement. To restore the lost or suppressed information, an HNM codebook mapping technique is proposed. The HNM codebook is trained on speaker-independent speech data. To reduce the sensitivity of the HNM codebook to speaker variability, a spectral energy normalization process is introduced. The proposed post-processing method is tested as an add-on module with several popular noise reduction methods. Evaluations of the performance gain obtained from the proposed post-processing are presented and compared to standard speech enhancement systems which show substantial improvement gains in perceptual quality

Journal ArticleDOI
TL;DR: This paper proposes a novel F0 contour estimation algorithm based on a precise parametric description of the voiced parts of speech derived from the power spectrum that is competitive on clean single-speaker speech, and outperforms existing methods in the presence of noise.
Abstract: This paper proposes a novel F0 contour estimation algorithm based on a precise parametric description of the voiced parts of speech derived from the power spectrum. The algorithm is able to perform in a wide variety of noisy environments as well as to estimate the F0s of cochannel concurrent speech. The speech spectrum is modeled as a sequence of spectral clusters governed by a common F0 contour expressed as a spline curve. These clusters are obtained by an unsupervised 2-D time-frequency clustering of the power density using a new formulation of the EM algorithm, and their common F 0 contour is estimated at the same time. A smooth F0 contour is extracted for the whole utterance, linking together its voiced parts. A noise model is used to cope with nonharmonic background noise, which would otherwise interfere with the clustering of the harmonic portions of speech. We evaluate our algorithm in comparison with existing methods on several tasks, and show 1) that it is competitive on clean single-speaker speech, 2) that it outperforms existing methods in the presence of noise, and 3) that it outperforms existing methods for the estimation of multiple F0 contours of cochannel concurrent speech

Proceedings ArticleDOI
13 Jun 2007
TL;DR: The objective of the UTDrive project is to analyze behavior while the driver is interacting with speech-activated systems or performing common secondary tasks, as well as to better understand speech characteristics of the driver undergoing additional cognitive load.
Abstract: This paper describes an overview of the UTDrive project. UTDrive is part of an on-going international collaboration to collect and research rich multi-modal data recorded for modeling driver behavior for in-vehicle environments. The objective of the UTDrive project is to analyze behavior while the driver is interacting with speech-activated systems or performing common secondary tasks, as well as to better understand speech characteristics of the driver undergoing additional cognitive load. The corpus consists of audio, video, gas/brake pedal pressure, forward distance, GPS information, and CAN-Bus information. The resulting corpus, analysis, and modeling will contribute to more effective speech interactive systems with are less distractive and adjustable to the driver's cognitive capacity and driving situations.