scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Perceptual learning for speech in noise after application of binary time-frequency masks.

06 Mar 2013-Journal of the Acoustical Society of America (Acoustical Society of America)-Vol. 133, Iss: 3, pp 1687-1692
TL;DR: The study reported here assessed the effect of training on the recognition of speech in noise after processing by ideal TF masks that did not restore perfect speech intelligibility.
Abstract: Ideal time-frequency (TF) masks can reject noise and improve the recognition of speech-noise mixtures. An ideal TF mask is constructed with prior knowledge of the target speech signal. The intelligibility of a processed speech-noise mixture depends upon the threshold criterion used to define the TF mask. The study reported here assessed the effect of training on the recognition of speech in noise after processing by ideal TF masks that did not restore perfect speech intelligibility. Two groups of listeners with normal hearing listened to speech-noise mixtures processed by TF masks calculated with different threshold criteria. For each group, a threshold criterion that initially produced word recognition scores between 0.56–0.69 was chosen for training. Listeners practiced with one set of TF-masked sentences until their word recognition performance approached asymptote. Perceptual learning was quantified by comparing word-recognition scores in the first and last training sessions. Word recognition scores improved with practice for all listeners with the greatest improvement observed for the same materials used in training.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: This study systematically evaluates a range of promising features for classification-based separation using six nonstationary noises at the low SNR level of -5 dB, and proposes a new feature called multi-resolution cochleagram (MRCG), which experimental results show gives the best classification results among all evaluated features.
Abstract: Speech separation can be formulated as a classification problem. In classification-based speech separation, supervised learning is employed to classify time-frequency units as either speech-dominant or noise-dominant. In very low signal-to-noise ratio (SNR) conditions, acoustic features extracted from a mixture are crucial for correct classification. In this study, we systematically evaluate a range of promising features for classification-based separation using six nonstationary noises at the low SNR level of -5 dB, which is chosen with the goal of improving human speech intelligibility in mind. In addition, we propose a new feature called multi-resolution cochleagram (MRCG). The new feature is constructed by combining four cochleagrams at different spectrotemporal resolutions in order to capture both the local and contextual information. Experimental results show that MRCG gives the best classification results among all evaluated features. In addition, our results indicate that auto-regressive moving average (ARMA) filtering, a post-processing technique for improving automatic speech recognition features, also improves many acoustic features for speech separation.

145 citations


Cites background from "Perceptual learning for speech in n..."

  • ...In subject tests, IBM separation has been shown to dramatically improve speech intelligibility in noise for both normal-hearing and hearing-impaired listeners [4], [24], [37], [1]....

    [...]

Journal ArticleDOI
TL;DR: In this paper, a novel language-, noise-and speaker independent deep neural network (DNN) architecture, termed CochleaNet, was proposed for causal or real-time speech enhancement (SE).

43 citations

Proceedings ArticleDOI
02 Sep 2018
TL;DR: In this article, a hybrid deep neural network (DNN) based audiovisual mask estimation model was proposed to integrate the temporal dynamics of both audio and noise-immune visual features for improved mask estimation and speech separation.
Abstract: Human auditory cortex excels at selectively suppressing background noise to focus on a target speaker. The process of selective attention in the brain is known to contextually exploit the available audio and visual cues to better focus on target speaker while filtering out other noises. In this study, we propose a novel deep neural network (DNN) based audiovisual (AV) mask estimation model. The proposed AV mask estimation model contextually integrates the temporal dynamics of both audio and noise-immune visual features for improved mask estimation and speech separation. For optimal AV features extraction and ideal binary mask (IBM) estimation, a hybrid DNN architecture is exploited that leverages the complementary strengths of a stacked long short term memory (LSTM) and convolution LSTM network. The comparative simulation results in terms of speech quality and intelligibility demonstrate the significant performance improvement of our proposed AV mask estimation model as compared to audio-only and visual-only mask estimation approaches for both speaker dependent and independent scenarios.

33 citations

Journal ArticleDOI
TL;DR: In this paper, the authors examined three noise perturbations on supervised speech separation: noise rate, vocal tract length, and frequency perturbation at low signal-to-noise ratios (SNRs).

31 citations


Cites background from "Perceptual learning for speech in n..."

  • ...Recent studes show IBM separation improves speech intelligibility in oise for both normal-hearing and hearing-impaired listeners 2 fi e d W m e s t 6 T n 1 W ( s e X e t ( s q d r a S D t t s b t u i r I e f s (Ahmadi et al., 2013; Brungart et al., 2006; Li and Loizou, 2008; Wang et al., 2009)....

    [...]

  • ...One way of dealing ith this problem is to apply speech enhancement (Ephraim nd Malah, 1984; Erkelens et al., 2007; Jensen and Henriks, 2012) on a noisy signal, where certain assumptions are ade regarding general statistics of the background noise. he speech enhancement approach is usually limited…...

    [...]

Posted Content
TL;DR: A causal, language, noise and speaker independent AV deep neural network (DNN) architecture for speech enhancement (SE) that exploits the noisy acoustic cues and noise robust visual cues to focus on the desired speaker and improve the speech intelligibility is presented.
Abstract: Noisy situations cause huge problems for suffers of hearing loss as hearing aids often make the signal more audible but do not always restore the intelligibility. In noisy settings, humans routinely exploit the audio-visual (AV) nature of the speech to selectively suppress the background noise and to focus on the target speaker. In this paper, we present a causal, language, noise and speaker independent AV deep neural network (DNN) architecture for speech enhancement (SE). The model exploits the noisy acoustic cues and noise robust visual cues to focus on the desired speaker and improve the speech intelligibility. To evaluate the proposed SE framework a first of its kind AV binaural speech corpus, called ASPIRE, is recorded in real noisy environments including cafeteria and restaurant. We demonstrate superior performance of our approach in terms of objective measures and subjective listening tests over the state-of-the-art SE approaches as well as recent DNN based SE models. In addition, our work challenges a popular belief that a scarcity of multi-language large vocabulary AV corpus and wide variety of noises is a major bottleneck to build a robust language, speaker and noise independent SE systems. We show that a model trained on synthetic mixture of Grid corpus (with 33 speakers and a small English vocabulary) and ChiME 3 Noises (consisting of only bus, pedestrian, cafeteria, and street noises) generalise well not only on large vocabulary corpora but also on completely unrelated languages (such as Mandarin), wide variety of speakers and noises.

16 citations


Cites background from "Perceptual learning for speech in n..."

  • ...The IBM has shown to improve the speech quality and intelligibility for the hearing impaired and normal hearing listeners [9, 10, 11]....

    [...]

References
More filters
Book ChapterDOI
01 Jan 2005
TL;DR: This chapter is an attempt at a computational-theory analysis of auditory scene analysis, where the main task is to understand the character of the CASA problem.
Abstract: In his famous treatise of computational vision, Marr (1982) makes a compelling argument for separating different levels of analysis in order to understand complex information processing. In particular, the computational theory level, concerned with the goal of computation and general processing strategy, must be separated from the algorithm level, or the separation of what from how. This chapter is an attempt at a computational-theory analysis of auditory scene analysis, where the main task is to understand the character of the CASA problem.

617 citations

Journal ArticleDOI
TL;DR: A database of speech samples from eight different talkers has been collected for use in multitalker communications research and the nature of the corpus, the data collection methodology, and the means for obtaining copies of the database are presented.
Abstract: A database of speech samples from eight different talkers has been collected for use in multitalker communications research. Descriptions of the nature of the corpus, the data collection methodology, and the means for obtaining copies of the database are presented.

488 citations

Journal ArticleDOI
TL;DR: This study attempted to isolate the effects that energetic masking, defined as the loss of detectable target information due to the spectral overlap of the target and masking signals, has on multitalker speech perception through the use of ideal time-frequency binary masks.
Abstract: When a target speech signal is obscured by an interfering speech wave form, comprehension of the target message depends both on the successful detection of the energy from the target speech wave form and on the successful extraction and recognition of the spectro-temporal energy pattern of the target out of a background of acoustically similar masker sounds. This study attempted to isolate the effects that energetic masking, defined as the loss of detectable target information due to the spectral overlap of the target and masking signals, has on multitalker speech perception. This was achieved through the use of ideal time-frequency binary masks that retained those spectro-temporal regions of the acoustic mixture that were dominated by the target speech but eliminated those regions that were dominated by the interfering speech. The results suggest that energetic masking plays a relatively small role in the overall masking that occurs when speech is masked by interfering speech but a much more significant role when speech is masked by interfering noise.

388 citations

Journal ArticleDOI
TL;DR: The findings from this study suggest that algorithms that can estimate reliably the SNR in each T-F unit can improve speech intelligibility.
Abstract: Traditional noise-suppression algorithms have been shown to improve speech quality, but not speech intelligibility. Motivated by prior intelligibility studies of speech synthesized using the ideal binary mask, an algorithm is proposed that decomposes the input signal into time-frequency (T-F) units and makes binary decisions, based on a Bayesian classifier, as to whether each T-F unit is dominated by the target or the masker. Speech corrupted at low signal-to-noise ratio (SNR) levels (-5 and 0 dB) using different types of maskers is synthesized by this algorithm and presented to normal-hearing listeners for identification. Results indicated substantial improvements in intelligibility (over 60% points in -5 dB babble) over that attained by human listeners with unprocessed stimuli. The findings from this study suggest that algorithms that can estimate reliably the SNR in each T-F unit can improve speech intelligibility.

336 citations

Journal ArticleDOI
TL;DR: The data confirm that the hearing handicap of many elderly subjects manifests itself primarily in a noisy environment and that Acceptable noise levels in rooms used by the aged must be 5 to 10 dB lower than those for normal-hearing subjects.
Abstract: For 140 male subjects (20 per decade between the ages 20 and 89) and 72 female subjects (20 per decade between 60 and 89, and 12 for the age interval 90-96), the monaural speech-reception threshold (SRT) for sentences was investigated in quiet and at four noise levels (22.2, 37.5, 52.5, and 67.5 dBA noise with long-term average speech spectra). The median SRT as well as the quartiles are given as a function of age. The data are described in terms of a model published earlier [J. Acoust. Soc. Am. 63, 533-549 (1978)]. According to this model every hearing loss for speech (SHL) is interpreted as the sum of a loss class A (attenuation), characterized by a reduction of the levels of both speech signal and noise, and a loss class D (distortion), comparable with a decrease in signal-to-noise ratio. Both SHLA+D (hearing loss in quiet) and SHLD (hearing loss at high noise levels) increase progressively above the age of 50 (reaching typical values of 30 and 6 dB, respectively, at age 85). The spread of SHLD as a function of SHLA+D for the individual ears is so large (sigma = 2.7 dB) that subjects with the same hearing loss for speech in quiet may differ considerably in their ability to understand speech in noise. The data confirm that the hearing handicap of many elderly subjects manifests itself primarily in a noisy environment. Acceptable noise levels in rooms used by the aged must be 5 to 10 dB lower than those for normal-hearing subjects.

323 citations

Trending Questions (1)
How to see time in noise Colorfit Pro 2?

Ideal time-frequency (TF) masks can reject noise and improve the recognition of speech-noise mixtures.