scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Spatial release from informational masking in speech recognition.

05 Jun 2001-Journal of the Acoustical Society of America (Acoustical Society of America)-Vol. 109, Iss: 5, pp 2112-2122
TL;DR: Three experiments were conducted to determine the extent to which perceived separation of speech and interference improves speech recognition in the free field, and indicated that the advantage of perceived separation is not limited to conditions where the interfering speech is understandable.
Abstract: Three experiments were conducted to determine the extent to which perceived separation of speech and interference improves speech recognition in the free field. Target speech stimuli were 320 grammatically correct but nonmeaningful sentences spoken by a female talker. In the first experiment the interference was a recording of either one or two female talkers reciting a continuous stream of similar nonmeaningful sentences. The target talker was always presented from a loudspeaker directly in front (0°). The interference was either presented from the front loudspeaker (the F-F condition) or from both a right loudspeaker (60°) and the front loudspeaker, with the right leading the front by 4 ms (the F-RF condition). Due to the precedence effect, the interference in the F-RF condition was perceived to be well to the right, while the target talker was heard from the front. For both the single-talker and two-talker interference, there was a sizable improvement in speech recognition in the F-RF condition compared with the F-F condition. However, a second experiment showed that there was no F-RF advantage when the interference was noise modulated by the single- or multi-channel envelope of the two-talker masker. Results of the third experiment indicated that the advantage of perceived separation is not limited to conditions where the interfering speech is understandable.
Citations
More filters
Journal ArticleDOI
TL;DR: In this paper, the authors proposed a supervised learning approach to speech segregation, in which a target speech signal is separated from interfering sounds using spatial localization cues: interaural time differences (ITD) and intra-aural intensity differences (IID).
Abstract: At a cocktail party, one can selectively attend to a single voice and filter out all the other acoustical interferences. How to simulate this perceptual ability remains a great challenge. This paper describes a novel, supervised learning approach to speech segregation, in which a target speech signal is separated from interfering sounds using spatial localization cues: interaural time differences (ITD) and interaural intensity differences (IID). Motivated by the auditory masking effect, the notion of an "ideal" time-frequency binary mask is suggested, which selects the target if it is stronger than the interference in a local time-frequency (T-F) unit. It is observed that within a narrow frequency band, modifications to the relative strength of the target source with respect to the interference trigger systematic changes for estimated ITD and IID. For a given spatial configuration, this interaction produces characteristic clustering in the binaural feature space. Consequently, pattern classification is performed in order to estimate ideal binary masks. A systematic evaluation in terms of signal-to-noise ratio as well as automatic speech recognition performance shows that the resulting system produces masks very close to ideal binary ones. A quantitative comparison shows that the model yields significant improvement in performance over an existing approach. Furthermore, under certain conditions the model produces large speech intelligibility improvements with normal listeners.

382 citations

Proceedings ArticleDOI
15 Jul 2001
TL;DR: A technique for speech segregation based on sound localization cues by observing that systematic changes of the interaural time differences and intensity differences occur as the energy ratio of the original signals is modified is explored.
Abstract: We study the cocktail-party effect, which refers to the ability of a listener to attend to a single talker in the presence of adverse acoustical conditions. It has been observed that this ability improves in the presence of binaural cues. In this paper, we explore a technique for speech segregation based on sound localization cues. The auditory masking phenomenon motivates an "ideal" binary mask in which time-frequency regions that correspond to the weak signal are cancelled. In our model we estimate this binary mask by observing that systematic changes of the interaural time differences and intensity differences occur as the energy ratio of the original signals is modified. The performance of our model is comparable with results obtained using the ideal binary mask and it shows a large improvement over existing pitch-based algorithms.

329 citations

Journal ArticleDOI
TL;DR: The results suggest that, in these conditions, the advantage due to spatial separation of sources is greater for informational masking than for energetic masking.
Abstract: The effect of spatial separation of sources on the masking of a speech signal was investigated for three types of maskers, ranging from energetic to informational. Normal-hearing listeners performed a closed-set speech identification task in the presence of a masker at various signal-to-noise ratios. Stimuli were presented in a quiet sound field. The signal was played from 0° azimuth and a masker was played either from the same location or from 90° to the right. Signals and maskers were derived from sentences that were preprocessed by a modified cochlear-implant simulation program that filtered each sentence into 15 frequency bands, extracted the envelopes from each band, and used these envelopes to modulate pure tones at the center frequencies of the bands. In each trial, the signal was generated by summing together eight randomly selected frequency bands from the preprocessed signal sentence. Three maskers were derived from the preprocessed masker sentences: (1) different-band sentence, which was generated by summing together six randomly selected frequency bands out of the seven bands not present in the signal (resulting in primarily informational masking); (2) different-band noise, which was generated by convolving the different-band sentence with Gaussian noise; and (3) same-band noise, which was generated by summing the same eight bands from the preprocessed masker sentence that were used in the signal sentence and convolving the result with Gaussian noise (resulting in primarily energetic masking). Results revealed that in the different-band sentence masker, the effect of spatial separation averaged 18 dB (at 51% correct), while in the different-band and same-band noise maskers the effect was less than 10 dB. These results suggest that, in these conditions, the advantage due to spatial separation of sources is greater for informational masking than for energetic masking.

325 citations

Journal ArticleDOI
TL;DR: Evidence is interpreted for a significant role of informational masking and modulation interference in cochlear implant speech recognition with fluctuating maskers that may originate from increased target-masker similarity when spectral resolution is reduced.
Abstract: Speech recognition performance was measured in normal-hearing and cochlear-implant listeners with maskers consisting of either steady-state speech-spectrum-shaped noise or a competing sentence. Target sentences from a male talker were presented in the presence of one of three competing talkers (same male, different male, or female) or speech-spectrum-shaped noise generated from this talker at several target-to-masker ratios. For the normal-hearing listeners, target-masker combinations were processed through a noise-excited vocoder designed to simulate a cochlear implant. With unprocessed stimuli, a normal-hearing control group maintained high levels of intelligibility down to target-to-masker ratios as low as 0 dB and showed a release from masking, producing better performance with single-talker maskers than with steady-state noise. In contrast, no masking release was observed in either implant or normal-hearing subjects listening through an implant simulation. The performance of the simulation and implant groups did not improve when the single-talker masker was a different talker compared to the same talker as the target speech, as was found in the normal-hearing control. These results are interpreted as evidence for a significant role of informational masking and modulation interference in cochlear implant speech recognition with fluctuating maskers. This informational masking may originate from increased target-masker similarity when spectral resolution is reduced.

317 citations

Journal ArticleDOI
TL;DR: In this letter, consideration is given to the problems of defining IM and specifying research that is needed to better understand and model IM.
Abstract: Informational masking (IM) has a long history and is currently receiving considerable attention. Nevertheless, there is no clear and generally accepted picture of how IM should be defined, and once defined, explained. In this letter, consideration is given to the problems of defining IM and specifying research that is needed to better understand and model IM.

314 citations

References
More filters
Journal ArticleDOI
TL;DR: In this paper, the relation between the messages received by the two ears was investigated, and two types of test were reported: (a) the behavior of a listener when presented with two speech signals simultaneously (statistical filtering problem) and (b) behavior when different speech signals are presented to his two ears.
Abstract: This paper describes a number of objective experiments on recognition, concerning particularly the relation between the messages received by the two ears. Rather than use steady tones or clicks (frequency or time‐point signals) continuous speech is used, and the results interpreted in the main statistically. Two types of test are reported: (a) the behavior of a listener when presented with two speech signals simultaneously (statistical filtering problem) and (b) behavior when different speech signals are presented to his two ears.

3,562 citations

Journal ArticleDOI
13 Oct 1995-Science
TL;DR: Nearly perfect speech recognition was observed under conditions of greatly reduced spectral information; the presentation of a dynamic temporal pattern in only a few broad spectral regions is sufficient for the recognition of speech.
Abstract: Nearly perfect speech recognition was observed under conditions of greatly reduced spectral information. Temporal envelopes of speech were extracted from broad frequency bands and were used to modulate noises of the same bandwidths. This manipulation preserved temporal envelope cues in each band but restricted the listener to severely degraded information on the distribution of spectral energy. The identification of consonants, vowels, and words in simple sentences improved markedly as the number of bands increased; high speech recognition performance was obtained with only three bands of modulated noise. Thus, the presentation of a dynamic temporal pattern in only a few broad spectral regions is sufficient for the recognition of speech.

2,865 citations

Journal ArticleDOI
TL;DR: The speech-reception threshold (SRT) for sentences presented in a fluctuating interfering background sound of 80 dBA SPL is measured and it is shown that hearing-impaired individuals perform poorer than suggested by the loss of audibility for some parts of the speech signal.
Abstract: The speech-reception threshold (SRT) for sentences presented in a fluctuating interfering background sound of 80 dBA SPL is measured for 20 normal-hearing listeners and 20 listeners with sensorineural hearing impairment. The interfering sounds range from steady-state noise, via modulated noise, to a single competing voice. Two voices are used, one male and one female, and the spectrum of the masker is shaped according to these voices. For both voices, the SRT is measured as well in noise spectrally shaped according to the target voice as shaped according to the other voice. The results show that, for normal-hearing listeners, the SRT for sentences in modulated noise is 4-6 dB lower than for steady-state noise; for sentences masked by a competing voice, this difference is 6-8 dB. For listeners with moderate sensorineural hearing loss, elevated thresholds are obtained without an appreciable effect of masker fluctuations. The implications of these results for estimating a hearing handicap in everyday conditions are discussed. By using the articulation index (AI), it is shown that hearing-impaired individuals perform poorer than suggested by the loss of audibility for some parts of the speech signal. Finally, three mechanisms are discussed that contribute to the absence of unmasking by masker fluctuations in hearing-impaired listeners. The low sensation level at which the impaired listeners receive the masker seems a major determinant. The second and third factors are: reduced temporal resolution and a reduction in comodulation masking release, respectively.

844 citations

Journal ArticleDOI
TL;DR: Speech reception thresholds (SRT) for sentences in noise were determined for normal-hearing subjects and the effect of interaural time delay (ITD) and acoustic headshadow on binaural speech intelligibility in noise was made.
Abstract: A study was made of the effect of interaural time delay (ITD) and acoustic headshadow on binaural speech intelligibility in noise. A free-field condition was simulated by presenting recordings, made with a KEMAR manikin in an anechoic room, through earphones. Recordings were made of speech, reproduced in front of the manikin, and of noise, emanating from seven angles in the azimuthal plane, ranging from 0 degree (frontal) to 180 degrees in steps of 30 degrees. From this noise, two signals were derived, one containing only ITD, the other containing only interaural level differences (ILD) due to headshadow. Using this material, speech reception thresholds (SRT) for sentences in noise were determined for a group of normal-hearing subjects. Results show that (1) for noise azimuths between 30 degrees and 150 degrees, the gain due to ITD lies between 3.9 and 5.1 dB, while the gain due to ILD ranges from 3.5 to 7.8 dB, and (2) ILD decreases the effectiveness of binaural unmasking due to ITD (on the average, the threshold shift drops from 4.6 to 2.6 dB). In a second experiment, also conducted with normal-hearing subjects, similar stimuli were used, but now presented monaurally or with an overall 20-dB attenuation in one channel, in order to simulate hearing loss. In addition, SRTs were determined for noise with fixed ITDs, for comparison with the results obtained with head-induced (frequency dependent) ITDs.(ABSTRACT TRUNCATED AT 250 WORDS)

509 citations