scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Informational and energetic masking effects in the perception of multiple simultaneous talkers.

29 Oct 2001-Journal of the Acoustical Society of America (Acoustical Society of America)-Vol. 110, Iss: 3, pp 2527-2538
TL;DR: In this experiment, the intelligibility of a target phrase masked by a single competing masker phrase was measured as a function of signal-to-noise ratio (SNR) with same-talker, same-sex, and different-sex target and masker voices.
Abstract: Although most recent multitalker research has emphasized the importance of binaural cues, monaural cues can play an equally important role in the perception of multiple simultaneous speech signals. In this experiment, the intelligibility of a target phrase masked by a single competing masker phrase was measured as a function of signal-to-noise ratio (SNR) with same-talker, same-sex, and different-sex target and masker voices. The results indicate that informational masking, rather than energetic masking, dominated performance in this experiment. The amount of masking was highly dependent on the similarity of the target and masker voices: performance was best when different-sex talkers were used and worst when the same talker was used for target and masker. Performance did not, however, improve monotonically with increasing SNR. Intelligibility generally plateaued at SNRs below 0 dB and, in some cases, intensity differences between the target and masking voices produced substantial improvements in performance with decreasing SNR. The results indicate that informational and energetic masking play substantially different roles in the perception of competing speech messages.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: An audio-visual corpus that consists of high-quality audio and video recordings of 1000 sentences spoken by each of 34 talkers to support the use of common material in speech perception and automatic speech recognition studies.
Abstract: An audio-visual corpus has been collected to support the use of common material in speech perception and automatic speech recognition studies. The corpus consists of high-quality audio and video recordings of 1000 sentences spoken by each of 34 talkers. Sentences are simple, syntactically identical phrases such as "place green at B 4 now". Intelligibility tests using the audio signals suggest that the material is easily identifiable in quiet and low levels of stationary noise. The annotated corpus is available on the web for research use.

1,088 citations

Book ChapterDOI
TL;DR: This chapter reviews recent advancements in studies of vocal adaptations to interference by background noise and relates these to fundamental issues in sound perception in animals and humans.
Abstract: Publisher Summary Environmental noise can affect acoustic communication through limiting the broadcast area, or active space, of a signal by decreasing signal-to-noise ratios at the position of the receiver. At the same time, noise is ubiquitous in all habitats and is, therefore, likely to disturb animals, as well as humans, under many circumstances. However, both animals and humans have evolved diverse solutions to the background noise problem, and this chapter reviews recent advancements in studies of vocal adaptations to interference by background noise and relate these to fundamental issues in sound perception. The chapter starts with the discussion of sender's side by considering potential evolutionary shaping of species-specific signal characteristics and individual short‐term adjustments of signal features. Subsequently, it focuses on the receivers of signals and reviews their sensory capacities for signal detection, recognition, and discrimination and relates these issues to auditory scene analysis and the ecological concept of signal space. The data from studies on insects, anurans, birds, and mammals, including humans, and to a lesser extent available work on fish and reptiles is also discussed in the chapter.

845 citations

Journal ArticleDOI
10 May 2012-Nature
TL;DR: It is demonstrated that population responses in non-primary human auditory cortex encode critical features of attended speech: speech spectrograms reconstructed based on cortical responses to the mixture of speakers reveal the salient spectral and temporal features of the attended speaker, as if subjects were listening to that speaker alone.
Abstract: The neural correlates of how attended speech is internally represented are described, shedding light on the ‘cocktail party problem’. The 'cocktail-party problem' — the question of what goes on in our brains when we listen selectively for one person's voice while ignoring many others — has puzzled researchers from various disciplines for years. Using electrophysiological recordings from neurosurgery patients listening to two speakers simultaneously, Nima Mesgarani and Edward Chang determine the neural correlates associated with the internal representation of attended speech. They find that the neural responses in the auditory cortex represent the attended voice robustly, almost as if the second voice were not there. With these patterns established, a simple algorithm trained on various speakers predicts which stimulus a subject is attending to, on the basis of the patterns emerging in the secondary auditory cortex. These results suggest that speech representation in the brain reflects not only the acoustic environment, but also the listener's understanding of these signals. As well as shedding light on a long-standing neurobiological problem, this work may give clues as to how automatic speech recognition might be improved to cope with more than one talker. Humans possess a remarkable ability to attend to a single speaker’s voice in a multi-talker background1,2,3. How the auditory system manages to extract intelligible speech under such acoustically complex and adverse listening conditions is not known, and, indeed, it is not clear how attended speech is internally represented4,5. Here, using multi-electrode surface recordings from the cortex of subjects engaged in a listening task with two simultaneous speakers, we demonstrate that population responses in non-primary human auditory cortex encode critical features of attended speech: speech spectrograms reconstructed based on cortical responses to the mixture of speakers reveal the salient spectral and temporal features of the attended speaker, as if subjects were listening to that speaker alone. A simple classifier trained solely on examples of single speakers can decode both attended words and speaker identity. We find that task performance is well predicted by a rapid increase in attention-modulated neural selectivity across both single-electrode and population-level cortical responses. These findings demonstrate that the cortical representation of speech does not merely reflect the external acoustic environment, but instead gives rise to the perceptual aspects relevant for the listener’s intended goal.

804 citations

Journal ArticleDOI
TL;DR: Recording from subjects selectively listening to one of two competing speakers using magnetoencephalography indicates that concurrent auditory objects, even if spectrotemporally overlapping and not resolvable at the auditory periphery, are neurally encoded individually in auditory cortex and emerge as fundamental representational units for top-down attentional modulation and bottom-up neural adaptation.
Abstract: A visual scene is perceived in terms of visual objects. Similar ideas have been proposed for the analogous case of auditory scene analysis, although their hypothesized neural underpinnings have not yet been established. Here, we address this question by recording from subjects selectively listening to one of two competing speakers, either of different or the same sex, using magnetoencephalography. Individual neural representations are seen for the speech of the two speakers, with each being selectively phase locked to the rhythm of the corresponding speech stream and from which can be exclusively reconstructed the temporal envelope of that speech stream. The neural representation of the attended speech dominates responses (with latency near 100 ms) in posterior auditory cortex. Furthermore, when the intensity of the attended and background speakers is separately varied over an 8-dB range, the neural representation of the attended speech adapts only to the intensity of that speaker but not to the intensity of the background speaker, suggesting an object-level intensity gain control. In summary, these results indicate that concurrent auditory objects, even if spectrotemporally overlapping and not resolvable at the auditory periphery, are neurally encoded individually in auditory cortex and emerge as fundamental representational units for top-down attentional modulation and bottom-up neural adaptation.

696 citations

Journal ArticleDOI
TL;DR: An automatic speech recognition system, adapted for use with partially specified inputs, to identify consonants in noise revealed that cues to voicing are degraded more in the model than in human auditory processing.
Abstract: Do listeners process noisy speech by taking advantage of "glimpses"-spectrotemporal regions in which the target signal is least affected by the background? This study used an automatic speech recognition system, adapted for use with partially specified inputs, to identify consonants in noise. Twelve masking conditions were chosen to create a range of glimpse sizes. Several different glimpsing models were employed, differing in the local signal-to-noise ratio (SNR) used for detection, the minimum glimpse size, and the use of information in the masked regions. Recognition results were compared with behavioral data. A quantitative analysis demonstrated that the proportion of the time-frequency plane glimpsed is a good predictor of intelligibility. Recognition scores in each noise condition confirmed that sufficient information exists in glimpses to support consonant identification. Close fits to listeners' performance were obtained at two local SNR thresholds: one at around 8 dB and another in the range -5 to -2 dB. A transmitted information analysis revealed that cues to voicing are degraded more in the model than in human auditory processing.

693 citations

References
More filters
Journal ArticleDOI
TL;DR: In this paper, the relation between the messages received by the two ears was investigated, and two types of test were reported: (a) the behavior of a listener when presented with two speech signals simultaneously (statistical filtering problem) and (b) behavior when different speech signals are presented to his two ears.
Abstract: This paper describes a number of objective experiments on recognition, concerning particularly the relation between the messages received by the two ears. Rather than use steady tones or clicks (frequency or time‐point signals) continuous speech is used, and the results interpreted in the main statistically. Two types of test are reported: (a) the behavior of a listener when presented with two speech signals simultaneously (statistical filtering problem) and (b) behavior when different speech signals are presented to his two ears.

3,562 citations

Journal ArticleDOI
TL;DR: The authors discusses the factors which govern the intelligibility of speech sounds and presents relationships which have been developed for expressing quantitatively, in terms of the fundamental characteristics of speech and hearing, the capability of the ear in recognizing these sounds.
Abstract: This paper discusses the factors which govern the intelligibility of speech sounds and presents relationships which have been developed for expressing quantitatively, in terms of the fundamental characteristics of speech and hearing, the capability of the ear in recognizing these sounds. The success of the listener in recognizing and interpreting speech sounds depends on their intensity in the ear and the intensity of unwanted sounds that may be present. The relationships presented in this paper deal with the intelligibility of speech sounds as a function of these intensities in the different frequency ranges of speech. Results calculated from these relationships are compared with observed results for a variety of systems.

1,030 citations

Journal ArticleDOI
TL;DR: The speech-reception threshold (SRT) for sentences presented in a fluctuating interfering background sound of 80 dBA SPL is measured and it is shown that hearing-impaired individuals perform poorer than suggested by the loss of audibility for some parts of the speech signal.
Abstract: The speech-reception threshold (SRT) for sentences presented in a fluctuating interfering background sound of 80 dBA SPL is measured for 20 normal-hearing listeners and 20 listeners with sensorineural hearing impairment. The interfering sounds range from steady-state noise, via modulated noise, to a single competing voice. Two voices are used, one male and one female, and the spectrum of the masker is shaped according to these voices. For both voices, the SRT is measured as well in noise spectrally shaped according to the target voice as shaped according to the other voice. The results show that, for normal-hearing listeners, the SRT for sentences in modulated noise is 4-6 dB lower than for steady-state noise; for sentences masked by a competing voice, this difference is 6-8 dB. For listeners with moderate sensorineural hearing loss, elevated thresholds are obtained without an appreciable effect of masker fluctuations. The implications of these results for estimating a hearing handicap in everyday conditions are discussed. By using the articulation index (AI), it is shown that hearing-impaired individuals perform poorer than suggested by the loss of audibility for some parts of the speech signal. Finally, three mechanisms are discussed that contribute to the absence of unmasking by masker fluctuations in hearing-impaired listeners. The low sensation level at which the impaired listeners receive the masker seems a major determinant. The second and third factors are: reduced temporal resolution and a reduction in comodulation masking release, respectively.

844 citations

Journal ArticleDOI
TL;DR: A database of speech samples from eight different talkers has been collected for use in multitalker communications research and the nature of the corpus, the data collection methodology, and the means for obtaining copies of the database are presented.
Abstract: A database of speech samples from eight different talkers has been collected for use in multitalker communications research. Descriptions of the nature of the corpus, the data collection methodology, and the means for obtaining copies of the database are presented.

488 citations