Intelligibility-enhancing speech modifications: the Hurricane Challenge
Summary (2 min read)
1. Introduction
- Speech output -whether from mobile phones, public address systems or simply domestic audio devices -is widely used.
- In many listening contexts the intelligibility of the intended message might be compromised by environmental noise or channel distortion.
- Consequently it is of interest to compare their performance using shared data and metrics.
- The idea of a common evaluation of algorithms was piloted in 2012 within the EU-funded 'Listening Talker' project.
2. The Challenge problem
- Entrants to the Challenge (section 3) were provided with a corpus of speech and noise waveforms (section 2.1), as well as optional data resources to construct/adapt a TTS system (section 2.2).
- Entrants then returned algorithmically-modified or synthetically-generated speech waveforms for the entire corpus.
- These were subjected to evaluation by listeners (section 4).
- Entrants had around 6 weeks to prepare their modified signals, and all made a financial contribution to the cost of listening tests.
2.1. Speech and noise corpora
- The 'Plain' unmodified natural speech corpus consists of the first 180 sentences of the Harvard corpus [17] read by a male British English speaker.
- The Harvard corpus contains sentences such as "the salt breeeze came across from the sea" arranged into phonemically-balanced subsets.
- The Plain corpus was elicited as read speech from a highly-intelligible speaker, and can therefore be considered as intrinsically rather clear (i.e. hyper-articulated).
- Entrants also received six sets of noise waveforms for each utterance arising from the combination of two masker types at three signal-to-noise ratios (SNRs).
- Entrants were permitted to modify the overall duration of the speech within these limits (i.e. a maximum total extension of 1s).
2.2. TTS
- In addition to the speech and noise waveforms outlined above, those entrants wishing to submit a TTS entry had available two natural speech datasets (spoken by the same speaker who produced the Plain material) and associated orthographic transcriptions.
- One consists of about 3 hours of additional unmodified natural speech for three different reading materials: 2023 newspaper style sentences, 300 sentences containing words from the modified rhyme test [18] inserted in the carrier sentence 'Now the authors will say word again', and the remaining 540 Harvard sentences not used in the evaluation.
- The second dataset consists of just under 1 hour of Lombard speech from the same speaker who produced the Plain corpus, recorded with speech modu-lated noise from a male speaker [19] played at 84 dBA over headphones.
- This dataset consists of the same reading material as the Plain set with the exception of the newspaper sentences.
3. Challenge entries
- Each entry has a short name which is used in the results presentation.
- Dynamic range compression is applied to decrease amplitude differences between vowels and consonants.
- Steady-state portions of speech (syllable nuclei) are detected from spectral transitions and their amplitudes are suppressed, given their lesser importance for speech perception and their greater energy compared with transient portions (syllable onset and coda) [28], also known as SSS.
- This entry incorporates additional spectral and time domain modifications into the Spectral Shaping and Dynamic Range Compression method [15], also known as uwSSDRCt.
- The excitation and duration parameters of the voice 'TTS' were adapted to the Lombard dataset provided in order to mimic a speaker's Lombard duration and F0 changes.
4. Listener evaluation
- Within each block, entries were mixed such that by listening to 6 blocks (=180 sentences) a single participant would hear 9 sentences from each entry.
- Listeners were given two short practice sessions, one per masker type, presented at 0 dB SNR for SSN and -3 dB for CS, using Plain speech Harvard sentences from outside the sets used for the main test.
- The subsequent stimulus was presented automatically after the entry of a response.
- Responses were scored in terms of number of words correctly identified.
- To permit comparison of entries, Fisher's least significant differences in dBs and percentage points are also tabulated, computed using separate ANOVAs for each SNR level and masker type with a single factor of modification entry.
5. Discussion
- Large intelligibility gains equivalent to boosting the level of unmodified speech by up to 5.6 dB were observed, with similarsized increases over both natural and TTS baselines and for both types of masker.
- Not surprisingly, Plain speech was more intelligible than unmodified TTS, although the gap reduced with decreasing SNR from around 4/7 dB to 3/5 dB for SSN/CS respectively.
- One striking outcome of the Challenge is the find-ing that three modified TTS entries (PSSDRC-syn, TTSLGP-DRC, GlottLombard) reached and even exceeded the intelligibility level of Plain speech in stationary noise, with PSSDRCsyn also showing marginal gains for the CS masker.
- Intriguingly, there was no clear advantage for entries that used prior knowledge of the masker.
- Durational changes were used by nearly half of natural speech entries and all TTS systems and appear to have contributed to good performance in several cases, especially for the GCRetime approach which exploits temporal fluctuations in the masker.
Did you find this useful? Give us your feedback
Citations
107 citations
97 citations
47 citations
40 citations
37 citations
References
1,100 citations
"Intelligibility-enhancing speech mo..." refers background in this paper
...phoneLLabso: A recogniser trained on WSJ0 [25] provides phone segmentation information and associates signal frames with acoustic models....
[...]
1,032 citations
794 citations
693 citations
"Intelligibility-enhancing speech mo..." refers methods in this paper
...Continuous timescale factors are derived from an optimisation procedure applied to the energetic masking relations of the speech and noise mixture [20] supplemented by the identification of potentially most informative speech regions [21]....
[...]
404 citations
"Intelligibility-enhancing speech mo..." refers background in this paper
...D tries to accurately predict instrumental intelligibility scores (SIIB [30] and ESTOI [31]) of modified speech, and then guides G to modify input speech in such a way to maximize the predicted intelligibility scores....
[...]
...intelligibility scores (SIIB [30] and ESTOI [31]) of modified speech, and then guides G to modify input speech in such a way to maximize the predicted intelligibility scores....
[...]