Mel cepstral coefficient modification based on the Glimpse Proportion measure for improving the intelligibility of HMM-generated synthetic speech in noise
read more
Citations
Using linguistic predictability and the Lombard effect to increase the intelligibility of synthetic speech in noise
A phonetic-contrast motivated adaptation to control the degree-of-articulation on Italian HMM-based synthetic voices.
Intelligibility Enhancement of Speech in Noise
Context-aware speech synthesis: A human-inspired model for monitoring and adapting synthetic speech
Exploring Listeners' Speech Rate Preferences.
References
Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds
A glimpsing model of speech perception in noise.
A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis
An adaptive algorithm for mel-cepstral analysis of speech
ICRA noises: artificial noise signals with speech-like spectral and temporal properties for hearing instrument assessment. International Collegium for Rehabilitative Audiology.
Related Papers (5)
Frequently Asked Questions (10)
Q2. What are the future works in "Mel cepstral coefficient modification based on the glimpse proportion measure for improving the intelligibility of hmm- generated synthetic speech in noise" ?
In future, the authors plan to investigate reallocating energy across time. The authors also plan operating under a loudness constraint rather than an energy one.
Q3. How did the authors use the LTAS for the Lombard voice?
The authors used as stopping criteria both error convergence and a maximum distortion threshold set to be 10% of relative increase in the Euclidian distance between the STEP representation of original and modified speech.
Q4. How did the authors extract the Mel cepstral coefficients?
To train, adapt and generate speech the authors extracted: 59 Mel cepstral coefficients with α = 0.77, Mel scale F0, and 25 aperiodicity energy bands extracted using STRAIGHT [8].
Q5. How many speakers did the authors use for the listening test?
For the listening test the authors used 32 native English speakers listening to the noisy samples over headphones in soundproof booths and typing in what he or she heard.
Q6. What is the GP measure for speech intelligibility in noise?
The Glimpse Proportion (GP) measure for speech intelligibility in noise [3] is the proportion of spectral-temporal regions called glimpses where speech is more energetic than noise.
Q7. What can be done to improve intelligibility in noise?
If such data are not available, then the authors can apply noise-independent modifications at the feature level based on known acoustic properties of Lombard speech, such as F0 increase, flattening of spectral tilt and duration stretch [1].
Q8. What are the spectral parameters that define the excitation signal?
The intelligibility gains obtained by the full Lombard voice L over the N-L voice reflect the impact of changes in duration patterns, F0 and the aperiodicity parameters that define the excitation signal, as pointed out in Table 2.
Q9. Why did the authors use an average voice model?
The authors decided to use an average voice model rather than building a speaker-dependent voice because the normal speech dataset was not phonetically balanced.
Q10. What is the difference between the Lombard and the N-L voice?
Moreover the authors observed that, for the competing talker, the intelligibility gain obtained by the Lombard voice over the modified voice was mainly due to changes in duration, F0 and excitation parameters.