Cepstral analysis based on the glimpse proportion measure for improving the intelligibility of HMM-based synthetic speech in noise
read more
Citations
Evaluating the intelligibility benefit of speech modifications in known noise conditions
The listening talker: A review of human and algorithmic context-induced modifications of speech
Maximizing Phoneme Recognition Accuracy for Enhanced Speech Intelligibility in Noise
Intelligibility enhancement of HMM-generated speech in additive noise by modifying Mel cepstral coefficients to increase the glimpse proportion
Evaluating the predictions of objective intelligibility metrics for modified and synthetic speech
References
Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds
A glimpsing model of speech perception in noise.
A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis
A revision of Zwicker's loudness model
The contribution of changes in F0 and spectral tilt to increased intelligibility of speech produced in noise
Related Papers (5)
Frequently Asked Questions (16)
Q2. What was the speech material used to generate vocoded speech?
The speech material the authors used to generate vocoded speech was the semantically unpredictable sentences (SUS) set from the Blizzard Challenge 2010.
Q3. What is the sigmoid function of the gpse?
Nt and Nf are the number of time frames and frequency channels respectively; L(.) is the logistic sigmoid function of zero offset and slope η.
Q4. How many participants were assigned to the experiment?
A total of eight native English speakers participated in the experiment with vocoded speech and other eight participants were assigned to the experiment with synthetic speech.
Q5. What was the purpose of the experiments?
For these experiments the authors added vocoded and HMM-generated synthetic speech to two different types of stationary noise, speech shaped noise (ssn) and high frequency noise (hf).
Q6. What is the sigmoid function of zero offset?
Gf = diag “ˆ gf,1 . . . gf,N ˜” is an NxN diagonal matrixwhose diagonal contains the Gammatone filter frequency response for frequency channel f ; S = diag “ˆ υ1 . . . υN ˜” is an NxN diagonal matrix whosediagonal contains the frequency response of the smoothing filter; b = ˆ b1 . . .
Q7. What is the sigmoid function of the GP?
The redefined cost function is:Et = εt − β GPt (5)where εt is the value of the function described in Eq. (2) in time frame t and GPt is the time evolution of the GP as defined in Eq. (3):GPt = 100Nf NfX f=1 L(yspt,f − y ns t,f ) (6)The cepstral coefficient vector ct = [ct,1 . . . ct,m . . . ct,M ]>is given by:ct = argmin ˆ
Q8. Why is the impact of the proposed method stronger for synthetic speech?
The impact of the proposed method seems to be stronger for synthetic speech although the GP gains were smaller or similar for synthetic speech, most probably because in harder tasks smaller glimpse variations lead to stronger effects.
Q9. What is the effect of the proposed method on speech intelligibility?
The authors proposed a new cepstral extraction method that aims not only to minimize the mismatch between periodogram and modelled spectrum but also to maximize speech intelligibility in noise, as defined by the Glimpse Proportion measure, given that the noise is known and SNR is known and fixed.
Q10. What funding was used for the research?
The research leading to these results was partly funded from the European Community’s Seventh Framework Programme (FP7/20072013) under grant agreements 213850 and 256230 (SCALE and LISTA).
Q11. What is the average SNR of the proposed method?
The authors can see that on average the proposed method reallocates energy mostly to the frequency range between 800 Hz and 4.8 kHz, the band where the auditory human system is more sensitive.
Q12. What is the effect of the method for speech shaped noise?
The listening tests with vocoded and synthetic speech showed the effectiveness of the method for speech shaped noise but not for high frequency noise, which might indicate that the amount of distortion introduced into the speech by the modification was too large.
Q13. What is the effect of the proposed method on the word accuracy?
For the high frequency noise case it seems that, although not significantly, the proposed method decreases the word accuracy rates.
Q14. What is the proposed approximated Glimpse Proportion measure?
The proposed approximated Glimpse Proportion measure is then given by:GP = 100NfNt NtX t=1 NfX f=1 L(yspt,f − y ns t,f ) (3)where yspt,f and y ns t,f are the approximated STEP representation for speech and noise respectively at analysis window t and frequency channel f ;
Q15. what is the definition of the error given by Eq. (5)?
According to the definition of the error given by Eq. (5) the gradient vector is:∇Et = ∇εt − β∇GPt (9)The formula expressing the value of ∇εt can be found in [6].
Q16. What is the GP measure for speech?
The approximated version of the GP measure proposed here obtains correlation coefficients that are smaller but still comparable to the ones obtained by the original GP measure and higher than the ones obtained by any other spectrum-based measure when using the subjective data from the experiment described in [4].