Q2. What is the way to improve the quality of speech?
A combination of DNNs and support vector machines (SVMs) for speech enhancement using binary classification of T-F bands is proposed by Wang and Wang (2013).
Q3. How was the critical distance of the reverberation calculated?
The critical distance (Kuttruff, 2009) at which the sound energy of the reverberation is equal to that of the direct path is approximated as rc ≈ 0.056 √gV/T60, where V is room volume, and g = 1.62 is the directivity factor modeled by using the average human speech directivity index of 2.1 dB measured by Monson et al. (2012).
Q4. How was the error backpropagation algorithm used?
The error backpropagation algorithm1 was used to train the network (8) using stochastic gradient descent with learning rate µ=0.1.
Q5. What was the error on the training data?
The backpropagation was run on the training data until the error on the testing data reached a minimum, which did not decrease during five successive iterations.
Q6. What are the typical cues for binaural recordings?
Such recordings frommicrophones placed in the ears are referred as binaural, and the typical cues are interaural time delay (ITD) and interaural level difference (ILD).
Q7. What is the main reason why Hummersone et al. argued that IRM?
Hummersone et al. (2014) argued that IRM may be more closely related to auditory processes than IBM and reviewed studies favoring IRM over IBM in certain ASR tasks and speech intelligibility (SI) measurements.
Q8. How many speakers were used in the three source case?
For the three source case, all speaker permutations using two separate sentences from each speaker were generated to produce 32 three-speaker mixtures.
Q9. How many sentences were captured in the TIMIT database?
100 sentences from the TIMIT database (Garofolo et al., 1993) were captured in both rooms for both distances, resulting in 400 sentences.
Q10. What is the purpose of this work?
This work extends the speech enhancement work of Pertilä and Nikunen (2014) to source separation by introducing a post-processing stage to include information between sources into the predicted T-F masks.
Q11. What is the SDR for the MTT-SEP?
The SDR is also highest for the MTT-SEP in the more reverberant room, while in the low reverberant room the SDR scores are high for MVDR, CNMF, and MTT-SEP.
Q12. How does Healy et al. use DNNs to predict IBM?
Healy et al. (2013) utilize DNNs to predict IBM, and demonstrate that the method can significantly improve SI for hearing impaired listeners.
Q13. What is the difference between the two modifications?
Both modifications decrease the leakage of interfering sources into the target speaker, thus improving the framework’s separation performance.
Q14. What is the difference between the two hearing aids?
The authors analyze different types of non-linear feature transforms, using varying amounts of adjacent T-F points, and propose an evolutionary quantization scheme to achieve low transmission rate to exchange feature values between the two hearing aid devices.
Q15. What is the difference between the two spatial cues?
The mentioned spatial cues are channel pairwise, which poses no issues in two channel recordings, but in general for M microphones the number of pairs grows O(M2), which can result in huge spatial cue vectors for microphone arrays.