Exploring Monaural Features for Classification-Based Speech Segregation
read more
Citations
Deep clustering: Discriminative embeddings for segmentation and separation
On training targets for supervised speech separation
Supervised Speech Separation Based on Deep Learning: An Overview
Complex ratio masking for monaural speech separation
Deep clustering: Discriminative embeddings for segmentation and separation
References
Regression Shrinkage and Selection via the Lasso
Model selection and estimation in regression with grouped variables
Suppression of acoustic noise in speech using spectral subtraction
Perceptual linear predictive (PLP) analysis of speech
Related Papers (5)
Assessment for automatic speech recognition II: NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems
Frequently Asked Questions (15)
Q2. What are the future works mentioned in the paper "Exploring monaural features for classification-based speech segregation" ?
The authors plan to address reverberant speech segregation in future work using this combined feature set. It would be interesting to explore new features that characterize both pitch and low modulation frequencies in future research. In addition to pitch, their results suggest that RASTA filtering also plays an important role in good generalization.
Q3. What is the third way to combine features?
The third way is to apply supervised feature transformation such as linear discriminant analysis (LDA) [9] to the concatenated feature vector.
Q4. What is the acoustic feature vector used in the experiments?
Treating as the input, conventional frame-level acoustic feature extraction is carried out and the feature vector at frame is taken as the feature representation for .
Q5. What is the contribution of GFCC to model fitting?
GFCC’s contribution to model fitting is relatively weak (i.e., its regression coefficients are relatively small), making it almost redundant given AMS, RASTA-PLP, MFCC and PITCH.
Q6. What is the expected performance of the classifier when tested on unseen speakers?
The classification performance is expected to degrade when tested on unseen speakers, as is evident from theperformance of single features.
Q7. What is the simplest way to extract acoustic features?
The authors also proposed a method to reduce the dimensionality for unit level features, which derives different acoustic features based on bandlimited spectral features.
Q8. Why do the authors use ideal sequential grouping for the tandem algorithm?
The authors use ideal sequential grouping for the tandem algorithm, because the algorithm does not deal with the issue of sequential grouping, i.e., it does not have away to group pitch contours (and their associated masks) of the same speaker across time to form a segregated sentence.
Q9. How many utterances are used in the training set?
for RASTA-PLP, a 5% gain is achieved by using 100 utterances compared to 20, and the performance seems to keep increasing with more training utterances.
Q10. How many seconds of mixtures are used for testing?
For testing, there are approximately 650 seconds of mixtures for the IEEE test set and 700 seconds for the TIMIT test set (see Section IV-I).
Q11. How many windows are used to integrate the acoustic features?
The resulted FFT magnitudes are integrated by 15 triangular windows uniformly spaced from 15.6 to 400 Hz, producing a 15-D AMS feature vector.
Q12. Why did the authors opt for using T-F unit level features?
The authors have opted for using T-F unit level featuresmainly because their experiments show that, although frame-level features produce comparable performance in matched-noise conditions, the performance is significantly worse than unit-level features in unmatched test conditions.
Q13. How many training utterances are used for the complementary feature set?
It is worth noting that the performance of the complementary feature set using only 20 training utterances surpasses the other features using more training utterances.
Q14. What other evaluation criteria have been developed in the speech separation community?
The authors should note that other evaluation criteria have been developed in the speech separation community, including SNR and source to distortion ratio (SDR).
Q15. How do the authors test the generalization to different speakers?
To further test generalization to different speakers, the authors create a new test set for each gender by mixing 20 utterances from the TIMIT corpus [10] with N1-N6 at 0 dB.