The AMI System for the Transcription of Speech in Meetings
read more
Citations
Recurrent neural network based language model
ESPnet: End-to-End Speech Processing Toolkit
Probabilistic and Bottle-Neck Features for LVCSR of Meetings
The CALO Meeting Assistant System
THCHS-30 : A Free Chinese Speech Corpus.
References
The generalized correlation method for estimation of time delay
Perceptual linear predictive (PLP) analysis of speech
Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains
Robust adaptive beamforming
The ICSI Meeting Corpus
Related Papers (5)
Frequently Asked Questions (18)
Q2. How many classes are adapted in the fourth pass?
In the fourth pass both M2 and M3 models are adapted using both CMLLR and MLLR with regression class trees for up to four classes.
Q3. What are the main issues addressed in the audio pre-processing stages?
The segmentation of the audio and implicit discarding of silence or noise;the speaker labelling for later acoustic adaptation; the normalisation of input channels; and the suppression of noise.
Q4. How many hours of speech data were used for the IHM?
For their IHM stem about 70 hours of speech data from the ICSI corpus [9], 13 hours from the NIST corpus [4] and 10 hours from ISL[1] were used.
Q5. What is the process of converting audio signals into feature streams?
After the initial processing the audio signals are converted into feature streams, with vectors comprised of 12 MF-PLP features and raw log energy and first and second order derivatives are added.
Q6. What is the energy for the present channel?
The cross-channel normalised energy is calculated as the energy for the present channel divided by the sum of energies across all channels.
Q7. What was the solution to this problem?
The solution to this problem was to project the meeting data into a narrowband space where both HLDA statistics can be gathered and discriminative training can be performed without regeneration of training lattices.
Q8. What is the importance of automatic transcription of speech in meetings?
The automatic transcription of speech in meetings is of crucial importance for meeting analysis, content analysis, summarisation, and analysis of dialogue structure.
Q9. What is the purpose of the MDM classifier?
The feature vectors are used to train a Multi-Layer-Perceptron (MLP) classifier with a 101 frame input layer, a 50 unit hidden layer and an output layer of two classes.
Q10. What was the first trial run of the AMI project?
Widespread work on automatic transcription of speech in meetings started with yearly performance evaluations by the U.S. National Institute of Standards and Technology (NIST) with a first trial run in 2002.
Q11. How many LCRC features were used in the SAT style training?
when using a SAT style training on each microphone channel (CHAT), i.e. one set of CMLLR transforms per channel, a small performance gain was observed.
Q12. What features are used for the detection of cross talk?
These features are cross-channel normalised energy, signal kurtosis, mean cross-correlation and maximum normalised cross-correlation.
Q13. What is the process of calculating the background noise?
Then the background noise spectrum is estimated using the lowest energy frames in the recording and this is used to Wiener-filter the data to remove the stationary noise.
Q14. What is the purpose of the paper?
Projects like AMI (Augmented Multiparty Interaction) aim to investigate to use of machine based techniques to aid people in and outside of meetings to gain efficient access to information.
Q15. How many hours of training were included in the training set?
The AMI corpus collection was not completed at the time of system development and only 16 hours were included in the training set.
Q16. What is the method for calculating the WERs?
In this case simply selecting the channel with the highest energy for every time frame was found to yield substantially lower word error rates (WERs).
Q17. How many FB coefficients are extracted every 10ms?
23 FB coefficients are extracted every 10ms and 15 vectors of left context are then used to find the LC state level phone posterior estimates.
Q18. How is the gain of acoustic microphones calculated?
It was found that, similar to CTS, 10-15% relative WER gain can be obtained using maximum likelihood based vocal tract length normalisation (VTLN) [6].