Author
Tan Lee
Other affiliations: Agency for Science, Technology and Research, Shenzhen University, University of Central Florida
Bio: Tan Lee is an academic researcher from The Chinese University of Hong Kong. The author has contributed to research in topics: Speech processing & Speaker recognition. The author has an hindex of 24, co-authored 230 publications receiving 2299 citations. Previous affiliations of Tan Lee include Agency for Science, Technology and Research & Shenzhen University.
Papers published on a yearly basis
Papers
More filters
TL;DR: Analysis of the annotated data shows that CU Corpora contain rich and balanced phonetic content, and the usefulness of the corpora is also demonstrated with a number of speech recognition and speech synthesis applications.
Abstract: This paper describes the development of CU Corpora, a series of large-scale speech corpora for Cantonese. Cantonese is the most commonly spoken Chinese dialect in Southern China and Hong Kong. CU Corpora are the first of their kind and intended to serve as an important infrastructure for the advancement of speech recognition and synthesis technologies for this widely used Chinese dialect. They contain a large amomat of speech data that cover various linguistic units of spoken Cantonese, including isolated syllables, polysyllabic words and continuous sentences. While some of the corpora are created for specific applications of common interest, the others are designed with emphasis on the coverage and distributions of different phonetic units, including the contextual ones. The speech data are annotated manually so as to provide sufficient orthographic and phonetic information for the development of different applications. Statistical analysis of the annotated data shows that CU Corpora contain rich and balanced phonetic content. The usefulness of the corpora is also demonstrated with a number of speech recognition and speech synthesis applications.
110 citations
TL;DR: This letter describes a speaker verification system that uses complementary acoustic features derived from the vocal source excitation and the vocal tract system, and a new feature set, named the wavelet octave coefficients of residues (WOCOR), to capture the spectro-temporal sourceexcitation characteristics embedded in the linear predictive residual signal.
Abstract: This letter describes a speaker verification system that uses complementary acoustic features derived from the vocal source excitation and the vocal tract system. A new feature set, named the wavelet octave coefficients of residues (WOCOR), is proposed to capture the spectro-temporal source excitation characteristics embedded in the linear predictive residual signal. WOCOR is used to supplement the conventional vocal tract-related features, in this case, the Mel-frequency cepstral coefficients (MFCC), for speaker verification. A novel confidence measure-based score fusion technique is applied to integrate WOCOR and MFCC. Speaker verification experiments are carried out on the NIST 2001 database. The equal error rate (EER) attained with the proposed method is 7.67%, in comparison to 9.30% of the conventional MFCC-based system
94 citations
17 Sep 2006
TL;DR: This paper combines the embedded aproach (using improved F0 smoothing) with explicit tone modeling in rescoring the output lattices of Mandarin automatic speech recognition systems to improve the character error rate on the CTV test set.
Abstract: Tone has a crucial role in Mandarin speech in distinguishing ambiguous words. Most state-of-the-art Mandarin automatic speech recognition systems adopt embedded tone modeling, where tonal acoustic units are used and F0 features are appended to the spectral feature vector. In this paper, we combine the embedded aproach (using improved F0 smoothing) with explicit tone modeling in rescoring the output lattices. Oracle experiments indicate 32% relative improvement can be achieved by rescoring with perfect tone information. Recognition experiments on Mandarin broadcast news show that, even with an accuracy of only 70%, the explicit tone classifier offers complementary knowledge and improves performance significantly. Through the combination of tone modeling techniques, the character error rate on the CTV test set can be improved from 13.0% to 11.5%.
78 citations
TL;DR: The paper presents an efficient method for tone recognition of isolated Cantonese syllables using Suprasegmental feature parameters extracted from the voiced portion of a monosyllabic utterance and a three-layer feedforward neural network is used to classify these feature vectors.
Abstract: Tone identification is essential for the recognition of the Chinese language, specifically far Cantonese which is well known for being very rich in tones. The paper presents an efficient method for tone recognition of isolated Cantonese syllables. Suprasegmental feature parameters are extracted from the voiced portion of a monosyllabic utterance and a three-layer feedforward neural network is used to classify these feature vectors. Using a phonologically complete vocabulary of 234 distinct syllables, the recognition accuracy for single-speaker and multispeaker is given by 89.0% and 87.6% respectively. >
66 citations
25 Mar 2012
TL;DR: Experimental results show that the ASMtokenizer outperforms a conventional GMM tokenizer and a language-mismatched phoneme recognizer, and the performance is significantly improved by applying unsupervised speaker normalization techniques.
Abstract: The framework of posteriorgram-based template matching has been shown to be successful for query-by-example spoken term detection (STD). This framework employs a tokenizer to convert query examples and test utterances into frame-level posteriorgrams, and applies dynamic time warping to match the query posteriorgrams with test posteriorgrams to locate possible occurrences of the query term. It is not trivial to design a reliable tokenizer due to heterogeneous test conditions and the limitation of training resources. This paper presents a study of using acoustic segment models (ASMs) as the tokenizer. ASMs can be obtained following an unsupervised iterative procedure without any training transcriptions. The STD performance of the ASM tokenizer is evaluated on Fisher Corpus with comparison to three alternative tokenizers. Experimental results show that the ASM tokenizer outperforms a conventional GMM tokenizer and a language-mismatched phoneme recognizer. In addition, the performance is significantly improved by applying unsupervised speaker normalization techniques.
66 citations
Cited by
More filters
[...]
TL;DR: There is, I think, something ethereal about i —the square root of minus one, which seems an odd beast at that time—an intruder hovering on the edge of reality.
Abstract: There is, I think, something ethereal about i —the square root of minus one. I remember first hearing about it at school. It seemed an odd beast at that time—an intruder hovering on the edge of reality.
Usually familiarity dulls this sense of the bizarre, but in the case of i it was the reverse: over the years the sense of its surreal nature intensified. It seemed that it was impossible to write mathematics that described the real world in …
33,785 citations
TL;DR: This paper starts with the fundamentals of automatic speaker recognition, concerning feature extraction and speaker modeling and elaborate advanced computational techniques to address robustness and session variability.
Abstract: This paper gives an overview of automatic speaker recognition technology, with an emphasis on text-independent recognition. Speaker recognition has been studied actively for several decades. We give an overview of both the classical and the state-of-the-art methods. We start with the fundamentals of automatic speaker recognition, concerning feature extraction and speaker modeling. We elaborate advanced computational techniques to address robustness and session variability. The recent progress from vectors towards supervectors opens up a new area of exploration and represents a technology trend. We also provide an overview of this recent development and discuss the evaluation methodology of speaker recognition systems. We conclude the paper with discussion on future directions.
1,433 citations
1,364 citations