scispace - formally typeset
Search or ask a question
Conference

International Symposium on Chinese Spoken Language Processing 

About: International Symposium on Chinese Spoken Language Processing is an academic conference. The conference publishes majorly in the area(s): Speech processing & Computer science. Over the lifetime, 1000 publications have been published by the conference receiving 4711 citations.

Papers published on a yearly basis

Papers
More filters
Book ChapterDOI
13 Dec 2006
TL;DR: The paper describes the design, collection, transcription and analysis of 200 hours of HKUST Mandarin Telephone Speech Corpus (HKUST/MTS), the largest and first of its kind for Mandarin conversational telephone speech, providing abundant and diversified samples for Mandarin speech recognition and other application-dependent tasks.
Abstract: The paper describes the design, collection, transcription and analysis of 200 hours of HKUST Mandarin Telephone Speech Corpus (HKUST/MTS) from over 2100 Mandarin speakers in mainland China under the DARPA EARS framework. The corpus includes speech data, transcriptions and speaker demographic information. The speech data include 1206 ten-minute natural Mandarin conversations between either strangers or friends. Each conversation focuses on a single topic. All calls are recorded over public telephone networks. All calls are manually annotated with standard Chinese characters (GBK) as well as specific mark-ups for spontaneous speech. A file with speaker demographic information is also provided. The corpus is the largest and first of its kind for Mandarin conversational telephone speech, providing abundant and diversified samples for Mandarin speech recognition and other application-dependent tasks, such as topic detection, information retrieval, keyword spotting, speaker recognition, etc. In a 2004 evaluation test by NIST, the corpus is found to improve system performance quite significantly.

124 citations

Proceedings ArticleDOI
Pan Jia, Cong Liu, Zhi-Guo Wang, Yu Hu, Hui Jiang1 
01 Dec 2012
TL;DR: This paper investigates DNN for several large vocabulary speech recognition tasks and proposes a few ideas to reconfigure the DNN input features, such as using logarithm spectrum features or VTLN normalized features in DNN.
Abstract: Recently, it has been reported that context-dependent deep neural network (DNN) has achieved some unprecedented gains in many challenging ASR tasks, including the well-known Switchboard task. In this paper, we first investigate DNN for several large vocabulary speech recognition tasks. Our results have confirmed that DNN can consistently achieve about 25–30% relative error reduction over the best discriminatively trained GMMs even in some ASR tasks with up to 700 hours of training data. Next, we have conducted a series of experiments to study where the unprecedented gain of DNN comes from. Our experiments show the gain of DNN is almost entirely attributed to DNN's feature vectors that are concatenated from several consecutive speech frames within a relatively long context window. At last, we have proposed a few ideas to reconfigure the DNN input features, such as using logarithm spectrum features or VTLN normalized features in DNN. Our results have shown that each of these methods yields over 3% relative error reduction over the traditional MFCC or PLP features in DNN.

102 citations

Proceedings ArticleDOI
27 Oct 2014
TL;DR: Experimental results show that the proposed new DNN enhances the separation performance in terms of different objective measures under the semi-supervised mode where the training data of the target speaker is provided while the unseen interferer in the separation stage is predicted by using multiple interfering speakers mixed with thetarget speaker in the training stage.
Abstract: In this paper, a novel deep neural network (DNN) architecture is proposed to generate the speech features of both the target speaker and interferer for speech separation. DNN is adopted here to directly model the highly nonlinear relationship between speech features of the mixed signals and the two competing speakers. With the modified output speech features for learning the parameters of the DNN, the generalization capacity to unseen interferers is improved for separating the target speech. Meanwhile, without any prior information from the interferer, the interfering speech can also be separated. Experimental results show that the proposed new DNN enhances the separation performance in terms of different objective measures under the semi-supervised mode where the training data of the target speaker is provided while the unseen interferer in the separation stage is predicted by using multiple interfering speakers mixed with the target speaker in the training stage.

62 citations

Book ChapterDOI
13 Dec 2006
TL;DR: The listening test results show that LSP and its dynamic counterpart, both in time and frequency, are preferred for the resultant higher synthesized speech quality.
Abstract: In this paper we present our Hidden Markov Model (HMM)-based, Mandarin Chinese Text-to-Speech (TTS) system Mandarin Chinese or Putonghua, “the common spoken language”, is a tone language where each of the 400 plus base syllables can have up to 5 different lexical tone patterns Their segmental and supra-segmental information is first modeled by 3 corresponding HMMs, including: (1) spectral envelop and gain; (2) voiced/unvoiced and fundamental frequency; and (3) segment duration The corresponding HMMs are trained from a read speech database of 1,000 sentences recorded by a female speaker Specifically, the spectral information is derived from short-time LPC spectral analysis Among all LPC parameters, Line Spectrum Pair (LSP) has the closest relevance to the natural resonances or the “formants” of a speech sound and it is selected to parameterize the spectral information Furthermore, the property of clustered LSPs around a spectral peak justify augmenting LSPs with their dynamic counterparts, both in time and frequency, in both HMM modeling and parameter trajectory synthesis One hundred sentences synthesized by 4 LSP-based systems have been subjectively evaluated with an AB comparison test The listening test results show that LSP and its dynamic counterpart, both in time and frequency, are preferred for the resultant higher synthesized speech quality

57 citations

Proceedings ArticleDOI
15 Dec 2004
TL;DR: This paper presents an effective method to detect the language boundary (LB) in code-switching utterances, mainly produced in Cantonese, a commonly used Chinese dialect, whilst occasionally English words are inserted between Cantonse words.
Abstract: In this paper, we present an effective method to detect the language boundary (LB) in code-switching utterances. The utterances are mainly produced in Cantonese, a commonly used Chinese dialect, whilst occasionally English words are inserted between Cantonese words. Bi-phone probabilities are calculated to measure the confidence that the recognized phones are in Cantonese. Two sets of context-independent mono-phone models are trained by monolingual Cantonese and monolingual English data separately. Both knowledge-based and data-driven model selection approaches are studied in order to retain the language-dependent characteristics and to merge duplicated phone sets between the two languages. The LB detection accuracy is 75.12% for utterances that contain one single code-switching word or phrase.

54 citations

Performance
Metrics
No. of papers from the Conference in previous years
YearPapers
2022101
202170
201894
2016135
2014137
201296