scispace - formally typeset
Search or ask a question
Author

Tadashi Emori

Bio: Tadashi Emori is an academic researcher from NEC. The author has contributed to research in topics: Cepstrum & Acoustic model. The author has an hindex of 9, co-authored 34 publications receiving 227 citations.

Papers
More filters
Proceedings Article
01 Jan 2001
TL;DR: A new VTLN method is proposed, in which the vocal tract length is normalized in the cepstrum space by means of linear mapping whose parameter is derived using maximumlikelihood estimation, which offers greater precision in determining parameters for individual speakers.
Abstract: Recently, vocal tract length normalization (VTLN) techniques have been developed for speaker normalization in speech recognition. This paper proposes a new VTLN method, in which the vocal tract length is normalized in the cepstrum space by means of linear mapping whose parameter is derived using maximumlikelihood estimation. The computational costs of this method are much lower than that of such conventional methods as ML-VTLN, in which the parameter for mapping is selected from among several parameters. Further, the new method offers greater precision in determining parameters for individual speakers. Experimental use of the method resulted in an error reduction rate of 7.1%. A combination of the proposed method with cepstrum mean normalization (CMN) method was also examined and found to reduce the error rate even more, by 14.6%.

33 citations

Proceedings ArticleDOI
14 Mar 2010
TL;DR: The proposed committee-based active learning method for large vocabulary continuous speech recognition, which applies not only to acoustic models but also to language models and their combination, proved to be significantly better than random selection.
Abstract: We propose a committee-based active learning method for large vocabulary continuous speech recognition. In this approach, multiple recognizers are prepared beforehand, and the recognition results obtained from them are used for selecting utterances. Here, a progressive search method is used for aligning sentences, and voting entropy is used as a measure for selecting utterances. We apply our method not only to acoustic models but also to language models and their combination. Our method was evaluated by using 190-hour speech data in the Corpus of Spontaneous Japanese. It proved to be significantly better than random selection. It only required 63 h of data to achieve a word accuracy of 74%, while standard training (i.e., random selection) required 97 h of data. The recognition accuracy of our proposed method was also better than that of the conventional uncertainty sampling method using word posterior probabilities as the confidence measure for selecting sentences.

22 citations

Patent
29 Feb 2008
TL;DR: In this paper, the authors propose a speaker model selection method based on the acoustic feature value of the speaker model, which is similar to that of an utterance speaker, with accuracy and stability.
Abstract: To enable selection of a speaker, the acoustic feature value of which is similar to that of an utterance speaker, with accuracy and stability, while adapting to changes even when the acoustic feature value of the speaker changes every moment. A speaker score calculating means (22) calculates a long-time speaker score (log likelihood of each of a plurality of speaker models stored in a speaker model storage section (31) with respect to the acoustic feature value) based on an arbitrary number of utterances, for example, and calculates a short-time speaker score based on a short-time utterance, for example. A long-time speaker selecting means 23 selects speakers corresponding to a predetermined number of speaker models having a high long-time speaker score. A short-time speaker selecting means 24 selects speakers corresponding to the speaker models, the number of which is smaller than the predetermined number and the short-time speaker sore of which is high, from among the speakers selected by the long-time speaker selecting means 23.

22 citations

Proceedings ArticleDOI
14 Oct 2002
TL;DR: An automatic speech-to-speech translation system for personal digital assistants (PDAs) that helps oral communication between Japanese and English speakers in various situations while traveling is presented.
Abstract: We present an automatic speech-to-speech translation system for personal digital assistants (PDAs) that helps oral communication between Japanese and English speakers in various situations while traveling. Our original compact large vocabulary continuous speech recognition engine, compact translation engine based on a lexicalized grammar, and compact Japanese speech synthesis engine lead to the development of a Japanese/English bi-directional speech translation system that works with limited computational resources.

19 citations

Patent
Tadashi Emori1, Yoshifumi Onishi1
30 May 2007
TL;DR: A language model learning system for learning a language model on an identifiable basis relating to a word error rate used in speech recognition is described in this paper, which includes a recognizing device (101) for recognizing an input speech by using a sound model and language model and outputting the recognized word sequence as the recognition result.
Abstract: A language model learning system for learning a language model on an identifiable basis relating to a word error rate used in speech recognition. The language model learning system (10) includes a recognizing device (101) for recognizing an input speech by using a sound model and a language model and outputting the recognized word sequence as the recognition result, a reliability degree computing device (103) for computing the degree of reliability of the word sequence, and a language model parameter updating device (104) for updating the parameters of the language model by using the degree of reliability. The language model parameter updating device updates the parameters of the language model to heighten the degree of reliability of the word sequence the computed degree of reliability of which is low when the recognizing device recognizes by using the updated language model and the reliability degree computing device computes the degree of reliability.

18 citations


Cited by
More filters
Journal ArticleDOI
Li Deng1, Xiao Li1
TL;DR: This overview article provides readers with an overview of modern ML techniques as utilized in the current and as relevant to future ASR research and systems, and presents and analyzes recent developments of deep learning and learning with sparse representations.
Abstract: Automatic Speech Recognition (ASR) has historically been a driving force behind many machine learning (ML) techniques, including the ubiquitously used hidden Markov model, discriminative learning, structured sequence learning, Bayesian learning, and adaptive learning. Moreover, ML can and occasionally does use ASR as a large-scale, realistic application to rigorously test the effectiveness of a given technique, and to inspire new problems arising from the inherently sequential and dynamic nature of speech. On the other hand, even though ASR is available commercially for some applications, it is largely an unsolved problem - for almost all applications, the performance of ASR is not on par with human performance. New insight from modern ML methodology shows great promise to advance the state-of-the-art in ASR technology. This overview article provides readers with an overview of modern ML techniques as utilized in the current and as relevant to future ASR research and systems. The intent is to foster further cross-pollination between the ML and ASR communities than has occurred in the past. The article is organized according to the major ML paradigms that are either popular already or have potential for making significant contributions to ASR technology. The paradigms presented and elaborated in this overview include: generative and discriminative learning; supervised, unsupervised, semi-supervised, and active learning; adaptive and multi-task learning; and Bayesian learning. These learning paradigms are motivated and discussed in the context of ASR technology and applications. We finally present and analyze recent developments of deep learning and learning with sparse representations, focusing on their direct relevance to advancing ASR technology.

346 citations

Patent
30 Sep 2010
TL;DR: In this article, the authors present systems, methods and non-transitory computer-readable media for performing speech recognition across different applications or environments without model customization or prior knowledge of the received speech.
Abstract: Disclosed herein are systems, methods and non-transitory computer-readable media for performing speech recognition across different applications or environments without model customization or prior knowledge of the domain of the received speech. The disclosure includes recognizing received speech with a collection of domain-specific speech recognizers, determining a speech recognition confidence for each of the speech recognition outputs, selecting speech recognition candidates based on a respective speech recognition confidence for each speech recognition output, and combining selected speech recognition candidates to generate text based on the combination.

152 citations

Patent
06 Nov 2012
TL;DR: In speech processing systems, compensation is made for sudden changes in the background noise in the average signal-to-noise ratio (SNR) calculation SNR outlier filtering may be used, alone or in conjunction with weighting the average SNR as discussed by the authors.
Abstract: In speech processing systems, compensation is made for sudden changes in the background noise in the average signal-to-noise ratio (SNR) calculation SNR outlier filtering may be used, alone or in conjunction with weighting the average SNR Adaptive weights may be applied on the SNRs per band before computing the average SNR The weighting function can be a function of noise level, noise type, and/or instantaneous SNR value Another weighting mechanism applies a null filtering or outlier filtering which sets the weight in a particular band to be zero This particular band may be characterized as the one that exhibits an SNR that is several times higher than the SNRs in other bands

114 citations

Patent
25 Feb 2013
TL;DR: In this paper, a method and apparatus employing classifier adaptation based on field data in a deployed voice-based interactive system comprise: collecting representations of voice characteristics, in association with corresponding speakers, the representations being generated by the deployed voice based interactive system; updating parameters of the classifier, used in speaker recognition, based on the representations collected.
Abstract: Typical speaker verification systems usually employ speakers' audio data collected during an enrollment phase when users enroll with the system and provide respective voice samples. Due to technical, business, or other constraints, the enrollment data may not be large enough or rich enough to encompass different inter-speaker and intra-speaker variations. According to at least one embodiment, a method and apparatus employing classifier adaptation based on field data in a deployed voice-based interactive system comprise: collecting representations of voice characteristics, in association with corresponding speakers, the representations being generated by the deployed voice-based interactive system; updating parameters of the classifier, used in speaker recognition, based on the representations collected; and employing the classifier, with the corresponding parameters updated, in performing speaker recognition.

100 citations

Patent
Antonio Nucci1, Ram Keralapura1
06 Aug 2009
TL;DR: In this article, a method for real-time speaker recognition using a coarse feature of the speaker from the speech data was proposed. But the method was based on a pre-determined speaker cluster.
Abstract: A method for real-time speaker recognition including obtaining speech data of a speaker, extracting, using a processor of a computer, a coarse feature of the speaker from the speech data, identifying the speaker as belonging to a pre-determined speaker cluster based on the coarse feature of the speaker, extracting, using the processor of the computer, a plurality of Mel-Frequency Cepstral Coefficients (MFCC) and a plurality of Gaussian Mixture Model (GMM) components from the speech data, determining a biometric signature of the speaker based on the plurality of MFCC and the plurality of GMM components, and determining in real time, using the processor of the computer, an identity of the speaker by comparing the biometric signature of the speaker to one of a plurality of biometric signature libraries associated with the pre-determined speaker cluster

98 citations