Joint Learning of Phonetic Units and Word Pronunciations for ASR

Home
/
Papers
/
Joint Learning of Phonetic Units and Word Pronunciations for ASR

Proceedings Article•

Joint Learning of Phonetic Units and Word Pronunciations for ASR

Chia-ying Lee¹, Yu Zhang¹, James Glass¹•Institutions (1)

01 Oct 2013-pp 182-192

TL;DR: An unsupervised alternative ‐ requiring no language-specific knowledge ‐ to the conventional manual approach for creating pronunciation dictionaries is proposed, which jointly discovers the phonetic inventory and the Letter-to-Sound mapping rules in a language using only transcribed data.

read less

Abstract: The creation of a pronunciation lexicon remains the most inefficient process in developing an Automatic Speech Recognizer (ASR). In this paper, we propose an unsupervised alternative ‐ requiring no language-specific knowledge ‐ to the conventional manual approach for creating pronunciation dictionaries. We present a hierarchical Bayesian model, which jointly discovers the phonetic inventory and the Letter-to-Sound (L2S) mapping rules in a language using only transcribed data. When tested on a corpus of spontaneous queries, the results demonstrate the superiority of the proposed joint learning scheme over its sequential counterpart, in which the latent phonetic inventory and L2S mappings are learned separately. Furthermore, the recognizers built with the automatically induced lexicon consistently outperform grapheme-based recognizers and even approach the performance of recognition systems trained using conventional supervised procedures.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Word Embeddings for Speech Recognition

[...]

Samy Bengio¹, Georg Heigold¹•Institutions (1)

Google¹

14 Sep 2014

TL;DR: This work presents here an alternative construction, where words are projected into a continuous embedding space where words that sound alike are nearby in the Euclidean sense, and shows how embeddings can still allow to score words that were not in the training dictionary.

...read moreread less

Abstract: Speech recognition systems have used the concept of states as a way to decompose words into sub-word units for decades. As the number of such states now reaches the number of words used to train acoustic models, it is interesting to consider approaches that relax the assumption that words are made of states. We present here an alternative construction, where words are projected into a continuous embedding space where words that sound alike are nearby in the Euclidean sense. We show how embeddings can still allow to score words that were not in the training dictionary. Initial experiments using a lattice rescoring approach and model combination on a large realistic dataset show improvements in word error rate.

...read moreread less

166 citations

Cites background from "Joint Learning of Phonetic Units an..."

...ples include grapheme-to-phoneme conversion [2], pronunciation learning [15, 10], and joint learning of phonetic units and word pronunciations [1, 9]....
[...]

Proceedings Article•DOI•

An attentional model for speech translation without transcription

[...]

Long Duong¹, Antonios Anastasopoulos², David Chiang², Steven Bird¹, Steven Bird³, Trevor Cohn¹ - Show less +2 more•Institutions (3)

University of Melbourne¹, University of Notre Dame², University of California, Berkeley³

01 Jun 2016

TL;DR: On the more challenging speech-to-word alignment task, the model nearly matches GIZA++’s performance on gold transcriptions, but without recourse to transcriptions or to a lexicon.

...read moreread less

Abstract: For many low-resource languages, spoken language resources are more likely to be annotated with translations than transcriptions. This bilingual speech data can be used for word-spotting, spoken document retrieval, and even for documentation of endangered languages. We experiment with the neural, attentional model applied to this data. On phoneto-word alignment and translation reranking tasks, we achieve large improvements relative to several baselines. On the more challenging speech-to-word alignment task, our model nearly matches GIZA++'s performance on gold transcriptions, but without recourse to transcriptions or to a lexicon.

...read moreread less

158 citations

Cites background from "Joint Learning of Phonetic Units an..."

...Recent work has introduced models that do not require pronunciation lexicons, but train only on speech with text transcriptions (Lee et al., 2013; Maas et al., 2015; Graves et al., 2006)....
[...]

Journal Article•DOI•

Unsupervised Lexicon Discovery from Acoustic Input

[...]

Chia-ying Lee¹, Timothy J. O'Donnell¹, James Glass¹•Institutions (1)

Massachusetts Institute of Technology¹

17 Jul 2015-Transactions of the Association for Computational Linguistics

TL;DR: It is shown that the model is competitive with state-of-the-art spoken term discovery systems, and analyses exploring the model’s behavior and the kinds of linguistic structures it learns are presented.

...read moreread less

Abstract: We present a model of unsupervised phonological lexicon discovery -- the problem of simultaneously learning phoneme-like and word-like units from acoustic input. Our model builds on earlier models of unsupervised phone-like unit discovery from acoustic data (Lee and Glass, 2012), and unsupervised symbolic lexicon discovery using the Adaptor Grammar framework (Johnson et al., 2006), integrating these earlier approaches using a probabilistic model of phonological variation. We show that the model is competitive with state-of-the-art spoken term discovery systems, and present analyses exploring the model's behavior and the kinds of linguistic structures it learns.

...read moreread less

112 citations

Cites methods from "Joint Learning of Phonetic Units an..."

...2 and employ the backward message-passing and forward-sampling algorithm described in Lee et al. (2013), designed for aligning a letter sequence and speech signals, to propose samples for ~vi and zi....
[...]

Journal Article•DOI•

Variational Inference for Acoustic Unit Discovery

[...]

Lucas Ondel¹, Lukas Burget¹, Jan Cernocký¹•Institutions (1)

Brno University of Technology¹

01 Jan 2016-Procedia Computer Science

TL;DR: This work considers Variational Bayes (VB) as alternative inference process and shows that, notwithstanding VB inference is an order of magnitude faster, it outperforms GS in terms of accuracy.

...read moreread less

104 citations

Journal Article•DOI•

Acoustic segment modeling with spectral clustering methods

[...]

Haipeng Wang¹, Tan Lee¹, Cheung-Chi Leung², Bin Ma², Haizhou Li² - Show less +1 more•Institutions (2)

The Chinese University of Hong Kong¹, Institute for Infocomm Research Singapore²

01 Feb 2015-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: This paper uses posterior features as the segment representations, and applies spectral clustering algorithms on the posterior representations of a Gaussian component clustering approach and a segment clustering (SC) approach, which could provide consistent improvement on four different testing scenarios with three evaluation metrics.

...read moreread less

Abstract: This paper presents a study of spectral clustering-based approaches to acoustic segment modeling (ASM). ASM aims at finding the underlying phoneme-like speech units and building the corresponding acoustic models in the unsupervised setting, where no prior linguistic knowledge and manual transcriptions are available. A typical ASM process involves three stages, namely initial segmentation, segment labeling, and iterative modeling. This work focuses on the improvement of segment labeling. Specifically, we use posterior features as the segment representations, and apply spectral clustering algorithms on the posterior representations. We propose a Gaussian component clustering (GCC) approach and a segment clustering (SC) approach. GCC applies spectral clustering on a set of Gaussian components, and SC applies spectral clustering on a large number of speech segments. Moreover, to exploit the complementary information of different posterior representations, a multiview segment clustering (MSC) approach is proposed. MSC simultaneously utilizes multiple posterior representations to cluster speech segments. To address the computational problem of spectral clustering in dealing with large numbers of speech segments, we use inner product similarity graph and make reformulations to avoid the explicit computation of the affinity matrix and Laplacian matrix. We carried out two sets of experiments for evaluation. First, we evaluated the ASM accuracy on the OGI-MTS dataset, and it was shown that our approach could yield 18.7% relative purity improvement and 15.1% relative NMI improvement compared with the baseline approach. Second, we examined the performances of our approaches in the real application of zero-resource query-by-example spoken term detection on SWS2012 dataset, and it was shown that our approaches could provide consistent improvement on four different testing scenarios with three evaluation metrics.

...read moreread less

57 citations

1
2
3
4
…
5
6
7
8
9

Collapse

References

PDF

Open Access

More filters

Book•

Bayesian Data Analysis

[...]

Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, Donald B. Rubin¹ - Show less +2 more•Institutions (1)

University of California, Irvine¹

01 Jan 1995

TL;DR: Detailed notes on Bayesian Computation Basics of Markov Chain Simulation, Regression Models, and Asymptotic Theorems are provided.

...read moreread less

Abstract: FUNDAMENTALS OF BAYESIAN INFERENCE Probability and Inference Single-Parameter Models Introduction to Multiparameter Models Asymptotics and Connections to Non-Bayesian Approaches Hierarchical Models FUNDAMENTALS OF BAYESIAN DATA ANALYSIS Model Checking Evaluating, Comparing, and Expanding Models Modeling Accounting for Data Collection Decision Analysis ADVANCED COMPUTATION Introduction to Bayesian Computation Basics of Markov Chain Simulation Computationally Efficient Markov Chain Simulation Modal and Distributional Approximations REGRESSION MODELS Introduction to Regression Models Hierarchical Linear Models Generalized Linear Models Models for Robust Inference Models for Missing Data NONLINEAR AND NONPARAMETRIC MODELS Parametric Nonlinear Models Basic Function Models Gaussian Process Models Finite Mixture Models Dirichlet Process Models APPENDICES A: Standard Probability Distributions B: Outline of Proofs of Asymptotic Theorems C: Computation in R and Stan Bibliographic Notes and Exercises appear at the end of each chapter.

...read moreread less

16,079 citations

Journal Article•DOI•

Bayesian data analysis.

[...]

John K. Kruschke¹•Institutions (1)

Indiana University¹

01 Sep 2010-Wiley Interdisciplinary Reviews: Cognitive Science

TL;DR: A fatal flaw of NHST is reviewed and some benefits of Bayesian data analysis are introduced and illustrative examples of multiple comparisons in Bayesian analysis of variance and Bayesian approaches to statistical power are presented.

...read moreread less

Abstract: Bayesian methods have garnered huge interest in cognitive science as an approach to models of cognition and perception. On the other hand, Bayesian methods for data analysis have not yet made much headway in cognitive science against the institutionalized inertia of 20th century null hypothesis significance testing (NHST). Ironically, specific Bayesian models of cognition and perception may not long endure the ravages of empirical verification, but generic Bayesian methods for data analysis will eventually dominate. It is time that Bayesian data analysis became the norm for empirical methods in cognitive science. This article reviews a fatal flaw of NHST and introduces the reader to some benefits of Bayesian data analysis. The article presents illustrative examples of multiple comparisons in Bayesian analysis of variance and Bayesian approaches to statistical power. Copyright © 2010 John Wiley & Sons, Ltd. For further resources related to this article, please visit the WIREs website.

...read moreread less

6,081 citations

"Joint Learning of Phonetic Units an..." refers methods in this paper

...We employ Gibbs sampling (Gelman et al., 2004) to approximate the posterior distribution of the latent variables in our model....
[...]

Journal Article•DOI•

Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences

[...]

S. Davis, Paul Mermelstein¹•Institutions (1)

bell northern research¹

01 Aug 1980-IEEE Transactions on Acoustics, Speech, and Signal Processing

TL;DR: In this article, several parametric representations of the acoustic signal were compared with regard to word recognition performance in a syllable-oriented continuous speech recognition system, and the emphasis was on the ability to retain phonetically significant acoustic information in the face of syntactic and duration variations.

...read moreread less

Abstract: Several parametric representations of the acoustic signal were compared with regard to word recognition performance in a syllable-oriented continuous speech recognition system. The vocabulary included many phonetically similar monosyllabic words, therefore the emphasis was on the ability to retain phonetically significant acoustic information in the face of syntactic and duration variations. For each parameter set (based on a mel-frequency cepstrum, a linear frequency cepstrum, a linear prediction cepstrum, a linear prediction spectrum, or a set of reflection coefficients), word templates were generated using an efficient dynamic warping method, and test data were time registered with the templates. A set of ten mel-frequency cepstrum coefficients computed every 6.4 ms resulted in the best performance, namely 96.5 percent and 95.0 percent recognition with each of two speakers. The superior performance of the mel-frequency cepstrum coefficients may be attributed to the fact that they better represent the perceptually relevant aspects of the short-term speech spectrum.

...read moreread less

4,822 citations

Journal Article•DOI•

Continuous speech recognition by statistical methods

[...]

Frederick Jelinek¹•Institutions (1)

IBM¹

01 Apr 1976

TL;DR: Experimental results are presented that indicate the power of the methods and concern modeling of a speaker and of an acoustic processor, extraction of the models' statistical parameters and hypothesis search procedures and likelihood computations of linguistic decoding.

...read moreread less

Abstract: Statistical methods useful in automatic recognition of continuous speech are described. They concern modeling of a speaker and of an acoustic processor, extraction of the models' statistical parameters and hypothesis search procedures and likelihood computations of linguistic decoding. Experimental results are presented that indicate the power of the methods.

...read moreread less

1,024 citations

"Joint Learning of Phonetic Units an..." refers methods in this paper

...These K HMMs are used to model the phonetic units in the language (Jelinek, 1976)....
[...]

Proceedings Article•DOI•

Tree-based state tying for high accuracy acoustic modelling

[...]

Steve Young¹, JJ Odell¹, Philip C. Woodland¹•Institutions (1)

University of Cambridge¹

08 Mar 1994

TL;DR: This paper describes a method of creating a tied-state continuous speech recognition system using a phonetic decision tree, which is shown to lead to similar recognition performance to that obtained using an earlier data-driven approach but to have the additional advantage of providing a mapping for unseen triphones.

...read moreread less

Abstract: The key problem to be faced when building a HMM-based continuous speech recogniser is maintaining the balance between model complexity and available training data. For large vocabulary systems requiring cross-word context dependent modelling, this is particularly acute since many such contexts will never occur in the training data. This paper describes a method of creating a tied-state continuous speech recognition system using a phonetic decision tree. This tree-based clustering is shown to lead to similar recognition performance to that obtained using an earlier data-driven approach but to have the additional advantage of providing a mapping for unseen triphones. State-tying is also compared with traditional model-based tying and shown to be clearly superior. Experimental results are presented for both the Resource Management and Wall Street Journal tasks.

...read moreread less

781 citations

Additional excerpts

...Conventionally, to train a context-dependent acoustic model, a list of questions based on the linguistic properties of phonetic units is required for growing decision tree classifiers (Young et al., 1994)....
[...]