Home
/
Authors
/
Georg Stemmer

Author

Georg Stemmer

Other affiliations: University of Erlangen-Nuremberg, Siemens

Bio: Georg Stemmer is an academic researcher from Intel. The author has contributed to research in topics: Word error rate & Language model. The author has an hindex of 19, co-authored 76 publications receiving 6165 citations. Previous affiliations of Georg Stemmer include University of Erlangen-Nuremberg & Siemens.

Papers published on a yearly basis

2021
2020
2018
2017
2016
2015
2014
2013
2011
2010
2009
2008
2007
2006
2005
2004
2003
2002
2001
2000
1999

Papers

PDF

Open Access

More filters

Proceedings Article•

The Kaldi Speech Recognition Toolkit

[...]

Daniel Povey¹, Arnab Ghoshal², Gilles Boulianne, Lukas Burget³, Ondrej Glembek³, Nagendra Kumar Goel, Mirko Hannemann³, Petr Motlicek⁴, Yanmin Qian⁵, Petr Schwarz³, Jan Silovsky, Georg Stemmer⁶, Karel Vesely³ - Show less +9 more•Institutions (6)

Microsoft¹, Saarland University², Brno University of Technology³, Idiap Research Institute⁴, Tsinghua University⁵, University of Erlangen-Nuremberg⁶

01 Jan 2011

TL;DR: The design of Kaldi is described, a free, open-source toolkit for speech recognition research that provides a speech recognition system based on finite-state automata together with detailed documentation and a comprehensive set of scripts for building complete recognition systems.

...read moreread less

Abstract: We describe the design of Kaldi, a free, open-source toolkit for speech recognition research. Kaldi provides a speech recognition system based on finite-state automata (using the freely available OpenFst), together with detailed documentation and a comprehensive set of scripts for building complete recognition systems. Kaldi is written is C++, and the core library supports modeling of arbitrary phonetic-context sizes, acoustic modeling with subspace Gaussian mixture models (SGMM) as well as standard Gaussian mixture models, together with all commonly used linear and affine transforms. Kaldi is released under the Apache License v2.0, which is highly nonrestrictive, making it suitable for a wide community of users.

...read moreread less

5,857 citations

Patent•

Language model modification for local speech recognition systems using remote sources

[...]

Michael E. Deisher, Georg Stemmer

26 Jun 2015

TL;DR: In this article, a language model is modified for a local speech recognition system using remote speech recognition sources, and text results corresponding to the utterance are generated using local vocabulary using the received text results and the generated text result are compared to determine words that are out of the local vocabulary.

...read moreread less

Abstract: A language model is modified for a local speech recognition system using remote speech recognition sources. In one example, a speech utterance is received. The speech utterance is sent to at least one remote speech recognition system. Text results corresponding to the utterance are received from the remote speech recognition system. A local text result is generated using local vocabulary. The received text results and the generated text result are compared to determine words that are out of the local vocabulary and the local vocabulary is updated using the out of vocabulary words.

...read moreread less

92 citations

Proceedings Article•DOI•

Revising Perceptual Linear Prediction (PLP).

[...]

Florian Hönig, Georg Stemmer, Christian Hacker, Fabio Brugnara

04 Sep 2005

TL;DR: Based on the observation that the techniques have many similarities, the processing steps of PLP are revised and the filter-bank, the equal-loudness pre-emphasis and the input for the linear prediction are improved.

...read moreread less

Abstract: Mel Frequency Cepstral Coefficients (MFCC) and Perceptual Linear Prediction (PLP) are the most popular acoustic features used in speech recognition. Often it depends on the task, which of the two methods leads to a better performance. In this work we develop acoustic features that combine the advantages of MFCC and PLP. Based on the observation that the techniques have many similarities, we revise the processing steps of PLP. In particular, the filter-bank, the equal-loudness pre-emphasis and the input for the linear prediction are improved. It is shown for a broadcast news transcription task and a corpus of children’s speech that the new variant of PLP performs better than both MFCC and conventional PLP for a wide range of clean and noisy acoustic conditions.

...read moreread less

84 citations

Proceedings Article•DOI•

Detection of persons with Parkinson's disease by acoustic, vocal, and prosodic analysis

[...]

Tobias Bocklet, Elmar Nöth¹, Georg Stemmer¹, Hana Ruzickova², Jan Rusz² - Show less +1 more•Institutions (2)

University of Erlangen-Nuremberg¹, Charles University in Prague²

01 Dec 2011

TL;DR: It is shown that read texts and monologues are the most meaningful texts when it comes to the automatic detection of PD based on articulation, voice, and prosodic evaluations.

...read moreread less

Abstract: 70% to 90% of patients with Parkinson's disease (PD) show an affected voice. Various studies revealed, that voice and prosody is one of the earliest indicators of PD. The issue of this study is to automatically detect whether the speech/voice of a person is affected by PD. We employ acoustic features, prosodic features and features derived from a two-mass model of the vocal folds on different kinds of speech tests: sustained phonations, syllable repetitions, read texts and monologues. Classification is performed in either case by SVMs. A correlation-based feature selection was performed, in order to identify the most important features for each of these systems. We report recognition results of 91% when trying to differentiate between normal speaking persons and speakers with PD in early stages with prosodic modeling. With acoustic modeling we achieved a recognition rate of 88% and with vocal modeling we achieved 79%. After feature selection these results could greatly be improved. But we expect those results to be too optimistic. We show that read texts and monologues are the most meaningful texts when it comes to the automatic detection of PD based on articulation, voice, and prosodic evaluations. The most important prosodic features were based on energy, pauses and F0. The masses and the compliances of spring were found to be the most important parameters of the two-mass vocal fold model.

...read moreread less

73 citations

Proceedings Article•DOI•

Adaptive training using simple target models [speech recognition applications]

[...]

Georg Stemmer, Fabio Brugnara, Diego Giuliani

18 Mar 2005

TL;DR: An acoustic normalization approach to adaptive training that reduces phonetically irrelevant acoustic variability at the beginning of the training procedure w.r.t. a set of target models to normalize the acoustic features.

...read moreread less

Abstract: Adaptive training aims at reducing the influence of speaker, channel and environment variability on the acoustic models. We describe an acoustic normalization approach to adaptive training. Phonetically irrelevant acoustic variability is reduced at the beginning of the training procedure w.r.t. a set of target models. The set of target models can be a set of HMMs or a Gaussian mixture model (GMM). CMLLR is applied to normalize the acoustic features. The normalized data contains less unwanted variability and is used to generate and train the recognition models. Employing a GMM as a target model leads to a text-independent procedure that can be embedded into the acoustic front-end. On a broadcast news transcription task we obtain relative reductions in WER of 7.8% in the first recognition pass over a conventionally trained system and of 3.4% in the second recognition pass over a SAT-trained system.

...read moreread less

57 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16

Collapse

Cited by

PDF

Open Access

More filters

Journal Article•

Dropout: a simple way to prevent neural networks from overfitting

[...]

Nitish Srivastava¹, Geoffrey E. Hinton¹, Alex Krizhevsky¹, Ilya Sutskever¹, Ruslan Salakhutdinov¹ - Show less +1 more•Institutions (1)

University of Toronto¹

01 Jan 2014-Journal of Machine Learning Research

TL;DR: It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.

...read moreread less

Abstract: Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.

...read moreread less

33,597 citations

Posted Content•

Representation Learning with Contrastive Predictive Coding

[...]

Aaron van den Oord¹, Yazhe Li¹, Oriol Vinyals¹•Institutions (1)

Google¹

10 Jul 2018-arXiv: Learning

TL;DR: This work proposes a universal unsupervised learning approach to extract useful representations from high-dimensional data, which it calls Contrastive Predictive Coding, and demonstrates that the approach is able to learn useful representations achieving strong performance on four distinct domains: speech, images, text and reinforcement learning in 3D environments.

...read moreread less

Abstract: While supervised learning has enabled great progress in many applications, unsupervised learning has not seen such widespread adoption, and remains an important and challenging endeavor for artificial intelligence. In this work, we propose a universal unsupervised learning approach to extract useful representations from high-dimensional data, which we call Contrastive Predictive Coding. The key insight of our model is to learn such representations by predicting the future in latent space by using powerful autoregressive models. We use a probabilistic contrastive loss which induces the latent space to capture information that is maximally useful to predict future samples. It also makes the model tractable by using negative sampling. While most prior work has focused on evaluating representations for a particular modality, we demonstrate that our approach is able to learn useful representations achieving strong performance on four distinct domains: speech, images, text and reinforcement learning in 3D environments.

...read moreread less

5,444 citations

Proceedings Article•DOI•

Librispeech: An ASR corpus based on public domain audio books

[...]

Vassil Panayotov¹, Guoguo Chen¹, Daniel Povey¹, Sanjeev Khudanpur¹•Institutions (1)

Johns Hopkins University¹

19 Apr 2015

TL;DR: It is shown that acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models training on WSJ itself.

...read moreread less

Abstract: This paper introduces a new corpus of read English speech, suitable for training and evaluating speech recognition systems. The LibriSpeech corpus is derived from audiobooks that are part of the LibriVox project, and contains 1000 hours of speech sampled at 16 kHz. We have made the corpus freely available for download, along with separately prepared language-model training data and pre-built language models. We show that acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models trained on WSJ itself. We are also releasing Kaldi scripts that make it easy to build these systems.

...read moreread less

4,770 citations

Proceedings Article•DOI•

SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

[...]

Daniel S. Park¹, William Chan¹, Yu Zhang², Chung-Cheng Chiu¹, Barret Zoph¹, Ekin D. Cubuk¹, Quoc V. Le¹ - Show less +3 more•Institutions (2)

Google¹, Massachusetts Institute of Technology²

18 Apr 2019

TL;DR: This work presents SpecAugment, a simple data augmentation method for speech recognition that is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients) and achieves state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work.

...read moreread less

Abstract: We present SpecAugment, a simple data augmentation method for speech recognition. SpecAugment is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients). The augmentation policy consists of warping the features, masking blocks of frequency channels, and masking blocks of time steps. We apply SpecAugment on Listen, Attend and Spell networks for end-to-end speech recognition tasks. We achieve state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work. On LibriSpeech, we achieve 6.8% WER on test-other without the use of a language model, and 5.8% WER with shallow fusion with a language model. This compares to the previous state-of-the-art hybrid system of 7.5% WER. For Switchboard, we achieve 7.2%/14.6% on the Switchboard/CallHome portion of the Hub5'00 test set without the use of a language model, and 6.8%/14.1% with shallow fusion, which compares to the previous state-of-the-art hybrid system at 8.3%/17.3% WER.

...read moreread less

2,758 citations

Proceedings Article•DOI•

X-Vectors: Robust DNN Embeddings for Speaker Recognition

[...]

David Snyder¹, Daniel Garcia-Romero¹, Gregory Sell¹, Daniel Povey¹, Sanjeev Khudanpur¹ - Show less +1 more•Institutions (1)

Johns Hopkins University¹

15 Apr 2018

TL;DR: This paper uses data augmentation, consisting of added noise and reverberation, as an inexpensive method to multiply the amount of training data and improve robustness of deep neural network embeddings for speaker recognition.

...read moreread less

Abstract: In this paper, we use data augmentation to improve performance of deep neural network (DNN) embeddings for speaker recognition. The DNN, which is trained to discriminate between speakers, maps variable-length utterances to fixed-dimensional embeddings that we call x-vectors. Prior studies have found that embeddings leverage large-scale training datasets better than i-vectors. However, it can be challenging to collect substantial quantities of labeled data for training. We use data augmentation, consisting of added noise and reverberation, as an inexpensive method to multiply the amount of training data and improve robustness. The x-vectors are compared with i-vector baselines on Speakers in the Wild and NIST SRE 2016 Cantonese. We find that while augmentation is beneficial in the PLDA classifier, it is not helpful in the i-vector extractor. However, the x-vector DNN effectively exploits data augmentation, due to its supervised training. As a result, the x-vectors achieve superior performance on the evaluation datasets.

...read moreread less

2,300 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse