scispace - formally typeset
Search or ask a question
Author

András Zolnay

Bio: András Zolnay is an academic researcher from RWTH Aachen University. The author has contributed to research in topics: Linear discriminant analysis & Word error rate. The author has an hindex of 9, co-authored 9 publications receiving 295 citations.

Papers
More filters
Proceedings ArticleDOI
18 Mar 2005
TL;DR: Experiments performed on the large-vocabulary task VerbMobil II (German conversational speech) show that the accuracy of automatic speech recognition systems can be improved by the combination of different acoustic features.
Abstract: In this paper, we consider the use of multiple acoustic features of the speech signal for robust speech recognition. We investigate the combination of various auditory based (mel frequency cepstrum coefficients, perceptual linear prediction, etc.) and articulatory based (voicedness) features. Features are combined by linear discriminant analysis and log-linear model combination based techniques. We describe the two feature combination techniques and compare the experimental results. Experiments performed on the large-vocabulary task VerbMobil II (German conversational speech) show that the accuracy of automatic speech recognition systems can be improved by the combination of different acoustic features.

78 citations

Proceedings Article
01 Jan 2002
TL;DR: A voiced-unvoiced measure was combined with the standard Mel Frequency Cepstral Coefficients using linear discriminant analysis (LDA) to choose the most relevant features for continuous speech recognition.
Abstract: In this paper, a voiced-unvoiced measure is used as acoustic feature for continuous speech recognition. The voiced-unvoiced measure was combined with the standard Mel Frequency Cepstral Coefficients (MFCC) using linear discriminant analysis (LDA) to choose the most relevant features. Experiments were performed on the SieTill (German digit strings recorded over telephone line) and on the SPINE (English spontaneous speech under different simulated noisy environments) corpus. The additional voiced-unvoiced measure results in improvements in word error rate (WER) of up to 11% relative to using MFCC alone with the same overall number of parameters in the system.

43 citations

Proceedings ArticleDOI
04 Sep 2005
TL;DR: The proposed method exploits the bandlimited interpolation idea (in the frequency-domain) to do the necessary frequency-warping and yields exact results as long as the cepstral coefficients are que-frency limited.
Abstract: In this paper, we show that frequency-warping (including VTLN) can be implemented through linear transformation of conventional MFCC. Unlike the Pitz-Ney [1] continuous domain approach, we directly determine the relation between frequency-warping and the linear-transformation in the discrete-domain. The advantage of such an approach is that it can be applied to any frequency-warping and is not limited to cases where an analytical closed-form solution can be found. The proposed method exploits the bandlimited interpolation idea (in the frequency-domain) to do the necessary frequency-warping and yields exact results as long as the cepstral coefficients are que-frency limited. This idea of quefrencylimitedness shows the importance of the filter-bank smoothing of the spectra which has been ignored in [1, 2]. Furthermore, unlike [1], since we operate in the discrete domain, we can also apply the usual discrete-cosine transform (i.e. DCT-II) on the logarithm of the filter-bank output to get conventional MFCC features. Therefore, using our proposed method, we can linearly transform conventional MFCC cepstra to do VTLN and we do not require any recomputation of the warped-features. We provide experimental results in support of this approach.

35 citations

Journal ArticleDOI
TL;DR: The results show that the accuracy of automatic speech recognition systems can be significantly improved by the combination of auditory and articulatory motivated features.

33 citations

Proceedings Article
01 Jan 2006
TL;DR: It is shown that the combination of acoustic features using LDA does not consistently lead to improvements in word error rate, and relative improvements inword error rate of up to 5% were observed for LDA-based combination of multiple acoustic features.
Abstract: In this paper, Linear Discriminant Analysis (LDA) is investigated with respect to the combination of different acoustic features for automatic speech recognition. It is shown that the combination of acoustic features using LDA does not consistently lead to improvements in word error rate. A detailed analysis of the recognition results on the Verbmobil (VM II) and on the English portion of the European Parliament Plenary Sessions (EPPS) corpus is given. This includes an independent analysis of the effect of the dimension of the input to LDA, the effect of strongly correlated input features, as well as a detailed numerical analysis of the generalized eigenvalue problem underlying LDA. Relative improvements in word error rate of up to 5% were observed for LDA-based combination of multiple acoustic features.

31 citations


Cited by
More filters
Journal ArticleDOI
01 Oct 1980

1,565 citations

Journal ArticleDOI
TL;DR: Current advances related to automatic speech recognition (ASR) and spoken language systems and deficiencies in dealing with variation naturally present in speech are outlined.

507 citations

01 Jan 2001
TL;DR: The probability of any event is the ratio between the value at which an expectation depending on the happening of the event ought to be computed, and the value of the thing expected upon it’s 2 happening.
Abstract: Problem Given the number of times in which an unknown event has happened and failed: Required the chance that the probability of its happening in a single trial lies somewhere between any two degrees of probability that can be named. SECTION 1 Definition 1. Several events are inconsistent, when if one of them happens, none of the rest can. 2. Two events are contrary when one, or other of them must; and both together cannot happen. 3. An event is said to fail, when it cannot happen; or, which comes to the same thing, when its contrary has happened. 4. An event is said to be determined when it has either happened or failed. 5. The probability of any event is the ratio between the value at which an expectation depending on the happening of the event ought to be computed, and the value of the thing expected upon it’s 2 happening.

368 citations

Journal ArticleDOI
TL;DR: Alignment templates, phrase-based models, and stochastic finite-state transducers are used to develop computer-assisted translation systems in a European project in two real tasks.
Abstract: Current machine translation (MT) systems are still not perfect. In practice, the output from these systems needs to be edited to correct errors. A way of increasing the productivity of the whole translation process (MT plus human work) is to incorporate the human correction activities within the translation process itself, thereby shifting the MT paradigm to that of computer-assisted translation. This model entails an iterative process in which the human translator activity is included in the loop: In each iteration, a prefix of the translation is validated (accepted or amended) by the human and the system computes its best (or n-best) translation suffix hypothesis to complete this prefix. A successful framework for MT is the so-called statistical (or pattern recognition) framework. Interestingly, within this framework, the adaptation of MT systems to the interactive scenario affects mainly the search process, allowing a great reuse of successful techniques and models. In this article, alignment templates, phrase-based models, and stochastic finite-state transducers are used to develop computer-assisted translation systems. These systems were assessed in a European project (TransType2) in two real tasks: The translation of printer manuals; manuals and the translation of the Bulletin of the European Union. In each task, the following three pairs of languages were involved (in both translation directions): English-Spanish, English-German, and English-French.

238 citations

Journal ArticleDOI
TL;DR: This paper expands T-F unit features to include gammatone frequency cepstral coefficients (GFCC), mel-frequency cep stral coefficients, relative spectral transform (RASTA) and perceptual linear prediction (PLP), and proposes to use a group Lasso approach to select complementary features in a principled way.
Abstract: Monaural speech segregation has been a very challenging problem for decades. By casting speech segregation as a binary classification problem, recent advances have been made in computational auditory scene analysis on segregation of both voiced and unvoiced speech. So far, pitch and amplitude modulation spectrogram have been used as two main kinds of time-frequency (T-F) unit level features in classification. In this paper, we expand T-F unit features to include gammatone frequency cepstral coefficients (GFCC), mel-frequency cepstral coefficients, relative spectral transform (RASTA) and perceptual linear prediction (PLP). Comprehensive comparisons are performed in order to identify effective features for classification-based speech segregation. Our experiments in matched and unmatched test conditions show that these newly included features significantly improve speech segregation performance. Specifically, GFCC and RASTA-PLP are the best single features in matched-noise and unmatched-noise test conditions, respectively. We also find that pitch-based features are crucial for good generalization to unseen environments. To further explore complementarity in terms of discriminative power, we propose to use a group Lasso approach to select complementary features in a principled way. The final combined feature set yields promising results in both matched and unmatched test conditions.

192 citations