scispace - formally typeset
Search or ask a question

Showing papers presented at "International Conference on Spoken Language Processing in 2004"



Proceedings ArticleDOI
04 Oct 2004
TL;DR: The cosine transform coefficients of the approximated sub-band envelopes, computed recursively from the all-pole polynomials, are used as inputs to a TRAP-based speech recognition system and are shown to improve recognition accuracy.
Abstract: Autoregressive modeling is applied for approximating the temporal evolution of spectral density in critical-band-sized sub-bands of a segment of speech signal. The generalized autocorrelation linear predictive technique allows for a compromise between fitting the peaks and the troughs of the Hilbert envelope of the signal in the sub-band. The cosine transform coefficients of the approximated sub-band envelopes, computed recursively from the all-pole polynomials, are used as inputs to a TRAP-based speech recognition system and are shown to improve recognition accuracy.

61 citations


Proceedings ArticleDOI
04 Oct 2004
TL;DR: This article showed that a language with more phonemes will allow shorter words and reduced embedding of short words within longer ones, decreasing the potential for spurious lexical competitors to be activated by speech signals.
Abstract: Language-specific differences in the size and distribution of the phonemic repertoire can have implications for the task facing listeners in recognising spoken words. A language with more phonemes will allow shorter words and reduced embedding of short words within longer ones, decreasing the potential for spurious lexical competitors to be activated by speech signals. We demonstrate that this is the case via comparative analyses of the vocabularies of English and Spanish. A language which uses suprasegmental as well as segmental contrasts, however, can substantially reduce the extent of spurious embedding.

32 citations


Proceedings Article
01 Jan 2004
TL;DR: In this paper, the authors present analyses, modifications, and first experiments with a new nonsense syllables database and discuss the results of preliminary experiments with phoneme recognition and phoneme extraction.
Abstract: The paper presents analyses, modifications, and first experiments with a new nonsense syllables database. Results of preliminary experiments with phoneme recognition are given and discussed.

30 citations


Proceedings Article
01 Oct 2004
TL;DR: Two techniques to cover the gap between the true and the estimated clean speech features in the context of Model-Based Feature Enhancement (MBFE) for noise robust speech recognition are presented.
Abstract: In this paper we present two techniques to cover the gap between the true and the estimated clean speech features in the context of Model-Based Feature Enhancement (MBFE) for noise robust speech recognition. While in the output of every feature enhancement algorithm some residual uncertainty remains, currently this information is mostly discarded. Firstly, we explain how the generation of not only a global MMSEestimate of clean speech, but also several alternative (stateconditional) estimates are supplied to the back-end for recognition. Secondly, we explore the benefits of calculating the variance of the front-end estimate and incorporating this in the acoustic models of the recogniser. Experiments on the Aurora2 task confirmed the superior performance of the resulting system: an average increase in recognition accuracy from 85.65% to 88.50% was obtained for the clean training condition.

25 citations


Proceedings ArticleDOI
04 Oct 2004
TL;DR: This paper develops a methodology for the design of audiovideo data corpora of the speaking face using the principles of data specification, data description and statistical representation both from an application-driven and from a scientifically motivated perspective.
Abstract: This paper develops a methodology for the design of audiovideo data corpora of the speaking face. Existing corpora are surveyed and the principles of data specification, data description and statistical representation are analysed both from an application-driven and from a scientifically motivated perspective. Furthermore, the possibility of “opportunistic” design of speaking-face data corpora is considered.

16 citations


Proceedings ArticleDOI
01 Jan 2004
TL;DR: The experiments indicate that binary distinctive features can be used to effectively represent the phonological context and including pitch accent feature in input improves the prediction of pronunciation variation on a ToBI-labeled subset of the Switchboard corpus.
Abstract: Pronunciation variation in conversational speech has caused significant amount of word errors in large vocabulary automatic speech recognition. Rule-based approaches and decision-tree based approaches have been previously proposed to model pronunciation variation. In this paper, we report our work on modeling pronunciation variation using artificial neural networks (ANN). The results we achieved are significantly better than previously published ones on two different corpora, indicating that ANN may be better suited for modeling pronunciation variation than other statistical models that have been previously investigated. Our experiments indicate that binary distinctive features can be used to effectively represent the phonological context. We also find that including pitch accent feature in input improves the prediction of pronunciation variation on a ToBI-labeled subset of the Switchboard corpus.

15 citations


Proceedings Article
01 Jan 2004
TL;DR: The OSALPC technique is applied to the problem of speaker identification in noisy conditions and it is shown that the technique achieves much better results than both LPC and mel-cepstrum parameterizations in this task.
Abstract: The OSALPC (One-Sided Autocorrelation Linear Predictive Coding) representation of the speech signal has shown to be attractive for speech recognition because of its simplicity and its high recognition performance with respect to the standard LPC in severe noisy conditions In this paper the OSALPC technique is applied to the problem of speaker identification in noisy conditions As shown with experimental results, using additive white noise, that technique also achieves much better results than both LPC and mel-cepstrum parameterizations in this task

5 citations


Proceedings ArticleDOI
04 Oct 2004
TL;DR: A novel scheme to analyze the effects of time variability of vocal tract for speaker recognition by adopting a pitch synchronous feature extraction method, which shows that slowly varying components contain more speaker discriminative information than rapidly varying components do.
Abstract: A novel scheme to analyze the effects of time variability of vocal tract for speaker recognition is proposed. We adopt a pitch synchronous feature extraction method to describe even more detailed characteristics of vocal tract, and decompose it into rapidly varying and slowly varying components with a specified linear filter along with time axis. Speaker identification tasks are performed with weighted combination of two decomposed feature sets and their corresponding models to show the efficiency of each decomposed feature set. Simulation results show that slowly varying components contain more speaker discriminative information than rapidly varying components do.

4 citations


Proceedings ArticleDOI
04 Oct 2004
TL;DR: This paper discussed different issues related to the usability ofspeech interaction system, which includes the usability concept, different design approaches, design process and evaluation questions for speech interaction system.
Abstract: This paper discussed different issues related to the usability of speech interaction system. It includes the usability concept, different design approaches, design process and evaluation questions for speech interaction system. Usability is a very fuzzy concept, especially when it related to the speech interaction system: it is hard to measure and it is very much context dependent. The traditional user-centered design approach may not be suitable for the speech interaction system design since the users might not have enough knowledge to see what the technology can do. Usage-centered design may be the better method but there is not comprehensive theory and methodology for the design process and evaluation.

4 citations


Proceedings ArticleDOI
04 Oct 2004

Proceedings ArticleDOI
04 Oct 2004
TL;DR: Dutch and English listeners' processing of six English phonemes was studied in a phoneme monitoring experiment, andLexical mediation was found to play a similar role for the Dutch and the English listeners, and there were no differences in the amount of lexical mediation for 'difficult' and 'easy' phoneme for theDutch listeners.
Abstract: This study investigates whether the inaccurate processing of non-native phonemes leads to a not native-like representation of word forms containing these phonemes. Dutch and English listeners' processing of six English phonemes was studied in a phoneme monitoring experiment. Half of the target phonemes were difficult to identify for the Dutch listeners. Lexical mediation was found to play a similar role for the Dutch and the English listeners, and there were no differences in the amount of lexical mediation for 'difficult' and 'easy' phonemes for the Dutch listeners. This suggests that the inaccurate processing of non-native phonemes does not necessarily lead to a not native-like representation of word forms containing these phonemes.

Proceedings ArticleDOI
04 Oct 2004
TL;DR: An investigation of the perception of American English phonemes by Dutch listeners proficient in English reveals that noise affected the responding of native and non-native listeners similarly.
Abstract: We report an investigation of the perception of American English phonemes by Dutch listeners proficient in English. Listeners identified either the consonant or the vowel in most possible English CV and VC syllables. The syllables were embedded in multispeaker babble at three signal-to-noise ratios (16 dB, 8 dB, and 0 dB). Effects of signal-to-noise ratio on vowel and consonant identification are discussed as a function of syllable position and of relationship to the native phoneme inventory. Comparison of the results with previously reported data from native listeners reveals that noise affected the responding of native and non-native listeners similarly.

Proceedings ArticleDOI
04 Oct 2004
TL;DR: A Logistic Regression Analysis revealed that the 3 parameters (RIME1 DURATION, RIME2 DURation, and C3 DURATIONS) contributed most reliably to a S#WS versus SW#S classification.
Abstract: The aim of this study was to determine if Dutch speakers reliably signal phrase-internal lexical boundaries, and if so, how. Six speakers recorded 4 pairs of phonemically identical strong-weak-strong (SWS) strings with matching syllable boundaries but mismatching intended word boundaries (e.g. reis # pastei versus reispas # tij, or more broadly C1V2(C)#C2V2(C)C3V3(C) vs. C1V2(C)C2V2(C)#C3V3(C)). An Analysis of Variance revealed 3 acoustic parameters that were significantly greater in S#WS items (C2 DURATION, RIME1 DURATION, C3 BURST AMPLITUDE) and 5 parameters that were significantly greater in the SW#S items (C2 VOT, C3 DURATION, RIME2 DURATION, RIME3 DURATION, and V2 AMPLITUDE). Additionally, center of gravity measurements suggested that the [s] to [t] coarticulation was greater in reis # pa[st]ei versus reispa[s] # [t]ij. Finally, a Logistic Regression Analysis revealed that the 3 parameters (RIME1 DURATION, RIME2 DURATION, and C3 DURATION) contributed most reliably to a S#WS versus SW#S classification.

Proceedings ArticleDOI
04 Oct 2004
TL;DR: An experimental study to dress indicates that the visual presentation of a multi-modal IVIS can be acts as redundancy or complementary modality for auditory presentation, which will aids in relieving the resource demand.
Abstract: To select the right modality for the interaction between drivers and the in-vehicle information system (IVIS) is crucial for safety reasons. This paper presents an experimental study to dress on this subject. The study was carried out on a 160 degree car-driving simulation lab. There are 10 subjects participated in the experiment. We compared the subjects driving behavior on speech input/output only and speech input with speech+visual output interaction modalities with a simple IVIS. To judge the safety status of subjects’ driving performance, two independent variables which includes the average division of over speed and the average division of the car out of lane were measured as dangerous extent. Result indicates that it is not significant differences of driving performance by using synthetic speech to replace the visual display in the IVIS. It indicated that the visual presentation of a multi-modal IVIS can be acts as redundancy or complementary modality for auditory presentation, which will aids in relieving the resource demand.

Proceedings ArticleDOI
04 Oct 2004
TL;DR: Investigation of how listeners of two unrelated languages process phonotactically legitimate and illegitimate sounds spoken in Dutch and American English proposes that in non-native speech perception, phonotactic legitimacy in the native language speeds up phoneme recognition, the richness of acousticphonetic cues improves listening accuracy, and familiarity with the non- native language modulates the relative influence of these two factors.
Abstract: We investigated how listeners of two unrelated languages, Dutch and Korean, process phonotactically legitimate and illegitimate sounds spoken in Dutch and American English. To Dutch listeners, unreleased word-final stops are phonotactically illegal because word-final stops in Dutch are generally released in isolation, but to Korean listeners, released final stops are illegal because word-final stops are never released in Korean. Two phoneme monitoring experiments showed a phonotactic effect: Dutch listeners detected released stops more rapidly than unreleased stops whereas the reverse was true for Korean listeners. Korean listeners with English stimuli detected released stops more accurately than unreleased stops, however, suggesting that acoustic-phonetic cues associated with released stops improve detection accuracy. We propose that in non-native speech perception, phonotactic legitimacy in the native language speeds up phoneme recognition, the richness of acousticphonetic cues improves listening accuracy, and familiarity with the non-native language modulates the relative influence of these two factors.

Proceedings Article
01 Jan 2004
TL;DR: In this article, the authors investigated speech rhythm in the different Arabic dialects that have been constantly described as stress-timed compared with other languages belonging to different rhythm categories such as French and Catalan.
Abstract: This paper raises questions about the discrete or continuous nature of rhythm classes. Within this framework, our study investigates speech rhythm in the different Arabic dialects that have been constantly described as stress-timed compared with other languages belonging to different rhythm categories. Preliminary evidence from perceptual experiments revealed that listeners use speech rhythm cues to distinguish speakers of Arabic from North Africa from those of the Middle East. Inan attempt to elucidate the reasons for this perceptual discrimination, an acoustic investigation based on duration measurement was carried out (i.e. percentages of vocalic intervals (%V) and the standard deviation of consonantal intervals ( ∆C)). This experiment reveals that despite their rhythmic differences, all Arabic dialects still cluster around stress-timed languages exhibiting a different distribution from languages belonging to other rhythm categories such as French and Catalan. Besides, our study suggests that there is no such thing as clear-cut rhythm classes but rather overlapping categories. As a means of comparison, we also used Pairwise Variability Indices so as to validate the reliability of our findings

Proceedings ArticleDOI
04 Oct 2004
TL;DR: It is shown that the performance of a speaker recognition system is closely connected to the mutual information between features and speaker, and upper and lower bounds for the performance are derived.
Abstract: In this paper, we develop theory for speaker recognition, based on information theory. We show that the performance of a speaker recognition system is closely connected to the mutual information between features and speaker, and derive upper and lower bounds for the performance. We apply the theory to the case when the speech is coded and transmitted over a packet-based channel, in which packet losses occurs. The theory gives important insights in what methods can be used to improve the recognition performance, and what methods are meaningless.

Proceedings ArticleDOI
04 Oct 2004
TL;DR: Preliminary results of a perception experiment with English-speaking listeners suggest that the absence of release bursts is most detrimental to the intelligibility of [k], least for [p] and intermediate for [t].
Abstract: This study compared acoustic characteristics of final stops in Korean and Thai. Word-final stops are phonetically realized as unreleased stops in these languages. Native speakers of Korean and Thai produced monosyllabic words ending with [p t k] in each of their native languages (L1). Formant frequencies of /i a u/ at the vowel’s offset were examined. In both languages, the place effect was significant and interacted with the vowel type. For non-front vowels (/a/ and /u/), F2 offset was highest before [t], while for the front vowel (/i/), it was highest before [k]. Preliminary results of a perception experiment with English-speaking listeners suggest that the absence of release bursts is most detrimental to the intelligibility of [k], least for [p] and intermediate for [t].