Showing papers presented at "International Conference on Spoken Language Processing in 2004"

PDF

Open Access

Proceedings Article•

From knowledge-ignorant to knowledge-rich modeling : a new speech research parading for next generation automatic speech recognition

[...]

C.-H. Lee

01 Jan 2004

97 citations

Proceedings Article•DOI•

LP-TRAP: Linear predictive temporal patterns

[...]

Marios Athineos, Hynek Hermansky, Daniel P. W. Ellis

04 Oct 2004

TL;DR: The cosine transform coefficients of the approximated sub-band envelopes, computed recursively from the all-pole polynomials, are used as inputs to a TRAP-based speech recognition system and are shown to improve recognition accuracy.

...read moreread less

Abstract: Autoregressive modeling is applied for approximating the temporal evolution of spectral density in critical-band-sized sub-bands of a segment of speech signal. The generalized autocorrelation linear predictive technique allows for a compromise between fitting the peaks and the troughs of the Hilbert envelope of the signal in the sub-band. The cosine transform coefficients of the approximated sub-band envelopes, computed recursively from the all-pole polynomials, are used as inputs to a TRAP-based speech recognition system and are shown to improve recognition accuracy.

...read moreread less

61 citations

Proceedings Article•DOI•

Phonemic repertoire and similarity within the vocabulary

[...]

Anne Cutler¹, Dennis Norris¹, Núria Sebastián-Gallés•Institutions (1)

Max Planck Society¹

04 Oct 2004

TL;DR: This article showed that a language with more phonemes will allow shorter words and reduced embedding of short words within longer ones, decreasing the potential for spurious lexical competitors to be activated by speech signals.

...read moreread less

Abstract: Language-specific differences in the size and distribution of the phonemic repertoire can have implications for the task facing listeners in recognising spoken words. A language with more phonemes will allow shorter words and reduced embedding of short words within longer ones, decreasing the potential for spurious lexical competitors to be activated by speech signals. We demonstrate that this is the case via comparative analyses of the vocabularies of English and Spanish. A language which uses suprasegmental as well as segmental contrasts, however, can substantially reduce the extent of spurious embedding.

...read moreread less

32 citations

Proceedings Article•

New Nonsense Syllables Database -- Analyses and Preliminary ASR Experiments

[...]

Petr Fousek, Petr Svojanovsky, Frantisek Grezl, Hynek Hermansky

01 Jan 2004

TL;DR: In this paper, the authors present analyses, modifications, and first experiments with a new nonsense syllables database and discuss the results of preliminary experiments with phoneme recognition and phoneme extraction.

...read moreread less

Abstract: The paper presents analyses, modifications, and first experiments with a new nonsense syllables database. Results of preliminary experiments with phoneme recognition are given and discussed.

...read moreread less

30 citations

Proceedings Article•

Accounting for the uncertainty of speech estimates in the context of model-based feature enhancement

[...]

Veronique Stouten¹, Hugo Van hamme¹, Patrick Wambacq¹•Institutions (1)

Katholieke Universiteit Leuven¹

01 Oct 2004

TL;DR: Two techniques to cover the gap between the true and the estimated clean speech features in the context of Model-Based Feature Enhancement (MBFE) for noise robust speech recognition are presented.

...read moreread less

Abstract: In this paper we present two techniques to cover the gap between the true and the estimated clean speech features in the context of Model-Based Feature Enhancement (MBFE) for noise robust speech recognition. While in the output of every feature enhancement algorithm some residual uncertainty remains, currently this information is mostly discarded. Firstly, we explain how the generation of not only a global MMSEestimate of clean speech, but also several alternative (stateconditional) estimates are supplied to the back-end for recognition. Secondly, we explore the benefits of calculating the variance of the front-end estimate and incorporating this in the acoustic models of the recogniser. Experiments on the Aurora2 task confirmed the superior performance of the resulting system: an average increase in recognition accuracy from 85.65% to 88.50% was obtained for the clean training condition.

...read moreread less

25 citations

Proceedings Article•DOI•

Aspects of speaking-face data corpus design methodology

[...]

J. Bruce Millar¹, Michael Wagner, Roland Goecke•Institutions (1)

Australian National University¹

04 Oct 2004

TL;DR: This paper develops a methodology for the design of audiovideo data corpora of the speaking face using the principles of data specification, data description and statistical representation both from an application-driven and from a scientifically motivated perspective.

...read moreread less

Abstract: This paper develops a methodology for the design of audiovideo data corpora of the speaking face. Existing corpora are surveyed and the principles of data specification, data description and statistical representation are analysed both from an application-driven and from a scientifically motivated perspective. Furthermore, the possibility of “opportunistic” design of speaking-face data corpora is considered.

...read moreread less

16 citations

Proceedings Article•DOI•

Modeling pronunciation variation using artificial neural networks for English spontaneous speech

[...]

Ken Chen, Mark Hasegawa-Johnson¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

01 Jan 2004

TL;DR: The experiments indicate that binary distinctive features can be used to effectively represent the phonological context and including pitch accent feature in input improves the prediction of pronunciation variation on a ToBI-labeled subset of the Switchboard corpus.

...read moreread less

Abstract: Pronunciation variation in conversational speech has caused significant amount of word errors in large vocabulary automatic speech recognition. Rule-based approaches and decision-tree based approaches have been previously proposed to model pronunciation variation. In this paper, we report our work on modeling pronunciation variation using artificial neural networks (ANN). The results we achieved are significantly better than previously published ones on two different corpora, indicating that ANN may be better suited for modeling pronunciation variation than other statistical models that have been previously investigated. Our experiments indicate that binary distinctive features can be used to effectively represent the phonological context. We also find that including pitch accent feature in input improves the prediction of pronunciation variation on a ToBI-labeled subset of the Switchboard corpus.

...read moreread less

15 citations

Proceedings Article•

Speaker identification in noisy conditions using linear prediction of one-sided autocorrelation sequence

[...]

Francisco Javier Hernando Pericás, Climent Nadeu Camprubí, C. Villagrasa, Enrique Monte Moreno

01 Jan 2004

TL;DR: The OSALPC technique is applied to the problem of speaker identification in noisy conditions and it is shown that the technique achieves much better results than both LPC and mel-cepstrum parameterizations in this task.

...read moreread less

Abstract: The OSALPC (One-Sided Autocorrelation Linear Predictive Coding) representation of the speech signal has shown to be attractive for speech recognition because of its simplicity and its high recognition performance with respect to the standard LPC in severe noisy conditions In this paper the OSALPC technique is applied to the problem of speaker identification in noisy conditions As shown with experimental results, using additive white noise, that technique also achieves much better results than both LPC and mel-cepstrum parameterizations in this task

...read moreread less

5 citations

Proceedings Article•DOI•

On the time variability of vocal tract for speaker recognition

[...]

Samuel Kim¹, Thomas Eriksson², Hong-Goo Kang•Institutions (2)

Yonsei University¹, Chalmers University of Technology²

04 Oct 2004

TL;DR: A novel scheme to analyze the effects of time variability of vocal tract for speaker recognition by adopting a pitch synchronous feature extraction method, which shows that slowly varying components contain more speaker discriminative information than rapidly varying components do.

...read moreread less

Abstract: A novel scheme to analyze the effects of time variability of vocal tract for speaker recognition is proposed. We adopt a pitch synchronous feature extraction method to describe even more detailed characteristics of vocal tract, and decompose it into rapidly varying and slowly varying components with a specified linear filter along with time axis. Speaker identification tasks are performed with weighted combination of two decomposed feature sets and their corresponding models to show the efficiency of each decomposed feature set. Simulation results show that slowly varying components contain more speaker discriminative information than rapidly varying components do.

...read moreread less

4 citations

Proceedings Article•DOI•

Speech interaction system - how to increase its usability

[...]

Fang Chen¹•Institutions (1)

Chalmers University of Technology¹

04 Oct 2004

TL;DR: This paper discussed different issues related to the usability ofspeech interaction system, which includes the usability concept, different design approaches, design process and evaluation questions for speech interaction system.

...read moreread less

Abstract: This paper discussed different issues related to the usability of speech interaction system. It includes the usability concept, different design approaches, design process and evaluation questions for speech interaction system. Usability is a very fuzzy concept, especially when it related to the speech interaction system: it is hard to measure and it is very much context dependent. The traditional user-centered design approach may not be suitable for the speech interaction system design since the users might not have enough knowledge to see what the technology can do. Usage-centered design may be the better method but there is not comprehensive theory and methodology for the design process and evaluation.

...read moreread less

4 citations

Proceedings Article•DOI•

Segmenting ambiguous phrases using phoneme duration

[...]

Keren B. Shatzman¹•Institutions (1)

Max Planck Society¹

04 Oct 2004

Proceedings Article•DOI•

Lexical representation of non-native phonemes

[...]

Mirjam Broersma¹, K. M. Kolkman¹•Institutions (1)

Max Planck Society¹

04 Oct 2004

TL;DR: Dutch and English listeners' processing of six English phonemes was studied in a phoneme monitoring experiment, andLexical mediation was found to play a similar role for the Dutch and the English listeners, and there were no differences in the amount of lexical mediation for 'difficult' and 'easy' phoneme for theDutch listeners.

...read moreread less

Abstract: This study investigates whether the inaccurate processing of non-native phonemes leads to a not native-like representation of word forms containing these phonemes. Dutch and English listeners' processing of six English phonemes was studied in a phoneme monitoring experiment. Half of the target phonemes were difficult to identify for the Dutch listeners. Lexical mediation was found to play a similar role for the Dutch and the English listeners, and there were no differences in the amount of lexical mediation for 'difficult' and 'easy' phonemes for the Dutch listeners. This suggests that the inaccurate processing of non-native phonemes does not necessarily lead to a not native-like representation of word forms containing these phonemes.

...read moreread less

Proceedings Article•DOI•

Perception of non-native phonemes in noise

[...]

Nicole Cooper¹, Anne Cutler¹•Institutions (1)

Max Planck Society¹

04 Oct 2004

TL;DR: An investigation of the perception of American English phonemes by Dutch listeners proficient in English reveals that noise affected the responding of native and non-native listeners similarly.

...read moreread less

Abstract: We report an investigation of the perception of American English phonemes by Dutch listeners proficient in English. Listeners identified either the consonant or the vowel in most possible English CV and VC syllables. The syllables were embedded in multispeaker babble at three signal-to-noise ratios (16 dB, 8 dB, and 0 dB). Effects of signal-to-noise ratio on vowel and consonant identification are discussed as a function of syllable position and of relationship to the native phoneme inventory. Comparison of the results with previously reported data from native listeners reveals that noise affected the responding of native and non-native listeners similarly.

...read moreread less

Proceedings Article•DOI•

Acoustic correlates of phrase-internal lexical boundaries in Dutch

[...]

Taehong Cho¹, Elizabeth K. Johnson¹•Institutions (1)

Max Planck Society¹

04 Oct 2004

TL;DR: A Logistic Regression Analysis revealed that the 3 parameters (RIME1 DURATION, RIME2 DURation, and C3 DURATIONS) contributed most reliably to a S#WS versus SW#S classification.

...read moreread less

Abstract: The aim of this study was to determine if Dutch speakers reliably signal phrase-internal lexical boundaries, and if so, how. Six speakers recorded 4 pairs of phonemically identical strong-weak-strong (SWS) strings with matching syllable boundaries but mismatching intended word boundaries (e.g. reis # pastei versus reispas # tij, or more broadly C1V2(C)#C2V2(C)C3V3(C) vs. C1V2(C)C2V2(C)#C3V3(C)). An Analysis of Variance revealed 3 acoustic parameters that were significantly greater in S#WS items (C2 DURATION, RIME1 DURATION, C3 BURST AMPLITUDE) and 5 parameters that were significantly greater in the SW#S items (C2 VOT, C3 DURATION, RIME2 DURATION, RIME3 DURATION, and V2 AMPLITUDE). Additionally, center of gravity measurements suggested that the [s] to [t] coarticulation was greater in reis # pa[st]ei versus reispa[s] # [t]ij. Finally, a Logistic Regression Analysis revealed that the 3 parameters (RIME1 DURATION, RIME2 DURATION, and C3 DURATION) contributed most reliably to a S#WS versus SW#S classification.

...read moreread less

Proceedings Article•DOI•

Evaluation of the difference between the driving behaviour of a Speech based and a speech-visual based task of an in-car computer

[...]

Zhan Fu, Lay Ling Pow, Fang Chen

04 Oct 2004

TL;DR: An experimental study to dress indicates that the visual presentation of a multi-modal IVIS can be acts as redundancy or complementary modality for auditory presentation, which will aids in relieving the resource demand.

...read moreread less

Abstract: To select the right modality for the interaction between drivers and the in-vehicle information system (IVIS) is crucial for safety reasons. This paper presents an experimental study to dress on this subject. The study was carried out on a 160 degree car-driving simulation lab. There are 10 subjects participated in the experiment. We compared the subjects driving behavior on speech input/output only and speech input with speech+visual output interaction modalities with a simple IVIS. To judge the safety status of subjects driving performance, two independent variables which includes the average division of over speed and the average division of the car out of lane were measured as dangerous extent. Result indicates that it is not significant differences of driving performance by using synthetic speech to replace the visual display in the IVIS. It indicated that the visual presentation of a multi-modal IVIS can be acts as redundancy or complementary modality for auditory presentation, which will aids in relieving the resource demand.

...read moreread less

Proceedings Article•DOI•

Phonotactics vs. phonetic cues in native and non-native listening: Dutch and Korean listeners' perception of Dutch and English

[...]

Taehong Cho¹, James M. McQueen¹•Institutions (1)

Max Planck Society¹

04 Oct 2004

TL;DR: Investigation of how listeners of two unrelated languages process phonotactically legitimate and illegitimate sounds spoken in Dutch and American English proposes that in non-native speech perception, phonotactic legitimacy in the native language speeds up phoneme recognition, the richness of acousticphonetic cues improves listening accuracy, and familiarity with the non- native language modulates the relative influence of these two factors.

...read moreread less

Abstract: We investigated how listeners of two unrelated languages, Dutch and Korean, process phonotactically legitimate and illegitimate sounds spoken in Dutch and American English. To Dutch listeners, unreleased word-final stops are phonotactically illegal because word-final stops in Dutch are generally released in isolation, but to Korean listeners, released final stops are illegal because word-final stops are never released in Korean. Two phoneme monitoring experiments showed a phonotactic effect: Dutch listeners detected released stops more rapidly than unreleased stops whereas the reverse was true for Korean listeners. Korean listeners with English stimuli detected released stops more accurately than unreleased stops, however, suggesting that acoustic-phonetic cues associated with released stops improve detection accuracy. We propose that in non-native speech perception, phonotactic legitimacy in the native language speeds up phoneme recognition, the richness of acousticphonetic cues improves listening accuracy, and familiarity with the non-native language modulates the relative influence of these two factors.

...read moreread less

Proceedings Article•

Speech Timing and Rhythmic structure in Arabic dialects

[...]

Rym Hamdi¹, Melissa Barkat-Defradas, Emmanuel Ferragne, François Pellegrino¹•Institutions (1)

Laboratoire Dynamique du Langage¹

01 Jan 2004

TL;DR: In this article, the authors investigated speech rhythm in the different Arabic dialects that have been constantly described as stress-timed compared with other languages belonging to different rhythm categories such as French and Catalan.

...read moreread less

Abstract: This paper raises questions about the discrete or continuous nature of rhythm classes. Within this framework, our study investigates speech rhythm in the different Arabic dialects that have been constantly described as stress-timed compared with other languages belonging to different rhythm categories. Preliminary evidence from perceptual experiments revealed that listeners use speech rhythm cues to distinguish speakers of Arabic from North Africa from those of the Middle East. Inan attempt to elucidate the reasons for this perceptual discrimination, an acoustic investigation based on duration measurement was carried out (i.e. percentages of vocalic intervals (%V) and the standard deviation of consonantal intervals ( ∆C)). This experiment reveals that despite their rhythmic differences, all Arabic dialects still cluster around stress-timed languages exhibiting a different distribution from languages belonging to other rhythm categories such as French and Catalan. Besides, our study suggests that there is no such thing as clear-cut rhythm classes but rather overlapping categories. As a means of comparison, we also used Pairwise Variability Indices so as to validate the reliability of our findings

...read moreread less

Proceedings Article•DOI•

Theory for speaker recognition over IP

[...]

Thomas Eriksson¹, Samuel Kim², Hong-Goo Kang, Chungyong Lee•Institutions (2)

Chalmers University of Technology¹, Yonsei University²

04 Oct 2004

TL;DR: It is shown that the performance of a speaker recognition system is closely connected to the mutual information between features and speaker, and upper and lower bounds for the performance are derived.

...read moreread less

Abstract: In this paper, we develop theory for speaker recognition, based on information theory. We show that the performance of a speaker recognition system is closely connected to the mutual information between features and speaker, and derive upper and lower bounds for the performance. We apply the theory to the case when the speech is coded and transmitted over a packet-based channel, in which packet losses occurs. The theory gives important insights in what methods can be used to improve the recognition performance, and what methods are meaningless.

...read moreread less

Proceedings Article•DOI•

A cross-linguistic acoustic comparison of unreleased word-final stops: Korean and Thai

[...]

Kimiko Tsukada¹•Institutions (1)

University of Sydney¹

04 Oct 2004

TL;DR: Preliminary results of a perception experiment with English-speaking listeners suggest that the absence of release bursts is most detrimental to the intelligibility of [k], least for [p] and intermediate for [t].

...read moreread less

Abstract: This study compared acoustic characteristics of final stops in Korean and Thai. Word-final stops are phonetically realized as unreleased stops in these languages. Native speakers of Korean and Thai produced monosyllabic words ending with [p t k] in each of their native languages (L1). Formant frequencies of /i a u/ at the vowel’s offset were examined. In both languages, the place effect was significant and interacted with the vowel type. For non-front vowels (/a/ and /u/), F2 offset was highest before [t], while for the front vowel (/i/), it was highest before [k]. Preliminary results of a perception experiment with English-speaking listeners suggest that the absence of release bursts is most detrimental to the intelligibility of [k], least for [p] and intermediate for [t].

...read moreread less