scispace - formally typeset
Search or ask a question

Showing papers on "Voice published in 1998"


Journal ArticleDOI
TL;DR: This article assessed the contribution of various phonetic and phonological factors to the perception of global foreign accent in Spanish speakers of fluent but heavily accented English recorded English phrases containing sounds or sequences of sounds whose production is characteristically difficult for native speakers of Spanish.

221 citations


Journal ArticleDOI
TL;DR: It is found that, contrary to some previous claims, children did not perform better with fricative consonants than with stops in a phoneme recognition task and preschoolers and kindergartners were more likely to mistakenly judge that a syllable began with a target phoneme.

65 citations


Journal ArticleDOI
TL;DR: The authors showed that the generative phonological distinction between lexical and surface representation can explain the apparent contradictory orders of acquisition of L2 voice and aspiration contrasts by native speakers of English.
Abstract: In this article, we show that the generative phonological distinction between lexical and surface representation can explain apparently contradictory orders of acquisition of L2 voice and aspiration contrasts by native speakers of English. Cross-language speech perception research has shown that English speakers distinguish synthetic voice onset time counterparts of aspirated–unaspirated minimal pairs more readily than voiced–voiceless. Here, we present evidence that in the perceptual acquisition of the same Thai contrasts, English speakers acquire voicing before aspiration. These divergent orders are argued to be due to the levels of representation tapped by the methodologies employed in each case: surface representations in the earlier studies, and lexical in the present one. The resulting difference in outcomes is attributed to the presence of aspiration in surface, but not lexical, representations in English (Chomsky and Halle, 1968). To address the further question of whether allophonic aspiration in...

59 citations


Proceedings ArticleDOI
12 May 1998
TL;DR: HMM-based connected digit recognition experiments show that voicing features and spectral information are complementary and that improved speech recognition performance is obtained by combining the two sources of information.
Abstract: We investigate a class of features related to voicing parameters that indicate whether the vocal chords are vibrating. Features describing voicing characteristics of speech signals are integrated with an existing 38-dimensional feature vector consisting of first and second order time derivatives of the frame energy and of the cepstral coefficients with their first and second derivatives. HMM-based connected digit recognition experiments comparing the traditional and extended feature sets show that voicing features and spectral information are complementary and that improved speech recognition performance is obtained by combining the two sources of information.

49 citations


Patent
13 Jul 1998
TL;DR: Voicing cut off frequency quantizer (62) quantizes the estimated voicing cut-off frequency value and provides, for respective samples, a voicing cutoff frequency index signal (6) which may be stored or transmitted as discussed by the authors.
Abstract: A speech coding system and associated method relies on a speech encoder (15) and a speech decoder (20). The speech encoder (15) includes a voicing cut off frequency analyzer (60). Voicing cut off frequency analyzer (60) includes voicing cut off frequency estimator (61) and voicing cut off frequency quantizer (62). Voicing cut off frequency estimator (61) estimates a voicing cut off frequency value for respective samples of an input speech waveform (1). To accomplish this, voicing cut off frequency estimator (61) utilizes a bandpass filter to estimate a frequency above which a sample of speech is voiced and below which the sample of speech is unvoiced. Voicing cut off frequency quantizer (62) quantizes the estimated voicing cut off frequency value and provides, for respective samples, a voicing cut off frequency index signal (6) which may be stored or transmitted. Voicing cut off frequency index signal (6) may comprise as few as 1 bit, and in a preferred embodiment, as few as 3 bits.

41 citations



Proceedings ArticleDOI
12 May 1998
TL;DR: A knowledge-based acoustic-phonetic system for the automatic recognition of fricatives, in speaker independent continuous speech, is proposed that uses an auditory-based front-end processing and incorporates new algorithms for the extraction and manipulation of the acoustic- phonetic features that proved to be rich in their information content.
Abstract: The acoustic-phonetic characteristics and the automatic recognition of the American English fricatives are investigated. The acoustic features that exist in the literature are evaluated and new features are proposed. To test the value of the extracted features, a knowledge-based acoustic-phonetic system for the automatic recognition of fricatives, in speaker independent continuous speech, is proposed. The system uses an auditory-based front-end processing and incorporates new algorithms for the extraction and manipulation of the acoustic-phonetic features that proved to be rich in their information content. Several features, which describe the relative amplitude, location of the most dominant peak, spectral shape and duration of unvoiced portion, are combined in the recognition process. A recognition accuracy of 95% for voicing detection and 93% for place of articulation detection are obtained for TIMIT database continuous speech of 22 speakers from 5 different dialect regions.

27 citations


Journal ArticleDOI
TL;DR: Accuracy increased over four conditions, and a claim that speech articulations are concerned directly with reproducing perceptual phenomena and that their ability to do so accurately may be constrained by processing load is offered.
Abstract: Speech data from a single child with a phonological impairment were analysed with a view to assessing the influence of utterance mode (spontaneous vs confrontation naming vs repetition), lexical status (word vs non-word) and phonological context (voicing status and position in word) on the accuracy of production of velar targets Under these conditions, accuracy was found to vary between 'correct' velar and 'incorrect' alveolar place of articulation First, accuracy increased over four conditions, from spontaneous speech to confrontation naming to real word repetition to non-word repetition Second, there was a higher incidence of correct velar targets in initial than final position in the word, and a higher incidence of correct /k/ targets than /g/ targets These findings are discussed in relation to a proposed model of child speech production, the configuration of which borrows heavily from similar models described recently in the literature The model attempts to explain how a child represents and processes word-forms, and over time revises their pronunciation The explanation offered for these findings entails a claim that speech articulations are concerned directly with reproducing perceptual phenomena and that their ability to do so accurately may be constrained by processing load

26 citations


Journal ArticleDOI
TL;DR: The experimental results showed that the listeners' acceptable range of durational modification was narrower for vowels in the first moraic position in the word than for those in the third moraicPosition, and the acceptable range was also narrower for the vowel /a/ than for the vowels /i/, and similarly narrower for vows followed by unvoiced consonants than for Those followed by voiced consonants.
Abstract: Few perceptual studies of the temporal aspects of speech have investigated the influence of changes in segmental durations in terms of acceptability. Aiming to contribute to the assessment of rules for assigning segmental durations in speech synthesis, the current study measured the perceptual acceptability of changes in the segmental duration of vowels as a function of the segment attributes or context, such as base duration, temporal position in a word, vowel quality, and voicing of the following segment. Seven listeners estimated the acceptability of word stimuli in which one of the vowels was subjected to a temporal modification from -50 ms (for shortening) to +50 ms (for lengthening) in 5-ms steps. The temporal modification was applied to vowel segments in 70 word contexts; their durations ranged from 35-145 ms, the mora position in the word was first or third, the vowel quality was /a/ or /i/, and the following segment was a voiced or an unvoiced consonant. The experimental results showed that the listeners' acceptable range of durational modification was narrower for vowels in the first moraic position in the word than for those in the third moraic position. The acceptable range was also narrower for the vowel /a/ than for the vowel /i/, and similarly narrower for vowels followed by unvoiced consonants than for those followed by voiced consonants. The vowel that fell into the least vulnerable class (the third /i/, followed by a voiced consonant) required 140% of the modification of that which fell into the most vulnerable class (the first /a/, followed by an unvoiced consonant) to yield the same acceptability decrement. In contrast, the effect of the original vowel duration on the acceptability of temporal modifications was not significant despite its wide variation (35-145 ms).

24 citations




Journal ArticleDOI
TL;DR: A knowledge‐based acoustic‐phonetic system for the automatic recognition of stops, in speaker independent continuous speech, is proposed that uses an auditory‐based front‐end processing and incorporates new algorithms for the extraction and manipulation of the acoustic‐ phonetic features that proved to be rich in their information content.
Abstract: Despite the recent successes in the field of automatic speech recognition, more research is still needed in order to understand the variability of speech and the acoustic characteristics of speech sounds in different contexts and for different speakers. In this paper, the acoustic‐phonetic characteristics and the automatic recognition of the American English stop consonants are investigated. The acoustic features that exist in the literature are evaluated and new features are proposed. To test the value of the extracted features, a knowledge‐based acoustic‐phonetic system for the automatic recognition of stops, in speaker independent continuous speech, is proposed. The system uses an auditory‐based front‐end processing and incorporates new algorithms for the extraction and manipulation of the acoustic‐phonetic features that proved to be rich in their information content. Several features, which describe the burst frequency, formant transitions, relative amplitude, spectral shape, and duration, are combined in the recognition process. Recognition accuracy of 95% for voicing detection and 90% for place of articulation detection are obtained for TIMIT database continuous speech of multiple speakers from different dialect regions. The obtained results are analyzed and compared to previous work. [Work was supported by Catalyst Foundation.]

Journal ArticleDOI
25 Aug 1998
TL;DR: The Twenty-Fourth Annual Meeting of the Berkeley Linguistics Society: General Session and Parasession on Phonetics and Phonological Universals (1998) as discussed by the authors was held in 1998.
Abstract: Proceedings of the Twenty-Fourth Annual Meeting of the Berkeley Linguistics Society: General Session and Parasession on Phonetics and Phonological Universals (1998)

Journal ArticleDOI
TL;DR: In this paper, an acoustic and perceptual study of a patient with severe apraxia of speech (AOS), focusing on the ability to signal the contrast of voice versus voiceless on real and non-word stimuli, was conducted.

Journal ArticleDOI
01 Jan 1998-Shofar
TL;DR: In this article, the authors used the Voice into Closure (ViC) method to check the phonetic voicing in Hebrew and found that in every voiced stop, the voicing extends longer than in its voiceless counterpart.
Abstract: As some scholars believe that voicing is only an accompanying, non-phonemic feature and that the essential characteristic in Hebrew is the force of articulation, we were encouraged to check the phonetic voicing in Hebrew. Since final sounds cannot be measured with an ordinary VOT method, we used also the "Voice into Closure" (ViC) method. Measurements of fourteen Hebrew subjects revealed that in every voiced stop the phonetic voicing extends longer than in its voiceless counterpart Comparing our VOT findings to those in twelve other languages, revealed that in a "voiced" sound there is always longer voicing than in its voiceless counterpart. Some characteristics that differentiate between languages are reported (e.g., Hebrew speakers resemble Spanish and Polish speakers with a large span of VOT, in comparison to the nine other languages checked). Perception tests of synthesized words (on English, Spanish, Thai, and Hebrew listeners) demonstrated that the voice timing cue by itself suffices to differentiate between the voiced-voiceless categories. While the hypothesis of "force" could not be supported, "voice timing" was found to be an actual physical feature, and it is reasonable to assume that this feature is the main cause of a categorical differentiation in all the languages.

Journal ArticleDOI
TL;DR: The authors showed that infants do not encode either place or voice distinctions in lexical representations, so that words differing in only these features are treated as identical, and they also found that infants did not respond to a change in voicing.
Abstract: While infants have been demonstrated to be sensitive to a wide variety of phonetic contrasts when tested in speech discrimination tasks [Eimas et al. (1971) et seq.], recent work [Stager and Werker (1997)] has shown that following habituation to a word–object pairing, infants of 14 months fail to notice when the place of articulation of the initial consonant is switched [b/d]. Using the same procedure, the present study has found that infants do not respond to a change in voicing [b/p]. They do, however, notice a switch between dissimilar words [lɪf/nim]. One interpretation of these findings is that 14‐month‐olds do not encode either place or voice distinctions in lexical representations, so that words differing in only these features are treated as identical. To test this hypothesis, the effect of combining featural contrasts is currently being investigated by examining whether infants do respond to a change in both place and voice [d/p]. If there is such an additive effect, the contrasts must be represented. This would entail that an explanation for the failure to distinguish words differing in only a single feature should invoke processing factors, rather than representational ones.

Patent
01 May 1998
TL;DR: In this article, a class of features related to voicing parameters that indicate whether the vocal chords are vibrating were integrated with an existing 38-dimensional feature vector consisting of first and second order time derivatives of the frame energy and of the cepstral coefficients with their first-and second derivatives.
Abstract: A class of features related to voicing parameters that indicate whether the vocal chords are vibrating. Features describing voicing characteristics of speech signals are integrated with an existing 38-dimensional feature vector consisting of first and second order time derivatives of the frame energy and of the cepstral coefficients with their first and second derivatives. Hidden Markov Model (HMM)-based connected digit recognition experiments comparing the traditional and extended feature sets show that voicing features and spectral information are complementary and that improved speech recognition performance is obtained by combining the two sources of information.

Proceedings ArticleDOI
12 May 1998
TL;DR: A voicing state determination algorithm (VSDA) that is used to simultaneously estimate the voicing state of two speakers present in a segment of co-channel speech by using a binary tree decision structure.
Abstract: This paper presents a voicing state determination algorithm (VSDA) that is used to simultaneously estimate the voicing state of two speakers present in a segment of co-channel speech. Supervised learning trains a Bayesian classifier to predict the voicing states. The possible voicing states are silence, voiced/voiced, voiced/unvoiced, unvoiced/voiced and unvoiced/unvoiced. We have assumed the silent state as a subset of the unvoiced class, except when both speakers are silent. We have chosen a binary tree decision structure. Our feature set is a projection of a 37 dimensional feature vector onto a single dimension applied at each branch of the decision tree, using the Fisher linear discriminant. We have produced co-channel speech from the TIMIT database which is used for training and testing. Preliminary results, at signal to interference ratio of 0 dB, have produced classification accuracy of 82.6%, 73.45%, and 68.24% on male/female, male/male and female/female mixtures respectively.

Journal ArticleDOI
TL;DR: The results obtained imply that the adaptation of the articulatory control system to the distorted conditions of articulation and voice generation can be governed, not only by acoustical parameters like formant frequencies, but also by such a complex phonetic element as the voicing cue.

Journal ArticleDOI
TL;DR: The authors showed that Spanish speakers use preceding closure interval (CI) to help distinguish the voicing characteristics of word-initial stops during production, while native English speakers do not, and showed that bilinguals have similar VOT boundaries to monolinguals when the tokens are presented in isolation, as evidenced by larger shifts in VO boundaries as a function of preceding CI.
Abstract: Green et al. [J. Acoust. Soc. Am. 102, 3136 (1997)] showed that Spanish speakers (monolingual and bilingual) use preceding closure interval (CI) to help distinguish the voicing characteristics of word‐initial stops during production, while native English (NE) speakers do not. The current study compared NE and Spanish‐English bilinguals with respect to the role of CI in the perception of voicing in word‐initial stops. Members from a /bada‐pada/ continuum, varying in voice onset time (VOT) from prevoiced to long lag values, were presented in an English and a Spanish sentence context using three different preceding CIs: 25, 75 and 125 ms. NE listeners were presented with either the English sentence tokens or the Spanish sentence tokens. One group of Bilinguals (in English mode) was presented with the English sentences while a second group (in Spanish mode) was presented with the Spanish sentences. Each subject was presented with the continuum in isolation after presentation of the sentences. The results show that although the bilinguals have similar VOT boundaries to monolinguals when the tokens are presented in isolation, they exhibit greater sensitivity to CI in the sentence context, as evidenced by larger shifts in VOT boundaries as a function of preceding CI.

Journal ArticleDOI
TL;DR: This article examined the weight of vowel duration and the first and second formant frequencies F1-F2 frequencies when distinguishing phonologically long and short vowel before a voiceless consonant and before a voiced consonant.
Abstract: Swedish is described as having a distinction between phonologically long and short vowels. This distinction is realized primarily through the duration of the vowels, but in some cases also through resonance characteristics of the vowels. In Swedish, like many languages, vowel duration is also longer preceding a voiced postvocalic consonant than a voiceless one. This study examines the weight of vowel duration and the first and second formant frequencies F1–F2 frequencies when distinguishing phonologically long and short vowel before a voiceless consonant (experiment 1) and before a voiced consonant (experiment 2). For three pairs of Swedish vowels ([i:]‐[ɪ], [o:]‐[ɔ], [ɑ:]‐[a]) 100 /kVt/ (experiment 1) and 100 /kVd/ (experiment 2) words were resynthesized having ten degrees of vowel duration and ten degrees of F1 and F2 adjustment. In both experiments listeners decided whether presented words contained a phonologically long or short vowel. Reaction times were also recorded. Results show that vowel duratio...

01 Jan 1998
TL;DR: In this article, a study of the voicing profiles of consonants in Mandarin Chinese and German is presented, where the voicing profile is defined as the frame-by-frame voicing status of a speech sound in continuous speech.
Abstract: In this paper we present a study of the voicing profiles of consonants in Mandarin Chinese and German. The voicing profile is defined as the frame-by-frame voicing status of a speech sound in continuous speech. We are particularly interested in discrepancies between the phonological voicing status of a speech sound and its actual phonetic realization in connected speech. We further examine the contextual factors that cause voicing variations and test the cross-language validity of these factors. The result can be used to improve speech synthesis, and to refine phone models to enhance the performance of automatic speech segmentation and recognition.

Proceedings Article
01 Jan 1998
TL;DR: Several algorithms to estimate the VOT automatically from continuous speech are described and compared on a speech recognition problem to reduce error rates by as much as 53 % over a baseline HMM based system.
Abstract: We examine the distinctive feature [voice] that separates the voiced from the unvoiced sounds for the case of stop consonants. We conduct acoustic-phonetic analyses on a large database and demonstrate the superior separability using a temporal measure (voice onset time; VOT) rather than spectral measures. We describe several algorithms to estimate the VOT automatically from continuous speech and compare them on a speech recognition problem to reduce error rates by as much as 53 % over a baseline HMM based system.

Proceedings Article
01 Jan 1998
TL;DR: A study of the voicing profiles of consonants in Mandarin Chinese and German to find discrepancies between the phonological voicing status of a speech sound and its actual phonetic realization in connected speech.
Abstract: In this paper we present a study of the voicing profiles of consonants in Mandarin Chinese and German The voicing profile is defined as the frame-by-frame voicing status of a speech sound in continuous speech We are particularly interested in discrepancies between the phonological voicing status of a speech sound and its actual phonetic realization in connected speech We further examine the contextual factors that cause voicing variations and test the cross-language validity of these factors The result can be used to improve speech synthesis, and to refine phone models to enhance the performance of automatic speech segmentation and recognition

Proceedings Article
01 Jan 1998
TL;DR: Durational and spectral variation in syllable-onset /l/s dependent on voicing in the coda is investigated, phonetically implemented as a variety of properties spread throughout the syllabic domain.
Abstract: This study investigates durational and spectral variation in syllable-onset /l/s dependent on voicing in the coda. 1560 pairs of (C)lVC monosyllables differing in the voicing of the final stop were read by 4 British English speakers. Onset /l/ was longer before voiced than voiceless codas, and darker (for 3 speakers) as measured by F2 frequency and spectral centre of gravity. Differences due to other variables (lexical status, isolation/carrier context, syllable onset, vowel quality and regional accent) are outlined. It is pr o osed that coda voicing is a feature associated with the whole syllable, phonetically implemented as a variety of properties spread throughout the syllabic domain. Implications for word recognition are outlined.

Journal ArticleDOI
TL;DR: Results indicated that accurate perception of final consonant voicing was not impaired by changes in the temporal structure of speech that accompanied the inexperienced signers' simultaneous communication.


30 Oct 1998
TL;DR: In this article, the authors explain the development of these two sandhi rules and the exceptions by assuming four stages for the devlopment of three phonological rules, either obligarory or optional, or the orders of their application on the evidence from regional dialects of Burmese.
Abstract: In Modern Standard Burmese (Myanmar) are observed two distinct types of voicing sandhi: (1) in the environment C1aC2, and if both Cl and C2 are any one of /p t t c k s/, both C1 and C2 become voiced, and (2) in the environment where C occurs after nonstop rhymes except atonic ones within a word or phrase with a postposition or postpositions, C becomes voiced if it has the voiced counterpart, hence C = /p ph t t th c ch k kh; s sh/. However, a fair number of varied exceptions are found for the first rule. The aim of the present paper is to explain the development of these two sandhi rules and the exceptions by assuming four stages for the devlopment of three phonological rules, either obligarory or optional, or the orders of their application on the evidence from the regional dialects of Burmese.

Proceedings Article
01 Jan 1998
TL;DR: There is still a possibility that people may deliberately control the F0 of the following vowel as an additional cue to the phonological difference between voiceless and voiced stop consonants.
Abstract: Data collected from Japanese and English showed that both phonetically fully voiced and (partially) devoiced allophones of /d/ have very similar perturbatory effect on the F0 of the following vowel. It is considered, therefore, that the phonetic voicing of /d/ (periodicity during the closure) i s not clearly correlated with lower levels of F0 on the following vowel. Although the F0 perturbation may be caused by some aspects in the production of the preceding stop which is not necessarily manifested in actual vocal cord vibration, this result indicates that there is still a possibility that people may deliberately control the F0 of the following vowel as an additional cue to the phonological difference between voiceless and voiced stop consonants.

Journal ArticleDOI
TL;DR: This article examined acquisition data from Swedish and American children, aged 24 and 30 months, to determine developmental patterns of vowel duration associated with context-sensitive voicing, and found that the 24-month-old children from both language backgrounds would exhibit the phonetically based tendency for vowels to be significantly longer in the voiced context, but that at 30 months of age, the effects of final consonant voicing would be much stronger in English than in Swedish.
Abstract: In most languages of the world, a vowel preceding a voiced obstruent is longer than the same vowel preceding a voiceless obstruent. Although the effect of postvocalic voicing on vowel duration is often considered to be phonetically driven, the extent of influence differs considerably across languages. In English, for example, vowels preceding voiced obstruents are nearly twice as long as those preceding voiceless obstruents, whereas in Swedish, the influence is minimal, perhaps because vowel length is phonemically contrastive in this language. The present study examines acquisition data from Swedish and American children, aged 24 and 30 months, to determine developmental patterns of vowel duration associated with context‐sensitive voicing. It was hypothesized that the 24‐month‐old children from both language backgrounds would exhibit the phonetically based tendency for vowels to be significantly longer in the voiced context, but that at 30 months of age, the effects of final consonant voicing would be much stronger in English than in Swedish. Durational measures of high front vowels (tense/long /i/ and its short/lax counterpart) supported the proposed hypothesis. [Work supported by NICHD.]