scispace - formally typeset
Search or ask a question

Showing papers on "Voice published in 2001"


Journal ArticleDOI
TL;DR: Differences in phonation type signal important linguistic information in many languages, including contrasts between otherwise identical lexical items and boundaries of prosodic constituents, according to a recurring set of articulatory, acoustic, and timing properties.

479 citations


Journal ArticleDOI
TL;DR: It is argued that laughers use the acoustic features of their vocalizations to shape listener affect, as voiced, songlike laughs were significantly more likely to elicit positive responses than were variants such as unvoiced grunts, pants, and snortlike sounds.
Abstract: We tested whether listeners are differentially responsive to the presence or absence of voicing, a salient, distinguishing acoustic feature, in laughter. Each of 128 participants rated 50 voiced and 20 unvoiced laughs twice according to one of five different rating strategies. Results were highly consistent regardless of whether participants rated their own emotional responses, likely responses of other people, or one of three perceived attributes concerning the laughers, thus indicating that participants were experiencing similarly differentiated affective responses in all these cases. Specifically, voiced, songlike laughs were significantly more likely to elicit positive responses than were variants such as unvoiced grunts, pants, and snortlike sounds. Participants were also highly consistent in their relative dislike of these other sounds, especially those produced by females. Based on these results, we argue that laughers use the acoustic features of their vocalizations to shape listener affect.

238 citations


Journal ArticleDOI
01 Jan 2001-Language
TL;DR: This paper showed that the feature value [-voice] although it is the unmarked value of the laryngeal feature [voice], can be active phonologically in a fashion parallel to the marked value [+ voice].
Abstract: This article provides empirical evidence against the claims that [voice] is a privative feature and that word-internal devoicing can occur in a language without word-final devoicing. The study of voice patterns in a number of languages shows that the feature value [-voice] although it is the unmarked value of the laryngeal feature [voice], can be active phonologically in a fashion parallel to the marked value [+ voice]. Across languages, voice assimilation may occur independently of devoicing and, although it normally affects both [+ voice] and [-voice], it may affect only one value in some languages.

154 citations


Journal ArticleDOI
TL;DR: In this article, the identification and discrimination functions of the features voicing and place-of-articulation were assessed in children with developmental dyslexia and compared to two control groups of children (age-matched and matched on reading level).
Abstract: Problems in reading and spelling may arise from poor perception of speech sounds. To study the integrity of phonological access in children with developmental dyslexia (mean age 8 years, 9 months) as compared to two control groups of children (age-matched and matched on reading level), identification and discrimination functions of the features voicing and place-of-articulation were assessed. No differences were found between groups with respect to identification of place-of-articulation. With respect to identification of the voicing contrast, children with developmental dyslexia performed poorer than age-matched controls, but similar to reading-level controls. For the voicing as well as the place-of-articulation contrast, children with developmental dyslexia discriminated poorer than both control groups. This pattern of identification and discrimination performance is discussed relative to the multidimensionality of the speech perception system. The clinical relevance of these perception tasks could be d...

73 citations


Journal ArticleDOI
TL;DR: The authors showed that the essential cues for understanding spoken language are largely dynamic in nature, derived from the complex modulation spectrum (including both amplitude and phase) below 20 Hz, segmentation of the speech signal into syllabic intervals between 50 and 400 ms, and a multi-time-scale, coarse-grained analysis of phonetic constituents into features based on voicing, manner and place of articulation.
Abstract: Classical models of speech recognition (by both human and machine) assume that a detailed, short‐term analysis of the acoustic signal is essential for accurately decoding spoken language. Several lines of evidence call this assumption into question: (1) intelligibility is relatively unimpaired when the frequency spectrum is distorted under a wide range of conditions (including cross‐spectral asynchrony, reverberation, waveform time reversal and selective deletion of 80% of the spectrum), (2) the acoustic properties of spontaneous speech rarely conform to canonical patterns associated with specific phonetic segments, and (3) automatic‐speech‐recognition phonetic classifiers often require ca. 250 ms of acoustic context (spanning several segments) to function reliably. This pattern of evidence suggests that the essential cues for understanding spoken language are largely dynamic in nature, derived from (1) the complex modulation spectrum (incorporating both amplitude and phase) below 20 Hz, (2) segmentation of the speech signal into syllabic intervals between 50 and 400 ms, and (3) a multi‐time‐scale, coarse‐grained analysis of phonetic constituents into features based on voicing, manner and place of articulation. [Work supported by the U.S. Department of Defense and NSF.]

69 citations


Journal ArticleDOI
TL;DR: The results are inconsistent with models of consonant classification in which acoustic–phonetic cues for place of articulation are not involved in the perception of the voicing contrast, but the observed perceptual interaction between place of articulated and voicing may be consistent with either a feature- or segment-based model of consonants classification.

65 citations


Journal ArticleDOI
TL;DR: A statistically guided, knowledge-based, acoustic-phonetic system for the automatic classification of stops, in speaker independent continuous speech, is proposed that uses a new auditory-based front-end processing and incorporates new algorithms for the extraction and manipulation of the acoustic- phonetic features that proved to be rich in their information content.
Abstract: In this paper, the acoustic-phonetic characteristics of the American English stop consonants are investigated. Features studied in the literature are evaluated for their information content and new features are proposed. A statistically guided, knowledge-based, acoustic-phonetic system for the automatic classification of stops, in speaker independent continuous speech, is proposed. The system uses a new auditory-based front-end processing and incorporates new algorithms for the extraction and manipulation of the acoustic-phonetic features that proved to be rich in their information content. Recognition experiments are performed using hard decision algorithms on stops extracted from the TIMIT database continuous speech of 60 speakers (not used in the design process) from seven different dialects of American English. An accuracy of 96% is obtained for voicing detection, 90% for place of articulation detection and 86% for the overall classification of stops.

63 citations


01 Jan 2001
TL;DR: Consonant confusions which were language-dependent – mostly errors in voicing and manner – were not reduced by the addition of visual cues whereas confusions that were common to both listener groups and related to acoustic-phonetic sound characteristics did show improvements.
Abstract: This study was designed to identify English speech contrasts that might be appropriate for the computerbased auditory-visual training of Spanish learners of English. It examines auditory-visual and auditory consonant and vowel confusions by Spanish speaking students of English and a native English control group. 36 Spanish listeners were tested on their identification of 16 consonants and 9 vowels of British English. For consonants, both L2 learners and controls showed significant improvements in the audiovisual condition, with larger effects for syllable final consonants. The patterns of errors by L2 learners were strongly predictable from our knowledge of the relation between the phoneme inventories of Spanish and English. Consonant confusions which were language-dependent – mostly errors in voicing and manner – were not reduced by the addition of visual cues whereas confusions that were common to both listener groups and related to acoustic-phonetic sound characteristics did show improvements. Spanish listeners did not use visual cues that disambiguated contrasts that are phonemic in English but have allophonic status in Spanish. Visual features therefore have different weights when cueing phonemic and allophonic distinctions.

49 citations


Book
07 Dec 2001
TL;DR: This chapter discusses laryngeal and Phonatory Features and Representations, as well as fission and Fusion, and discusses the role of language structure inimilation.
Abstract: Table of Contents List of Abbreviations 1. Introduction 2. Laryngeal and Phonatory Features and Representations 3. Assimilation 4. Deglottalization 5. Debuccalization 6. Dissimilation 7. Ejective Voicing 8. Fission and Fusion 9. Conclusion Endnotes References

48 citations


Journal ArticleDOI
TL;DR: Listener accuracy in identifying voiced and voiceless stops and fricatives in tracheoesophageal (TE) and laryngeal speech were compared and the participant will be able to identify the most common listener misperceptions of trachesophageals speech.

34 citations



Journal ArticleDOI
TL;DR: The authors investigated the effect of bilingualism (Greek/AustralianEnglish) on speakers' ability to perceive unfamiliar speech contrasts (in this case Thai); and whether speakers' speech productions bear any relationship to their speech perception.
Abstract: This study investigates two issues: the effect of bilingualism(Greek/AustralianEnglish) on speakers' ability to perceive unfamiliar speech contrasts (in this case Thai); and whether speakers' speech productions bear any relationship to their speech perception. Thai has three bilabial stop contrasts in word-initial position, voiced /b/,voiceless /p/ and voiceless aspirated /ph/. English and Greek both have two-way voicing distinctions, voiced /b/ and voiceless /p/, but in the wordinitial position in English, /p/ is realized as an aspirated [ph] and only occurs as an unaspirated [p] when in other than the initial position. Experiment 1 examined the perception of Thai bilabial stops by monolingual Australian-English speakers, bilingual Greek/Australian-English speakers, and a control group of Thai speakers. Experiment 2 examined the production of bilabial stops by these speaker groups. The results of Experiment 1 show no difference between the three speaker groups when discriminating the Thai distinctions /ba/ versus /p}a/ and /pa/ versus /p}a/. However, there was a tendency for Greek/Australian-English speakers to discriminate /ba/ versus /pa/ better than monolingual English speakers. More importantly, when participants were classified on the basis of their production profiles obtained in Experiment 2, Greek/Australian-English speakers with extreme voice onset times for /ba/ and /pa/ productions showed comparable perceptual performance to that of the Thai speakers. These results suggest that bilinguals who exaggerate the voicing differences between sounds when speaking, best perceive these differences when listening. These findings show that production profiles are an important adjunct to the assessment of bilingual speakers, and have important implications for the interface between perception and production.



Proceedings Article
01 Jan 2001
TL;DR: This paper describes a phase vocoder based technique for voice transformation that provides a fle xible way to manipulate various aspects of the input signal, e.g., fundamental frequency of voicing, duration, energy, and formant positions, without explicit extraction.
Abstract: This paper describes a phase vocoder based technique for voice transformation. This method provides a fle xible way to manipulate various aspects of the input signal, e.g., fundamental frequency of voicing, duration, energy, and formant positions, without explicit extraction. The modifications to the signal can be specific to any feature dimensions, and can vary dynamically over time. There are many potential applications for this technique. In concatenative speech synthesis, the method can be applied to transform the speech corpus to different voice characteristics, or to smooth any pitch or formant discontinuities between concatenation boundaries. The method can also be used as a tool for language learning. We can modify the prosody of the student’ s own speech to match that from a native speaker, and use the result as guidance for improvements. The technique can also be used to convert other biological signals, such as killer whale vocalizations, to a signal that is more appropriate for human auditory perception. Our initial experiments show encouraging results for all of these applications.

Journal ArticleDOI
TL;DR: In this article, the results of an acoustic speech-production experiment were presented, in which speakers repeated simple syllabic forms varying in consonantal voicing in time to a metronome that controlled repetition rate.
Abstract: This paper presents the results of an acoustic speech-production experiment in which speakers repeated simple syllabic forms varying in consonantal voicing in time to a metronome that controlled repetition rate. Speakers exhibited very different patterns of tempo control for syllables with onsets than for syllables with codas. Syllables with codas exhibited internal temporal consistency, leaving junctures between the repeated syllables to take up most of the tempo variation. Open syllables with onsets, by contrast, often exhibited nearly proportional scaling of all of the acoustic portions of the signal. Results also suggest that phonemic use of vowel duration as a cue to voicing acted to constrain temporal patterns with some speakers. These results are discussed with respect to possible models of local temporal adjustment within a context of global timing constraints.

Journal ArticleDOI
TL;DR: Research is reviewed concerning the performance of several neurological groups on the perception and production of voicing contrasts in speech and a model is presented specifying the level of phonemic processing thought to be impaired for each patient group.

Patent
19 Sep 2001
TL;DR: In this paper, a speech segment to be analyzed is cut out with a window having a length of a plurality of pitch periods for RK model voicing source parameter estimation, based on such estimations, an RKmodel voicing source waveform is generated, its relationship with the speech segment is analyzed by ARX system identification, and then a glottal transform function is estimated.
Abstract: A speech segment to be analyzed is cut out with a window having a length of a plurality of pitch periods for RK model voicing source parameter estimation. GCIs are all estimated for a plurality of voicing source pulses. Based on such estimations, an RK model voicing source waveform is generated, its relationship with the speech segment is analyzed by ARX system identification, and then a glottal transform function is estimated. While this process repeated, when GCIs converge at a predetermined value, the identification is completed. Accordingly, a high quality analysis-synthesis system, which isolates voicing source parameters of speech signals from vocal tract parameters thereof with high accuracy, can be realized.

01 Jan 2001
TL;DR: This article found that syllableonset /l/s are slightly longer and darker in syllables with voiced codas, compared with voiceless ones, and that, in certain conditions, these acoustic differences play a role in word recognition.
Abstract: Acoustic cues to a given phonological contrast may extend over long stretches of time. Thus, recent findings show that syllableonset /l/s are slightly longer and darker in syllables with voiced codas, compared with voiceless ones, and that, in certain conditions, these acoustic differences play a role in word recognition. It is argued that coda-dependent variations in onset /l/s contribute to enhancing two major perceptual properties associated with coda voicing. We further suggest that the listener's sensitivity to these phonetic dependencies between syllable onsets and codas is at variance with segmental models of word recognition in which acoustic cues are integrated over short time intervals. Our findings provide support for an alternative, non-segmental approach, according to which distributed acoustic cues are central to word recognition and acoustic-phonetic finedetail is available at the initial contact stage of lexical access.

Journal ArticleDOI
TL;DR: In this article, an intramouth vibrating voice-generation system was developed to aid alaryngeal speech, which fixes a vibrator in artificial teeth as a substitute for a glottal sound source and proper sound control improves the speech.
Abstract: This paper proposes and evaluates an intramouth vibrating voice-generation system we have developed to aid alaryngeal speech. The system fixes a vibrator in artificial teeth as a substitute for a glottal sound source, and proper sound control improves the speech. With this system, we controlled the substitute glottal sound with intraoral pressure, which increases for voiceless consonants, for clearer speech. In addition, the system controls the pitch of speech using pressure from a finger. This concise pitch control is available to all patients for more natural speech. We tested two methods of pitch control by finger pressure: one in which finger pressure directly determines the pitch, and the other in which finger pressure is converted into binary commands of voice and accent that execute pitch pattern generation. Conventional pitch control with expiration pressure served as a reference. Without voicing control, less than 50% of syllables were identified correctly. Voicing control improved this rate to 60%. Similarly, voicing control improved misidentification of voiceless consonants to corre-sponding voiced ones from 30% to 10%. Binary pitch control with finger pressure performed better than direct pitch control and was perceived as natural as direct pitch control with expiration pressure.


Journal ArticleDOI
TL;DR: In this article, the authors used inverse filtering to obtain the differentiated glottal flow for each vowel in the dynamic speech, which is theoretically robust across different vowel qualities and pitches, and then a baseline voice quality measurement was obtained for each word from each speaker to control for segmental and personal voice quality differences.
Abstract: Different voice qualities are used in normal speech to convey phonological contrasts (e.g., Zapotec, Gujarati, etc.), prosodic information, emotions, etc. When measuring changes in voice quality in dynamic speech, it is necessary to use a technique that can capture small, relative differences under quickly changing conditions. Although a number of techniques are available for assessing voice quality (e.g., perceptual assessment, qualitative assessment of the time amplitude waveform, qualitative assessment of the spectrogram, quantitative measurement of the spectrum, and quantitative measurement of the voicing source), they have, for the most part, been designed for capturing voice quality differences in sustained vowels or carefully matched speech conditions. This study develops a technique for relative voice quality measurement which ascertains voice quality differences across diverse words, pitches, and speakers. Specifically, inverse filtering is used to obtain the differentiated glottal flow for each vowel in the dynamic speech, which is theoretically robust across different vowel qualities and pitches. Then, for comparison, a baseline voice quality measurement is obtained for each word from each speaker to control for segmental and personal voice quality differences. Thus, the relative voice quality measurement will be resistant to the variations in dynamic speech. [Work supported by NIH and NIH/NIDCD.]

06 Nov 2001
TL;DR: In this paper, the authors describe the phonological systems of two closely related Northern Je languages: Mebengokre (the language of the Kayapo and Xikrin nations) and Apinaye (the homonymous nation) and discuss critically the notion of phonological system, showing the way in which certain facts that are normally treated in descriptive studies as “phonological processes”, divorced from the system (which is often thought of as a mere inventory), are directly relevant to the oppositions that constitute the phonology system.
Abstract: This thesis has a double purpose In the first place, it endeavors to describe the phonological systems of two closely related Northern Je languages: Mebengokre (the language of the Kayapo and Xikrin nations), and Apinaye (the language of the homonymous nation) In the second place, it intends to discuss critically the notion of phonological system, showing the way in which certain facts that are normally treated in descriptive studies as “phonological processes”, divorced from the system (which is often thought of as a mere inventory), are directly relevant to the oppositions that constitute the phonological system To exemplify these ideas, we devote our attention to certain processes that involve nasality and voicing in these two languages Une of the clearest differences between the phonology of Mebengokre and Apinaye regards the behavior of so-called “nasal” consonants: in the first system, nasal consonants clearly contrast with voiced stops In Apinaye, on the other hand, fully nasal consonants and voiced stops with nasalized contours are in complementary distribution We argue initially that to represent the contour segments as being specified for the feature [nasal] leads us to an untenable situation: nasality would exhibit, in these segments, a completely passive behavior, retreating even next to [–nasal]; for this reason we opt for a representation in which nasality could be thought of as an epiphenomenon of the implementation of sonorant voicing Some facts of the Apinaye language nevertheless suggest that at least coda segments cannot be characterized simply as “sonorants unspecified for nasality”: one of these facts is the permanence of a brief nasal transition between oral segments after the delinking of one of these coda consonants This thesis takes up some of the points initially raised by D’Angelis (1998) in relation to other languages in the Macro-Je stock The discussion about the notion of phonological system is mainly inspired in the structuralist paradigm of the Prague Linguistic Circle; later developments are always put thought in the light of Trubetzkoy’s (1939) intuitions Among the more recent reflections regarding the representation of nasals, we here take into account mainly the works of Steriade (1993) and Piggott (1992)

Proceedings ArticleDOI
19 Aug 2001
TL;DR: This paper investigates the effectiveness of using neural networks in classifying Malay speech sounds according to their place of articulation and voicing, and proposes a system that classifies 16 selected Malay syllables into their groups of phonetic features.
Abstract: This paper investigates the effectiveness of using neural networks in classifying Malay speech sounds according to their place of articulation and voicing. The system is very different from conventional speech recognition systems, where the systems do not classify the speech sounds into groups of voicing and place of articulation. The system proposed classifies 16 selected Malay syllables into their groups of phonetic features. The Malay syllables are initialized with stops and followed by succeeding vowels. The speech tokens are sampled at 16 kHz with 16-bit resolution. LPC-derived cepstrum is used to extract the speech features. A three-layer multilayer perceptron (MLP) is used to train and recognize the Malay syllables. The system gives an encouraging result, with an average accuracy of 92.92%.

Journal ArticleDOI
Shari R. Baum1
TL;DR: A number of patients in both the fluent and nonfluent aphasic groups could not consistently identify even endpoint stimuli, confirming phonetic categorization impairments previously shown in such individuals.

Proceedings Article
01 Jan 2001
TL;DR: The results suggest that ejectives have a similar pattern to plosives and that therefore a unified explanation for all three types of stops should be sought.
Abstract: Voice onset time after voiceless unaspirated stopsdemonstrates a dependence on place of articulation, mostreliably being shorter for labial and coronal than for velarstops Some of the proposed explanations for this patternsuggest that a parallel dependence is not be expected foraspirated or ejective stops However, similar patterns dooccur with both aspirated and unaspirated stops Cho andLadefoged [1] have suggested that ejectives do not follow thesame trend, but they had little data on bilabial ejectives tocompare with more plentiful data on velars This papercontributes more material to this debate with expanded dataon Yapese and the first published material on ejective VOT inNez Perce The results suggest that ejectives have a similarpattern to plosives and that therefore a unified explanation forall three types of stops should be sought 1 Introduction It is well known that after the release of a prevocalic voicelessunaspirated stop the time that elapses before voicing beginsfor the vowel shows dependence on the place of articulationof the stop The voice onset time (VOT) is quite reliablyshorter after a bilabial ([p]) than after a velar stop ([k]), withcoronal stops often being intermediate and almost alwaysshorter than velars (see, for example, [2, 3]) One proposedexplanation for the patterns seen is that the rate of apertureincrease in the releasing gesture differs for differentarticulators and locations For example, Stevens [4] estimatesthat at the release of a labial stop the cross-sectional area isincreasing at about 100 cm

DOI
01 Jan 2001
TL;DR: In this paper, the authors investigate seemingly unrelated occurrences of obstruent voicing in derived environments in three different languages: Breton, German, and Italy, and give a unified account of the voicing patterns found in the three languages.
Abstract: 1 Introduction In this paper, I investigate seemingly unrelated occurrences of obstruent voicing in derived environments in three different languages: Breton, German, and Ital-ian. In the Breton dialect spoken on Île de Groix, underlyingly voiceless obstru-ents are realised as voiced when they are followed by a vowel-, or sonorant-initial word (1a). Since they do not surface as voiced when followed by a vowel-initial affix (1b), this instance of voicing cannot be explained as intervocalic or intersonorant voicing. 1 (1) Île de Groix Breton: a. !"ùg #záj]! 'Sit down there!'$ b. !"ùk-ed #záj]! 'Sit down (you.plural) there!' [Ternes 1970:45] In certain varieties of German, where intervocalic voicing is not operative either, root-final underlyingly voiceless obstruents (2a) surface as voiced when followed by a vowel-initial clitic (2b). Voicing does not occur in Standard German pronunciation (2c). (2) German: a. wi[s]en 'to know' b. wei[z]ich 'I know' c. wei[s] ich 'I know' The Italian case is sligthly more complicated. Word-internally, s is either voiced intervocalically or realised as a geminate (3a,b). There are some instances where intervocalic voicing does not apply, as with a vocalic prefix and an s-initial stem (3c). In almost the same environment, the final s of the prefix dis-is subject to intervocalic voicing (3d). In connection with a consonant-initial stem, the same prefix surfaces with the voicing specification of the following obstruent (3e,f). (3) Italian: a. ca[z]a 'house' d. di[z]-onesto 'dishonest' b. ca[s:]a 'cash register' e. di[sp]iacere 'displeasure' c. a-[s]ociale 'asocial' f. di[zg]razia 'misfortune' This paper aims at giving a unified account of the voicing patterns found in the three languages. I will argue in particular that this voicing effect is closely tied to the alignment of morphological and prosodic categories. That is, the left stem


Journal ArticleDOI
TL;DR: It is argued that since the rule introduces nondistinctive segments, it is therefore problematic for both the strict cycle condition and structure preservation, but if the rule operates on underspecified segments, all these problems can be avoided.
Abstract: The Old English fricative voicing rule (FVR) has been variously formulated in both linear and nonlinear frameworks, yet to no complete satisfaction. Moreover, no attempt has yet been made to define the nature of the FVR and its status in lexical phonology. The peculiarity of the rule is that it has both lexical and postlexical properties. In general, a lexical rule is structure preserving and a postlexical rule nonstructure preserving; hence, a lexical rule should not create novel allophonic, nonneutralizing segments absent from the underlying inventory. However, unlike other non-structure-preserving rules, the FVR does not apply across the board—it applies only lexically, inside morphemes in nonderived words or stems and in derived words only across inflectional suffix boundaries. In the following discussion, linear and nonlinear versions of the FVR are first reviewed, followed by the restatement of the FVR as lexical but noncyclic and nonstructure preserving—that is, as a word-level rule that applies at level 3 to nonderived and derived forms. It is argued that since the rule introduces nondistinctive segments, it is therefore problematic for both the strict cycle condition and structure preservation, but if the rule operates on underspecified segments, all these problems can be avoided.

Proceedings Article
01 Jan 2001
TL;DR: This paper found that the perception of voicing has a strong dependence on vowel context with /Ca/ syllables being significantly better discriminated than /Ci/ and /Cu/ and found that labials consistently had a higher threshold SNR (or more easily confusable) than alveolars and velars.
Abstract: Previous research has shown that the VOT and first formant transition are primary perceptual cues for the voicing distinction for syllable‐initial plosives (SIP) in quiet environments. This study seeks to determine which cues are important for the perception of voicing for SIP in the presence of noise. Stimuli for the perceptual experiments consisted of naturally spoken CV syllables (six plosives in three vowel contexts) in varying levels of additive white Gaussian noise. In each experiment, plosives which share the same place of articulation (e.g., /p, b/) were presented to subjects in identification tasks. For each voiced/voiceless pair, a threshold SNR value was calculated. It was found that the perception of voicing has a strong dependence on vowel context with /Ca/ syllables being significantly better discriminated than /Ci/ and /Cu/ syllables. In addition, labials consistently had a higher threshold SNR (or more easily confusable) than alveolars and velars. Threshold SNR values were then correlated ...