scispace - formally typeset
Search or ask a question

Showing papers on "Voice published in 2004"


Book ChapterDOI
01 Jan 2004
TL;DR: For example, the authors showed that a German-like voice-neutralising alternation system resolves rapidly when the phonotactics of obstruent voicing is recognized. But this is not the case in many other languages.
Abstract: The problem All languages have distributional regularities: patterns which restrict what sounds can appear where, including nowhere, as determined by local syntagmatic factors independent of any particular morphemic alternations. Early Generative Phonology tended to slight the study of distributional relations in favour of morphophonemics, perhaps because word-relatedness phonology was thought to be more productive of theoretical depth, reliably leading the analyst beyond the merely observable. But over the last few decades it has become clear that much morphophonemics can be understood as accommodation to phonotactic requirements, e.g., Kisseberth (1970), Sommerstein (1974), Kiparsky (1980), Goldsmith (1993), etc. A German-like voice-neutralising alternation system resolves rapidly when the phonotactics of obstruent voicing is recognised. And even as celebrated a problem in abstractness and opacity as Yawelmani Yokuts vocalic phonology turns on a surface-visible asymmetry in height-contrasts between long and short vowels. Distributions require nontrivial learning: the data do not explicitly indicate the nature, or even the presence, of distributional regularities, and every distributional statement goes beyond what can be observed as fact, the ‘positive evidence’. From seeing X in this or that environment the learner must somehow conclude ‘X can only appear under these conditions and never anywhere else’ – when such a conclusion is warranted. A familiar learning hazard is immediately encountered. Multiple grammars can be consistent with the same data, grammars which are empirically distinct in that they make different predictions about other forms not represented in the data.

160 citations



Journal ArticleDOI
TL;DR: Results show that both stress and focus can be used to distinguish contrastive from noncontrastive aspects of speech behavior, and suggest that vowel duration differences due to vowel quantity indicate a linguistic contrast, but vowel duration Differences due to consonant voicing do not.

132 citations


Journal ArticleDOI
TL;DR: In this article, the authors examined the acquisition of the voicing contrast in German-Spanish bilingual children, on the basis of the acoustic measurement of Voice Onset Time (VOT), and found that the bilingual children displayed three different patterns of VOT development: 1. Delay in the phonetic realization of voicing: two bilingual children did not acquire long lag stops in German during the testing period; 2. Transfer of voicing features: one child produced German voiced stops with lead voicing and Spanish voiceless stops with long lag voicing; and 3.
Abstract: This study examines the acquisition of the voicing contrast in German-Spanish bilingual children, on the basis of the acoustic measurement of Voice Onset Time (VOT). VOT in four bilingual children (aged 2;0–3;0) was measured and compared to VOT in three monolingual German children (aged 1;9–2;6), and to previous literature findings in Spanish. All measurements were based on word-initial stops extracted from naturalistic speech recordings. Results revealed that the bilingual children displayed three different patterns of VOT development: 1. Delay in the phonetic realization of voicing: two bilingual children did not acquire long lag stops in German during the testing period; 2. Transfer of voicing features: one child produced German voiced stops with lead voicing and Spanish voiceless stops with long lag voicing; and 3. No cross-language influence in the phonetic realization of voicing. The relevance of the findings for cross-linguistic interaction in bilingual phonetic/phonological development is discussed.

119 citations


Journal ArticleDOI
TL;DR: The paradox raised is that although prevoicing is the most reliable cue to the voicing distinction for listeners, it is not reliably produced by speakers.

116 citations


Journal ArticleDOI
TL;DR: The authors showed that initial aspirated and tense consonants correlate with a high tone and lax and voiced consonants correlated with a low tone, and that the consonant-tone correlation is another case of voiceless-high and voiced-low.
Abstract: Korean is thought to be unique in having three kinds of voiceless stops: aspirated /ph th kh/, tense /p* t* k*/, and lax /p t k/. The contrast between tense and lax stops raises two theoretical problems. First, to distinguish them either a new feature [tense] is needed, or the contrast in voicing (or aspiration) must be increased from two to three. Either way there is a large increase in the number of possible stops in the world's languages, but the expansion lacks support beyond Korean. Second, initial aspirated and tense consonants correlate with a high tone, and lax and voiced consonants correlate with a low tone. The correlation cannot be explained in the standard tonogenesis model (voiceless-high and voiced-low). We argue instead that (a) underlyingly "tense" stops are regular voiceless unaspirated stops, and "lax" stops are regular voiced stops, (b) there is no compelling evidence for a new distinctive feature, and (c) the consonant-tone correlation is another case of voiceless-high and voiced-low. We conclude that Korean does not have an unusual phonology, and there is no need to complicate feature theory.

94 citations


Journal ArticleDOI
TL;DR: It was concluded that acoustic properties other than vocalicduration might play more important roles in voicing decisions for final stops than commonly asserted, sometimes even taking precedence over vocalic duration.
Abstract: Adults whose native languages permit syllable-final obstruents, and show a vocalic length distinction based on the voicing of those obstruents, consistently weight vocalic duration strongly in their perceptual decisions about the voicing of final stops, at least in laboratory studies using synthetic speech. Children, on the other hand, generally disregard such signal properties in their speech perception, favoring formant transitions instead. These age-related differences led to the prediction that children learning English as a native language would weight vocalic duration less than adults, but weight syllable-final transitions more in decisions of final-consonant voicing. This study tested that prediction. In the first experiment, adults and children (eight and six years olds) labeled synthetic and natural CVC words with voiced or voiceless stops in final C position. Predictions were strictly supported for synthetic stimuli only. With natural stimuli it appeared that adults and children alike weighted syllable-offset transitions strongly in their voicing decisions. The predicted age-related difference in the weighting of vocalic duration was seen for these natural stimuli almost exclusively when syllable-final transitions signaled a voiced final stop. A second experiment with adults and children (seven and five years old) replicated these results for natural stimuli with four new sets of natural stimuli. It was concluded that acoustic properties other than vocalic duration might play more important roles in voicing decisions for final stops than commonly asserted, sometimes even taking precedence over vocalic duration.

76 citations


Journal ArticleDOI
01 Jan 2004-Language
TL;DR: In Lezgian, a Nakh-Daghestanian language, final and preconsonantal ejectives and voiceless unaspirated obstruents are voiced in certain monosyllabic nouns.
Abstract: In Lezgian, a Nakh-Daghestanian language, final and preconsonantal ejectives and voiceless unaspirated obstruents are voiced in certain monosyllabic nouns. This article offers acoustic evidence confirming that the two coda-voicing series are indeed voiced in final position. Based on comparative evidence, it is demonstrated that this phonetically aberrant neutralization pattern is the result of a series of phonetically natural sound changes. Such 'crazy rules' (Bach and Harms 1972) undermine any direct phonetic licensing approach to phonology, such as LICENSING BY CUE (Steriade 1997).

68 citations


Journal ArticleDOI
TL;DR: An acoustic analysis of German obstruent realizations was carried out focusing on the adequacy of the word material employed and the consistency of the phonetic realizations produced by speakers from different regions of Germany, revealing significant effects of voiced vs. voiceless obstruents in medial (nonneutralizing) position.

67 citations


Journal ArticleDOI
TL;DR: The association of onset darkness and coda voicing does not seem to be ascribable to anticipatory coarticulation of features essential to voicing itself; this observation provides support for nonsegmental models of speech perception in which fine phonetic detail is mapped directly to linguistic structure without reference to phoneme-sized segments.

63 citations


Journal ArticleDOI
TL;DR: The first attempts to develop a method for the objective assessment of quality in substitution voices using a peripheral auditory model with a built-in fundamental frequency (pitch) extractor confirm that the quality of tracheo-esophageal speech is superior to that of esophageaal speech, but inferior toThat of normal speech and speech with the preservation of one vocal fold.
Abstract: This paper describes our first attempts to develop a method for the objective assessment of quality in substitution voices. The objective analysis deals with acoustic parameters characterising short voice and speech samples like a sequence of isolated vowels, a sequence of VCV and CVCVCV syllables, a short sentence, etc. A database of 113 registrations from 68 patients (53 total laryngectomy patients with tracheo-esophageal speech, 14 total laryngectomy patients with esophageal speech and 5 patients with partial frontolateral laryngectomy) and 6 registrations from healthy control persons was collected. Each registration consisted of seven speech utterances and was subjected to an acoustic analysis as well as to a perceptual evaluation, the latter involving eight parameters like “overall impression”, “tonicity”, etc. Since the goal of our work is to find out the best acoustical measurement for supporting perception and making it precise, it seemed logical to strive for a perceptually based acoustic analysis. We therefore performed the analysis by means of a peripheral auditory model with a built-in fundamental frequency (pitch) extractor. From the frame-level outputs (a frame is 10 ms) of the analyser, global objective parameters, such as (1) the percentage of voiced frames, (2) the average voicing evidence, (3) the voicing length distribution and (4) the fundamental frequency jitter, were computed for the different speech utterances. So as to reduce the parameter variability arising from the nature of the speech utterances (e.g., the presence of pauses in the signal, errors caused by the pitch extractor, etc.), the objective parameters were computed using non-standard averaging schemes involving energy weighting and frame selection. A statistical analysis of the objective parameters confirms that the quality of tracheo-esophageal speech is superior to that of esophageal speech, but inferior to that of normal speech and speech with the preservation of one vocal fold. Correlations between the objective parameters and the perceptual parameters are moderate.

Dissertation
01 Jan 2004
TL;DR: The Probabilistic framework makes the acoustic-phonetic approach to speech recognition suitable for practical recognition tasks as well as compatible with probabilistic pronunciation and language models.
Abstract: A probabilistic and statistical framework is presented for automatic speech recognition based on a phonetic feature representation of speech sounds. In this acoustic-phonetic approach, the speech recognition problern is hypothesized as a maximization of the joint posterior probability of a set of phonetic features and the corresponding acoustic landmarks. Binary classifiers of the manner phonetic features—syllabic, sonorant and continuant—are applied for the probabilistic detection of speech landmarks. The landmarks include stop bursts, vowel onsets, syllabic peaks, syllabic dips, fricative onsets and offsets; and sonorant consonant onsets and offsets. The classifiers use automatically extracted knowledge based acoustic parameters (APs) that are acoustic correlates of those phonetic features. For isolated word recognition with known and limited vocabulary, the landmark sequences are constrained using a manner class pronunciation graph. Probabilistic decisions on place and voicing phonetic features are then made using a separate set of APs extracted using the landmarks. The framework exploits two properties of the knowledge-based acoustic cues of phonetic features: (1) sufficiency of the acoustic cues of a phonetic feature for a decision on that feature and (2) invariance of the acoustic cues with respect to context. The probabilistic framework makes the acoustic-phonetic approach to speech recognition suitable for practical recognition tasks as well as compatible with probabilistic pronunciation and language models. Support vector machines (SVMs) are applied for the binary classification tasks because of their two favorable properties—good generalization and the ability to learn from a relatively small amount of high dimensional data. Performance comparable to Hidden Markov Model (HMM) based systems is obtained on landmark detection as well as isolated word recognition. Applications to restoring of lattices from a large vocabulary continuous speech recognizer are also presented.

Journal ArticleDOI
TL;DR: It is concluded that the interlanguage ranking follows from the frequency of different input structures, given the assumption that constraint rankings are stochastic (Boersma & Hayes, 2001), and that final devoicing is an effect of positional markedness constraints.
Abstract: One of the most interesting features of language contact situations involves the appearance of systematic patterns that are not manifested in either of the two languages in contact. A full analysis of such patterns may require constraint rankings that differ from those of both the native and the target languages. I examine possible sources of these constraint rankings with respect to the devoicing of final obstruents in learners whose native language contains either no final consonants or no final obstruents, and whose target language contains both voiced and voiceless final obstruents. I conclude that the interlanguage ranking follows from the frequency of different input structures, given the assumption that constraint rankings are stochastic (Boersma & Hayes, 2001), and that final devoicing is an effect of positional markedness constraints. I then consider possible alternative explanations of interlanguage final devoicing: as a reflection of native language rankings of positional faithfulness constraints; as an effect of perceptual filtering; and / or as a function of articulatory difficulty of sustaining voicing in final position.

01 Jan 2004
TL;DR: The authors hypothesize that the phonetic basis of the asymmetry is the tendency for diphthongs to assimilate to their nuclei before voiced codas and to their offglides before voiceless ones.
Abstract: Canadian Raising is the best-known of a diverse class of English allophonic height alternations in /ai/ conditioned by coda voicing. The alternations have been independently re-innovated and show a systematic typology: The voiceless environment selects the higher allophone. We hypothesize that the phonetic basis of the asymmetry is the tendency for diphthongs to assimilate to their nuclei before voiced codas and to their offglides before voiceless ones. Predictions are tested in an instrumental study of the development of a Canadian-Raising-like alternation in and around Cleveland, Ohio, in 28 speakers born between 1878 and 1977. Results support the hypothesis and contradict two widespread views about Canadian Raising, (1) that it arises out of the Great Vowel Shift and (2) that diphthongs are less diphthongal in the short pre-voiceless environment.

Journal ArticleDOI
TL;DR: Overall, findings suggest that in French sentences, voicing assimilation is strictly regressive and complete Assimilation is achieved by the covariation of several acoustic correlates, which attests to the complementarity of the underlying articulatory gestures.
Abstract: This study examined the manner in which French speakers used some acoustic correlates to produce the stop voicing distinction in French sentences when syllables containing syllable initial and -final stops were between vowels (/pa_a/) and between voiceless fricatives (/pas_s/). Data analyses revealed that /b, d, g/ were longer, were more frequently phonated, and were preceded by longer vowels than /p, t, k/ in three conditions: syllable-initial stops between vowels and between voiceless fricatives and syllable-final stops between vowels. When a voiceless fricative /s/ followed /b, d, g/, the voicing contrast was reduced as a result of complete regressive voicing assimilation, achieved by the concomitant devoicing of /b, d, g/ closures and the significant reduction in voicing-related differences in preceding vowel and closure durations. When /s/ preceded /b, d, g/, the voicing distinction was enhanced: significant voicing-related duration differences were accompanied by the complete assimilation of /s/ to [z]. Overall, findings suggest that in French sentences, voicing assimilation is strictly regressive and complete assimilation is achieved by the covariation of several acoustic correlates, which attests to the complementarity of the underlying articulatory gestures.

Journal ArticleDOI
TL;DR: In this article, the authors apply the voicing profile method to the analysis of voicing properties of consonants in German and compare the results with those of stops in three other languages, viz Mandarin Chinese, Hindi, and Mexican Spanish.
Abstract: Within and across languages the realization of consonant voicing is highly variable This study aims to identify, and quantify, the segmental, prosodic and positional factors that have an influence on consonant voicing A widely used acoustic measure of voicing, viz voice onset time, is known to have disadvantages both in a cross-linguistic framework, where it fails to provide sufficient information for certain stop consonant classifications, and across consonant classes because it is not defined for fricatives and sonorants This study applies the voicing profile method to the analysis of voicing properties of consonants in German The voicing profile is defined as the frame-by-frame voicing status of speech sound realizations in a speech corpus The speech database was judiciously constructed to cover systematically all possible speech sound combinations in German and a number of positional and prosodic contexts in which these combinations occur The results are put in a cross-linguistic perspective by comparing the voicing profiles of German stops to those of stops in three other languages, viz Mandarin Chinese, Hindi, and Mexican Spanish The results are also discussed in the context of the production and maintenance of voicing during speech production The voicing profile analysis is intended to serve as a methodology for investigating the discrepancies between the phonemic voicing specification of a speech sound and its phonetic realization in connected speech

DissertationDOI
01 Jan 2004
TL;DR: In this paper, the authors investigated the perceptual relevance of prevoicing in Dutch and found that the presence of vocal fold vibration during the closure of initial voiced plosives (negative voice onset time) was the strongest cue that listeners use when classifying ploives as voiced or voiceless.
Abstract: In this dissertation the perceptual relevance of prevoicing in Dutch was investigated. Prevoicing is the presence of vocal fold vibration during the closure of initial voiced plosives (negative voice onset time). The presence or absence of prevoicing is generally used to describe the difference between voiced and voiceless Dutch plosives. The first experiment described in this dissertation showed that prevoicing is frequently absent in Dutch and that several factors affect the production of prevoicing. A detailed acoustic analysis of the voicing distinction identified several acoustic correlates of voicing. Prevoicing appeared to be by far the best predictor. Perceptual classification data revealed that prevoicing was indeed the strongest cue that listeners use when classifying plosives as voiced or voiceless. In the cases where prevoicing was absent, other acoustic cues influenced classification, such that some of these tokens were still perceived as being voiced. In the second part of this dissertation the influence of prevoicing variation on spoken-word recognition was examined. In several cross-modal priming experiments two types of prevoicing variation were contrasted: a difference between the presence and absence of prevoicing (6 versus 0 periods of prevoicing) and a difference in the amount of prevoicing (12 versus 6 periods). All these experiments indicated that primes with 12 and 6 periods of prevoicing had the same effect on lexical decisions to the visual targets. The primes without prevoicing had a different effect, but only when their voiceless counterparts were real words. Phonetic detail appears to influence lexical access only when it is useful: In Dutch, the presence versus absence of prevoicing is informative, while the amount of prevoicing is not.

Journal ArticleDOI
TL;DR: In this article, it was shown that for AG, like NG, the appropriate feature of contrast is [spread glottis, whereas for NG, it is not the spreadglottis.
Abstract: It is well-known that so-called voiced plosives in German, including Austrian German, are voiceless except between vowels where they are (sometimes) voiced (i.e. have vocal fold vibration during closure). Nonetheless, in the phonological literature, the contrast is often treated as one of [voice]. This leaves a rather substantial mismatch between the phonological description and the phonetic facts. Jessen & Ringen (2002) have recently presented experimental evidence in support of the position that the contrast in northern Standard German (NG) is one of [spread glottis]. It is often suggested that in Austrian German there is a two-way contrast of plosives, but no aspiration. This raises a question about whether the contrast in Austrian Standard German (AG) can possibly be one of [spread glottis] vs. non-[spread glottis]. This paper investigates this question. We present experimental results and argue that for AG, like NG, the appropriate feature of contrast is [spread glottis].


Proceedings ArticleDOI
M. Graciarena1, H. Franco1, Jing Zheng1, D. Vergyri1, A. Stolcke1 
17 May 2004
TL;DR: This work augments the Mel cepstral feature representation with voicing features from an independent front end and computed the normalized autocorrelation peak and a newly proposed entropy of the high-order cepstrum to integrate the voicing features into SRI's DECIPHER system.
Abstract: We augment the Mel cepstral (MFCC) feature representation with voicing features from an independent front end. The voicing feature front end parameters are optimized for recognition accuracy. The voicing features computed are the normalized autocorrelation peak and a newly proposed entropy of the high-order cepstrum. We explored several alternatives to integrate the voicing features into SRI's DECIPHER system. Promising early results were obtained in a simple system concatenating the voicing features with MFCC features and optimizing the voicing feature window duration. Best results overall came from a more complex system combining a multiframe voicing feature window with the MFCC plus third differential features using linear discriminant analysis and optimizing the number of voicing feature frames. The best integration approach from the single-pass system experiments was implemented in a multi-pass system for large vocabulary testing on the Switchboard database. An average WER reduction of 2% relative was obtained on the NIST Hub-5 dev2001 and eval2002 databases.

Journal ArticleDOI
TL;DR: The overall findings are that attributes relating to the burst spectrum in relation to the vowel contribute most effectively, while Attributes relating to formant transition are somewhat less effective.
Abstract: One of the approaches to automatic speech recognition is a distinctive feature-based speech recognition system, in which each of the underlying word segments is represented with a set of distinctive features. This thesis presents a study concerning acoustic attributes used for identifying the place of articulation features for stop consonant segments. The acoustic attributes are selected so that they capture the information relevant to place identification, including amplitude and energy of release bursts, formant movements of adjacent vowels, spectra of noises after the releases, and some temporal cues. An experimental procedure for examining the relative importance of these acoustic attributes for identifying stop place is developed. The ability of each attribute to separate the three places is evaluated by the classification error based on the distributions of its values for the three places, and another quantifier based on F-ratio. These two quantifiers generally agree and show how well each individual attribute separates the three places. Combinations of non-redundant attributes are used for the place classifications based on Mahalanobis distance. When stops contain release bursts, the classification accuracies are better than 90%. It was also shown that voicing and vowel frontness contexts lead to a better classification accuracy of stops in some contexts. When stops are located between two vowels, information on the formant structures in the vowels on both sides can be combined. Such combination yielded the best classification accuracy of 95.5%. By using appropriate methods for stops in different contexts, an overall classification accuracy of 92.1% is achieved. Linear discriminant function analysis is used to address the relative importance of these attributes when combinations are used. Their discriminating abilities and the ranking of their relative importance to the classifications in different vowel and voicing contexts are reported. The overall findings are that attributes relating to the burst spectrum in relation to the vowel contribute most effectively, while attributes relating to formant transition are somewhat less effective. The approach used in this study can be applied to different classes of sounds, as well as stops in different noise environments. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

Book ChapterDOI
01 Jan 2004
TL;DR: Nguyen et al. as mentioned in this paper showed that the durational difference in stressed syllables can be 100 ms or more, and it is well-established as one of the strongest perceptual cues to whether the coda is voiced or voiceless.
Abstract: It is well known that syllables in many languages have longer vowels when their codas are voiced rather than voiceless (for English, cf. Jones, 1972; House & Fairbanks, 1953; Peterson & Lehiste, 1960; for other languages, including exceptions, see Keating, 1985). In English, the durational difference in stressed syllables can be 100 ms or more, and it is well-established as one of the strongest perceptual cues to whether the coda is voiced or voiceless (e.g. Denes, 1955; Chen, 1970; Raphael, 1972). More recently, van Santen, Coleman & Randolph (1992) showed for one General American speaker that this coda-dependent durational difference is not restricted to syllabic nuclei, but includes sonorant consonants, while Slater and Coleman (1996) showed that, for a British English speaker, the differences tended to be greatest in a confined region of the syllable, the specific location being determined by the syllable’s segmental structure. In a companion study to the present paper (Nguyen & Hawkins, 1998; Hawkins & Nguyen, submitted), we confirmed the existence of the durational difference and showed that it is accompanied by systematic spectral differences in four accents of British English (one speaker per accent). For three speakers/accents, F2 frequency and the spectral centre of gravity (COG) in the /l/ were lower before voiced compared with voiceless codas, as illustrated in Figure X.1. (The fourth speaker, not discussed further here, had a different pattern, consistent with the fact that his accent realises the /l/-/r/ contrast differently.) Since F1 frequency in onset /l/s did not differ due to coda voicing, whereas both F2 frequency and the COG did, we tentatively concluded that our measured spectral differences reflect degree of velarisation, consistent with impressionistic observations. Thus the general pattern is that onset /l/ is relatively long and dark when the coda of the same syllable is voiced, and relatively short and light when the coda is voiceless. Do these differences in the acoustic shape of onset /l/ affect whether the syllable coda is heard as voiced or voiceless? If they do, the contribution of the onset is likely to be small and subtle, because the measured acoustic differences are small (mean 4.3 ms, 11 Hz COG, 16 Hz F2 over three speakers). However, though small, the durational differences are completely consistent and strongly statistically significant. Spectral differences are more variable but also statistically significant. Moreover, at least some can be heard. Even if only the more extreme variants provide listeners with early perceptual information about coda voicing, there are far-reaching implications for how we model syllableand word-recognition, because the acoustic-phonetic properties we are concerned with are in nonadjacent segments and, for the most part, seem to be articulatorily and acoustically independent of one another. So, by testing whether these acoustic properties of onset /l/ affect the identification of coda voicing, we are coming closer to testing the standard assumption that lexical items are represented as sequences of discrete phonemic or allophonic units, for in standard phonological theory, longer duration and

Journal ArticleDOI
TL;DR: This paper examined how consonant manner of articulation interacts with intonation type in shaping the F0 contours in English and found that there are three distinct consonantal effects: F0 interruption due to devoicing, a large but brief (10−40 ms) F0 raising at the onset of voicing, and a smaller but longerlasting F 0 raising throughout a large proportion of the preceding and following vowels.
Abstract: In this study we examine how consonant manner of articulation interacts with intonation type in shaping the F0 contours in English. Native speakers of American English read aloud words differing in vowel length, consonant manner of articulation and consonant position in word. They produced each word in either a statement or question carrier. F0 contours of their speech were extracted by measuring every complete vocal period. Preliminary results based on graphic analysis of three speakers’ data suggest that there are three distinct consonantal effects: F0 interruption due to devoicing, a large but brief (10–40 ms) F0 raising at the onset of voicing, and a smaller but longer‐lasting F0 raising throughout a large proportion of the preceding and following vowels. These effects appear to be imposed on a continuously changing F0 curve that is either rising‐falling or falling‐rising, depending on whether the carrier sentence is a statement or a question. Further analysis will test the hypothesis that these continuous curves result from local pitch targets that are assigned to individual syllables and implemented with them in synchrony regardless of their segmental composition. [Work supported by NIDCD Grant No. R01 DC03902.]

Journal ArticleDOI
TL;DR: In this paper, it is argued that Turkish has a three-way contrast of stops, which is a combination of [spread glottis] and [voice] contrasts, and that *sg and *voice are low-ranking in Turkish.
Abstract: According to the descriptive grammars of Turkish (Kornfilt 1997, Swift 1963, Lewis 1967, among others) and underspecification-based analysis of Turkish (Inkelas 1995), stops have a two-way contrast: voiced unaspirated vs. voiceless aspirated. Moreover, voiced stops are described as fully voiced in all positions. By contrast, it is argued in this paper that Turkish has a three-way contrast of stops, which is a combination of [spread glottis] and [voice] contrasts. More specifically, Turkish stops contrast as voiceless aspirated vs. voiceless unaspirated vs. voiced. Support for the claim comes from an acoustic-phonetic experimental study of Turkish. In this study a native male speaker of Turkish was recorded and analyzed acoustically with a speech analysis software package (Wavesurfer). In addition, a list of words read by a native female speaker of Turkish from the University of Victoria Phonetic Database was analyzed. The acoustic analysis of the data shows that Turkish has voiceless aspirated stops in all positions in a word. Voiced stops are found in intervocalic position, word-finally and in consonant clusters. Moreover, the data reveal voiceless unaspirated stops in word-initial position and in consonant clusters. In order to account for the data, an OT analysis of the three-way contrast of stops is proposed. This analysis entails that *sg and *voice are low-ranking in Turkish. Once this three-way contrast is acknowledged, a straightforward alternative to the underspecification analysis of Inkelas (1995) is possible.

Journal ArticleDOI
TL;DR: This study presents results from high-speed imaging recordings of the voice source, that is the pharyngo-esophageal segment, in four laryngectomized men, which show that the subjects had a high overall intelligibility as judged by the listeners.
Abstract: This study presents results from high-speed imaging recordings of the voice source, that is the pharyngo-esophageal segment, in four laryngectomized men. The subjects were asked to produce VCV-syllables with voiced and voiceless stop consonants during simultaneous high-speed imaging recordings and audio recordings. A general and detailed visuo-perceptual analysis of the shape and vibratory pattern in the pharyngo-esophageal (PE-) segment was made, as well as acoustical measurements of voice onset time (VOT) and closure duration for each syllable. The syllables were also audio-perceptually evaluated by five expert listeners. Results show that the subjects had a high overall intelligibility as judged by the listeners. All four subjects were able to make opening gestures in the PE-segment while producing voiceless stop consonants. In cases where misperceptions were predominant, the acoustical analysis with spectrograms and the detailed analysis of the vibration in the PE-segment gave information about probab...

Journal ArticleDOI
TL;DR: The authors analyzed the L2 English of Cantonese speakers from Hong Kong and found that the rates of voicing and devoicing vary with the environment, and suggested that transfer plays a significant role in shaping the L 2 English ranking.
Abstract: This study offers a description and analysis of consonant voicing and devoicing in the L2 English of Cantonese speakers from Hong Kong. Focusing on stem-final voiced obstruents in distinct structural environments, we find that the rates of voicing and devoicing vary with the environment. Stem-final obstruents are more likely to devoice in prevoiceless and word-final positions than in prevocalic and pre-sonorant positions. Our analysis reveals a systematic pattern, expressible by three ranked constraints: IdOn(Lar)»*Lar» Id(Lar). This ranking retains to a large extent the ranking relations of the constraints in Cantonese. Our finding suggests that transfer plays a significant role in shaping the L2 English ranking. The comparison with L1 English shows that the two varieties are distinct, providing further evidence for a distinct Hong Kong English accent (Hung, 2000) and for the emergence of a distinct variety of English with unique sound structures (Peng and Setter, 2000).

Journal ArticleDOI
TL;DR: The results indicate a model of syllabic affiliation where specific juncture-marking aspects of the signal dominate parsing, and in their absence other differences provide additional, weaker cues to syllable affiliation.
Abstract: Stetson (1951) noted that repeating singleton coda consonants at fast speech rates makes them be perceived as onset consonants affiliated with a following vowel. The current study documents the perception of rate-induced resyllabification, as well as what temporal properties give rise to the perception of syllable affiliation. Stimuli were extracted from a previous study of repeated stop + vowel and vowel + stop syllables (de Jong, 2001a, 2001b). Forced-choice identification tasks show that slow repetitions are clearly distinguished. As speakers increase rate, they reach a point after which listeners disagree as to the affiliation of the stop. This pattern is found for voiced and voiceless consonants using different stimulus extraction techniques. Acoustic models of the identifications indicate that the sudden shift in syllabification occurs with the loss of an acoustic hiatus between successive syllables. Acoustic models of the fast rate identifications indicate various other qualities, such as consonant voicing, affect the probability that the consonants will be perceived as onsets. These results indicate a model of syllabic affiliation where specific juncture-marking aspects of the signal dominate parsing, and in their absence other differences provide additional, weaker cues to syllabic affiliation.

Amee P. Shah1
01 Jan 2004
TL;DR: In this paper, Flege et al. identify the acoustic deviations in the speech of Spanish-accented speakers of English and their influence on the native perception of accentedness.
Abstract: This study attempted to identify the acoustic deviations in the speech of Spanish-accented speakers of English and their influence on the native perception of accentedness. Recordings of eight multisyllabic target words spoken in sentences by 22 Spanish speakers of English and five native speakers of American English were analyzed for temporal acoustic differences. Acoustic deviations in Spanish-accented speech included overall word duration, unstressed vowel duration, stressed-unstressed (S/U) vowel duration ratios, Voice Onset Time (VOT) and closure duration in intervocalic flaps/stops. Native listeners listened to the nonnative samples and assigned a range of ratings of accentedness. Results showed that the accentedness ratings were correlated to varying degrees with each of overall word duration, S/U vowel duration ratios, VOT duration, and closure duration of intervocalic /t/. Overall, results suggest that Spanishaccented English is characterized by systematic temporal differences from native American English, and that these temporal differences are related to the perception of accentedness as judged by native AE listeners. BACKGROUND Relative to the long-standing research in normal speech perception and production, the study of foreign-accented speech is a recent, but fertile area of interest. The speech of non-native speakers of a language or “foreign-accented” speech is usually characterized by presence of acoustic-phonetic and phonological deviations from the norm. Research in this field has attempted to tap these deviations in order to characterize different accents, study the effect of these deviations on listeners’ perception and intelligibility, understand the process of second language learning as it shapes in ways different from the first language, and ultimately, to understand the processes of speech perception and production. The present work adds to, and extends this body of research through an attempt to isolate those nonnative speech deviations that cue listeners’ perception of accentedness, using the case of Spanish-accented English. Previous studies have typically described nonnative speech in terms of phonological characteristics (e.g., MacDonald, 1989; Ortega-Llebaria, 1997, Stockwell & Bowen, 1965) with only a few studies addressing the acoustic characteristics of those productions (e.g. Flege & Port, 1981; Munro, 1993; Magen, 1998; Backman, 1978; Flege & Eefting, 1987; Flege, Munro & Skelton, 1992). Moreover, these acoustic studies were mainly restricted to studying single parameters of English, such as vowel duration differences in voicing contrasts, acoustic vowel spaces, voice onset time of stop consonants, changes in fundamental frequency as related to intonation, or some measure of rate of speech (e.g. Flege & Port, 1981; Elsendoorn, 1985; Flege, 1993; Backman, 1978; Flege, 1991; Schmidt and Flege, 1996). Multiple acoustic parameters as they relate to perceived accentedness have not yet been

Patent
Corey Brady1
10 Jun 2004
TL;DR: A speech-to-text conversion and annotation system is described in this article, where the system displays an annotated text corresponding to computer rendered speech and allows a user to adjust voicing and pronunciation parameters.
Abstract: A speech to text conversion and annotation system. In an embodiment, the system displays an annotated text corresponding to computer rendered speech and allows a user to adjust voicing and pronunciation parameters of the annotated text; and use a text to speech engine to render the annotated text to a human like generated voice that has modified voicing and pronunciation corresponding to the user selected voicing and pronunciation parameters. In another embodiment, a read-aloud coaching system is introduced that allows a student to “incrementally program” a voice synthesis engine, thoughtfully and purposively creating his or her own reading of a literary text.

01 Jan 2004
TL;DR: A probabilistic framework for landmark-based speech recognition that utilizes the sufficiency and context invariance properties of acoustic cues for phonetic features is presented and results have been obtained for manner recognition and the corresponding landmarks.
Abstract: A probabilistic framework for landmark-based speech recognition that utilizes the sufficiency and context invariance properties of acoustic cues for phonetic features is presented. Binary classifiers of the manner phonetic features "sonorant", "continuant" and "syllabic" operate on each frame of speech, each using a small number of relevant and sufficient acoustic parameters to generate probabilistic landmark sequences. The relative nature of the parameters developed for the extraction of acoustic cues for manner phonetic features makes them "invariant" of the manner of neighboring speech frames. This invariance of manner acoustic cues makes the use of only those three classifiers along with the speech/silence classifier complete irrespective of the manner context. The obtained landmarks are then used to extract relevant acoustic cues to make probabilistic binary decisions for the place and voicing phonetic features. Similar to the invariance property of the manner acoustic cues, the acoustic cues for place phonetic features extracted using manner landmarks are invariant of the place of neighboring sounds. Pronunciation models based on phonetic features are used to constrain the landmark sequences and to narrow the classification of place and voicing. Preliminary results have been obtained for manner recognition and the corresponding landmarks. Using classifiers trained from the phonetically rich TIMIT database, 80.2% accuracy was obtained for broad class recognition of the isolated digits in the TIDIGITS database which compares well with the accuracies of 74.8% and 81.0% obtained by a hidden Markov model (HMM) based system using mel-frequency cepstral coefficients (MFCCs) and knowledge-based parameters, respectively.