scispace - formally typeset
Search or ask a question

Showing papers by "Dan Jurafsky published in 2003"


Journal ArticleDOI
TL;DR: This study investigates which factors affect the forms of function words, especially whether they have a fuller pronunciation or a more reduced or lenited pronunciation, based on over 8000 occurrences of the ten most frequent English function words in a 4-h sample from conversations from the Switchboard corpus.
Abstract: Function words, especially frequently occurring ones such as (the, that, and, and of ), vary widely in pronunciation. Understanding this variation is essential both for cognitive modeling of lexical production and for computer speech recognition and synthesis. This study investigates which factors affect the forms of function words, especially whether they have a fuller pronunciation (e.g., ði, ðaet, aend, ʌv) or a more reduced or lenited pronunciation (e.g., ðə, ðīt, n, ə). It is based on over 8000 occurrences of the ten most frequent English function words in a 4-h sample from conversations from the Switchboard corpus. Ordinary linear and logistic regression models were used to examine variation in the length of the words, in the form of their vowel (basic, full, or reduced), and whether final obstruents were present or not. For all these measures, after controlling for segmental context, rate of speech, and other important factors, there are strong independent effects that made high-frequency monosyllabic function words more likely to be longer or have a fuller form (1) when neighboring disfluencies (such as filled pauses uh and um) indicate that the speaker was encountering problems in planning the utterance; (2) when the word is unexpected, i.e., less predictable in context; (3) when the word is either utterance initial or utterance final. Looking at the phenomenon in a different way, frequent function words are more likely to be shorter and to have less-full forms in fluent speech, in predictable positions or multiword collocations, and utterance internally. Also considered are other factors such as sex (women are more likely to use fuller forms, even after controlling for rate of speech, for example), and some of the differences among the ten function words in their response to the factors.

383 citations


Proceedings ArticleDOI
19 Nov 2003
TL;DR: The authors formulate the semantic parsing problem as a classification problem using support vector machines and use a hand-labeled training set and a set of features drawn from earlier work together with some feature enhancements.
Abstract: There is an ever-growing need to add structure in the form of semantic markup to the huge amounts of unstructured text data now available. We present the technique of shallow semantic parsing, the process of assigning a simple WHO did WHAT to WHOM, etc., structure to sentences in text, as a useful tool in achieving this goal. We formulate the semantic parsing problem as a classification problem using support vector machines. Using a hand-labeled training set and a set of features drawn from earlier work together with some feature enhancements, we demonstrate a system that performs better than all other published results on shallow semantic parsing.

92 citations


Journal ArticleDOI
TL;DR: This study investigates three factors that have been argued to define "canonical form" in sentence comprehension: Syntactic structure, semantic role, and frequency of usage, and shows that sentences whose structure matches the lexical bias of the main verb are significantly easier than sentences in which structure and lexical biases do not match.

33 citations


16 Apr 2003
TL;DR: The crucial importance of training on the Hispanicaccented data for acoustic model performance is shown, and the tendency of Spanish-acented speakers to use longer, and presumably less-reduced, schwa vowels than native-English speakers is described.
Abstract: We describe a recognition experiment and two analytic experiments on a database of strongly Hispanic-accented English. We show the crucial importance of training on the Hispanicaccented data for acoustic model performance, and describe the tendency of Spanish-accented speakers to use longer, and presumably less-reduced, schwa vowels than native-English speakers.

19 citations


Proceedings ArticleDOI
11 Jul 2003
TL;DR: This paper systematically surveys the distribution of rhythm in constructions in Chinese from the statistical data acquired from a shallow tree bank, and shows that using the probabilistic rhythm feature significantly improves the performance of the shallow parser.
Abstract: The length of a constituent (number of syllables in a word or number of words in a phrase), or rhythm, plays an important role in Chinese syntax. This paper systematically surveys the distribution of rhythm in constructions in Chinese from the statistical data acquired from a shallow tree bank. Based on our survey, we then used the rhythm feature in a practical shallow parsing task by using rhythm as a statistical feature to augment a PCFG model. Our results show that using the probabilistic rhythm feature significantly improves the performance of our shallow parser.

10 citations


Journal ArticleDOI
TL;DR: This paper found that the unaccusative verb frame (the apple dropped) is no more difficult than the intransitive with agent subject (Gottfried, Menn, & Holland, 1997, using repetition; Gahl, Mern, Ramsberger, Jurafsky, Elder, & Rewega et al., 2001, using plausibility judgement).

2 citations


Book
01 Jan 2003
TL;DR: This work presents a statistical system for identifying the semantic relationships, or semantic roles, filled by constituents of a sentence, based on statistical classifiers trained on roughly 50,000 sentences hand labeled with semantic roles in the FrameNet semantic labeling project.
Abstract: Over the past decade, natural language processing has been transformed by the adoption of statistical methods. The statistical approach began with shallow problems such as part-of-speech tagging, progressed to syntactic parsing, and is now being applied to higher-level semantic tasks. We present a statistical system for identifying the semantic relationships, or semantic roles, filled by constituents of a sentence. The system operates at the level of frame semantics, which provide us with an intermediate representation between the detail of complete theories of semantics and simpler domain-specific slot-filler representations. Given an input sentence, the system labels constituents with roles such as SPEAKER, MESSAGE, and TOPIC, identifying participants in various types of actions or states.The system is based on statistical classifiers that were trained on roughly 50,000 sentences hand labeled with semantic roles in the FrameNet semantic labeling project. We then parsed each training sentence and extracted various lexical and syntactic features, including the syntactic category of the constituent, its grammatical function, and position in the sentence. These features were combined with knowledge of the target verb, noun, or adjective: as well as information such as the prior probabilities of various combinations of semantic roles. We also used various methods of lexical clustering to generalize across possible fillers of roles. Test sentences were parsed, annotated with these features, and then passed through the classifiers.Our system achieves 80% accuracy in identifying the semantic role of presegmented constituents. At the harder task of simultaneously segmenting constituents and identifying their semantic role, the system achieved 65% precision and 61% recall.

1 citations