scispace - formally typeset
Search or ask a question

Showing papers in "Computational Linguistics in 1997"


Journal Article
Marti A. Hearst1
TL;DR: The algorithm is fully implemented and is shown to produce segmentation that corresponds well to human judgments of the subtopic boundaries of 12 texts, which should be useful for many text analysis tasks, including information retrieval and summarization.
Abstract: TextTiling is a technique for subdividing texts into multi-paragraph units that represent passages, or subtopics. The discourse cues for identifying major subtopic shifts are patterns of lexical co-occurrence and distribution. The algorithm is fully implemented and is shown to produce segmentation that corresponds well to human judgments of the subtopic boundaries of 12 texts. Multi-paragraph subtopic segmentation should be useful for many text analysis tasks, including information retrieval and summarization.

1,381 citations


Journal Article
Mehryar Mohri1
TL;DR: This work recalls classical theorems and gives new ones characterizing sequential string-to-string transducers, including algorithms for determinizing and minizizing these transducers very efficiently, and characterizations of the transducers admitting determinization and the corresponding algorithms.
Abstract: Finite-machines have been used in various domains of natural language processing. We consider here the use of a type of transducer that supports very efficient programs: sequential transducers. We recall classical theorems and give new ones characterizing sequential string-to-string transducers. Transducers that outpur weights also play an important role in language and speech processing. We give a specific study of string-to-weight transducers, including algorithms for determinizing and minizizing these transducers very efficiently, and characterizations of the transducers admitting determinization and the corresponding algorithms. Some applications of these algorithms in speech recognition are described and illustrated.

1,052 citations


Journal Article
TL;DR: A novel stochastic inversion transduction grammar formalism for bilingual language modeling of sentence-pairs, and the concept of bilingual parsing with a variety of parallel corpus analysis applications are introduced.
Abstract: We introduce (1) a novel stochastic inversion transduction grammar formalism for bilingual language modeling of sentence-pairs, and (2) the concept of bilingual parsing with a variety of parallel corpus analysis applications. Aside from the bilingual orientation, three major features distinguish the formalism from the finite-state transducers more traditionally found in computational linguistics: it skips directly to a context-free rather than finite-state base, it permits a minimal extra degree of ordering flexibility, and its probabilistic formulation admits an efficient maximum-likelihood bilingual parsing algorithm. A convenient normal form is shown to exist. Analysis of the formalism's expressiveness suggests that it is particularly well suited to modeling ordering shifts between languages, balancing needed flexibility against complexity constraints. We discuss a number of examples of how stochastic inversion transduction grammars bring bilingual constraints to bear upon problematic corpus analysis tasks such as segmentation, bracketing, phrasal alignment, and parsing.

987 citations


Journal ArticleDOI
TL;DR: This paper describes the reliability of a dialogue structure coding scheme based on utterance function, game structure, and higher-level transaction structure that has been applied to a corpus of spontaneous task-oriented spoken dialogues.
Abstract: This paper describes the reliability of a dialogue structure coding scheme based on utterance function, game structure, and higher-level transaction structure that has been applied to a corpus of spontaneous task-oriented spoken dialogues.

453 citations


Journal ArticleDOI
TL;DR: The first part of this paper presents a method for empirically validating multitutterance units referred to as discourse segments, and reports highly significant results of segmentations performed by naive subjects, where a commonsense notion of speaker intention is the segmentation criterion.
Abstract: The need to model the relation between discourse structure and linguistic features of utterances is almost universally acknowledged in the literature on discourse. However, there is only weak consensus on what the units of discourse structure are, or the criteria for recognizing and generating them. We present quantitative results of a two-part study using a corpus of spontaneous, narrative monologues. The first part of our paper presents a method for empirically validating multitutterance units referred to as discourse segments. We report highly significant results of segmentations performed by naive subjects, where a commonsense notion of speaker intention is the segmentation criterion. In the second part of our study, data abstracted from the subjects' segmentations serve as a target for evaluating two sets of algorithms that use utterance features to perform segmentation. On the first algorithm set, we evaluate and compare the correlation of discourse segmentation with three types of linguistic cues (referential noun phrases, cue words, and pauses). We then develop a second set using two methods: error analysis and machine learning. Testing the new algorithms on a new data set shows that when multiple sources of linguistic knowledge are used concurrently, algorithm performance improves.

226 citations


Journal ArticleDOI
Steven Abney1
TL;DR: In this article, the authors define stochastic attribute-value grammars and give an algorithm for computing the maximum-likelihood estimate of their parameters, which is adapted from Della Pietra and Lafferty (1995).
Abstract: Probabilistic analogues of regular and context-free grammars are well known in computational linguistics, and currently the subject of intensive research. To date, however, no satisfactory probabilistic analogue of attribute-value grammars has been proposed: previous attempts have failed to define an adequate parameter-estimation algorithm.In the present paper, I define stochastic attribute-value grammars and give an algorithm for computing the maximum-likelihood estimate of their parameters. The estimation algorithm is adapted from Della Pietra, Della Pietra, and Lafferty (1995). To estimate model parameters, it is necessary to compute the expectations of certain functions under random fields. In the application discussed by Della Pietra, Della Pietra, and Lafferty (representing English orthographic constraints), Gibbs sampling can be used to estimate the needed expectations. The fact that attribute-value grammars generate constrained languages makes Gibbs sampling inapplicable, but I show that sampling can be done using the more general Metropolis-Hastings algorithm.

217 citations


Journal Article
TL;DR: Using the proposed technique, unknown-word-guessing rule sets were induced and integrated into a stochastic tagger and a rule-based tagger, which were then applied to texts with unknown words.
Abstract: Words unknown to the lexicon present a substantial problem to NLP modules that rely on morphosyntactic information, such as part-of-speech taggers or syntactic parsers. In this paper we present a technique for fully automatic acquisition of rules that guess possible part-of-speech tags for unknown words using their starting and ending segments. The learning is performed from a general-purpose lexicon and word frequencies collected from a raw corpus. Three complimentary sets of word-guessing rules are statistically induced: prefix morphological rules, suffix morphological rules and ending-guessing rules. Using the proposed technique, unknown-word-guessing rule sets were induced and integrated into a stochastic tagger and a rule-based tagger, which were then applied to texts with unknown words.

198 citations


Journal ArticleDOI
TL;DR: This article presents an efficient, trainable system for sentence boundary disambiguation, called Satz, which makes simple estimates of the parts of speech of the tokens immediately preceding and following each punctuation mark, and uses these estimates as input to a machine learning algorithm that then classifies the punctuated mark.
Abstract: The sentence is a standard textual unit in natual language processing applications In many language the punctuation mark that indicates the end-of-sentence boundary is ambiguous; thus the tokenizers of most NLP systems must be equipped with special sentence boundary recognition rules for every new text collectionAs an alternative, this article presents an efficient, trainable system for sentence boundary disambiguation The system, called Satz, makes simple estimates of the parts of speech of the tokens immediately preceding and following each punctuation mark, and uses these estimates as input to a machine learning algorithm that then classifies the punctuation mark Satz is very fast both in training and sentence analysis, and its combined robustness and accuracy surpass existing techniques The system needs only a small lexicon and training corpus, and has been shown to transfer quickly and easily from English to other languages, as demonstrated on Franch and German

173 citations


Journal ArticleDOI
TL;DR: This paper presents a general approach to lexical choice that can handle multiple, interacting constraints, and focuses on the problem of floating constraints, semantic or pragmatic constraints that float, appearing at a variety of different syntactic ranks, often merged with other semantic constraints.
Abstract: Lexical choice is a computationally complex task, requiring a generation system to consider a potentially large number of mappings between concepts and words. Constraints that aid in determining which word is best come from a wide variety of sources, including syntax, semantics, pragmatics, the lexicon, and the underlying domain. Furthermore, in some situations, different constraints come into play early on, while in others, they apply much later. This makes it difficult to determine a systematic ordering in which to apply constraints. In this paper, we present a general approach to lexical choice that can handle multiple, interacting constraints. We focus on the problem of floating constraints, semantic or pragmatic constraints that float, appearing at a variety of different syntactic ranks, often merged with other semantic constraints. This means that multiple content units can be realized by a single surface element, and conversely, that a single content unit can be realized by a variety of surface elements. Our approach uses the Functional Unification Formalism (FUF) to represent a generation lexicon, allowing for declarative and compositional representation of individual constraints.

96 citations


Journal Article
TL;DR: This paper describes KNIGHT, a robust explanation system that constructs multisentential and multi-paragraph explanations from the Biology Knowledge Base, a large-scale knowledge base in the domain of botanical anatomy, physiology, and development, and introduces the Two-Panel evaluation methodology.
Abstract: To explain complex phenomena, an explanation system must be able to select information from a formal representation of domain knowledge, organize the selected information into multisentential discourse plans, and realize the discourse plans in text. Although recent years have witnessed significant progress in the development of sophisticated computational mechanisms for explanation, empirical results have been limited. This paper reports on a seven-year effort to empirically study explanation generation from semantically rich, large-scale knowledge bases. In particular, it describes KNIGHT, a robust explanation system that constructs multisentential and multi-paragraph explanations from the Biology Knowledge Base, a large-scale knowledge base in the domain of botanical anatomy, physiology, and development. We introduce the Two-Panel evaluation methodology and describe how KNIGHT's performance was assessed with this methodology in the most extensive empirical evaluation conducted on an explanation system. In this evaluation, KNIGHT scored within "half a grade" of domain experts, and its performance exceeded that of one of the domain experts.

90 citations


Journal ArticleDOI
TL;DR: An algorithm capable of identifying the translation for each word in a bilingual corpus by exploiting lexicographic resources is presented, drawing on the two classification systems of words in Longman Lexicon of Contemporary English and Tongyici Cilin.
Abstract: This paper presents an algorithm capable of identifying the translation for each word in a bilingual corpus. Previously proposed methods rely heavily on word-based statistics. Under a word-based approach, frequent words with a consistent translation can be aligned at a high rate of precision. However, words that are less frequent or exhibit diverse translations generally do not have statistically significant evidence for confident alignment, thereby leading to incomplete or incorrect alignments. The algorithm proposed herein attempts to broaden coverage by exploiting lexicographic resources. To this end, we draw on the two classification systems of words in Longman Lexicon of Contemporary English (LLOCE) and Tongyici Cilin (Synonym Forest, CILIN). Automatically acquired class-based alignment rules are used to compensate for what is lacking in a bilingual dictionary such as the English-Chinese version of the Longman Dictionary of Contemporary English (LecDOCE). In addition, this alignment method is implemented using LecDOCE examples and their translations for training and testing, while further examples from a technical manual in both English and Chinese are used for an open test. Quantitative results of the closed and open tests are also summarized.

Journal Article
TL;DR: In this paper, a bidirectional, head-driven parser for constraint-based grammars is described for the OVIS system, which is a Dutch spoken dialogue system in which information about public transport can be obtained by telephone.
Abstract: This paper describes an efficient and robust implementation of a bidirectional, head-driven parser for constraint-based grammars. This parser is developed for the OVIS system: a Dutch spoken dialogue system in which information about public transport can be obtained by telephone.After a review of the motivation for head-driven parsing strategies, and head-corner parsing in particular, a nondeterministic version of the head-corner parser is presented. A memorization technique is applied to obtain a fast parser. A goal-weakening technique is introduced, which greatly improves average case efficiency, both in terms of speed and space requirements.I argue in favor of such a memorization strategy with goal-weakening in comparison with ordinary chart parsers because such a strategy can be applied selectively and therefore enormously reduces the space requirements of the parser, while no practical loss in time-efficiency is observed. On the contrary, experiments are described in which head-corner and left-corner parsers implemented with selective memorization and goal weakening outperform "standard" chart parsers. The experiments include the grammar of the OVIS system and the Alvey NL Tools grammar.Head-corner parsing is a mix of bottom-up and top-down processing. Certain approaches to robust parsing require purely bottom-up processing. Therefore, it seems that head-corner parsing is unsuitable for such robust parsing techniques. However, it is shown how underspecification (which arises very naturally in a logic programming environment) can be used in the head-corner parser to allow such robust parsing techniques. A particular robust parsing model, implemented in OVIS, is described.

Journal Article
TL;DR: New and successful approaches to the construction of letter-to-sound rules for English and French are described, which range from the trivial in languages like Spanish or Swahili, to extremely complex in languages such asEnglish and French.
Abstract: Letter-to-sound rules, also known as grapheme-to-phoneme rules, are important computational tools and have been used for a variety of purposes including word or name lookups for database searches and speech synthesis.These rules are especially useful when integrated into database searches on names and addresses, since they can complement orthographic search algorithms that make use of permutation, deletion, and insertion by allowing for a comparison with the phonetic equivalent. In databases, phonetics can help retrieve a word or a proper name without the user needing to know the correct spelling. A phonetic index is built with the vocabulary of the application. This could be an entire dictionary, or a list of proper names. The searched word is then converted into phonetics and retrieved with its information, if the word is in the phonetic index. This phonetic lookup can be used to retrieve a misspelled word in a dictionary or a database, or in a text editor to suggest corrections.Such rules are also necessary to formalize grapheme-phoneme correspondences in speech synthesis architecture. In text-to-speech systems, these rules are typically used to create phonemes from computer text. These phonemic symbols, in turn, are used to feed lower-level phonetic modules (such as timing, intonation, vowel formant trajectories, etc.) which, in turn, feed a vocal tract model and finally output a waveform and, via a digital-analogue converter, synthesized speech. Such rules are a necessary and integral part of a text-to-speech system since a database lookup (dictionary search) is not sufficient to handle derived forms, new words, nonce forms, proper nouns, low-frequency technical jargon, and the like; such forms typically are not included in the database. And while the use of a dictionary is more important now that denser and faster memory is available to smaller systems, letter-to-sound still plays a crucial and central role in speech synthesis technology.Grapheme-to-phoneme technology is also useful in speech recognition, as a way of generating pronunciations for new words that may be available in grapheme form, or for naive users to add new words more easily. In that case, the system must generate the multiple variations of the word.While there are different problems in languages that use non-alphabetic writing systems (syllabaries, as in Japanese, or logographic systems, as in Chinese) (DeFrancis 1984), all alphabetic systems have a structured set of correspondences. These range from the trivial in languages like Spanish or Swahili, to extremely complex in languages such as English and French. This paper will outline some of the previous attempts to construct such rule sets and will describe new and successful approaches to the construction of letter-to-sound rules for English and French.

Journal Article
TL;DR: It is believed that critical tokenization provides a precise mathematical description of the principle of maximum tokenization, and forms the sound mathematical foundation for categorizing tokenization ambiguity into critical and hidden types.
Abstract: Tokenization is the process of mapping sentences from character strings into strings of words. This paper sets out to study critical tokenization, a distinctive type of tokenization following the principle of maximum tokenization. The objective in this paper is to develop its mathematical description and understanding.The main results are as follows: (1) Critical points are all and only unambiguous token boundaries for any character string on a complete dictionary; (2)Any critically tokenized word string is a minimal element in the partially ordered set of all tokenized word strings with respect to the word string cover relation; (3) Any tokenized string can be reproduced from a critically tokenized word string but not vice versa; (4) Critical tokenization forms the sound mathematical foundation for categorizing tokenization ambiguity into critical and hidden types, a precise mathematical understanding of conventional concepts like combinational and overlapping ambiguities; (5) Many important maximum tokenization variations, such as forward and backward maximum matching and shortest tokenization, are all true subclasses of critical tokenization.It is believed that critical tokenization provides a precise mathematical description of the principle of maximum tokenization. Important implications and practical applications of critical tokenization in effective ambiguity resolution and in efficient tokenization implementation are also carefully examined.

Journal Article
Andrew Kehler1
TL;DR: The fundamental concepts of centering theory are reviewed and some facets of the pronoun interpretation problem that motivate a centering-style analysis are discussed, as well as some problems with a popular Centering-based approach.
Abstract: We review the fundamental concepts of centering theory and discuss some facets of the pronoun interpretation problem that motivate a centering-style analysis. We then demonstrate some problems with a popular centering-based approach with respect to these motivations.

Journal Article
TL;DR: An empirically based system that automatically resolves VP ellipsis in the 644 examples identified in the parsed Penn Treebank is reported on, and the performance of the system is comparable to the best existing systems for pronoun resolution.
Abstract: This paper reports on an empirically based system that automatically resolves VP ellipsis in the 644 examples identified in the parsed Penn Treebank. The results reported here represent the first systematic corpus-based study of VP ellipsis resolution, and the performance of the system is comparable to the best existing systems for pronoun resolution. The methodology and utilities described can be applied to other discourse-processing problems, such as other forms of ellipsis and anaphora resolution.The system determines potential antecedents for ellipsis by applying syntactic constraints, and these antecedents are ranked by combining structural and discourse preference factors such as recency, clausal relations, and parallelism. The system is evaluated by comparing its output to the choices of human coders. The system achieves a success rate of 94.8%, where success is defined as sharing of a head between the system choice and the coder choice, while a baseline recency-based scheme achieves a success rate of 75.0% by this measure. Other criteria for success are also examined. When success is defined as an exact, word-for-word match with the coder choice, the system performs with 76.0% accuracy, and the baseline approach achieves only 14.6% accuracy. Analysis of the individual components of the system shows that each of the structural and discourse constraints used are strong predictors of the antecedent of VP ellipsis.

Journal Article
TL;DR: This issue brings together a collection of papers illustrating recent approaches to empirical research in discourse generation and interpretation, and describes an empirical research strategy that leads from empirical findings to general theories.
Abstract: Computational theories of discourse are concerned with the context-based interpretation or generation of discourse phenomena in text and dialogue. In the past, research in this area focused on specifying the mechanisms underlying particular discourse phenomena; the models proposed were often motivated by a few constructed examples. While this approach led to many theoretical advances, models developed in this manner are difficult to evaluate because it is hard to tell whether they generalize beyond the particular examples used to motivate them. Recently however the field has turned to issues of robustness and the coverage of theories of particular phenomena with respect to specific types of data. This new empirical focus is supported by several recent advances: an increasing theoretical consensus on discourse models; a large amount of online dialogue and textual corpora available; and improvements in component technologies and tools for building and testing discourse and dialogue testbeds. This means that it is now possible to determine how representative particular discourse phenomena are, how frequently they occur, whether they are related to other phenomena, what percentage of the cases a particular model covers, the inherent difficulty of the problem, and how well an algorithm for processing or generating the phenomena should perform to be considered a good model. This issue brings together a collection of papers illustrating recent approaches to empirical research in discourse generation and interpretation. Section 2 gives a general overview of empirical studies in discourse and describes an empirical research strategy that leads from empirical findings to general theories. Section 3 discusses how each article exemplifies the empirical research strategy and how empirical methods have been employed in each research project.

Journal Article
TL;DR: Analysis of the dialogue structure of actual human-computer interactions indicates there are differences in user behavior and dialogue structure as a function of the computer's level of initiative, and provides evidence that a spoken natural language dialogue system must be capable of varying itslevel of initiative to facilitate effective interaction with users of varying levels of expertise and experience.
Abstract: This paper presents an analysis of the dialogue structure of actual human-computer interactions. The 141 dialogues analyzed were produced from experiments with a variable initiative spoken natural language dialogue system organized around the paradigm of the Missing Axiom Theory for language use. Results about utterance classification into subdialogues, frequency of user-initiated subdialogue transitions, regularity of subdialogue transitions, frequency of linguistic control shifts, and frequency of user-initiated error corrections are presented. These results indicate there are differences in user behavior and dialogue structure as a function of the computer's level of initiative. Furthermore, they provide evidence that a spoken natural language dialogue system must be capable of varying its level of initiative in order to facilitate effective interaction with users of varying levels of expertise and experience.

Journal Article
TL;DR: The results of the comparison show that the rules are fairly effective in dealing with the generation of anaphora in Chinese, and they are implemented in a Chinese natural language generation system that is able to generate descriptive texts.
Abstract: The goal of this work is to study how to generate various kinds of anaphora in Chinese, including zero, pronominal, and nominal anaphora, from the syntactic and semantic representation of multisentential text. In this research we confine ourselves to descriptive texts. We examine the occurrence of anaphora in human-generated text and those generated by a hypothetical computer equipped with anaphor generation rules, assuming that the computer can generate the same texts as the human except that anaphora are generated by the rules. A sequence of rules using independently motivated linguistic constraints is developed until the results obtained are close to those in the real texts. The best rule obtained for the choice of anaphor type makes use of the following conditions: locality between anaphor and antecedent, syntactic constraints on zero anaphora, discourse segment structures, salience of objects and animacy of objects. We further establish a rule for choosing descriptions if a nominal anaphor is decided on. We have implemented the above rules in a Chinese natural language generation system that is able to generate descriptive texts. We sent some generated texts to a number of native speakers of Chinese and compared human-created results and computer-generated text to investigate the quality of the generated anaphora. The results of the comparison show that the rules are fairly effective in dealing with the generation of anaphora in Chinese.

Journal Article
TL;DR: A complier is described which translates a set of lexical rules and their interaction into a definite clause encoding, which is called by the base lexical entries in the lexicon, and which ensures the automatic transfer of properties not changed by a lexical rule.
Abstract: This paper proposes a new computational treatment of lexical rules as used in the HPSG framework. A complier is described which translates a set of lexical rules and their interaction into a definite clause encoding, which is called by the base lexical entries in the lexicon. This way, the disjunctive possibilities arising from lexical rule application are encoded as systematic covariation in the specification of lexical entries. The compiler ensures the automatic transfer of properties not changed by a lexical rule. Program transformation techniques are used to advance the encoding. The final output of the compiler constitutes an efficient computational counterpart of the linguistic generalizations captured by lexical rules and allows on-the-fly application of lexical rules.

Journal Article
TL;DR: In this article, the As. tentent ici de determiner si elle provient directement de quelque relation uniforme entre les deux propositions or indirectement de principes discursifs motives independamment qui gouverneraient la reference pronominale.
Abstract: On sait depuis longtemps que les relations anaphoriques contenues dans la signification implicite d'un syntagme verbal elide dependent des relations anaphoriques correspondantes contenues dans la source de l'ellipse. Afin d'identifier la cause sous-jacente de cette dependance, les As. tentent ici de determiner si elle provient directement de quelque relation uniforme entre les deux propositions ou si elle provient indirectement de principes discursifs motives independamment qui gouverneraient la reference pronominale

Journal ArticleDOI
TL;DR: A novel stochastic inversion transduction grammar formalism for bilingual language modeling of sentence-pairs and the concept of bilingual parsing with a variety of parallel languages is introduced.
Abstract: We introduce (1) a novel stochastic inversion transduction grammar formalism for bilingual language modeling of sentence-pairs, and (2) the concept of bilingual parsing with a variety of parallel c...

Journal ArticleDOI
TL;DR: This work considers the use of a type of transducer that supports very efficient programs: sequential transducers ininite-machines.
Abstract: Finite-machines have been used in various domains of natural language processing. We consider here the use of a type of transducer that supports very efficient programs: sequential transducers. We ...

Journal ArticleDOI
TL;DR: Letter-to-sound rules, also known as grapheme- to-phoneme rules, are important computational tools and have been used for a variety of purposes including word or name lookups for database searches.
Abstract: Letter-to-sound rules, also known as grapheme-to-phoneme rules, are important computational tools and have been used for a variety of purposes including word or name lookups for database searches a...

Journal Article
TL;DR: The authors propose a technique for implementer un analyseur base on le formalisme de la grammaire fonctionnelle lexicale for indiennes en general and au bangla (bengali) en particulier.
Abstract: Les As. proposent une technique pour implementer un analyseur base sur le formalisme de la grammaire fonctionnelle lexicale destine aux langues indiennes en general et au bangla (bengali) en particulier. Les langues indiennes sont pour la plupart non configurationnelles et hautement flexionnelles. Leurs fonctions grammaticales (FG) sont precisees dans les flexions casuelles contenues dans les noms tete des syntagmes nominaux et dans les particules postpositionnelles des syntagmes postpositionnels. Toutefois, comme il n'existe pas de correspondance systematique entre les FG et les marqueurs casuels, les techniques classiques ont du prendre en compte un certain nombre d'alternances dans l'encodage syntaxique des FG. Les As. montrent ici comment reduire ces alternances en se basant sur une evaluation retardee du schema d'encodage syntaxique

Journal ArticleDOI
TL;DR: In this article, an explanation system must be able to select information from a formal representation of domain knowledge, organize the selected information into multisentential domains, and explain complex phenomena.
Abstract: To explain complex phenomena, an explanation system must be able to select information from a formal representation of domain knowledge, organize the selected information into multisentential disco...


Journal ArticleDOI
TL;DR: The fundamental concepts of centering theory are reviewed and some facets of the pronoun interpretation problem that motivate a centering-style analysis are discussed.
Abstract: We review the fundamental concepts of centering theory and discuss some facets of the pronoun interpretation problem that motivate a centering-style analysis. We then demonstrate some problems with...