Showing papers in &quot;Computational Linguistics in 2002&quot;

Summarizing scientific articles: experiments with relevance and rhetorical status

TL;DR: A system for identifying the semantic relationships, or semantic roles, filled by constituents of a sentence within a semantic frame, based on statistical classifiers trained on roughly 50,000 sentences that were hand-annotated with semantic roles by the FrameNet semantic labeling project.

...read moreread less

Abstract: We present a system for identifying the semantic relationships, or semantic roles, filled by constituents of a sentence within a semantic frame. Given an input sentence and a target word and frame, the system labels constituents with either abstract semantic roles, such as AGENT or PATIENT, or more domain-specific semantic roles, such as SPEAKER, MESSAGE, and TOPIC.The system is based on statistical classifiers trained on roughly 50,000 sentences that were hand-annotated with semantic roles by the FrameNet semantic labeling project. We then parsed each training sentence into a syntactic tree and extracted various lexical and syntactic features, including the phrase type of each constituent, its grammatical function, and its position in the sentence. These features were combined with knowledge of the predicate verb, noun, or adjective, as well as information such as the prior probabilities of various combinations of semantic roles. We used various lexical clustering algorithms to generalize across possible fillers of roles. Test sentences were parsed, were annotated with these features, and were then passed through the classifiers.Our system achieves 82% accuracy in identifying the semantic role of presegmented constituents. At the more difficult task of simultaneously segmenting constituents and identifying their semantic role, the system achieved 65% precision and 61% recall.Our study also allowed us to compare the usefulness of different features and feature combination methods in the semantic role labeling task. We also explore the integration of role labeling with statistical syntactic parsing and attempt to generalize to predicates unseen in the training data.

...read moreread less

1,666 citations

Journal Article•DOI•

[...]

Simone Teufel¹, Marc Moens²•Institutions (2)

University of Cambridge¹, University of Edinburgh²

Introduction to the special issue on summarization

TL;DR: The authors proposed a strategy for the summarization of scientific articles that concentrates on the rhetorical status of statements in an article: material for summaries is selected in such a way that summaries can highlight the new contribution of the source article and situate it with respect to earlier work.

...read moreread less

Abstract: In this article we propose a strategy for the summarization of scientific articles that concentrates on the rhetorical status of statements in an article: Material for summaries is selected in such a way that summaries can highlight the new contribution of the source article and situate it with respect to earlier work.We provide a gold standard for summaries of this kind consisting of a substantial corpus of conference articles in computational linguistics annotated with human judgments of the rhetorical status and relevance of each sentence in the articles. We present several experiments measuring our judges' agreement on these annotations.We also present an algorithm that, on the basis of the annotated training material, selects content from unseen articles and classifies it into a fixed set of seven rhetorical categories. The output of this extraction and classification system can be viewed as a single-document summary in its own right; alternatively, it provides starting material for the generation of task-oriented and user-tailored summaries designed to give users an overview of a scientific field.

...read moreread less

584 citations

Journal Article•DOI•

[...]

Dragomir R. Radev¹, Eduard Hovy², Kathleen R. McKeown³•Institutions (3)

University of Michigan¹, University of Southern California², New York University³

A critique and improvement of an evaluation metric for text segmentation

TL;DR: This work focuses on automatic summarization of open-domain multiparty dialogues in diverse genres, and on the development of a robust practical text summarizer based on rhetorical structure extraction.

...read moreread less

Abstract: generation based on rhetorical structure extraction. In Proceedings of the International Conference on Computational Linguistics, Kyoto, Japan, pages 344–348. Otterbacher, Jahna, Dragomir R. Radev, and Airong Luo. 2002. Revisions that improve cohesion in multi-document summaries: A preliminary study. In ACL Workshop on Text Summarization, Philadelphia. Papineni, K., S. Roukos, T. Ward, and W-J. Zhu. 2001. BLEU: A method for automatic evaluation of machine translation. Research Report RC22176, IBM. Radev, Dragomir, Simone Teufel, Horacio Saggion, Wai Lam, John Blitzer, Arda Celebi, Hong Qi, Elliott Drabek, and Danyu Liu. 2002. Evaluation of text summarization in a cross-lingual information retrieval framework. Technical Report, Center for Language and Speech Processing, Johns Hopkins University, Baltimore, June. Radev, Dragomir R., Hongyan Jing, and Malgorzata Budzikowska. 2000. Centroid-based summarization of multiple documents: Sentence extraction, utility-based evaluation, and user studies. In ANLP/NAACL Workshop on Summarization, Seattle, April. Radev, Dragomir R. and Kathleen R. McKeown. 1998. Generating natural language summaries from multiple on-line sources. Computational Linguistics, 24(3):469–500. Rau, Lisa and Paul Jacobs. 1991. Creating segmented databases from free text for text retrieval. In Proceedings of the 14th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, New York, pages 337–346. Saggion, Horacio and Guy Lapalme. 2002. Generating indicative-informative summaries with SumUM. Computational Linguistics, 28(4), 497–526. Salton, G., A. Singhal, M. Mitra, and C. Buckley. 1997. Automatic text structuring and summarization. Information Processing & Management, 33(2):193–207. Silber, H. Gregory and Kathleen McCoy. 2002. Efficiently computed lexical chains as an intermediate representation for automatic text summarization. Computational Linguistics, 28(4), 487–496. Sparck Jones, Karen. 1999. Automatic summarizing: Factors and directions. In I. Mani and M. T. Maybury, editors, Advances in Automatic Text Summarization. MIT Press, Cambridge, pages 1–13. Strzalkowski, Tomek, Gees Stein, J. Wang, and Bowden Wise. 1999. A robust practical text summarizer. In I. Mani and M. T. Maybury, editors, Advances in Automatic Text Summarization. MIT Press, Cambridge, pages 137–154. Teufel, Simone and Marc Moens. 2002. Summarizing scientific articles: Experiments with relevance and rhetorical status. Computational Linguistics, 28(4), 409–445. White, Michael and Claire Cardie. 2002. Selecting sentences for multidocument summaries using randomized local search. In Proceedings of the Workshop on Automatic Summarization (including DUC 2002), Philadelphia, July. Association for Computational Linguistics, New Brunswick, NJ, pages 9–18. Witbrock, Michael and Vibhu Mittal. 1999. Ultra-summarization: A statistical approach to generating highly condensed non-extractive summaries. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, pages 315–316. Zechner, Klaus. 2002. Automatic summarization of open-domain multiparty dialogues in diverse genres. Computational Linguistics, 28(4), 447–485.

...read moreread less

529 citations

Journal Article•DOI•

[...]

Lev Pevzner¹, Marti A. Hearst²•Institutions (2)

Harvard University¹, University of California, Berkeley²

Near-synonymy and lexical choice

TL;DR: A simple modification to the Pk metric is proposed, called Window Diff, which moves a fixed-sized window across the text and penalizes the algorithm whenever the number of boundaries within the window does not match the true number of borders for that window of text.

...read moreread less

Abstract: The Pk evaluation metric, initially proposed by Beeferman, Berger, and Lafferty (1997), is becoming the standard measure for assessing text segmentation algorithms. However, a theoretical analysis of the metric finds several problems: the metric penalizes false negatives more heavily than false positives, overpenalizes near misses, and is affected by variation in segment size distribution. We propose a simple modification to the Pk metric that remedies these problems. This new metric-called WindowDiff-moves a fixed-sized window across the text and penalizes the algorithm whenever the number of boundaries within the window does not match the true number of boundaries for that window of text.

...read moreread less

439 citations

Journal Article•DOI•

[...]

Philip Edmonds, Graeme Hirst¹•Institutions (1)

University of Toronto¹

Efficiently computed lexical chains as an intermediate representation for automatic text summarization

TL;DR: A preliminary theory to account for near-synonymy is proposed, relying crucially on the notion of granularity of representation, in which the meaning of a word arises out of a context-dependent combination of a Context-independent core meaning and a set of explicit differences to its near- synonyms.

...read moreread less

Abstract: We develop a new computational model for representing the fine-grained meanings of near-synonyms and the differences between them. We also develop a lexical-choice process that can decide which of several near-synonyms is most appropriate in a particular situation. This research has direct applications in machine translation and text generation.We first identify the problems of representing near-synonyms in a computational lexicon and show that no previous model adequately accounts for near-synonymy. We then propose a preliminary theory to account for near-synonymy, relying crucially on the notion of granularity of representation, in which the meaning of a word arises out of a context-dependent combination of a context-independent core meaning and a set of explicit differences to its near-synonyms. That is, near-synonyms cluster together.We then develop a clustered model of lexical knowledge, derived from the conventional ontological model. The model cuts off the ontology at a coarse grain, thus avoiding an awkward proliferation of language-dependent concepts in the ontology, yet maintaining the advantages of efficient computation and reasoning. The model groups near-synonyms into subconceptual clusters that are linked to the ontology. A cluster differentiates near-synonyms in terms of fine-grained aspects of denotation, implication, expressed attitude, and style. The model is general enough to account for other types of variation, for instance, in collocational behavior.An efficient, robust, and flexible fine-grained lexical-choice process is a consequence of a clustered model of lexical knowledge. To make it work, we formalize criteria for lexical choice as preferences to express certain concepts with varying indirectness, to express attitudes, and to establish certain styles. The lexical-choice process itself works on two tiers: between clusters and between near-synonyns of clusters. We describe our prototype implementation of the system, called I-Saurus.

...read moreread less

252 citations

Journal Article•DOI•

[...]

H. Grogory Silber, Kathleen F. McCoy

Generating indicative-informative summaries with sumUM

TL;DR: This paper presented a linear-time algorithm for lexical chain computation, which makes lexical chains a computationally feasible candidate as an intermediate representation for automatic text summarization, and a method for evaluating Lexical chains as a intermediate step in summarization is also presented and carried out.

...read moreread less

Abstract: While automatic text summarization is an area that has received a great deal of attention in recent research, the problem of efficiency in this task has not been frequently addressed. When the size and quantity of documents available on the Internet and from other sources are considered, the need for a highly efficient tool that produces usable summaries is clear. We present a linear-time algorithm for lexical chain computation. The algorithm makes lexical chains a computationally feasible candidate as an intermediate representation for automatic text summarization. A method for evaluating lexical chains as an intermediate step in summarization is also presented and carried out. Such an evaluation was heretofore not possible because of the computational complexity of previous lexical chains algorithms.

...read moreread less

174 citations

Journal Article•DOI•

[...]

Horacio Saggion, Guy Lapalme¹•Institutions (1)

Université de Montréal¹

Class-based probability estimation using a semantic hierarchy

TL;DR: SumUM is a text summarization system that takes a raw technical text as input and produces an indicative informative summary that motivates the topics, describes entities, and defines concepts.

...read moreread less

Abstract: We present and evaluate SumUM, a text summarization system that takes a raw technical text as input and produces an indicative informative summary. The indicative part of the summary identifies the topics of the document, and the informative part elaborates on some of these topics according to the reader's interest. SumUM motivates the topics, describes entities, and defines concepts. It is a first step for exploring the issue of dynamic summarization. This is accomplished through a process of shallow syntactic and semantic analysis, concept identification, and text regeneration. Our method was developed through the study of a corpus of abstracts written by professional abstractors. Relying on human judgment, we have evaluated indicativeness, informativeness, and text acceptability of the automatic summaries. The results thus far indicate good performance when compared with other summarization technologies.

...read moreread less

149 citations

Journal Article•DOI•

[...]

Stephen Clark¹, David J. Weir²•Institutions (2)

University of Edinburgh¹, University of Sussex²

Generating referring expressions: boolean extensions of the incremental algorithm

TL;DR: This article concerns the estimation of a particular kind of probability, namely, the probability of a noun sense appearing as a particular argument of a predicate, and a procedure is developed that uses a chi-square test to determine a suitable level of generalization.

...read moreread less

Abstract: This article concerns the estimation of a particular kind of probability, namely, the probability of a noun sense appearing as a particular argument of a predicate. In order to overcome the accompanying sparse-data problem, the proposal here is to define the probabilities in terms of senses from a semantic hierarchy and exploit the fact that the senses can be grouped into classes consisting of semantically similar senses. There is a particular focus on the problem of how to determine a suitable class for a given sense, or, alternatively, how to determine a suitable level of generalization in the hierarchy. A procedure is developed that uses a chi-square test to determine a suitable level of generalization. In order to test the performance of the estimation method, a pseudo-disambiguation task is used, together with two alternative estimation methods. Each method uses a different generalization procedure; the first alternative uses the minimum description length principle, and the second uses Resnik's measure of selectional preference. In addition, the performance of our method is investigated using both the standard Pearson chi-square statistic and the log-likelihood chi-square statistic.

...read moreread less

142 citations

Journal Article•DOI•

[...]

Kees van Deemter¹•Institutions (1)

University of Brighton¹

The disambiguation of nominalizations

TL;DR: A logical perspective to the generation of referring expressions is brought, addressing the incompleteness of existing algorithms in this area, and generalizations and extensions of the Incremental Algorithm of Dale and Reiter (1995) are proposed.

...read moreread less

Abstract: This paper brings a logical perspective to the generation of referring expressions, addressing the incompleteness of existing algorithms in this area. After studying references to individual objects, we discuss references to sets, including Boolean descriptions that make use of negated and disjoined properties. To guarantee that a distinguishing description is generated whenever such descriptions exist, the paper proposes generalizations and extensions of the Incremental Algorithm of Dale and Reiter (1995).

...read moreread less

128 citations

Journal Article•DOI•

[...]

Maria Lapata¹•Institutions (1)

University of Edinburgh¹

Automatic summarization of open-domain multiparty dialogues in diverse genres

TL;DR: This article presents an approach that treats the interpretation of nominalizations as a disambiguation problem and shows how it can re-create the missing distributional evidence by exploiting partial parsing, smoothing techniques, and contextual information.

...read moreread less

Abstract: This article addresses the interpretation of nominalizations, a particular class of compound nouns whose head noun is derived from a verb and whose modifier is interpreted as an argument of this verb. Any attempt to automatically interpret nominalizations needs to take into account: (a) the selectional constraints imposed by the nominalized compound head, (b) the fact that the relation of the modifier and the head noun can be ambiguous, and (c) the fact that these constraints can be easily overridden by contextual or pragmatic factors. The interpretation of nominalizations poses a further challenge for probabilistic approaches since the argument relations between a head and its modifier are not readily available in the corpus. Even an approximation that maps the compound head to its underlying verb provides insufficient evidence. We present an approach that treats the interpretation task as a disambiguation problem and show how we can "re-create" the missing distributional evidence by exploiting partial parsing, smoothing techniques, and contextual information. We combine these distinct information sources using Ripper, a system that learns sets of rules from data, and achieve an accuracy of 86.1% (over a baseline of 61.5%) on the British National Corpus.

...read moreread less

121 citations

Journal Article•DOI•

[...]

Klaus Zechner¹•Institutions (1)

Princeton University¹

Using hidden Markov modeling to decompose human-written summaries

TL;DR: The authors presented an approach for obtaining automatic-extract summaries for human transcripts of multiparty dialogues of four different genres, without any restriction on domain, addressing the following issues: detection and removal of speech disfluencies; detection and insertion of sentence boundaries; and detection and linking of cross-speaker information units (question-answer pairs).

...read moreread less

Abstract: Automatic summarization of open-domain spoken dialogues is a relatively new research area. This article introduces the task and the challenges involved and motivates and presents an approach for obtaining automatic-extract summaries for human transcripts of multiparty dialogues of four different genres, without any restriction on domain.We address the following issues, which are intrinsic to spoken-dialogue summarization and typically can be ignored when summarizing written text such as news wire data: (1) detection and removal of speech disfluencies; (2) detection and insertion of sentence boundaries; and (3) detection and linking of cross-speaker information units (question-answer pairs).A system evaluation is performed using a corpus of 23 dialogue excerpts with an average duration of about 10 minutes, comprising 80 topical segments and about 47,000 words total. The corpus was manually annotated for relevant text spans by six human annotators. The global evaluation shows that for the two more informal genres, our summarization system using dialogue-specific components significantly outperforms two baselines: (1) a maximum-marginal-relevance ranking algorithm using TF*IDF term weighting, and (2) a LEAD baseline that extracts the first n words from a text.

...read moreread less

Journal Article•DOI•

[...]

Hongyan Jing¹•Institutions (1)

Alcatel-Lucent¹

Periods, capitalized words, etc.

TL;DR: A hidden Markov model solution to the decomposition problem of summary sentence decomposition is proposed, which can lead to better text generation techniques for summarization.

...read moreread less

Abstract: Professional summarizers often reuse original documents to generate summaries. The task of summary sentence decomposition is to deduce whether a summary sentence is constructed by reusing the original text and to identify reused phrases. Specifically, the decomposition program needs to answer three questions for a given summary sentence: (1) Is this summary sentence constructed by reusing the text in the original document? (2) If so, what phrases in the sentence come from the original document? and (3) From where in the document do the phrases come? Solving the decomposition problem can lead to better text generation techniques for summarization. Decomposition can also provide large training and testing corpora for extraction-based summarizers. We propose a hidden Markov model solution to the decomposition problem. Evaluations show that the proposed algorithm performs well.

...read moreread less

Journal Article•DOI•

[...]

Andrei Mikheev

Briefly noted - Mathematical foundations of information retrieval

TL;DR: This approach proved to be robust to domain shifts and new lexica and produced performance on the level with the highest reported results when incorporated into a part-of-speech tagger and helped reduce the error rate significantly on capitalized words and sentence boundaries.

...read moreread less

Abstract: In this article we present an approach for tackling three important aspects of text normalization: sentence boundary disambiguation, disambiguation of capitalized words in positions where capitalization is expected, and identification of abbreviations. As opposed to the two dominant techniques of computing statistics or writing specialized grammars, our document-centered approach works by considering suggestive local contexts and repetitions of individual words within a document. This approach proved to be robust to domain shifts and new lexica and produced performance on the level with the highest reported results. When incorporated into a part-of-speech tagger, it helped reduce the error rate significantly on capitalized words and sentence boundaries. We also investigated the portability to other languages and obtained encouraging results.

...read moreread less

Journal Article•DOI•

[...]

Sandor Dominich¹•Institutions (1)

University of Pannonia¹

Squibs and discussions: the DOP Estimation method is biased and inconsistent

TL;DR: This paper presents a meta-modelling framework for estimating the relevance of information retrieval in a number of discrete-time models and shows clear patterns in how these models are modified over time.

...read moreread less

Abstract: Acknowledgments. Preface. 1. Introduction. 2. Mathematics Handbook. 3. Information Retrieval Models. 4. Mathematical Theory of Information Retrieval. 5. Relevance Effectiveness in Information Retrieval. 6. Further Topics in Information Retrieval. Appendices. References. Index.

...read moreread less

Journal Article•DOI•

[...]

Mark Johnson¹•Institutions (1)

Brown University¹

The Lexical Basis of Sentence Processing: Formal, Computational, and Experimental Issues

TL;DR: This note observes that this estimation method is biased and inconsistent that is, the estimated distribution does not in general converge on the true distribution as the size of the training corpus increases.

...read moreread less

Abstract: A data-oriented parsing or DOP model for statistical parsing associates fragments of linguistic representations with numerical weights, where these weights are estimated by normalizing the empirical frequency of each fragment in a training corpus (see Bod [1998] and references cited therein). This note observes that this estimation method is biased and inconsistent; that is, the estimated distribution does not in general converge on the true distribution as the size of the training corpus increases.

...read moreread less

Journal Article•DOI•

[...]

Paola Merlo, Suzanne Stevenson

11 Jul 2002-Computational Linguistics

TL;DR: A computational model of the grammatical aspects of word recognition as supertagging and the lexical source of unexpressed participants and their role in sentence and discourse understanding are modeled.

...read moreread less

Abstract: 1. Preface 2. Words, numbers and all that: The lexicon in sentence understanding (by Stevenson, Suzanne) 3. The lexicon in Optimality Theory (by Bresnan, Joan) 4. Optimality-theoretic Lexical Functional Grammar (by Johnson, Mark) 5. The lexicon and the laundromat (by Fodor, Jerry) 6. Semantics in the spin cycle: Competence and performance criteria for the creation of lexical entries (by Weinberg, Amy) 7. Connectionist and symbolist sentence processing (by Steedman, Mark) 8. A computational model of the grammatical aspects of word recognition as supertagging (by Kim, Albert E.) 9. Incrementality and lexicalism: A treebank study (by Lombardo, Vincenzo) 10. Modular architectures and statistical mechanisms: The case from lexical category disambiguation (by Crocker, Matthew) 11. Encoding and storage in working memory during sentence comprehension (by Stowe, Laurie A.) 12. The time course of information integration in sentence processing (by Spivey, Michael J.) 13. The lexical source of unexpressed participants and their role in sentence and discourse understanding (by Mauner, Gail) 14. Reduced relatives judged hard require constraint-based analyses (by Filip, Hana) 15. Predicting thematic role assignments in context (by Altmann, Gerry T.M.) 16. Lexical semantics as a basis for argument structure frequency biases (by Argamann, Vera) 17. Verb sense and verb subcategorization probabilities (by Roland, Doug) 18. Author index 19. Item index

...read moreread less

Journal Article•DOI•

Squibs and discussions: human variation and lexical choice

[...]

Ehud Reiter¹, Somayajulu Sripada¹•Institutions (1)

University of Aberdeen¹

Toward an aposynthesis of topic continuity and intrasentential anaphora

TL;DR: Evidence that different people probably associate slightly different meanings with words in a language community is summarized and its implications for natural language generation, especially for lexical choice, that is, choosing appropriate words for a generated text are discussed.

...read moreread less

Abstract: Much natural language processing research implicitly assumes that word meanings are fixed in a language community, but in fact there is good evidence that different people probably associate slightly different meanings with words. We summarize some evidence for this claim from the literature and from an ongoing research project, and discuss its implications for natural language generation, especially for lexical choice, that is, choosing appropriate words for a generated text.

...read moreread less

Journal Article•DOI•

[...]

Eleni Miltsakaki¹•Institutions (1)

University of Pennsylvania¹

Incremental construction and maintenance of minimal finite-state automata

TL;DR: An aposynthetic model of discourse in which topic continuity, computed across units, and focusing preferences internal to these units are subject to different mechanisms is proposed, which overcomes important problems in anaphora resolution and reconciles seemingly contradictory experimental results reported in the literature.

...read moreread less

Abstract: The problem of proposing referents for anaphoric expressions has been extensively researched in the literature and significant insights have been gained through the various approaches. However, no single model is capable of handling all the cases. We argue that this is due to a failure of the models to identify two distinct processes. Drawing on current insights and empirical data from various languages we propose an aposynthetic1 model of discourse in which topic continuity, computed, across units, and focusing preferences internal to these units are subject to different mechanisms. The observed focusing preferences across the units (i.e., intersententially) are best modeled Structurally, along the lines suggested in centering theory. The focusing mechanism within the unit is subject to preferences projected by the semantics of the verbs and the connectives in the unit as suggested in semantic/pragmatic focusing accounts. We show that this distinction not only overcomes important problems in anaphora resolution but also reconciles seemingly contradictory experimental results reported in the literature. We specify a model of anaphora resolution that interleaves the two mechanisms. We test the central hypotheses of the proposed model with an experimental study in English and a corpus-based study in Greek.

...read moreread less

Journal Article•DOI•

[...]

Rafael C. Carrasco, Mikel L. Forcada

Computational Nonlinear Morphology with Emphasis on Semitic Languages by George Anton Kiraz

TL;DR: This article describes a simple and equally efficient method for modifying any minimal finite-state automaton (be it acyclic or not) so that a string is added to or removed from the language it accepts; both operations are very important when dictionary maintenance is performed.

...read moreread less

Abstract: Daciuk et al. [Computational Linguistics 26(1):3-16 (2000)] describe a method for constructing incrementally minimal, deterministic, acyclic finite-state automata (dictionaries) from sets of strings. But acyclic finite-state automata have limitations: For instance, if one wants a linguistic application to accept all possible integer numbers or Internet addresses, the corresponding finite-state automaton has to be cyclic. In this article, we describe a simple and equally efficient method for modifying any minimal finite-state automaton (be it acyclic or not) so that a string is added to or removed from the language it accepts; both operations are very important when dictionary maintenance is performed and solve the dictionary construction problem addressed by Daciuk et al. as a special case. The algorithms proposed here may be straightforwardly derived from the customary textbook constructions for the intersection and the complementation of finite-state automata; the algorithms exploit the special properties of the automata resulting from the intersection operation when one of the finite-state automata accepts a single string.

...read moreread less

Journal Article•

[...]

M. Walther

01 Jan 2002-Computational Linguistics

TL;DR: This book outlines a computational approach to morphology that explicitly includes languages from the Semitic family, in particular Arabic and Syriac, where the linearity hypothesis—every word can be built via string concatenation of its component morphemes—seems to break down.

...read moreread less

Abstract: Computational morphology would be an almost trivial exercise if every language were like English. Here, chopp-ing off the occasion-al affix-es, of which there are not too many, is sufficient to isolate the stem, perhaps modulo a few (morpho)graphemic rules to handle phenomena like the consonant doubling we just saw in chopping. This relative ease with which one can identify the core meaning component of a word explains the success of rather simple stemming algorithms for English or the way in which most part-of-speech (POS) taggers get away with just examining bounded initial and final substrings of unknown words for guessing their parts of speech. In contrast, this book outlines a computational approach to morphology that explicitly includes languages from the Semitic family, in particular Arabic and Syriac, where the linearity hypothesis—every word can be built via string concatenation of its component morphemes—seems to break down (we will take up the validity of that assumption below). Example 1 illustrates the problem at hand with Syriac verb forms of the root {q1t.2l3} ‘notion of killing’ (from Kiraz [1996]).

...read moreread less

Journal Article•DOI•

The combinatory morphemic lexicon

[...]

Cem Bozsahin¹•Institutions (1)

Middle East Technical University¹

Syllable-pattern-based unknown-morpheme segmentation and estimation for hybrid part-of-speech tagging of Korean

TL;DR: A morphosyntactic framework based on Combinatory Categorial Grammar is proposed that provides flexible constituency, flexible category consistency, and lexical projection of morphosynthesis properties and attachment to grammar in order to establish a morphemic grammar-lexicon.

...read moreread less

Abstract: Grammars that expect words from the lexicon may be at odds with the transparent projection of syntactic and semantic scope relations of smaller units. We propose a morphosyntactic framework based on Combinatory Categorial Grammar that provides flexible constituency, flexible category consistency, and lexical projection of morphosyntactic properties and attachment to grammar in order to establish a morphemic grammar-lexicon. These mechanisms provide enough expressive power in the lexicon to formulate semantically transparent specifications without the necessity to confine structure forming to words and phrases. For instance, bound morphemes as lexical items can have phrasal scope or word scope, independent of their attachment characteristics but consistent with their semantics. The controls can be attuned in the lexicon to language-particular properties. The result is a transparent interface of inflectional morphology, syntax, and semantics. We present a computational system and show the application of the framework to English and Turkish.

...read moreread less

Journal Article•DOI•

[...]

Gary Geunbae Lee¹, Jong-Hyeok Lee¹, Jeong-Won Cha¹•Institutions (1)

Pohang University of Science and Technology¹

Early years in machine translation: memoirs and biographies of pioneers

TL;DR: A syllable-pattern-based generalized unknown-morpheme-estimation method with POSTAG (POStech TAGger),1 which is a statistical and rule-based hybrid POS tagging system that can guess the POS tags of unknown morphemes regardless of their numbers and/or positions in an eojeol.

...read moreread less

Abstract: Most errors in Korean morphological analysis and part-of-speech (POS) tagging are caused by unknown morphemes. This paper presents a syllable-pattern-based generalized unknown-morpheme-estimation method with POSTAG (POStech TAGger), which is a statistical and rule-based hybrid POS tagging system. This method of guessing unknown morphemes is based on a combination of a morpheme pattern dictionary that encodes general lexical patterns of Korean morphemes with a posteriori syllable trigram estimation. The syllable trigrams help to calculate lexical probabilities of the unknown morphemes and are utilized to search for the best tagging result. This method can guess the POS tags of unknown morphemes regardless of their numbers and/or positions in an eojeol (a Korean spacing unit similar to an English word), which is not possible with other systems for tagging Korean. In a series of experiments using three different domain corpora, the system achieved a 97% tagging accuracy even though 10% of the morphemes in the test corpora were unknown. It also achieved very high coverage and accuracy of estimation for all classes of unknown morphemes.

...read moreread less

Journal Article•DOI•

[...]

Warren J. Plath¹•Institutions (1)

IBM¹

Review of "Automatic summarization" by Inderjeet Mani, Amsterdam: John Benjamins (Natural language processing series, edited by Ruslan Mitkov, volume 3), 2001

Journal Article•DOI•

[...]

Chris D. Paice¹•Institutions (1)

Lancaster University¹

Patterns of text: in honour of Michael Hoey

Journal Article•DOI•

[...]

Graeme Hirst¹•Institutions (1)

University of Toronto¹

Spatial Language: Cognitive and Computational Perspectives

Journal Article•DOI•

[...]

Kenny R. Coventry, Patrick Olivier¹•Institutions (1)

University of Plymouth¹

01 Dec 2002-Computational Linguistics

TL;DR: A Conceptual Model for Representing Verbal Expressions used in Route Descriptions and Goal-Directed Effects on Processing a Spatial Environment.

...read moreread less

Abstract: Preface K.R. Coventry, P. Olivier. Contributors. 1. Reasoning about Shape using the Tangential Axis Transform or the Shape's `Grain' G. Edwards. 2. A Conceptual Model for Representing Verbal Expressions used in Route Descriptions A. Gryl, et al. 3. Resolving Ambiguous Descriptions through Visual Information I. Duwe, et al. 4. An Anthropomorphic Agent for the Use of Spatial Language T. Joerding, I. Wachsmuth. 5. Gesture, Thought, and Spatial Language K. Emmorey, S. Casey. 6. Organization of Temporal Situations N. Franklin, T. Federico. 7. Grounding Meaning in Visual Knowledge. A Case Study: Dimensional Adjectives A. Goy. 8. Understanding How We Think about Space C. Manning, et al. 9. The Real Story of `Over'? K.R. Coventry, G. Mather. 10. Generating Spatial Descriptions from a Cognitive Point of View R. Porzel, et al. 11. Multiple Frames of Reference in Interpreting Complex Projective Terms C. Eschenbach, et al. 12. Goal-Directed Effects on Processing a Spatial Environment. Indications from Memory and Language H.A. Taylor, S.J. Naylor. 13. Memory for Text and Memory for Space. Two Concurrent Memory Systems? M. Wagener-Wender. Author Index. Subject Index.

...read moreread less

Journal Article•

Early Years in Machine Translation: Memoirs and Biographies of Pioneers edited by W. John Hutchins

[...]

W. J. Plath

01 Jan 2002-Computational Linguistics

TL;DR: The aim when compiling this volume has been to hear from those who participated directly in the earliest years of mechanical translation, or "machine translation" as it is now commonly known, and, in the case of those major figures already deceased, to obtain memories and assessments from people who knew them well as discussed by the authors.

...read moreread less

Abstract: The aim when compiling this volume has been to hear from those who participated directly in the earliest years of mechanical translation, or ‘machine translation’ (MT) as it is now commonly known, and, in the case of those major figures already deceased, to obtain memories and assessments from people who knew them well. Naturally, it has not been possible to cover every one of the pioneers of machine translation, but the principal researchers of the United States, the Soviet Union, and Europe (East and West) are represented here. (page vii)

...read moreread less

Journal Article•DOI•

A note on typing feature structures

[...]

Shuly Wintner¹, Anoop Sarkar²•Institutions (2)

University of Haifa¹, University of Pennsylvania²

Book reviews: The theory and practice of discourse parsing and summarization

TL;DR: This work has constructed a type signature for an existing broad-coverage grammar of English and implemented a type inference algorithm that operates on the feature structure specifications in the grammar and reports incompatibilities with the signature.

...read moreread less

Abstract: Feature structures are used to convey linguistic information in a variety of linguistic formalisms. Various definitions of feature structures exist; one dimension of variation is typing: unlike untyped feature structures, typed ones associate a type with every structure and impose appropriateness constraints on the occurrences of features and on the values that they take. This work demonstrates the benefits that typing can carry even for linguistic formalisms that use untyped feature structures. We present a method for validating the consistency of (untyped) feature structure specifications by imposing a type discipline. This method facilitates a great number of compile-time checks: many possible errors can be detected before the grammar is used for parsing. We have constructed a type signature for an existing broad-coverage grammar of English and implemented a type inference algorithm that operates on the feature structure specifications in the grammar and reports incompatibilities with the signature. We have detected a large number of errors in the grammar, some of which are described in the article.

...read moreread less

Journal Article•DOI•

[...]

Udo Hahn¹•Institutions (1)

University of Freiburg¹