scispace - formally typeset
Search or ask a question

Showing papers in "Research on Language and Computation in 2009"


Journal ArticleDOI
TL;DR: This paper compares a variety of pattern models, including ones which have been previously reported and variations of them, and finds that the best performance was observed from two models which use the majority of relevant portions of the dependency tree without including irrelevant sections.
Abstract: Several techniques for the automatic acquisition of Information Extraction (IE) systems have used dependency trees to form the basis of an extraction pattern representation. These approaches have used a variety of pattern models (schemes for representing IE patterns based on particular parts of the dependency analysis). An appropriate pattern model should be expressive enough to represent the information which is to be extracted from text without being overly complex. Previous investigations into the appropriateness of the currently proposed models have been limited. This paper compares a variety of pattern models, including ones which have been previously reported and variations of them. Each model is evaluated using existing data consisting of IE scenarios from two very different domains (newswire stories and biomedical journal articles). The models are analysed in terms of their ability to represent relevant information, number of patterns generated and performance on an IE scenario. It was found that the best performance was observed from two models which use the majority of relevant portions of the dependency tree without including irrelevant sections.

32 citations


Journal ArticleDOI
TL;DR: This work investigates the problem of computing the partition function of a probabilistic context-free grammar, and considers a number of applicable methods to find PCFGs that result from the intersection of another PCFG and a finite automaton.
Abstract: We investigate the problem of computing the partition function of a probabilistic context-free grammar, and consider a number of applicable methods. Particular attention is devoted to PCFGs that result from the intersection of another PCFG and a finite automaton. We report experiments involving the Wall Street Journal corpus.

31 citations


Journal ArticleDOI
TL;DR: The challenges that need to be met in both methodology and evaluation when moving towards the acquisition of more comprehensive conceptual representations from corpora are investigated and the usefulness of three types of knowledge in guiding the extraction process is investigated: encyclopedic, syntactic and semantic.
Abstract: In recent years a number of methods have been proposed for the automatic acquisition of feature-based conceptual representations from text corpora. Such methods could offer valuable support for theoretical research on conceptual representation. However, existing methods do not target the full range of concept-relation-feature triples occurring in human-generated norms (e.g. flute produce sound) but rather focus on concept-feature pairs (e.g. flute --- sound) or triples involving specific relations only (e.g. is-a or part-of relations). In this article we investigate the challenges that need to be met in both methodology and evaluation when moving towards the acquisition of more comprehensive conceptual representations from corpora. In particular, we investigate the usefulness of three types of knowledge in guiding the extraction process: encyclopedic, syntactic and semantic. We present first a semantic analysis of existing, human-generated feature production norms, which reveals information about co-occurring concept and feature classes. We introduce then a novel method for large-scale feature extraction which uses the class-based information to guide the acquisition process. The method involves extracting candidate triples consisting of concepts, relations and features (e.g. deer have antlers, flute produce sound) from corpus data parsed for grammatical dependencies, and re-weighting the triples on the basis of conditional probabilities calculated from our semantic analysis. We apply this method to an automatically parsed Wikipedia corpus which includes encyclopedic information and evaluate its accuracy using a number of different methods: direct evaluation against the McRae norms in terms of feature types and frequencies, human evaluation, and novel evaluation in terms of conceptual structure variables. Our investigation highlights a number of issues which require addressing in both methodology and evaluation when aiming to improve the accuracy of unconstrained feature extraction further.

31 citations


Journal ArticleDOI
Chris Biemann1
TL;DR: A method is presented that constructs a statistical syntactic tagger model from a large amount of unlabeled text data and results in corpus annotation in a comparable way to what POS-taggers provide, but results in slightly different categories as opposed to what is assumed by a linguistically motivated POS-tagger.
Abstract: Syntactic preprocessing is a step that is widely used in NLP applications. Traditionally, rule-based or statistical Part-of-Speech (POS) taggers are employed that either need considerable rule development times or a sufficient amount of manually labeled data. To alleviate this acquisition bottleneck and to enable preprocessing for minority languages and specialized domains, a method is presented that constructs a statistical syntactic tagger model from a large amount of unlabeled text data. The method presented here is called unsupervised POS-tagging, as its application results in corpus annotation in a comparable way to what POS-taggers provide. Nevertheless, its application results in slightly different categories as opposed to what is assumed by a linguistically motivated POS-tagger. These differences hamper evaluation procedures that compare the output of the unsupervised POS-tagger to a tagging with a supervised tagger. To measure the extent to which unsupervised POS-tagging can contribute in application-based settings, the system is evaluated in supervised POS-tagging, word sense disambiguation, named entity recognition and chunking. Unsupervised POS-tagging has been explored since the beginning of the 1990s. Unlike in previous approaches, the kind and number of different tags is here generated by the method itself. Another difference to other methods is that not all words above a certain frequency rank get assigned a tag, but the method is allowed to exclude words from the clustering, if their distribution does not match closely enough with other words. The lexicon size is considerably larger than in previous approaches, resulting in a lower out-of-vocabulary (OOV) rate and in a more consistent tagging. The system presented here is available for download as open-source software along with tagger models for several languages, so the contributions of this work can be easily incorporated into other applications.

26 citations


Journal ArticleDOI
TL;DR: This paper proposes an alternative definition of MCTAG that characterizes the trees in the tree language of an MCT AG via the properties of the derivation trees (in the underlying TAG) the M CTAG licences, and provides similar characterizations for various types of MctAG.
Abstract: Multicomponent Tree Adjoining Grammars (MCTAGs) are a formalism that has been shown to be useful for many natural language applications. The definition of non-local MCTAG however is problematic since it refers to the process of the derivation itself: a simultaneity constraint must be respected concerning the way the members of the elementary tree sets are added. Looking only at the result of a derivation (i.e., the derived tree and the derivation tree), this simultaneity is no longer visible and therefore cannot be checked. i.e., this way of characterizing MCTAG does not allow to abstract away from the concrete order of derivation. In this paper, we propose an alternative definition of MCTAG that characterizes the trees in the tree language of an MCTAG via the properties of the derivation trees (in the underlying TAG) the MCTAG licences. We provide similar characterizations for various types of MCTAG. These characterizations give a better understanding of the formalisms, they allow a more systematic comparison of different types of MCTAG, and, furthermore, they can be exploited for parsing.

20 citations


Journal ArticleDOI
TL;DR: It is shown in this article how an approach developed for the task of recognizing textual entailment relations can be extended to identify paraphrase and elaboration relations, and how these approaches offer significantly better results than several baselines.
Abstract: We show in this article how an approach developed for the task of recognizing textual entailment relations can be extended to identify paraphrase and elaboration relations. Entailment is a unidirectional relation between two sentences in which one sentence logically infers the other. There seems to be a close relation between entailment and two other sentence-to-sentence relations: elaboration and paraphrase. This close relation is discussed to theoretically justify the newly derived approaches. The proposed approaches use lexical, syntactic, and shallow negation handling. The proposed approaches offer significantly better results than several baselines. When compared to other paraphrase and elaboration approaches they produce similar or better results. We report results on several data sets: the Microsoft Research Paraphrase corpus, a benchmark for evaluating approaches to paraphrase identification, and a data set collected from high-school students' interactions with an intelligent tutoring system iSTART, which includes both paraphrase and elaboration utterances.

13 citations


Journal ArticleDOI
TL;DR: The merge operator on multisets and a ‘min’ operation expressed in terms of harmonic inequality provide a semiring over violation profiles that allows standard optimization algorithms to be used for OT grammars with weighted finite-state constraints in which the weights are violation-multisets.
Abstract: This paper provides a brief algebraic characterization of constraint violations in Optimality Theory (OT). I show that if violations are taken to be multisets over a fixed basis set Con then the merge operator on multisets and a ‘min’ operation expressed in terms of harmonic inequality provide a semiring over violation profiles. This semiring allows standard optimization algorithms to be used for OT grammars with weighted finite-state constraints in which the weights are violation-multisets. Most usefully, because multisets are unordered, the merge operation is commutative and thus it is possible to give a single graph representation of the entire class of grammars (i.e. rankings) for a given constraint set. This allows a neat factorization of the optimization problem that isolates the main source of complexity into a single constant γ denoting the size of the graph representation of the whole constraint set. I show that the computational cost of optimization is linear in the length of the underlying form with the multiplicative constant γ. This perspective thus makes it straightforward to evaluate the complexity of optimization for different constraint sets.

5 citations


Journal ArticleDOI
TL;DR: First-order transitive closure logic is proposed, which is capable of defining non-regular tree languages that are of interest to linguistics and is more expressive in defining classes of tree languages than monadic second-order logic.
Abstract: Model theoretic syntax is concerned with studying the descriptive complexity of grammar formalisms for natural languages by defining their derivation trees in suitable logical formalisms. The central tool for model theoretic syntax has been monadic second-order logic (MSO). Much of the recent research in this area has been concerned with finding more expressive logics to capture the derivation trees of grammar formalisms that generate non-context-free languages. The motivation behind this search for more expressive logics is to describe formally certain mildly context-sensitive phenomena of natural languages. Several extensions to MSO have been proposed, most of which no longer define the derivation trees of grammar formalisms directly, while others introduce logically odd restrictions. We therefore propose to consider first-order transitive closure logic. In this logic, derivation trees can be defined in a direct way. Our main result is that transitive closure logic, even deterministic transitive closure logic, is more expressive in defining classes of tree languages than MSO. (Deterministic) transitive closure logics are capable of defining non-regular tree languages that are of interest to linguistics.

4 citations