scispace - formally typeset
Search or ask a question

Showing papers in "Computational Linguistics in 2014"


Journal ArticleDOI
Tobias Kuhn1
TL;DR: A comprehensive survey of existing English-based Natural Language Controlled Natural Language (CNL) can be found in this article, where the authors provide a common terminology and a common model for CNL, to contribute to the understanding of their general nature, to provide a starting point for researchers interested in the area and to help developers to make design decisions.
Abstract: What is here called controlled natural language CNL has traditionally been given many different names. Especially during the last four decades, a wide variety of such languages have been designed. They are applied to improve communication among humans, to improve translation, or to provide natural and intuitive representations for formal notations. Despite the apparent differences, it seems sensible to put all these languages under the same umbrella. To bring order to the variety of languages, a general classification scheme is presented here. A comprehensive survey of existing English-based CNLs is given, listing and describing 100 languages from 1930 until today. Classification of these languages reveals that they form a single scattered cloud filling the conceptual space between natural languages such as English on the one end and formal languages such as propositional logic on the other. The goal of this article is to provide a common terminology and a common model for CNL, to contribute to the understanding of their general nature, to provide a starting point for researchers interested in the area, and to help developers to make design decisions.

308 citations


Journal ArticleDOI
TL;DR: This article describes the creation of a novel Arabic resource with dialect annotations, and uses the data to train and evaluate automatic classifiers for dialect identification, and establishes that classifiers using dialectal data significantly and dramatically outperform baselines that use MSA-only data, achieving near-human classification accuracy.
Abstract: The written form of the Arabic language, Modern Standard Arabic MSA, differs in a non-trivial manner from the various spoken regional dialects of Arabic "the true native language"i¾of Arabic speakers. Those dialects, in turn, differ quite a bit from each other. However, due to MSA's prevalence in written form, almost all Arabic data sets have predominantly MSA content. In this article, we describe the creation of a novel Arabic resource with dialect annotations. We have created a large monolingual data set rich in dialectal Arabic content called the Arabic On-line Commentary Data set Zaidan and Callison-Burch 2011. We describe our annotation effort to identify the dialect level and dialect itself in each of more than 100,000 sentences from the data set by crowdsourcing the annotation task, and delve into interesting annotator behaviors like over-identification of one's own dialect. Using this new annotated data set, we consider the task of Arabic dialect identification: Given the word sequence forming an Arabic sentence, determine the variety of Arabic in which it is written. We use the data to train and evaluate automatic classifiers for dialect identification, and establish that classifiers using dialectal data significantly and dramatically outperform baselines that use MSA-only data, achieving near-human classification accuracy. Finally, we apply our classifiers to discover dialectical data from a large Web crawl consisting of 3.5 million pages mined from on-line Arabic newspapers.

264 citations


Journal ArticleDOI
TL;DR: This article presents a WSD algorithm based on random walks over large Lexical Knowledge Bases (LKB) that performs better than other graph-based methods when run on a graph built from WordNet and eXtended WordNet.
Abstract: Word Sense Disambiguation WSD systems automatically choose the intended meaning of a word in context. In this article we present a WSD algorithm based on random walks over large Lexical Knowledge Bases LKB. We show that our algorithm performs better than other graph-based methods when run on a graph built from WordNet and eXtended WordNet. Our algorithm and LKB combination compares favorably to other knowledge-based approaches in the literature that use similar knowledge on a variety of English data sets and a data set on Spanish. We include a detailed analysis of the factors that affect the algorithm. The algorithm and the LKBs used are publicly available, and the results easily reproducible.

263 citations


Journal ArticleDOI
TL;DR: A two-stage statistical model that takes lexical targets in their sentential contexts and predicts frame-semantic structures and results in qualitatively better structures than naïve local predictors, which outperforms the prior state of the art by significant margins.
Abstract: Frame semantics is a linguistic theory that has been instantiated for English in the FrameNet lexicon. We solve the problem of frame-semantic parsing using a two-stage statistical model that takes lexical targets i.e., content words and phrases in their sentential contexts and predicts frame-semantic structures. Given a target in context, the first stage disambiguates it to a semantic frame. This model uses latent variables and semi-supervised learning to improve frame disambiguation for targets unseen at training time. The second stage finds the target's locally expressed semantic arguments. At inference time, a fast exact dual decomposition algorithm collectively predicts all the arguments of a frame at once in order to respect declaratively stated linguistic constraints, resulting in qualitatively better structures than nave local predictors. Both components are feature-based and discriminatively trained on a small set of annotated frame-semantic parses. On the SemEval 2007 benchmark data set, the approach, along with a heuristic identifier of frame-evoking targets, outperforms the prior state of the art by significant margins. Additionally, we present experiments on the much larger FrameNet 1.5 data set. We have released our frame-semantic parser as open-source software.

257 citations


Journal ArticleDOI
TL;DR: The importance of the NER task is demonstrated, the main characteristics of the Arabic language are highlighted, and the aspects of standardization in annotating named entities are illustrated.
Abstract: As more and more Arabic textual information becomes available through the Web in homes and businesses, via Internet and Intranet services, there is an urgent need for technologies and tools to process the relevant information. Named Entity Recognition NER is an Information Extraction task that has become an integral part of many other Natural Language Processing NLP tasks, such as Machine Translation and Information Retrieval. Arabic NER has begun to receive attention in recent years. The characteristics and peculiarities of Arabic, a member of the Semitic languages family, make dealing with NER a challenge. The performance of an Arabic NER component affects the overall performance of the NLP system in a positive manner. This article attempts to describe and detail the recent increase in interest and progress made in Arabic NER research. The importance of the NER task is demonstrated, the main characteristics of the Arabic language are highlighted, and the aspects of standardization in annotating named entities are illustrated. Moreover, the different Arabic linguistic resources are presented and the approaches used in Arabic NER field are explained. The features of common tools used in Arabic NER are described, and standard evaluation metrics are illustrated. In addition, a review of the state of the art of Arabic NER research is discussed. Finally, we present our conclusions. Throughout the presentation, illustrative examples are used for clarification.

188 citations


Journal ArticleDOI
TL;DR: This book captures the essence of a series of highly successful workshops organized over the last few years in corpus-based natural language processing, including part-of-speech tagging, word sense disambiguation, parsing on real-life texts, working with parallel corpora and improved techniques for document processing.
Abstract: This book is intended for researchers who want to keep abreast of current developments in corpus-based natural language processing. It captures the essence of a series of highly successful workshops organized over the last few years. The papers cover a range of current research topics in this field including part-of-speech tagging, word sense disambiguation, parsing on real-life texts, working with parallel corpora and improved techniques for document processing.

129 citations


Journal ArticleDOI
TL;DR: A comprehensive introduction to the Penn Discourse Treebank is provided to correct some wrong (or perhaps inadvertent) assumptions about the PDTB and its annotation and to explain variations seen in the annotation of comparable resources in other languages and genres to allow developers of future comparable resources to recognize whether the variations are relevant to them.
Abstract: The Penn Discourse Treebank (PDTB) was released to the public in 2008. It remains the largest manually annotated corpus of discourse relations to date. Its focus on discourse relations that are either lexically-grounded in explicit discourse connectives or associated with sentential adjacency has not only facilitated its use in language technology and psycholinguistics but also has spawned the annotation of comparable corpora in other languages and genres. Given this situation, this paper has four aims: (1) to provide a comprehensive introduction to the PDTB for those who are unfamiliar with it; (2) to correct some wrong (or perhaps inadvertent) assumptions about the PDTB and its annotation that may have weakened previous results or the performance of decision procedures induced from the data; (3) to explain variations seen in the annotation of comparable resources in other languages and genres, which should allow developers of future comparable resources to recognize whether the variations are relevant to them; and (4) to enumerate and explain relationships between PDTB annotation and complementary annotation of other linguistic phenomena. The paper draws on work done by ourselves and others since the corpus was released.

89 citations


Journal ArticleDOI
TL;DR: This work addresses the challenge of identifying the authors of anonymous texts by using topic models to obtain author representations and presents experimental results that demonstrate the applicability of topical author representations to two other problems: inferring the sentiment polarity of texts, and predicting the ratings that users would give to items such as movies.
Abstract: Authorship attribution deals with identifying the authors of anonymous texts. Traditionally, research in this field has focused on formal texts, such as essays and novels, but recently more attention has been given to texts generated by on-line users, such as e-mails and blogs. Authorship attribution of such on-line texts is a more challenging task than traditional authorship attribution, because such texts tend to be short, and the number of candidate authors is often larger than in traditional settings. We address this challenge by using topic models to obtain author representations. In addition to exploring novel ways of applying two popular topic models to this task, we test our new model that projects authors and documents to two disjoint topic spaces. Utilizing our model in authorship attribution yields state-of-the-art performance on several data sets, containing either formal texts written by a few authors or informal texts generated by tens to thousands of on-line users. We also present experimental results that demonstrate the applicability of topical author representations to two other problems: inferring the sentiment polarity of texts, and predicting the ratings that users would give to items such as movies.

78 citations


Journal ArticleDOI
TL;DR: Bagel is presented, a fully data-driven generation method that treats the language generation task as a search for the most likely sequence of semantic concepts and realization phrases, according to Factored Language Models (FLMs).
Abstract: Most previous work on trainable language generation has focused on two paradigms: (a) using a generation decisions of an existing generator. Both approaches rely on the existence of a handcrafted generation component, which is likely to limit their scalability to new domains. The first contribution of this article is to present Bagel, a fully data-driven generation method that treats the language generation task as a search for the most likely sequence of semantic concepts and realization phrases, according to Factored Language Models (FLMs). As domain utterances are not readily available for most natural language generation tasks, a large creative effort is required to produce the data necessary to represent human linguistic variation for nontrivial domains. This article is based on the assumption that learning to produce paraphrases can be facilitated by collecting data from a large sample of untrained annotators using crowdsourcing—rather than a few domain experts—by relying on a coarse meaning representation. A second contribution of this article is to use crowdsourced data to show how dialogue naturalness can be improved by learning to vary the output utterances generated for a given semantic input. Two data-driven methods for generating paraphrases in dialogue are presented: (a) by sampling from the n-best list of realizations produced by Bagel's FLM reranker; and (b) by learning a structured perceptron predicting whether candidate realizations are valid paraphrases. We train Bagel on a set of 1,956 utterances produced by 137 annotators, which covers 10 types of dialogue acts and 128 semantic concepts in a tourist information system for Cambridge. An automated evaluation shows that Bagel outperforms utterance class LM baselines on this domain. A human evaluation of 600 resynthesized dialogue extracts shows that Bagel's FLM output produces utterances comparable to a handcrafted baseline, whereas the perceptron classifier performs worse. Interestingly, human judges find the system sampling from the n-best list to be more natural than a system always returning the first-best utterance. The judges are also more willing to interact with the n-best system in the future. These results suggest that capturing the large variation found in human language using data-driven methods is beneficial for dialogue interaction.

76 citations


Journal ArticleDOI
TL;DR: A new class of unsupervised, nonparametric Bayesian models with the purpose of probabilistically inferring coreference clusters of event mentions from a collection of unlabeled documents is described, which combines an infinite latent class model with a discrete time series model.
Abstract: The task of event coreference resolution plays a critical role in many natural language processing applications such as information extraction, question answering, and topic detection and tracking. In this article, we describe a new class of unsupervised, nonparametric Bayesian models with the purpose of probabilistically inferring coreference clusters of event mentions from a collection of unlabeled documents. In order to infer these clusters, we automatically extract various lexical, syntactic, and semantic features for each event mention from the document collection. Extracting a rich set of features for each event mention allows us to cast event coreference resolution as the task of grouping together the mentions that share the same features they have the same participating entities, share the same location, happen at the same time, etc.. Some of the most important challenges posed by the resolution of event coreference in an unsupervised way stem from a the choice of representing event mentions through a rich set of features and b the ability of modeling events described both within the same document and across multiple documents. Our first unsupervised model that addresses these challenges is a generalization of the hierarchical Dirichlet process. This new extension presents the hierarchical Dirichlet process's ability to capture the uncertainty regarding the number of clustering components and, additionally, takes into account any finite number of features associated with each event mention. Furthermore, to overcome some of the limitations of this extension, we devised a new hybrid model, which combines an infinite latent class model with a discrete time series model. The main advantage of this hybrid model stands in its capability to automatically infer the number of features associated with each event mention from data and, at the same time, to perform an automatic selection of the most informative features for the task of event coreference. The evaluation performed for solving both within-and cross-document event coreference shows significant improvements of these models when compared against two baselines for this task.

65 citations


Journal ArticleDOI
TL;DR: This paper provides a realistic simulation of large-scale evaluation for the Word Sense Disambiguation task and puts forward two novel approaches to the wide-coverage generation of semantically aware pseudowords, which enable a large- scale experimental framework for the comparison of state-of-the-art supervised and knowledge-based algorithms.
Abstract: The evaluation of several tasks in lexical semantics is often limited by the lack of large amounts of manual annotations, not only for training purposes, but also for testing purposes. Word Sense Disambiguation (WSD) is a case in point, as hand-labeled datasets are particularly hard and time-consuming to create. Consequently, evaluations tend to be performed on a small scale, which does not allow for in-depth analysis of the factors that determine a systems' performance. In this paper we address this issue by means of a realistic simulation of large-scale evaluation for the WSD task. We do this by providing two main contributions: First, we put forward two novel approaches to the wide-coverage generation of semantically aware pseudowords (i.e., artificial words capable of modeling real polysemous words); second, we leverage the most suitable type of pseudoword to create large pseudosense-annotated corpora, which enable a large-scale experimental framework for the comparison of state-of-the-art supervised and knowledge-based algorithms. Using this framework, we study the impact of supervision and knowledge on the two major disambiguation paradigms and perform an in-depth analysis of the factors which affect their performance.

Journal ArticleDOI
TL;DR: Novel techniques for extracting features from n-gram models, Hidden Markov Models, and other statistical language models are investigated, including a novel Partial Lattice Markov Random Field model.
Abstract: Finding the right representations for words is critical for building accurate NLP systems when domain-specific labeled data for the task is scarce. This article investigates novel techniques for extracting features from n-gram models, Hidden Markov Models, and other statistical language models, including a novel Partial Lattice Markov Random Field model. Experiments on part-of-speech tagging and information extraction, among other tasks, indicate that features taken from statistical language models, in combination with more traditional features, outperform traditional representations alone, and that graphical model representations outperform n-gram models, especially on sparse and polysemous words.

Journal ArticleDOI
TL;DR: A novel method in which words are the vertices in a graph, synonyms are linked by edges, and the bits assigned to a word are determined by a vertex coding algorithm, which ensures that each word represents a unique sequence of bits, without cutting out large numbers of synonyms, and thus maintains a reasonable embedding capacity.
Abstract: Linguistic steganography is concerned with hiding information in natural language text. One of the major transformations used in linguistic steganography is synonym substitution. However, few existing studies have studied the practical application of this approach. In this article we propose two improvements to the use of synonym substitution for encoding hidden bits of information. First, we use the Google n-gram corpus for checking the applicability of a synonym in context, and we evaluate this method using data from the SemEval lexical substitution task and human annotated data. Second, we address the problem that arises from words with more than one sense, which creates a potential ambiguity in terms of which bits are represented by a particular word. We develop a novel method in which words are the vertices in a graph, synonyms are linked by edges, and the bits assigned to a word are determined by a vertex coding algorithm. This method ensures that each word represents a unique sequence of bits, without cutting out large numbers of synonyms, and thus maintains a reasonable embedding capacity.

Journal ArticleDOI
TL;DR: A structure learning system for unrestricted coreference resolution that explores two key modeling techniques: latent coreference trees and automatic entropy-guided feature induction, which is the best performing system for each of the three languages.
Abstract: We describe a structure learning system for unrestricted coreference resolution that explores two key modeling techniques: latent coreference trees and automatic entropy-guided feature induction. The latent tree modeling makes the learning problem computationally feasible because it incorporates a meaningful hidden structure. Additionally, using an automatic feature induction method, we can efficiently build enhanced nonlinear models using linear model learning algorithms. We present empirical results that highlight the contribution of each modeling technique used in the proposed system. Empirical evaluation is performed on the multilingual unrestricted coreference CoNLL-2012 Shared Task datasets, which comprise three languages: Arabic, Chinese and English. We apply the same system to all languages, except for minor adaptations to some language-dependent features such as nested mentions and specific static pronoun lists. A previous version of this system was submitted to the CoNLL-2012 Shared Task closed track, achieving an official score of 58.69, the best among the competitors. The unique enhancement added to the current system version is the inclusion of candidate arcs linking nested mentions for the Chinese language. By including such arcs, the score increases by almost 4.5 points for that language. The current system shows a score of 60.15, which corresponds to a 3.5% error reduction, and is the best performing system for each of the three languages.

Journal ArticleDOI
TL;DR: The working hypothesis of this article is that semantic roles can be induced without human supervision from a corpus of syntactically parsed sentences based on three linguistic principles, and a method is presented that implements these principles and formalizes the task as a graph partitioning problem.
Abstract: As in many natural language processing tasks, data-driven models based on supervised learning have become the method of choice for semantic role labeling. These models are guaranteed to perform well when given sufficient amount of labeled training data. Producing this data is costly and time-consuming, however, thus raising the question of whether unsupervised methods offer a viable alternative. The working hypothesis of this article is that semantic roles can be induced without human supervision from a corpus of syntactically parsed sentences based on three linguistic principles: (1) arguments in the same syntactic position (within a specific linking) bear the same semantic role, (2) arguments within a clause bear a unique role, and (3) clusters representing the same semantic role should be more or less lexically and distributionally equivalent. We present a method that implements these principles and formalizes the task as a graph partitioning problem, whereby argument instances of a verb are represented as vertices in a graph whose edges express similarities between these instances. The graph consists of multiple edge layers, each one capturing a different aspect of argument-instance similarity, and we develop extensions of standard clustering algorithms for partitioning such multi-layer graphs. Experiments for English and German demonstrate that our approach is able to induce semantic role clusters that are consistently better than a strong baseline and are competitive with the state of the art.

Journal ArticleDOI
TL;DR: A new training method, feature-frequency–adaptive on-line training, for fast and accurate training of natural language processing systems, based on the core idea that higher frequency features should have a learning rate that decays faster.
Abstract: Training speed and accuracy are two major concerns of large-scale natural language processing systems. Typically, we need to make a tradeoff between speed and accuracy. It is trivial to improve the training speed via sacrificing accuracy or to improve the accuracy via sacrificing speed. Nevertheless, it is nontrivial to improve the training speed and the accuracy at the same time, which is the target of this work. To reach this target, we present a new training method, feature-frequency—adaptive on-line training, for fast and accurate training of natural language processing systems. It is based on the core idea that higher frequency features should have a learning rate that decays faster. Theoretical analysis shows that the proposed method is convergent with a fast convergence rate. Experiments are conducted based on well-known benchmark tasks, including named entity recognition, word segmentation, phrase chunking, and sentiment analysis. These tasks consist of three structured classification tasks and one non-structured classification task, with binary features and real-valued features, respectively. Experimental results demonstrate that the proposed method is faster and at the same time more accurate than existing methods, achieving state-of-the-art scores on the tasks with different characteristics.

Journal ArticleDOI
TL;DR: It is shown that by using only a small corpus of non-adaptive dialogues and user knowledge profiles it is possible to learn an adaptive user modeling policy using a sense-predict-adapt approach, and that the learned policy generalises better to unseen user profiles than these hand-coded policies, while having comparable performance on known user profiles.
Abstract: We address the problem of dynamically modeling and adapting to unknown users in resource-scarce domains in the context of interactive spoken dialogue systems. As an example, we show how a system can learn to choose referring expressions to refer to domain entities for users with different levels of domain expertise, and whose domain knowledge is initially unknown to the system. We approach this problem using a three step process: collecting data using a Wizard-of-Oz method, building simulated users, and learning to model and adapt to users using Reinforcement Learning techniques. We show that by using only a small corpus of non-adaptive dialogues and user knowledge profiles it is possible to learn an adaptive user modeling policy using a sense-predict-adapt approach. Our evaluation results show that the learned user modeling and adaptation strategies performed better in terms of adaptation than some simple hand-coded baseline policies, with both simulated and real users. With real users, the learned policy produced around a 20% increase in adaptation in comparison to an adaptive hand-coded baseline. We also show that adaptation to users' domain knowledge results in improving task success (99.47% for the learned policy vs. 84.7% for a hand-coded baseline) and reducing dialogue time of the conversation (11% relative difference). We also compared the learned policy to a variety of carefully hand-crafted adaptive policies that employ the user knowledge profiles to adapt their choices of referring expressions throughout a conversation. We show that the learned policy generalises better to unseen user profiles than these hand-coded policies, while having comparable performance on known user profiles. We discuss the overall advantages of this method and how it can be extended to other levels of adaptation such as content selection and dialogue management, and to other domains where adapting to users' domain knowledge is useful, such as travel and healthcare.

Journal ArticleDOI
TL;DR: The use of pushdown automata (PDA) in the context of statistical machine translation and alignment under a synchronous context-free grammar and a two-pass decoding strategy involving a weaker language model in the first-pass is proposed to address the results of PDA complexity analysis.
Abstract: This article describes the use of pushdown automata (PDA) in the context of statistical machine translation and alignment under a synchronous context-free grammar. We use PDAs to compactly represent the space of candidate translations generated by the grammar when applied to an input sentence. General-purpose PDA algorithms for replacement, composition, shortest path, and expansion are presented. We describe HiPDT, a hierarchical phrase-based decoder using the PDA representation and these algorithms. We contrast the complexity of this decoder with a decoder based on a finite state automata representation, showing that PDAs provide a more suitable framework to achieve exact decoding for larger synchronous context-free grammars and smaller language models. We assess this experimentally on a large-scale Chinese-to-English alignment and translation task. In translation, we propose a two-pass decoding strategy involving a weaker language model in the first-pass to address the results of PDA complexity analysis. We study in depth the experimental conditions and tradeoffs in which HiPDT can achieve state-of-the-art performance for large-scale SMT.

Journal ArticleDOI
TL;DR: A probabilistic framework for acquiring selectional preferences of linguistic predicates and for using the acquired representations to model the effects of context on word meaning is described and it is argued that Probabilistic methods provide an effective and flexible methodology for distributional semantics.
Abstract: We describe a probabilistic framework for acquiring selectional preferences of linguistic predicates and for using the acquired representations to model the effects of context on word meaning. Our framework uses Bayesian latent-variable models inspired by, and extending, the well-known Latent Dirichlet Allocation (LDA) model of topical structure in documents; when applied to predicate—argument data, topic models automatically induce semantic classes of arguments and assign each predicate a distribution over those classes. We consider LDA and a number of extensions to the model and evaluate them on a variety of semantic prediction tasks, demonstrating that our approach attains state-of-the-art performance. More generally, we argue that probabilistic methods provide an effective and flexible methodology for distributional semantics.

Journal ArticleDOI
TL;DR: This work defines a criterion of T-non-theoretical grounding as guidance to avoid circularities in empirical computational linguistics, and exemplifies how this criterion can be met by crowdsourcing, by task-related data annotation, or by data in the wild.
Abstract: Philosophy of science has pointed out a problem of theoretical terms in empirical sciences. This problem arises if all known measuring procedures for a quantity of a theory presuppose the validity of this very theory, because then statements containing theoretical terms are circular. We argue that a similar circularity can happen in empirical computational linguistics, especially in cases where data are manually annotated by experts. We define a criterion of T-non-theoretical grounding as guidance to avoid such circularities, and exemplify how this criterion can be met by crowdsourcing, by task-related data annotation, or by data in the wild. We argue that this criterion should be considered as a necessary condition for an empirical science, in addition to measures for reliability of data annotation.

Journal ArticleDOI
TL;DR: This work shows how arc-eager dependency parsers can be constrained to respect two different types of conditions on the output dependency graph: span constraints, which require certain spans to correspond to subtrees of the graph, and arc constraints, who require certain arcs to be present in the graph.
Abstract: Arc-eager dependency parsers process sentences in a single left-to-right pass over the input and have linear time complexity with greedy decoding or beam search. We show how such parsers can be constrained to respect two different types of conditions on the output dependency graph: span constraints, which require certain spans to correspond to subtrees of the graph, and arc constraints, which require certain arcs to be present in the graph. The constraints are incorporated into the arc-eager transition system as a set of preconditions for each transition and preserve the linear time complexity of the parser.

Journal ArticleDOI
TL;DR: A simple modification to the arc-eager system for transition-based dependency parsing is proposed that enforces the tree constraint without requiring any modification of the parser training procedure.
Abstract: The arc-eager system for transition-based dependency parsing is widely used in natural language processing despite the fact that it does not guarantee that the output is a well-formed dependency tree. We propose a simple modification to the original system that enforces the tree constraint without requiring any modification to the parser training procedure. Experiments on multiple languages show that the method on average achieves 72% of the error reduction possible and consistently outperforms the standard heuristic in current use.

Journal ArticleDOI
TL;DR: A tree-to-tree machine translation system inspired by quasi-synchronous grammar is presented that combines phrases and dependency syntax, integrating the advantages of phrase-based and syntax-based translation.
Abstract: Recent research has shown clear improvement in translation quality by exploiting linguistic syntax for either the source or target language. However, when using syntax for both languages "tree-to-tree" translation, there is evidence that syntactic divergence can hamper the extraction of useful rules Ding and Palmer 2005. Smith and Eisner 2006 introduced quasi-synchronous grammar, a formalism that treats non-isomorphic structure softly using features rather than hard constraints. Although a natural fit for translation modeling, its flexibility has proved challenging for building real-world systems. In this article, we present a tree-to-tree machine translation system inspired by quasi-synchronous grammar. The core of our approach is a new model that combines phrases and dependency syntax, integrating the advantages of phrase-based and syntax-based translation. We report statistically significant improvements over a phrase-based baseline on five of seven test sets across four language pairs. We also present encouraging preliminary results on the use of unsupervised dependency parsing for syntax-based machine translation.

Journal ArticleDOI
TL;DR: A method for identifying the sentiment polarity of words that applies a Markov random walk model to a large word relatedness graph, and produces a polarity estimate for any given word is proposed.
Abstract: Automatically identifying the sentiment polarity of words is a very important task that has been used as the essential building block of many natural language processing systems such as text classification, text filtering, product review analysis, survey response analysis, and on-line discussion mining. We propose a method for identifying the sentiment polarity of words that applies a Markov random walk model to a large word relatedness graph, and produces a polarity estimate for any given word. The model can accurately and quickly assign a polarity sign and magnitude to any word. It can be used both in a semi-supervised setting where a training set of labeled words is used, and in a weakly supervised setting where only a handful of seed words is used to define the two polarity classes. The method is experimentally tested using a gold standard set of positive and negative words from the General Inquirer lexicon. We also show how our method can be used for three-way classification which identifies neutral words in addition to positive and negative words. Our experiments show that the proposed method outperforms the state-of-the-art methods in the semi-supervised setting and is comparable to the best reported values in the weakly supervised setting. In addition, the proposed method is faster and does not need a large corpus. We also present extensions of our methods for identifying the polarity of foreign words and out-of-vocabulary words.

Journal ArticleDOI
TL;DR: The ultimate goal is to develop a hybrid corpus annotation methodology that combines fully automatic annotation and manual parse selection, in order to make the annotation task more efficient while maintaining high accuracy and the high degree of consistency necessary for any foreseen uses of a treebank.
Abstract: This article presents an ensemble parse approach to detecting and selecting high-quality linguistic analyses output by a hand-crafted HPSG grammar of Spanish implemented in the LKB system. The approach uses full agreement (i.e., exact syntactic match) along with a MaxEnt parse selection model and a statistical dependency parser trained on the same data. The ultimate goal is to develop a hybrid corpus annotation methodology that combines fully automatic annotation and manual parse selection, in order to make the annotation task more efficient while maintaining high accuracy and the high degree of consistency necessary for any foreseen uses of a treebank.

Journal ArticleDOI
Claire Cardie1
TL;DR: This 2012 book is written as a comprehensive introductory and survey text for sentiment analysis and opinion mining, a field of study that investigates computational techniques for analyzing text to uncover the opinions, sentiment, emotions, and evaluations expressed therein.
Abstract: This 2012 book is written as a comprehensive introductory and survey text for sentiment analysis and opinion mining, a field of study that investigates computational techniques for analyzing text to uncover the opinions, sentiment, emotions, and evaluations expressed therein. As such, it aims to be accessible to a broad audience that includes students, researchers, and practitioners, as well as to cover all important topics in the field. With regard to the first aim, the book is very much a success: The writing is clear and concise, informative examples motivate each new topic, terminology is clearly defined, and descriptions of key algorithms are provided in the running text along with short (usually one-line) descriptions of each piece of relevant related work. The latter, in particular, makes the book an excellent platform from which to dive into the quickly expanding body of literature on sentiment and opinion analysis. In addition, I believe that the book should be easily accessible to anyone with a computer science background.

Journal ArticleDOI
TL;DR: A Markov chain Monte Carlo algorithm is developed which corrects for the bias introduced by unbalanced forests, and experiments using the algorithm to learn Synchronous Context-Free Grammar rules for machine translation are presented.
Abstract: We study the problem of sampling trees from forests, in the setting where probabilities for each tree may be a function of arbitrarily large tree fragments. This setting extends recent work for sampling to learn Tree Substitution Grammars to the case where the tree structure TSG derived tree is not fixed. We develop a Markov chain Monte Carlo algorithm which corrects for the bias introduced by unbalanced forests, and we present experiments using the algorithm to learn Synchronous Context-Free Grammar rules for machine translation. In this application, the forests being sampled represent the set of Hiero-style rules that are consistent with fixed input word-level alignments. We demonstrate equivalent machine translation performance to standard techniques but with much smaller grammars.

Journal ArticleDOI
Dan Jurafsky1
TL;DR: Charles J. Fillmore was one of the world’s pre-eminent scholars of lexical meaning and its relationship with context, grammar, corpora, and computation, and his work had an enormous impact on computational linguistics.
Abstract: Charles J. Fillmore died at his home in San Francisco on February 13, 2014, of brain cancer. He was 84 years old. Fillmore was one of the world’s pre-eminent scholars of lexical meaning and its relationship with context, grammar, corpora, and computation, and his work had an enormous impact on computational linguistics. His early theoretical work in the 1960s, 1970s, and 1980s on case grammar and then frame semantics significantly influenced computational linguistics, AI, and knowledge representation. More recent work in the last two decades on FrameNet, a computational lexicon and annotated corpus, influenced corpus linguistics and computational lexicography, and led to modern natural language understanding tasks like semantic role labeling. Fillmore was born and raised in St. Paul, Minnesota, and studied linguistics at the University of Minnesota. As an undergraduate he worked on a pre-computational Latin corpus linguistics project, alphabetizing index cards and building concordances. During his service in the Army in the early 1950s he was stationed for three years in Japan. After his service he became the first US soldier to be discharged locally in Japan, and stayed for three years studying Japanese. He supported himself by teaching English, pioneering a way to make ends meet that afterwards became popular with generations of young Americans abroad. In 1957 he moved back to the United States to attend graduate school at the University of Michigan. At Michigan, Fillmore worked on phonetics, phonology, and syntax, first in the American Structuralist tradition of developing what were called “discovery procedures” for linguistic analysis, algorithms for inducing phones or parts of speech. Discovery procedures were thought of as a methodological tool, a formal procedure that linguists could apply to data to discover linguistic structure, for example inducing parts of speech from the slots in “sentence frames” informed by the distribution of surrounding words. Like many linguistic graduate students of the period, he also worked partly on machine translation, and was interviewed at the time by Yehoshua Bar-Hillel, who was touring US machine translation laboratories in preparation for his famous report on the state of MT (Bar-Hillel 1960). Early in his graduate career, however, Fillmore read Noam Chomsky’s Syntactic Structures and became an immediate proponent of the new transformational grammar. He graduated with his PhD in 1962 and moved to the linguistics department at Ohio State University. In his early work there Fillmore developed a number of early formal properties of generative grammar, such as the idea that rules would re-apply to representations in iterative stages called cycles (Fillmore 1963), a formal mechanism that still plays a role in modern theories of generative grammar.

Journal ArticleDOI
TL;DR: Two instantiations of binary lexicographic semirings are presented, one involving a pair of tropical weights, and the other a tropical weight paired with a novel string semiring the authors term the categorial semiring, used to yield an exact encoding of backoff models with epsilon transitions.
Abstract: This paper explores lexicographic semirings and their application to problems in speech and language processing. Specifically, we present two instantiations of binary lexicographic semirings, one involving a pair of tropical weights, and the other a tropical weight paired with a novel string semiring we term the categorial semiring. The first of these is used to yield an exact encoding of backoff models with epsilon transitions. This lexicographic language model semiring allows for off-line optimization of exact models represented as large weighted finite-state transducers in contrast to implicit (on-line) failure transition representations. We present empirical results demonstrating that, even in simple intersection scenarios amenable to the use of failure transitions, the use of the more powerful lexicographic semiring is competitive in terms of time of intersection. The second of these lexicographic semirings is applied to the problem of extracting, from a lattice of word sequences tagged for part of speech, only the single best-scoring part of speech tagging for each word sequence. We do this by incorporating the tags as a categorial weight in the second component of a {Tropical, Categorial} lexicographic semiring, determinizing the resulting word lattice acceptor in that semiring, and then mapping the tags back as output labels of the word lattice transducer. We compare our approach to a competing method due to Povey et al. (2012).

Journal ArticleDOI
TL;DR: More accurate entropy estimators are described and their performance both in simulations and on evaluation of WSI systems is analyzed.
Abstract: Information-theoretic measures are among the most standard techniques for evaluation of clustering methods including word sense induction (WSI) systems. Such measures rely on sample-based estimates of the entropy. However, the standard maximum likelihood estimates of the entropy are heavily biased with the bias dependent on, among other things, the number of clusters and the sample size. This makes the measures unreliable and unfair when the number of clusters produced by different systems vary and the sample size is not exceedingly large. This corresponds exactly to the setting of WSI evaluation where a ground-truth cluster sense number arguably does not exist and the standard evaluation scenarios use a small number of instances of each word to compute the score. We describe more accurate entropy estimators and analyze their performance both in simulations and on evaluation of WSI systems.