scispace - formally typeset
Search or ask a question

Showing papers in "Computational Linguistics in 2015"


Journal ArticleDOI
TL;DR: SimLex-999 is presented, a gold standard resource for evaluating distributional semantic models that improves on existing resources in several important ways, and explicitly quantifies similarity rather than association or relatedness so that pairs of entities that are associated but not actually similar have a low rating.
Abstract: We present SimLex-999, a gold standard resource for evaluating distributional semantic models that improves on existing resources in several important ways. First, in contrast to gold standards such as WordSim-353 and MEN, it explicitly quantifies similarity rather than association or relatedness so that pairs of entities that are associated but not actually similar Freud, psychology have a low rating. We show that, via this focus on similarity, SimLex-999 incentivizes the development of models with a different, and arguably wider, range of applications than those which reflect conceptual association. Second, SimLex-999 contains a range of concrete and abstract adjective, noun, and verb pairs, together with an independent rating of concreteness and free association strength for each pair. This diversity enables fine-grained analyses of the performance of models on concepts of different types, and consequently greater insight into how architectures can be improved. Further, unlike existing gold standard evaluations, for which automatic approaches have reached or surpassed the inter-annotator agreement ceiling, state-of-the-art models perform well below this ceiling on SimLex-999. There is therefore plenty of scope for SimLex-999 to quantify future improvements to distributional semantic models, guiding the development of the next generation of representation-learning architectures.

986 citations


Journal ArticleDOI
TL;DR: 2015 seems like the year when the full force of the tsunami hit the major Natural Language Processing (NLP) conferences, but some pundits are predicting that the final damage will be even worse, and predictions of steamrollering are being predicted.
Abstract: Deep Learning waves have lapped at the shores of computational linguistics for several years now, but 2015 seems like the year when the full force of the tsunami hit the major Natural Language Processing (NLP) conferences. However, some pundits are predicting that the final damage will be even worse. Accompanying ICML 2015 in Lille, France, there was another, almost as big, event: the 2015 Deep Learning Workshop. The workshop ended with a panel discussion, and at it, Neil Lawrence said, “NLP is kind of like a rabbit in the headlights of the Deep Learning machine, waiting to be flattened.” Now that is a remark that the computational linguistics community has to take seriously! Is it the end of the road for us? Where are these predictions of steamrollering coming from? At the June 2015 opening of the Facebook AI Research Lab in Paris, its director Yann LeCun said: “The next big step for Deep Learning is natural language understanding, which aims to give machines the power to understand not just individual words but entire sentences and paragraphs.”1 In a November 2014 Reddit AMA (Ask Me Anything), Geoff Hinton said, “I think that the most exciting areas over the next five years will be really understanding text and videos. I will be disappointed if in five years’ time we do not have something that can watch a YouTube video and tell a story about what happened. In a few years time we will put [Deep Learning] on a chip that fits into someone’s ear and have an English-decoding chip that’s just like a real Babel fish.”2 And Yoshua Bengio, the third giant of modern Deep Learning, has also increasingly oriented his group’s research toward language, including recent exciting new developments in neural machine translation systems. It’s not just Deep Learning researchers. When leading machine learning researcher Michael Jordan was asked at a September 2014 AMA, “If you got a billion dollars to spend on a huge research project that you get to lead, what would you like to do?”, he answered: “I’d use the billion dollars to build a NASA-size program focusing on natural language processing, in all of its glory (semantics, pragmatics, etc.).” He went on: “Intellectually I think that NLP is fascinating, allowing us to focus on highly structured inference problems, on issues that go to the core of ‘what is thought’ but remain eminently practical, and on a technology

201 citations


Journal ArticleDOI
TL;DR: It is demonstrated that CODRA significantly outperforms the state-of-the-art, often by a wide margin, and that a reranking of the k-best parse hypotheses generated by COD RA can potentially improve the accuracy even further.
Abstract: Clauses and sentences rarely stand on their own in an actual discourse; rather, the relationship between them carries important information that allows the discourse to express a meaning as a whole beyond the sum of its individual parts. Rhetorical analysis seeks to uncover this coherence structure. In this article, we present CODRA-a COmplete probabilistic Discriminative framework for performing Rhetorical Analysis in accordance with Rhetorical Structure Theory, which posits a tree representation of a discourse. CODRA comprises a discourse segmenter and a discourse parser. First, the discourse segmenter, which is based on a binary classifier, identifies the elementary discourse units in a given text. Then the discourse parser builds a discourse tree by applying an optimal parsing algorithm to probabilities inferred from two Conditional Random Fields: one for intra-sentential parsing and the other for multi-sentential parsing. We present two approaches to combine these two stages of parsing effectively. By conducting a series of empirical evaluations over two different data sets, we demonstrate that CODRA significantly outperforms the state-of-the-art, often by a wide margin. We also show that a reranking of the k-best parse hypotheses generated by CODRA can potentially improve the accuracy even further.

197 citations


Journal ArticleDOI
TL;DR: The proposed topic-to-question generation approach can significantly outperform the state-of-the-art results and use of syntactic tree kernels for the automatic judgment of the syntactic correctness of the questions is proposed.
Abstract: This paper is concerned with automatic generation of all possible questions from a topic of interest. Specifically, we consider that each topic is associated with a body of texts containing useful information about the topic. Then, questions are generated by exploiting the named entity information and the predicate argument structures of the sentences present in the body of texts. The importance of the generated questions is measured using Latent Dirichlet Allocation by identifying the subtopics which are closely related to the original topic in the given body of texts and applying the Extended String Subsequence Kernel to calculate their similarity with the questions. We also propose the use of syntactic tree kernels for the automatic judgment of the syntactic correctness of the questions. The questions are ranked by considering both their importance in the context of the given body of texts and syntactic correctness. To the best of our knowledge, no previous study has accomplished this task in our setting. A series of experiments demonstrate that the proposed topic-to-question generation approach can significantly outperform the state-of-the-art results.

71 citations


Journal ArticleDOI
Li Dong1, Furu Wei2, Shujie Liu2, Ming Zhou2, Ke Xu1 
TL;DR: This article developed a statistical parser to directly analyze the sentiment structure of a sentence and showed that complicated phenomena in sentiment analysis e.g., negation, intensification, and contrast can be handled the same way as simple and straightforward sentiment expressions in a unified and probabilistic way.
Abstract: We present a statistical parsing framework for sentence-level sentiment classification in this article. Unlike previous works that use syntactic parsing results for sentiment analysis, we develop a statistical parser to directly analyze the sentiment structure of a sentence. We show that complicated phenomena in sentiment analysis e.g., negation, intensification, and contrast can be handled the same way as simple and straightforward sentiment expressions in a unified and probabilistic way. We formulate the sentiment grammar upon Context-Free Grammars CFGs, and provide a formal description of the sentiment parsing framework. We develop the parsing model to obtain possible sentiment parse trees for a sentence, from which the polarity model is proposed to derive the sentiment strength and polarity, and the ranking model is dedicated to selecting the best sentiment tree. We train the parser directly from examples of sentences annotated only with sentiment polarity labels but without any syntactic annotations or polarity annotations of constituents within sentences. Therefore we can obtain training data easily. In particular, we train a sentiment parser, s.parser, from a large amount of review sentences with users' ratings as rough sentiment polarity labels. Extensive experiments on existing benchmark data sets show significant improvements over baseline sentiment classification approaches.

68 citations


Journal ArticleDOI
TL;DR: The goal of this article is to review the system features and evaluation strategies that have been proposed for the metaphor processing task, and to analyze their benefits and downsides, with the aim of identifying the desired properties of metaphor processing systems and a set of requirements for their evaluation.
Abstract: System design and evaluation methodologies receive significant attention in natural language processing NLP, with the systems typically being evaluated on a common task and against shared data sets. This enables direct system comparison and facilitates progress in the field. However, computational work on metaphor is considerably more fragmented than similar research efforts in other areas of NLP and semantics. Recent years have seen a growing interest in computational modeling of metaphor, with many new statistical techniques opening routes for improving system accuracy and robustness. However, the lack of a common task definition, shared data set, and evaluation strategy makes the methods hard to compare, and thus hampers our progress as a community in this area. The goal of this article is to review the system features and evaluation strategies that have been proposed for the metaphor processing task, and to analyze their benefits and downsides, with the aim of identifying the desired properties of metaphor processing systems and a set of requirements for their evaluation.

67 citations


Journal ArticleDOI
TL;DR: The new versatile measure γ is proposed, which fulfills this requirement and copes with both paradigms, and it is shown that this new method performs as well as, or even better than, other more specialized methods devoted to categorization or segmentation, while combining the two paradigm at the same time.
Abstract: Agreement measures have been widely used in computational linguistics for more than 15 years to check the reliability of annotation processes. Although considerable effort has been made concerning categorization, fewer studies address unitizing, and when both paradigms are combined even fewer methods are available and discussed. The aim of this article is threefold. First, we advocate that to deal with unitizing, alignment and agreement measures should be considered as a unified process, because a relevant measure should rely on an alignment of the units from different annotators, and this alignment should be computed according to the principles of the measure. Second, we propose the new versatile measure γ, which fulfills this requirement and copes with both paradigms, and we introduce its implementation. Third, we show that this new method performs as well as, or even better than, other more specialized methods devoted to categorization or segmentation, while combining the two paradigms at the same time.

53 citations


Journal ArticleDOI
TL;DR: This work generates concrete models for this setting by developing algorithms to construct tensors and linear maps and instantiate the abstract parameters using empirical data and evaluates them against several experiments based on measuring how well models align with human judgments in a paraphrase detection task.
Abstract: Modeling compositional meaning for sentences using empirical distributional methods has been a challenge for computational linguists. The categorical model of Clark, Coecke, and Sadrzadeh 2008 and Coecke, Sadrzadeh, and Clark 2010 provides a solution by unifying a categorial grammar and a distributional model of meaning. It takes into account syntactic relations during semantic vector composition operations. But the setting is abstract: It has not been evaluated on empirical data and applied to any language tasks. We generate concrete models for this setting by developing algorithms to construct tensors and linear maps and instantiate the abstract parameters using empirical data. We then evaluate our concrete models against several experiments, both existing and new, based on measuring how well models align with human judgments in a paraphrase detection task. Our results show the implementation of this general abstract framework to perform on par with or outperform other leading models in these experiments.1

45 citations


Journal ArticleDOI
TL;DR: A novel machine translation model, the Operation Sequence Model (OSM), which combines the benefits of phrase-based and N-gram-based statistical machine translation (SMT) and remedies their drawbacks and outperforms lexicalized reordering on all translation tasks.
Abstract: In this article, we present a novel machine translation model, the Operation Sequence Model OSM, which combines the benefits of phrase-based and N-gram-based statistical machine translation SMT and remedies their drawbacks. The model represents the translation process as a linear sequence of operations. The sequence includes not only translation operations but also reordering operations. As in N-gram-based SMT, the model is: i based on minimal translation units, ii takes both source and target information into account, iii does not make a phrasal independence assumption, and iv avoids the spurious phrasal segmentation problem. As in phrase-based SMT, the model i has the ability to memorize lexical reordering triggers, ii builds the search graph dynamically, and iii decodes with large translation units during search. The unique properties of the model are i its strong coupling of reordering and translation where translation and reordering decisions are conditioned on n previous translation and reordering decisions, and ii the ability to model local and long-range reorderings consistently. Using BLEU as a metric of translation accuracy, we found that our system performs significantly better than state-of-the-art phrase-based systems Moses and Phrasal and N-gram-based systems Ncode on standard translation tasks. We compare the reordering component of the OSM to the Moses lexical reordering model by integrating it into Moses. Our results show that OSM outperforms lexicalized reordering on all translation tasks. The translation quality is shown to be improved further by learning generalized representations with a POS-based OSM.

41 citations


Journal ArticleDOI
TL;DR: This article studies word ordering using a syntax-based approach and a discriminative model, and develops a learning-guided search framework, based on best-first search, and investigates several alternative training algorithms.
Abstract: Word ordering is a fundamental problem in text generation. In this article, we study word ordering using a syntax-based approach and a discriminative model. Two grammar formalisms are considered: Combinatory Categorial Grammar CCG and dependency grammar. Given the search for a likely string and syntactic analysis, the search space is massive, making discriminative training challenging. We develop a learning-guided search framework, based on best-first search, and investigate several alternative training algorithms. The framework we present is flexible in that it allows constraints to be imposed on output word orders. To demonstrate this flexibility, a variety of input conditions are considered. First, we investigate a "pure" word-ordering task in which the input is a multi-set of words, and the task is to order them into a grammatical and fluent sentence. This task has been tackled previously, and we report improved performance over existing systems on a standard Wall Street Journal test set. Second, we tackle the same reordering problem, but with a variety of input conditions, from the bare case with no dependencies or POS tags specified, to the extreme case where all POS tags and unordered, unlabeled dependencies are provided as input and various conditions in between. When applied to the NLG 2011 shared task, our system gives competitive results compared with the best-performing systems, which provide a further demonstration of the practical utility of our system.

40 citations


Journal ArticleDOI
TL;DR: This article formalizes the task of cross-lingual sentiment lexicon learning as a learning problem on a bilingual word graph, and shows that both synonym and antonym word relations can be used to build the intra-language relation.
Abstract: In this article we address the task of cross-lingual sentiment lexicon learning, which aims to automatically generate sentiment lexicons for the target languages with available English sentiment lexicons. We formalize the task as a learning problem on a bilingual word graph, in which the intra-language relations among the words in the same language and the inter-language relations among the words between different languages are properly represented. With the words in the English sentiment lexicon as seeds, we propose a bilingual word graph label propagation approach to induce sentiment polarities of the unlabeled words in the target language. Particularly, we show that both synonym and antonym word relations can be used to build the intra-language relation, and that the word alignment information derived from bilingual parallel sentences can be effectively leveraged to build the inter-language relation. The evaluation of Chinese sentiment lexicon learning shows that the proposed approach outperforms existing approaches in both precision and recall. Experiments conducted on the NTCIR data set further demonstrate the effectiveness of the learned sentiment lexicon in sentence-level sentiment classification.

Journal ArticleDOI
TL;DR: This article presents methods for learning transitive graphs that contain tens of thousands of nodes, where nodes represent predicates and edges correspond to entailment rules (termed entailment graphs), and demonstrates that these methods for the first time scale to large graphs containing 20,000 nodes and more than 100,000 edges.
Abstract: Entailment rules between predicates are fundamental to many semantic-inference applications. Consequently, learning such rules has been an active field of research in recent years. Methods for learning entailment rules between predicates that take into account dependencies between different rules e.g., entailment is a transitive relation have been shown to improve rule quality, but suffer from scalability issues, that is, the number of predicates handled is often quite small. In this article, we present methods for learning transitive graphs that contain tens of thousands of nodes, where nodes represent predicates and edges correspond to entailment rules termed entailment graphs. Our methods are able to scale to a large number of predicates by exploiting structural properties of entailment graphs such as the fact that they exhibit a "tree-like" property. We apply our methods on two data sets and demonstrate that our methods find high-quality solutions faster than methods proposed in the past, and moreover our methods for the first time scale to large graphs containing 20,000 nodes and more than 100,000 edges.

Journal ArticleDOI
TL;DR: A mathematical and empirical verification of computational constancy measures for natural language text is presented, showing how K is essentially equivalent to an approximation of the second-order Rényi entropy, thus indicating its signification within language science.
Abstract: This article presents a mathematical and empirical verification of computational constancy measures for natural language text. A constancy measure characterizes a given text by having an invariant value for any size larger than a certain amount. The study of such measures has a 70-year history dating back to Yule's K, with the original intended application of author identification. We examine various measures proposed since Yule and reconsider reports made so far, thus overviewing the study of constancy measures. We then explain how K is essentially equivalent to an approximation of the second-order Renyi entropy, thus indicating its signification within language science. We then empirically examine constancy measure candidates within this new, broader context. The approximated higher-order entropy exhibits stable convergence across different languages and kinds of text. We also show, however, that it cannot identify authors, contrary to Yule's intention. Lastly, we apply K to two unknown scripts, the Voynich manuscript and Rongorongo, and show how the results support previous hypotheses about these scripts.

Journal ArticleDOI
TL;DR: A method for extracting narrative recall scores automatically and highly accurately from a word-level alignment between a retelling and the source narrative is presented and improvements to existing machine translation–based systems for word alignment are proposed, including a novel method of word alignment relying on random walks on a graph that achieves alignment accuracy superior to that of standard expectation maximization–based techniques.
Abstract: Among the more recent applications for natural language processing algorithms has been the analysis of spoken language data for diagnostic and remedial purposes, fueled by the demand for simple, objective, and unobtrusive screening tools for neurological disorders such as dementia. The automated analysis of narrative retellings in particular shows potential as a component of such a screening tool since the ability to produce accurate and meaningful narratives is noticeably impaired in individuals with dementia and its frequent precursor, mild cognitive impairment, as well as other neurodegenerative and neurodevelopmental disorders. In this article, we present a method for extracting narrative recall scores automatically and highly accurately from a word-level alignment between a retelling and the source narrative. We propose improvements to existing machine translation-based systems for word alignment, including a novel method of word alignment relying on random walks on a graph that achieves alignment accuracy superior to that of standard expectation maximization-based techniques for word alignment in a fraction of the time required for expectation maximization. In addition, the narrative recall score features extracted from these high-quality word alignments yield diagnostic classification accuracy comparable to that achieved using manually assigned scores and significantly higher than that achieved with summary-level text similarity metrics used in other areas of NLP. These methods can be trivially adapted to spontaneous language samples elicited with non-linguistic stimuli, thereby demonstrating the flexibility and generalizability of these methods.

Journal ArticleDOI
TL;DR: This article investigated the formal significance of this difference and concluded that lexicalized versions of the classical CCG formalism are strictly less powerful than TAG, whereas modern CCG assumes a universal set of rules, isolating all cross-linguistic variation in the lexicon.
Abstract: The weak equivalence of Combinatory Categorial Grammar CCG and Tree-Adjoining Grammar TAG is a central result of the literature on mildly context-sensitive grammar formalisms. However, the categorial formalism for which this equivalence has been established differs significantly from the versions of CCG that are in use today. In particular, it allows restriction of combinatory rules on a per grammar basis, whereas modern CCG assumes a universal set of rules, isolating all cross-linguistic variation in the lexicon. In this article we investigate the formal significance of this difference. Our main result is that lexicalized versions of the classical CCG formalism are strictly less powerful than TAG.

Journal ArticleDOI
TL;DR: An effective framework for the difficult task of inducing implicit arguments and their antecedents in discourse is established and empirically demonstrate the importance of modeling this phenomenon in discourse-level tasks.
Abstract: In this article, we investigate aspects of sentential meaning that are not expressed in local predicate-argument structures. In particular, we examine instances of semantic arguments that are only inferable from discourse context. The goal of this work is to automatically acquire and process such instances, which we also refer to as implicit arguments, to improve computational models of language. As contributions towards this goal, we establish an effective framework for the difficult task of inducing implicit arguments and their antecedents in discourse and empirically demonstrate the importance of modeling this phenomenon in discourse-level tasks. Our framework builds upon a novel projection approach that allows for the accurate detection of implicit arguments by aligning and comparing predicate-argument structures across pairs of comparable texts. As part of this framework, we develop a graph-based model for predicate alignment that significantly outperforms previous approaches. Based on such alignments, we show that implicit argument instances can be automatically induced and applied to improve a current model of linking implicit arguments in discourse. We further validate that decisions on argument realization, although being a subtle phenomenon most of the time, can considerably affect the perceived coherence of a text. Our experiments reveal that previous models of coherence are not able to predict this impact. Consequently, we develop a novel coherence model, which learns to accurately predict argument realization based on automatically aligned pairs of implicit and explicit arguments.

Journal ArticleDOI
TL;DR: Computational Lexicography at Brighton was born, and Adam was one of the founding organizers of SENSEVAL, an initiative to bring international teams of researchers together to work in friendly competition on a pre-determined word sense disambiguation task.
Abstract: A long time ago now (maybe 1988?), Gerald (Gazdar) and I supervised Adam’s DPhil at the University of Sussex. Adam was my age, give or take a year, having come to academia a little late, and was my first doctoral student. Adam’s topic was polysemy, and I’m not really sure that much supervision was actually required, though I recall fun exchanges trying to model the subtleties of word meaning using symbolic knowledge representation techniques—an experience that was clearly enough to convince Adam later that this was a bad idea. In fact, Adam’s thesis title itself was Polysemy. Much as we encourage short thesis titles, pulling off the one-word title is a tall order, requiring a unique combination of focus and coverage, breadth and depth, and, most of all, authority. Adam completely nailed it, at least from the perspective of the pre-empirical Computational Linguistics of the early 1990s. Three years later, after a spell working for dictionary publishers, Adam joined me as a research fellow, now at the University of Brighton. I had a project to explore the automatic enrichment of lexical databases to support the latest trends in language analysis, and, in particular, task-specific lexical resources. I was really pleased and excited to recruit Adam—he had lost none of his intellectual independence, a quality I particularly valued. Within a few weeks he came to me with his own plan for the research—a “detour,” as he put it, from the original workplan. I still have the e-mail, dated 6 April 1995, in which he proposed that, instead of chasing a prescriptive notion of a single lexical resource that needed to be customized to each domain, we should let the domain determine the lexicon, providing lexicographic tools to explore words, and particularly word senses, that were significant for that domain. In that e-mail, Computational Lexicography at Brighton was born. Over the next eight years or so, Computational Lexicography became a key part of our group’s success, increasingly under Adam’s direct leadership. The key project, WASPS, developed the WASPbench—the direct precursor of the Sketch Engine, recruiting David (Tugwell) to the team. In addition, Adam was one of the founding organizers of SENSEVAL, an initiative to bring international teams of researchers together to work in friendly competition on a pre-determined word sense disambiguation task (and which has now transformed into SEMEVAL). Together we secured funding to support the first two rounds of SENSEVAL; each round required the preparation of standardized data sets, guided by Adam’s highly tuned intuitions about lexical data preparation and management. And we engaged somewhat in the European funding merry-go-round, most fondly in the CONCEDE project, working on dictionaries for Central European languages with amazing teams from the MULTEXT-EAST consortium, and with Georgian and German colleagues in the GREG project.

Journal ArticleDOI
Mark Dras1
TL;DR: This work investigates one approach to evaluating human evaluation in NLP, Log-Linear Bradley-Terry models, and applies it to sample NLP data.
Abstract: Human evaluation plays an important role in NLP, often in the form of preference judgments. Although there has been some use of classical non-parametric and bespoke approaches to evaluating these sorts of judgments, there is an entire body of work on this in the context of sensory discrimination testing and the human judgments that are central to it, backed by rigorous statistical theory and freely available software, that NLP can draw on. We investigate one approach, Log-Linear Bradley-Terry models, and apply it to sample NLP data.

Journal ArticleDOI
TL;DR: By ignoring the statistical dependence of the text messages published in social media, standard cross-validation can result in misleading conclusions in a machine learning task, and this work explores alternative evaluation methods that explicitly deal with statistical dependence in text.
Abstract: In recent years, many studies have been published on data collected from social media, especially microblogs such as Twitter. However, rather few of these studies have considered evaluation methodologies that take into account the statistically dependent nature of such data, which breaks the theoretical conditions for using cross-validation. Despite concerns raised in the past about using cross-validation for data of similar characteristics, such as time series, some of these studies evaluate their work using standard k-fold cross-validation. Through experiments on Twitter data collected during a two-year period that includes disastrous events, we show that by ignoring the statistical dependence of the text messages published in social media, standard cross-validation can result in misleading conclusions in a machine learning task. We explore alternative evaluation methods that explicitly deal with statistical dependence in text. Our work also raises concerns for any other data for which similar conditions might hold.

Journal ArticleDOI
TL;DR: It is shown that similarity equations of an important class of composition methods can be decomposed into operations performed on the subparts of the input phrases, establishing a strong link between these models and convolution kernels.
Abstract: Distributional semantics has been extended to phrases and sentences by means of composition operations. We look at how these operations affect similarity measurements, showing that similarity equations of an important class of composition methods can be decomposed into operations performed on the subparts of the input phrases. This establishes a strong link between these models and convolution kernels.

Journal ArticleDOI
TL;DR: The problem of annotation adaptation and the intrinsic principles of the solutions are described, and a series of successively enhanced models that can automatically adapt the divergence between different annotation formats are presented.
Abstract: Manually annotated corpora are indispensable resources, yet for many annotation tasks, such as the creation of treebanks, there exist multiple corpora with different and incompatible annotation guidelines. This leads to an inefficient use of human expertise, but it could be remedied by integrating knowledge across corpora with different annotation guidelines. In this article we describe the problem of annotation adaptation and the intrinsic principles of the solutions, and present a series of successively enhanced models that can automatically adapt the divergence between different annotation formats. We evaluate our algorithms on the tasks of Chinese word segmentation and dependency parsing. For word segmentation, where there are no universal segmentation guidelines because of the lack of morphology in Chinese, we perform annotation adaptation from the much larger People's Daily corpus to the smaller but more popular Penn Chinese Treebank. For dependency parsing, we perform annotation adaptation from the Penn Chinese Treebank to a semantics-oriented Dependency Treebank, which is annotated using significantly different annotation guidelines. In both experiments, automatic annotation adaptation brings significant improvement, achieving state-of-the-art performance despite the use of purely local features in training.

Journal ArticleDOI
TL;DR: The resulting algorithm combines the benefits of independent derivations with those of Feature-Based grammars and accounts for a range of interactions between dependent vs. independent derivation on the one hand, and syntactic constraints, linear ordering, and nonscopal semantic dependencies on the other hand.
Abstract: In parsing with Tree Adjoining Grammar TAG, independent derivations have been shown by Schabes and Shieber 1994 to be essential for correctly supporting syntactic analysis, semantic interpretation, and statistical language modeling. However, the parsing algorithm they propose is not directly applicable to Feature-Based TAGs FB-TAG. We provide a recognition algorithm for FB-TAG that supports both dependent and independent derivations. The resulting algorithm combines the benefits of independent derivations with those of Feature-Based grammars. In particular, we show that it accounts for a range of interactions between dependent vs. independent derivation on the one hand, and syntactic constraints, linear ordering, and scopal vs. nonscopal semantic dependencies on the other hand.

Journal ArticleDOI
TL;DR: This article evaluates two algorithmic heuristics applied to design large text corpora in English and French for covering phonological information or POS labels and proposes a generalization where the constraints on each covering feature can be multi-valued.
Abstract: Linguistic corpus design is a critical concern for building rich annotated corpora useful in different domains of applications. For example, speech technologies such as ASR Automatic Speech Recognition or TTS Text-to-Speech need a huge amount of speech data to train data-driven models or to produce synthetic speech. Collecting data is always related to costs recording speech, verifying annotations, etc., and as a rule of thumb, the more data you gather, the more costly your application will be. Within this context, we present in this article solutions to reduce the amount of linguistic text content while maintaining a sufficient level of linguistic richness required by a model or an application. This problem can be formalized as a Set Covering Problem SCP and we evaluate two algorithmic heuristics applied to design large text corpora in English and French for covering phonological information or POS labels. The first considered algorithm is a standard greedy solution with an agglomerative/spitting strategy and we propose a second algorithm based on Lagrangian relaxation. The latter approach provides a lower bound to the cost of each covering solution. This lower bound can be used as a metric to evaluate the quality of a reduced corpus whatever the algorithm applied. Experiments show that a suboptimal algorithm like a greedy algorithm achieves good results; the cost of its solutions is not so far from the lower bound about 4.35% for 3-phoneme coverings. Usually, constraints in SCP are binary; we proposed here a generalization where the constraints on each covering feature can be multi-valued.

Journal ArticleDOI
TL;DR: This paper presents findings from a larger investigation of authorship attribution methods, which pertains to the effects of normalization methods and distance measures in different languages, describing the aims, data and methods.
Abstract: Authorship Attribution is a research area in quantitative text analysis concerned with attributing texts of unknown or disputed authorship to their actual author based on quantitatively measured linguistic evidence (see Juola 2006; Stamatatos 2009; Koppel et al. 2009). Authorship attribution has applications in literary studies, history, forensics and many other fields, e.g. corpus stylistics (Oakes 2009). The fundamental assumption in authorship attribution is that individuals have idiosyncratic habits of language use, leading to a stylistic similarity of texts written by the same person. Many of these stylistic habits can be measured by assessing the relative frequencies of function words or parts of speech, vocabulary richness, and many other linguistic features. Distance metrics between the resulting feature vectors indicate the overall similarity of texts to each other, and can be used for attributing a text of unknown authorship to the most similar of a (usually closed) set of candidate authors. The aim of this paper is to present findings from a larger investigation of authorship attribution methods which centres around the following questions: (a) How and why exactly does authorship attribution based on distance measures work? (b) Why do different distance measures and normalization strategies perform differently? (c) Specifically, why do they perform differently for different languages and language families, and (d) How can such knowledge be used to improve authorship attribution methods? First, we describe current issues in authorship attribution and contextualize our own work. Second, we report some of our earlier research into the question. Then, we present our most recent investigation, which pertains to the effects of normalization methods and distance measures in different languages, describing our aims, data and methods. We conclude with a summary of our results.

Journal ArticleDOI
TL;DR: Analysis of a set of typed texts in Brazilian Portuguese shows that diacritical marks play a major role, as indicated by the frequency of mistakes involving them, rendering Damerau's original findings mostly unfit for spelling correction systems, although still holding them useful, should one set aside such marks.
Abstract: Fifty years after Damerau set up his statistics for the distribution of errors in typed texts, his findings are still used in a range of different languages. Because these statistics were derived from texts in English, the question of whether they actually apply to other languages has been raised. We address this issue through the analysis of a set of typed texts in Brazilian Portuguese, deriving statistics tailored to this language. Results show that diacritical marks play a major role, as indicated by the frequency of mistakes involving them, thereby rendering Damerau's original findings mostly unfit for spelling correction systems, although still holding them useful, should one set aside such marks. Furthermore, a comparison between these results and those published for Spanish show no statistically significant differences between both languages-an indication that the distribution of spelling errors depends on the adopted character set rather than the language itself.



Journal ArticleDOI
TL;DR: This volume starts its own introduction by praising Web corpora for their size, ease of construction, and availability as a source of new text types.
Abstract: The Web is the main source of data in modern computational linguistics. Other volumes in the same series, for example, Introductions to Opinion Mining (Liu 2012) and Semisupervised Machine Learning (Søgaard 2013), start their problem statements by referring to data from the Web. This volume starts its own introduction by praising Web corpora for their size, ease of construction, and availability as a source of new text types. A random check of papers from the most recent ACL meeting also shows that the majority of them use Web data in one way or another. Our field definitely needs a comprehensive overview and a DIY manual for the task of constructing a corpus from the Web. This book is, to the best of my knowledge, the first attempt at providing such an overview.

Journal ArticleDOI
TL;DR: The main contribution of the book, driving semantic processing from the ground up by a formal domain-specific ontology, is elaborated in ten well-structured chapters spanning 143 pages of content.
Abstract: A book aiming to build a bridge between two fields that share the subject of research but do not share the same views necessarily puts itself in a difficult position: The authors have either to strike a fair balance at peril of dissatisfying both sides or nail their colors to the mast and cater mainly to one of two communities. For semantic processing of natural language with either NLP methods or Semantic Web approaches, the authors clearly favor the latter and propose a strictly ontology-driven interpretation of natural language. The main contribution of the book, driving semantic processing from the ground up by a formal domain-specific ontology, is elaborated in ten well-structured chapters spanning 143 pages of content.

Journal ArticleDOI
TL;DR: The recipient of the ACL Lifetime Achievement Award, a veteran of NLP research, is fortunate to witness and be a part of its long yet inspiring journey in China and wants to share his experience and thoughts with you.
Abstract: Good afternoon, ladies and gentlemen. I am standing here, grateful, excited, and proud. I see so many friends, my colleagues, students, and many more researchers in this room. I see that the work we started 50 years ago is now flourishing and is embedded in people’s everyday lives. I see for the first time that the ACL conference is held here in Beijing, China. And I am deeply honored to be awarded the Lifetime Achievement Award of 2015. I want to thank the ACL for giving me the Lifetime Achievement Award of 2015. It is the appreciation of not only my work, but also of the work that my fellow researchers, my colleagues, and my students have done through all these years. It is an honor for all of us. As a veteran of NLP research, I am fortunate to witness and be a part of its long yet inspiring journey in China. So today, to everyone here, my friends, colleagues, and students, either well-known scientists or young researchers: I’d like to share my experience and thoughts with you.