Showing papers in &quot;Computational Linguistics in 2009&quot;

Statistical approaches to computer-assisted translation

TL;DR: The goal of this work is to automatically distinguish between prior and contextual polarity, with a focus on understanding which features are important for this task, and it is shown that the presence of neutral instances greatly degrades the performance of features for distinguishing between positive and negative polarity.

...read moreread less

Abstract: Many approaches to automatic sentiment analysis begin with a large lexicon of words marked with their prior polarity (also called semantic orientation). However, the contextual polarity of the phrase in which a particular instance of a word appears may be quite different from the word's prior polarity. Positive words are used in phrases expressing negative sentiments, or vice versa. Also, quite often words that are positive or negative out of context are neutral in context, meaning they are not even being used to express a sentiment. The goal of this work is to automatically distinguish between prior and contextual polarity, with a focus on understanding which features are important for this task. Because an important aspect of the problem is identifying when polar terms are being used in neutral contexts, features for distinguishing between neutral and polar instances are evaluated, as well as features for distinguishing between positive and negative contextual polarity. The evaluation includes assessing the performance of features across multiple machine learning algorithms. For all learning algorithms except one, the combination of all features together gives the best performance. Another facet of the evaluation considers how the presence of neutral instances affects the performance of features for distinguishing between positive and negative polarity. These experiments show that the presence of neutral instances greatly degrades the performance of these features, and that perhaps the best way to improve performance across all polarity classes is to improve the system's ability to identify when an instance is neutral.

...read moreread less

677 citations

Journal Article•DOI•

[...]

Sergio Barrachina, Oliver Bender, Francisco Casacuberta, Jorge Civera, Elsa Cubel, Shahram Khadivi, Antonio Lagarda, Hermann Ney, Jesús Tomás, Enrique Vidal, Juan-Miguel Vilar - Show less +7 more

An investigation into the validity of some metrics for automatically evaluating natural language generation systems

TL;DR: Alignment templates, phrase-based models, and stochastic finite-state transducers are used to develop computer-assisted translation systems in a European project in two real tasks.

...read moreread less

Abstract: Current machine translation (MT) systems are still not perfect. In practice, the output from these systems needs to be edited to correct errors. A way of increasing the productivity of the whole translation process (MT plus human work) is to incorporate the human correction activities within the translation process itself, thereby shifting the MT paradigm to that of computer-assisted translation. This model entails an iterative process in which the human translator activity is included in the loop: In each iteration, a prefix of the translation is validated (accepted or amended) by the human and the system computes its best (or n-best) translation suffix hypothesis to complete this prefix. A successful framework for MT is the so-called statistical (or pattern recognition) framework. Interestingly, within this framework, the adaptation of MT systems to the interactive scenario affects mainly the search process, allowing a great reuse of successful techniques and models. In this article, alignment templates, phrase-based models, and stochastic finite-state transducers are used to develop computer-assisted translation systems. These systems were assessed in a European project (TransType2) in two real tasks: The translation of printer manuals; manuals and the translation of the Bulletin of the European Union. In each task, the following three pairs of languages were involved (in both translation directions): English-Spanish, English-German, and English-French.

...read moreread less

238 citations

Journal Article•DOI•

[...]

Ehud Reiter¹, Ehud Reiter², Anja Belz¹, Anja Belz²•Institutions (2)

University of Brighton¹, University of Aberdeen²

Unsupervised type and token identification of idiomatic expressions

TL;DR: The results of two studies of how well some metrics which are popular in other areas of NLP correlate with human judgments in the domain of computer-generated weather forecasts suggest that, at least in this domain, metrics may provide a useful measure of language quality, although the evidence for this is not as strong as one would ideally like to see.

...read moreread less

Abstract: There is growing interest in using automatically computed corpus-based evaluation metrics to evaluate Natural Language Generation (NLG) systems, because these are often considerably cheaper than the human-based evaluations which have traditionally been used in NLG. We review previous work on NLG evaluation and on validation of automatic metrics in NLP, and then present the results of two studies of how well some metrics which are popular in other areas of NLP (notably BLEU and ROUGE) correlate with human judgments in the domain of computer-generated weather forecasts. Our results suggest that, at least in this domain, metrics may provide a useful measure of language quality, although the evidence for this is not as strong as we would ideally like to see; however, they do not provide a useful measure of content quality. We also discuss a number of caveats which must be kept in mind when interpreting this and other validation studies.

...read moreread less

194 citations

Journal Article•DOI•

[...]

Afsaneh Fazly¹, Paul Cook¹, Suzanne Stevenson¹•Institutions (1)

University of Toronto¹

Punctuation as implicit annotations for chinese word segmentation

TL;DR: This article develops statistical measures that each model a specific property of idiomatic expressions by looking at their actual usage patterns in text, and uses some of the measures in a token identification task where they distinguish idiomatic and literal usages of potentially idiomatic expression in context.

...read moreread less

Abstract: Idiomatic expressions are plentiful in everyday language, yet they remain mysterious, as it is not clear exactly how people learn and understand them. They are of special interest to linguists, psycholinguists, and lexicographers, mainly because of their syntactic and semantic idiosyncrasies as well as their unclear lexical status. Despite a great deal of research on the properties of idioms in the linguistics literature, there is not much agreement on which properties are characteristic of these expressions. Because of their peculiarities, idiomatic expressions have mostly been overlooked by researchers in computational linguistics. In this article, we look into the usefulness of some of the identified linguistic properties of idioms for their automatic recognition. Specifically, we develop statistical measures that each model a specific property of idiomatic expressions by looking at their actual usage patterns in text. We use these statistical measures in a type-based classification task where we automatically separate idiomatic expressions (expressions with a possible idiomatic interpretation) from similar-on-the-surface literal phrases (for which no idiomatic interpretation is possible). In addition, we use some of the measures in a token identification task where we distinguish idiomatic and literal usages of potentially idiomatic expressions in context.

...read moreread less

188 citations

Journal Article•DOI•

[...]

Zhongguo Li¹, Maosong Sun¹•Institutions (1)

Tsinghua University¹

Applying computational models of spatial prepositions to visually situated dialog

TL;DR: A Chinese word segmentation model learned from punctuation marks which are perfect word delimiters is presented, which is considerably more effective than previous methods in unknown word recognition.

...read moreread less

Abstract: We present a Chinese word segmentation model learned from punctuation marks which are perfect word delimiters. The learning is aided by a manually segmented corpus. Our method is considerably more effective than previous methods in unknown word recognition. This is a step toward addressing one of the toughest problems in Chinese word segmentation.

...read moreread less

113 citations

Journal Article•DOI•

[...]

John D. Kelleher¹, John D. Kelleher², Fintan Costello², Fintan Costello¹•Institutions (2)

Dublin Institute of Technology¹, University College Dublin²

Bootstrapping distributional feature vector quality

TL;DR: A generic architecture for a visually situated dialog system is described and the interactions between the spatial cognition module, which provides the interface to the models of prepositional semantics, and the other components in the architecture are highlighted.

...read moreread less

Abstract: This article describes the application of computational models of spatial prepositions to visually situated dialog systems. In these dialogs, spatial prepositions are important because people often use them to refer to entities in the visual context of a dialog. We first describe a generic architecture for a visually situated dialog system and highlight the interactions between the spatial cognition module, which provides the interface to the models of prepositional semantics, and the other components in the architecture. Following this, we present two new computational models of topological and projective spatial prepositions. The main novelty within these models is the fact that they account for the contextual effect which other distractor objects in a visual scene can have on the region described by a given preposition. We next present psycholinguistic tests evaluating our approach to distractor interference on prepositional semantics, and illustrate how these models are used for both interpretation and generation of prepositional expressions.

...read moreread less

91 citations

Journal Article•DOI•

[...]

Maayan Zhitomirsky-Geffet¹, Ido Dagan¹•Institutions (1)

Bar-Ilan University¹

From annotator agreement to noise models

TL;DR: A novel bootstrapping approach for improving the quality of feature vector weighting in distributional word similarity, motivated by attempts to utilize distributional similarity for identifying the concrete semantic relationship of lexical entailment.

...read moreread less

Abstract: This article presents a novel bootstrapping approach for improving the quality of feature vector weighting in distributional word similarity. The method was motivated by attempts to utilize distributional similarity for identifying the concrete semantic relationship of lexical entailment. Our analysis revealed that a major reason for the rather loose semantic similarity obtained by distributional similarity methods is insufficient quality of the word feature vectors, caused by deficient feature weighting. This observation led to the definition of a bootstrapping scheme which yields improved feature weights, and hence higher quality feature vectors. The underlying idea of our approach is that features which are common to similar words are also most characteristic for their meanings, and thus should be promoted. This idea is realized via a bootstrapping step applied to an initial standard approximation of the similarity space. The superior performance of the bootstrapping method was assessed in two different experiments, one based on direct human gold-standard annotation and the other based on an automatically created disambiguation dataset. These results are further supported by applying a novel quantitative measurement of the quality of feature weighting functions. Improved feature weighting also allows massive feature reduction, which indicates that the most characteristic features for a word are indeed concentrated at the top ranks of its vector. Finally, experiments with three prominent similarity measures and two feature weighting functions showed that the bootstrapping scheme is robust and is independent of the original functions over which it is applied.

...read moreread less

77 citations

Journal Article•DOI•

[...]

Beata Beigman Klebanov¹, Eyal Beigman¹•Institutions (1)

Northwestern University¹

Prepositions in applications: A survey and introduction to the special issue

TL;DR: The transition from annotated data to a gold standard, that is, a subset that is sufficiently noise-free with high confidence, is discussed, which helps promote cautious benchmarking.

...read moreread less

Abstract: This article discusses the transition from annotated data to a gold standard, that is, a subset that is sufficiently noise-free with high confidence. Unless appropriately reinterpreted, agreement coefficients do not indicate the quality of the data set as a benchmarking resource: High overall agreement is neither sufficient nor necessary to distill some amount of highly reliable data from the annotated material. A mathematical framework is developed that allows estimation of the noise level of the agreed subset of annotated data, which helps promote cautious benchmarking.

...read moreread less

76 citations

Journal Article•DOI•

[...]

Timothy Baldwin¹, Valia Kordoni², Aline Villavicencio³•Institutions (3)

University of Melbourne¹, Saarland University², Universidade Federal do Rio Grande do Sul³

Binarization of synchronous context-free grammars

TL;DR: Although NLP in general has benefitted from advances in those areas where prepositions have received attention, there are still many issues to be addressed and accurate models of preposition usage are essential to avoid repeatedly making errors.

...read moreread less

Abstract: Prepositions1—as well as prepositional phrases (PPs) and markers of various sorts— have a mixed history in computational linguistics (CL), as well as related fields such as artificial intelligence, information retrieval (IR), and computational psycholinguistics: On the one hand they have been championed as being vital to precise language understanding (e.g., in information extraction), and on the other they have been ignored on the grounds of being syntactically promiscuous and semantically vacuous, and relegated to the ignominious rank of “stop word” (e.g., in text classification and IR). Although NLP in general has benefitted from advances in those areas where prepositions have received attention, there are still many issues to be addressed. For example, in machine translation, generating a preposition (or “case marker” in languages such as Japanese) incorrectly in the target language can lead to critical semantic divergences over the source language string. Equivalently in information retrieval and information extraction, it would seem desirable to be able to predict that book on NLP and book about NLPmean largely the same thing, but paranoid about drugs and paranoid on drugs suggest very different things. Prepositions are often among the most frequent words in a language. For example, based on the British National Corpus (BNC; Burnard 2000), four out of the top-ten most-frequent words in English are prepositions (of, to, in, and for). In terms of both parsing and generation, therefore, accurate models of preposition usage are essential to avoid repeatedly making errors. Despite their frequency, however, they are notoriously difficult to master, even for humans (Chodorow, Tetreault, and Han 2007). For example, Lindstromberg (2001) estimates that less than 10% of upper-level English as a Second

...read moreread less

62 citations

Journal Article•DOI•

[...]

Liang Huang¹, Liang Huang², Hao Zhang¹, Hao Zhang², Daniel Gildea², Daniel Gildea¹, Kevin Knight², Kevin Knight¹ - Show less +4 more•Institutions (2)

University of Rochester¹, Information Sciences Institute²

Evaluating centering for information ordering using corpora

TL;DR: In large-scale experiments, it is found that almost all rules are binarizable and the resulting binarized rule set significantly improves the speed and accuracy of a state-of-the-art syntax-based machine translation system.

...read moreread less

Abstract: Systems based on synchronous grammars and tree transducers promise to improve the quality of statistical machine translation output, but are often very computationally intensive. The complexity is exponential in the size of individual grammar rules due to arbitrary re-orderings between the two languages. We develop a theory of binarization for synchronous context-free grammars and present a linear-time algorithm for binarizing synchronous rules when possible. In our large-scale experiments, we found that almost all rules are binarizable and the resulting binarized rule set significantly improves the speed and accuracy of a state-of-the-art syntax-based machine translation system. We also discuss the more general, and computationally more difficult, problem of finding good parsing strategies for non-binarizable rules, and present an approximate polynomial-time algorithm for this problem.

...read moreread less

61 citations

Journal Article•DOI•

[...]

Nikiforos Karamanis, Chris Mellish, Massimo Poesio, Jon Oberlander

Kernel methods for minimally supervised wsd

TL;DR: The main result is that the simplest metric (which relies exclusively on NOCB transitions) sets a robust baseline that cannot be outperformed by other metrics which make use of additional centering-based features.

...read moreread less

Abstract: In this article we discuss several metrics of coherence defined using centering theory and investigate the usefulness of such metrics for information ordering in automatic text generation. We estimate empirically which is the most promising metric and how useful this metric is using a general methodology applied on several corpora. Our main result is that the simplest metric (which relies exclusively on NOCB transitions) sets a robust baseline that cannot be outperformed by other metrics which make use of additional centering-based features. This baseline can be used for the development of both text-to-text and concept-to-text generation systems.

...read moreread less

Journal Article•DOI•

[...]

Claudio Giuliano, Alfio Gliozzo, Carlo Strapparava

Exploiting semantic role resources for preposition disambiguation

TL;DR: A combination of basic kernel functions are used to independently estimate syntagmatic and domain similarity, building a set of word-expert classifiers that share a common domain model acquired from a large corpus of unlabeled data.

...read moreread less

Abstract: We present a semi-supervised technique for word sense disambiguation that exploits external knowledge acquired in an unsupervised manner. In particular, we use a combination of basic kernel functions to independently estimate syntagmatic and domain similarity, building a set of word-expert classifiers that share a common domain model acquired from a large corpus of unlabeled data. The results show that the proposed approach achieves state-of-the-art performance on a wide range of lexical sample tasks and on the English all-words task of Senseval-3, although it uses a considerably smaller number of training examples than other methods.

...read moreread less

Journal Article•DOI•

[...]

Tom O'Hara¹, Janyce Wiebe¹•Institutions (1)

University of Pittsburgh¹

Robust understanding in multimodal interfaces

TL;DR: In this paper, semantic role annotations provided by the Penn Treebank and FrameNet tagged corpora are used for preposition disambiguation and a common inventory is derived from these in support of definition analysis, which is the motivation for this work.

...read moreread less

Abstract: This article describes how semantic role resources can be exploited for preposition disambiguation. The main resources include the semantic role annotations provided by the Penn Treebank and FrameNet tagged corpora. The resources also include the assertions contained in the Factotum knowledge base, as well as information from Cyc and Conceptual Graphs. A common inventory is derived from these in support of definition analysis, which is the motivation for this work. The disambiguation concentrates on relations indicated by prepositional phrases, and is framed as word-sense disambiguation for the preposition in question. A new type of feature for word-sense disambiguation is introduced, using WordNet hypernyms as collocations rather than just words. Various experiments over the Penn Treebank and FrameNet data are presented, including prepositions classified separately versus together, and illustrating the effects of filtering. Similar experimentation is done over the Factotum data, including a method for inferring likely preposition usage from corpora, as knowledge bases do not generally indicate how relationships are expressed in English (in contrast to the explicit annotations on this in the Penn Treebank and FrameNet). Other experiments are included with the FrameNet data mapped into the common relation inventory developed for definition analysis, illustrating how preposition disambiguation might be applied in lexical acquisition.

...read moreread less

Journal Article•DOI•

[...]

Srinivas Bangalore¹, Michael V. Johnston¹•Institutions (1)

AT&T Labs¹

Natural language processing and linguistic fieldwork

TL;DR: This article shows how the finite-state approach to multimodal language processing can be extended to support multimodals applications combining speech with complex freehand pen input, and evaluates the approach in the context of a multimodAL conversational system (MATCH).

...read moreread less

Abstract: Multimodal grammars provide an effective mechanism for quickly creating integration and understanding capabilities for interactive systems supporting simultaneous use of multiple input modalities. However, like other approaches based on hand-crafted grammars, multimodal grammars can be brittle with respect to unexpected, erroneous, or disfluent input. In this article, we show how the finite-state approach to multimodal language processing can be extended to support multimodal applications combining speech with complex freehand pen input, and evaluate the approach in the context of a multimodal conversational system (MATCH). We explore a range of different techniques for improving the robustness of multimodal integration and understanding. These include techniques for building effective language models for speech recognition when little or no multimodal training data is available, and techniques for robust multimodal understanding that draw on classification, machine translation, and sequence edit methods. We also explore the use of edit-based methods to overcome mismatches between the gesture stream and the speech stream.

...read moreread less

Journal Article•DOI•

[...]

Steven Bird¹•Institutions (1)

University of Melbourne¹

A framework for fast incremental interpretation during speech decoding

TL;DR: The language documentation community uses technology to process language, but is largely ignorant of the field of natural language processing.

...read moreread less

Abstract: March 2009 marked an important milestone: the First International Conference on Language Documentation and Conservation, held at the University of Hawai‘i.1 The scale of the event was striking, with five parallel tracks running over three days. The organizers coped magnificently with three times the expected participation (over 300). The buzz among the participants was that we were at the start of something big, that we were already part of a significant and growing community dedicated to supporting small languages together, the conference subtitle. The event was full of computation and linguistics, yet devoid of computational linguistics. The language documentation community uses technology to process language, but is largely ignorant of the field of natural language processing. I pondered what we have to offer this community: “Send us your 10 million words of Nahuatl-English bitext and we’ll do you a machine translation system!” “Show us your Bambara WordNet and we’ll use it to train a word sense disambiguation tool!” “Write up the word-formation rules of Inuktitut in this arcane format and we’ll give you a morphological analyzer!” Is there not some more immediate contribution we could offer?

...read moreread less

Journal Article•DOI•

[...]

William Schuler¹, Stephen Wu¹, Lane Schwartz¹•Institutions (1)

University of Minnesota¹

What science underlies natural language engineering

TL;DR: A single unified referential semantic probability model is described which brings several kinds of context to bear in speech decoding, and performs accurate recognition in real time on large domains in the absence of example in-domain training sentences.

...read moreread less

Abstract: This article describes a framework for incorporating referential semantic information from a world model or ontology directly into a probabilistic language model of the sort commonly used in speech recognition, where it can be probabilistically weighted together with phonological and syntactic factors as an integral part of the decoding process. Introducing world model referents into the decoding search greatly increases the search space, but by using a single integrated phonological, syntactic, and referential semantic language model, the decoder is able to incrementally prune this search based on probabilities associated with these combined contexts. The result is a single unified referential semantic probability model which brings several kinds of context to bear in speech decoding, and performs accurate recognition in real time on large domains in the absence of example in-domain training sentences.

...read moreread less

Journal Article•DOI•

[...]

Shuly Wintner¹•Institutions (1)

University of Haifa¹

That's nice... what can you do with it?

TL;DR: In the following pages, a computational linguist calls for the return of linguistics to computational linguistics.

...read moreread less

Abstract: One of the most thought-provoking proposals I have heard recently came from Lori Levin during the discussion that concluded the EACL 2009 Workshop on the Interaction between Linguistics and Computational Linguistics. Lori proposed that we should form an ACL Special Interest Group on Linguistics. At first blush, I found the idea weird: Isn’t it a little like the American Academy of Pediatrics forming a SIG on Medicine (or on Children)? Second thoughts, however, revealed the appropriateness of the idea: In essence, linguistics is altogether missing in contemporary natural language engineering research. In the following pages I want to call for the return of linguistics to computational linguistics. The last two decades were marked by a complete paradigm shift in computational linguistics. Frustrated by the inability of applications based on explicit linguistic knowledge to scale up to real-world needs, and, perhaps more deeply, frustrated with the dominating theories in formal linguistics, we looked instead to corpora that reflect language use as our sources of (implicit) knowledge. With the shift in methodology came a subtle change in the goals of our entire enterprise. Two decades ago, a computational linguist could be interested in developing NLP applications; or in formalizing (and reasoning about) linguistic processes. These days, it is the former only. A superficial look at the papers presented in our main conferences reveals that the vast majority of them are engineering papers, discussing engineering solutions to practical problems. Virtually none addresses fundamental issues in linguistics. There’s nothing wrong with engineering work, of course. Every school of technology has departments of engineering in areas as diverse as Chemical Engineering, Mechanical Engineering, Aeronautical Engineering, or Biomedical Engineering; there’s no reason why there shouldn’t also be a discipline of Natural Language Engineering. But in the more established disciplines, engineering departments conduct research that is informed by some well-defined branch of science. Chemical engineers study chemistry; electrical engineers study physics; aeronautical engineers study dynamics; and biomedical engineers study biology, physiology, medical sciences, and so on. The success of engineering is also in part due to the choice of the “right” mathematics. The theoretical development of several scientific areas, notably physics, went alongside mathematical developments. Physics could not have accounted for natural phenomena without such mathematical infrastructure. For example, the development of (partial) differential equations went hand in hand with some of the greatest achievement in physics, and this branch of mathematics later turned out to be applicable also to chemistry, electrical engineering, and economics, among many other scientific fields.

...read moreread less

Journal Article•DOI•

[...]

Anja Belz¹•Institutions (1)

University of Brighton¹

The syntax and semantics of prepositions in the task of automatic interpretation of nominal phrases and compounds: A cross-linguistic study

TL;DR: The talk included progress reports on the current size of the artificial brain, its structure, update rate, and power consumption, and explained how intelli- gent behavior was going to develop by mechanisms simulating biological evolution.

...read moreread less

Abstract: A regular fixture on the mid 1990s international research seminar circuit was the billion-neuron artificial brain talk. The idea behind this project was simple: in order to create artificial intelligence, what was needed first of all was a very large artificial brain; if a big enough set of interconnected modules of neurons could be implemented, then it would be possible to evolve mammalian-level behavior with current computational- neuron technology. The talk included progress reports on the current size of the artificial brain, its structure, update rate, and power consumption, and explained how intelli- gent behavior was going to develop by mechanisms simulating biological evolution. What the talk didnt mention was what kind of functionality the team had so far managed to evolve, and so the first comment at the end of the talk was inevitably nice work, but have you actually done anything with the brain yet?1 In human language technology (HLT) research, we currently report a range of evaluation scores that measure and assess various aspects of systems, in particular the similarity of their outputs to samples of human language or to human-produced gold- standard annotations, but are we leaving ourselves open to the same question as the billion-neuron artificial brain researchers?

...read moreread less

Journal Article•DOI•

[...]

Roxana Girju

Introduction to Information Retrieval Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze (Stanford University, Yahoo! Research, and University of Stuttgart) Cambridge: Cambridge University Press, 2008, xxi+482 pp; hardbound, ISBN 978-0-521-86571-5, $60.00

TL;DR: Given a training set of English nominal phrases and compounds along with their translations in the five Romance languages, the algorithm automatically learns classification rules and applies them to unseen test instances for semantic interpretation and results are compared against two state-of-the-art models reported in the literature.

...read moreread less

Abstract: In this article we explore the syntactic and semantic properties of prepositions in the context of the semantic interpretation of nominal phrases and compounds. We investigate the problem based on cross-linguistic evidence from a set of six languages: English, Spanish, Italian, French, Portuguese, and Romanian. The focus on English and Romance languages is well motivated. Most of the time, English nominal phrases and compounds translate into constructions of the form N P N in Romance languages, where the P (preposition) may vary in ways that correlate with the semantics. Thus, we present empirical observations on the distribution of nominal phrases and compounds and the distribution of their meanings on two different corpora, based on two state-of-the-art classification tag sets: Lauer's set of eight prepositions and our list of 22 semantic relations. A mapping between the two tag sets is also provided. Furthermore, given a training set of English nominal phrases and compounds along with their translations in the five Romance languages, our algorithm automatically learns classification rules and applies them to unseen test instances for semantic interpretation. Experimental results are compared against two state-of-the-art models reported in the literature.

...read moreread less

Journal Article•DOI•

[...]

Olga Vechtomova

14 May 2009-Computational Linguistics

TL;DR: Introduction to Information Retrieval is the first textbook with a coherent treatment of classical and web information retrieval, including web search and the related areas of text classification and text clustering.

...read moreread less

Abstract: Introduction to Information Retrieval is the first textbook with a coherent treatment of classical and web information retrieval, including web search and the related areas of text classification and text clustering. Written from a computer science perspective, it gives an up-to-date treatment of all aspects of the design and implementation of systems for gathering, indexing, and searching documents and of methods for evaluating systems, along with an introduction to the use of machine learning methods on text collections. Designed as the primary text for a graduate or advanced undergraduate course in information retrieval, the book will also interest researchers and professionals. A complete set of lecture slides and exercises that accompany the book are available on the web.

...read moreread less

Journal Article•DOI•

The dawn of statistical asr and mt

[...]

Frederick Jelinek¹•Institutions (1)

Johns Hopkins University¹

The complexity of ranking hypotheses in optimality theory

TL;DR: The leader of three innovative groups that he headed in the last 47 years: at Cornell, IBM, and now at Johns Hopkins is told that he should give an acceptance speech and was furnished with example texts by previous recipients.

...read moreread less

Abstract: I am very grateful for the award you have bestowed on me. To understand your generosity I have to assume that you are honoring the leadership of three innovative groups that I headed in the last 47 years: at Cornell, IBM, and now at Johns Hopkins. You know my co-workers in the last two teams. The Cornell group was in Information Theory and included Toby Berger, Terrence Fine, and Neil J. A. Sloane (earlier my Ph.D. student), all of whom earned their own laurels. I was told that I should give an acceptance speech and was furnished with example texts by previous recipients. They wrote about the development and impact of their ideas. So I will tell you about my beginnings and motivations and then focus on the contributions of my IBM team. In this way the text will have some historical value and may clear up certain widely held misconceptions.

...read moreread less

Journal Article•DOI•

[...]

Jason Riggle¹•Institutions (1)

University of Chicago¹

An empirical study of corpus-based response automation methods for an e-mail-based help-desk domain

TL;DR: This work uses the three-valued logic of Elementary Ranking Conditions to show that the VCD of Optimality Theory with k constraints is k-1 and establishes that the complexity of OT is a well-behaved function of k and that the hardness of learning in OT is linear in k for a variety of frameworks that employ probabilistic definitions of learnability.

...read moreread less

Abstract: Given a constraint set with k constraints in the framework of Optimality Theory (OT), what is its capacity as a classification scheme for linguistic data? One useful measure of this capacity is the size of the largest data set of which each subset is consistent with a different grammar hypothesis. This measure is known as the Vapnik-Chervonenkis dimension (VCD) and is a standard complexity measure for concept classes in computational learnability theory. In this work, I use the three-valued logic of Elementary Ranking Conditions to show that the VCD of Optimality Theory with k constraints is k-1. Analysis of OT in terms of the VCD establishes that the complexity of OT is a well-behaved function of k and that the 'hardness' of learning in OT is linear in k for a variety of frameworks that employ probabilistic definitions of learnability.

...read moreread less

Journal Article•DOI•

[...]

Yuval Marom¹, Ingrid Zukerman¹•Institutions (1)

Monash University¹

Speech and Language Processing (second edition) Daniel Jurafsky and James H. Martin (Stanford University and University of Colorado at Boulder) Pearson Prentice Hall, 2009, xxxi+988 pp; hardbound, ISBN 978-0-13-187321-6, $115.00

TL;DR: An investigation of corpus-based methods for the automation of help-desk e-mail responses along two operational dimensions: information-gathering technique, and granularity of the information.

...read moreread less

Abstract: This article presents an investigation of corpus-based methods for the automation of help-desk e-mail responses. Specifically, we investigate this problem along two operational dimensions: (1) information-gathering technique, and (2) granularity of the information. We consider two information-gathering techniques (retrieval and prediction) applied to information represented at two levels of granularity (document-level and sentence-level). Document-level methods correspond to the reuse of an existing response e-mail to address new requests. Sentence-level methods correspond to applying extractive multi-document summarization techniques to collate units of information from more than one e-mail. Evaluation of the performance of the different methods shows that in combination they are able to successfully automate the generation of responses for a substantial portion of e-mail requests in our corpus. We also investigate a meta-selection process that learns to choose one method to address a new inquiry e-mail, thus providing a unified response automation solution.

...read moreread less

Journal Article•DOI•

[...]

Vlado Keselj¹•Institutions (1)

Dalhousie University¹

08 Sep 2009-Computational Linguistics

Journal Article•DOI•

A minimal recursion semantic analysis of locatives

[...]

Fredrik Jørgensen, Jan Tore Lønning

Learning Machine Translation Cyril Goutte†, Nicola Cancedda*, Marc Dymetman*, and George Foster† (editors) (†Institute for Information Technology, National Research Council Canada; *Xerox Research Centre Europe) Cambridge, MA: The MIT Press, 2009, xii+316 pp; hardbound, ISBN 978-0-262-07297-7, $45.00, £29.95

TL;DR: The article investigates the distinction between static and directional locatives, and between different types of directional locative PPs, and shows how this analysis can be incorporated into Minimal Recursion Semantics (MRS) (Copestake et al. 2005).

...read moreread less

Abstract: The article describes a pilot implementation of a grammar containing different types of locative PPs. In particular, we investigate the distinction between static and directional locatives, and between different types of directional locatives. Locatives may act as modifiers as well as referring expressions depending on the syntactic context. We handle this with a single lexical entry. The implementation is of Norwegian locatives, but English locatives are both discussed and compared to Norwegian locatives. The semantic analysis is based on a proposal by Markus Kracht (2002), and we show how this analysis can be incorporated into Minimal Recursion Semantics (MRS) (Copestake et al. 2005). We discuss how the resulting system may be applied in a transfer-based machine translation system, and how we can map from a shallow MRS representation to a deeper semantic representation.

...read moreread less

Journal Article•DOI•

[...]

Phil Blunsom

07 Dec 2009-Computational Linguistics

TL;DR: Learning Machine Translation is a book focused on the application of machine learning to SMT, presenting a number of approaches applying discriminative machine learning techniques within a SMT decoder.

...read moreread less

Abstract: Attending recent computational linguistics conferences, it is hard to ignore the phenomenal amount of research devoted to statistical machine translation (SMT). Driven by the wide availability of open-source translation systems, corpora, and evaluation tools, a research area that was once the preserve of large research groups has become accessible to those of more modest resources. Although the current state-of-the-art SMT systems have matured into robust commercial systems, capable of providing reasonable quality translations for a variety of domains, they remain limited by naive modeling assumptions and a heavy reliance on heuristics. These limitations have led researchers to ask the question of whether the adoption of techniques from the machine learning literature could allow more complex translations to be modeled effectively. As such, this book, focused on the application of machine learning to SMT, is particularly timely in capturing the current interest of the machine translation community. Learning Machine Translation is presented in two parts. The first, titled “Enabling Technologies,” focuses on research peripheral to machine translation. Topics covered include the acquisition of parallel corpora, cross-language named-entity processing, and language modeling. The second part covers core machine translation system building, presenting a number of approaches applying discriminative machine learning techniques within a SMT decoder. Much of the content of the book arose from the Machine Learning for Multilingual AccessWorkshop held at the Neural Information Processing conference in 2006. As SMT is not a frequent topic at that conference, the bridging of research from the mainstream machine learning community with research on MT is particularly promising. A fine example of this cross-over is Chapter 9, “Kernel-Based Machine Translation,” in which a novel approach to estimating translation models is presented. However, this promise is not entirely fulfilled, as some contributions either fail to make use of machine learning or are somewhat obscure, unlikely to impact on the mainstream SMT community.

...read moreread less

Journal Article•DOI•

Hozumi tanaka

[...]

Timothy Baldwin, Takenobu Tokunaga, Jun'ichi Tsujii