Showing papers in &quot;Computational Linguistics in 2010&quot;

Generating phrasal and sentential paraphrases: A survey of data-driven methods

TL;DR: The Distributional Memory approach is shown to be tenable despite the constraints imposed by its multi-purpose nature, and performs competitively against task-specific algorithms recently reported in the literature for the same tasks, and against several state-of-the-art methods.

...read moreread less

Abstract: Research into corpus-based semantics has focused on the development of ad hoc models that treat single tasks, or sets of closely related tasks, as unrelated challenges to be tackled by extracting different kinds of distributional information from the corpus. As an alternative to this "one task, one model" approach, the Distributional Memory framework extracts distributional information once and for all from the corpus, in the form of a set of weighted word-link-word tuples arranged into a third-order tensor. Different matrices are then generated from the tensor, and their rows and columns constitute natural spaces to deal with different semantic problems. In this way, the same distributional information can be shared across tasks such as modeling word similarity judgments, discovering synonyms, concept categorization, predicting selectional preferences of verbs, solving analogy problems, classifying relations between word pairs, harvesting qualia structures with patterns or example pairs, predicting the typical properties of concepts, and classifying verbs into alternation classes. Extensive empirical testing in all these domains shows that a Distributional Memory implementation performs competitively against task-specific algorithms recently reported in the literature for the same tasks, and against our implementations of several state-of-the-art methods. The Distributional Memory approach is thus shown to be tenable despite the constraints imposed by its multi-purpose nature.

...read moreread less

671 citations

Journal Article•DOI•

[...]

Nitin Madnani¹, Bonnie J. Dorr¹•Institutions (1)

University of Maryland, College Park¹

A flexible, corpus-driven model of regular and inverse selectional preferences

TL;DR: A comprehensive and application-independent survey of data-driven phrasal and sentential paraphrase generation methods is conducted, while also conveying an appreciation for the importance and potential use of paraphrases in the field of NLP research.

...read moreread less

Abstract: The task of paraphrasing is inherently familiar to speakers of all languages. Moreover, the task of automatically generating or extracting semantic equivalences for the various units of language-words, phrases, and sentences-is an important part of natural language processing (NLP) and is being increasingly employed to improve the performance of several NLP applications. In this article, we attempt to conduct a comprehensive and application-independent survey of data-driven phrasal and sentential paraphrase generation methods, while also conveying an appreciation for the importance and potential use of paraphrases in the field of NLP research. Recent work done in manual and automatic construction of paraphrase corpora is also examined. We also discuss the strategies used for evaluating paraphrase generation techniques and briefly explore some future trends in paraphrase generation.

...read moreread less

308 citations

Journal Article•DOI•

[...]

Katrin Erk¹, Sebastian Padó², Ulrike Padó•Institutions (2)

University of Texas at Austin¹, Heidelberg University²

Discourse constraints for document compression

TL;DR: This work presents a vector space–based model for selectional preferences that predicts plausibility scores for argument headwords and obtains consistent benefits from using the disambiguation and semantic role information provided by a semantically tagged primary corpus.

...read moreread less

Abstract: We present a vector space-based model for selectional preferences that predicts plausibility scores for argument headwords. It does not require any lexical resources (such as WordNet). It can be trained either on one corpus with syntactic annotation, or on a combination of a small semantically annotated primary corpus and a large, syntactically analyzed generalization corpus. Our model is able to predict inverse selectional preferences, that is, plausibility scores for predicates given argument heads. We evaluate our model on one NLP task (pseudo-disambiguation) and one cognitive task (prediction of human plausibility judgments), gauging the influence of different parameters and comparing our model against other model classes. We obtain consistent benefits from using the disambiguation and semantic role information provided by a semantically tagged primary corpus. As for parameters, we identify settings that yield good performance across a range of experimental conditions. However, frequency remains a major influence of prediction quality, and we also identify more robust parameter settings suitable for applications with many infrequent items.

...read moreread less

115 citations

Journal Article•DOI•

[...]

James Clarke¹, Mirella Lapata²•Institutions (2)

University of Illinois at Urbana–Champaign¹, University of Edinburgh²

Sorting texts by readability

TL;DR: A discourse-informed model which is capable of producing document compressions that are coherent and informative is presented, inspired by theories of local coherence and formulated within the framework of integer linear programming.

...read moreread less

Abstract: Sentence compression holds promise for many applications ranging from summarization to subtitle generation. The task is typically performed on isolated sentences without taking the surrounding context into account, even though most applications would operate over entire documents. In this article we present a discourse-informed model which is capable of producing document compressions that are coherent and informative. Our model is inspired by theories of local coherence and formulated within the framework of integer linear programming. Experimental results show significant improvements over a state-of-the-art discourse agnostic approach.

...read moreread less

91 citations

Journal Article•DOI•

[...]

Kumiko Tanaka-Ishii¹, Satoshi Tezuka¹, Hiroshi Terada¹•Institutions (1)

University of Tokyo¹

Query rewriting using monolingual statistical machine translation

TL;DR: The proposed method is compared with regression methods and a state-of-the art classification method, and an application is presented, called Terrace, which retrieves texts with readability similar to that of a given input text.

...read moreread less

Abstract: This article presents a novel approach for readability assessment through sorting. A comparator that judges the relative readability between two texts is generated through machine learning, and a given set of texts is sorted by this comparator. Our proposal is advantageous because it solves the problem of a lack of training data, because the construction of the comparator only requires training data annotated with two reading levels. The proposed method is compared with regression methods and a state-of-the art classification method. Moreover, we present our application, called Terrace, which retrieves texts with readability similar to that of a given input text.

...read moreread less

90 citations

Journal Article•DOI•

[...]

Stefan Riezler¹, Yi Liu¹•Institutions (1)

Google¹

Broad-coverage parsing using human-like memory constraints

TL;DR: It is shown in an extrinsic evaluation in a real-world Web search task that the combination of a query-to-snippet translation model with a query language model achieves improved contextual query expansion compared to a state-of-the-art query expansion model that is trained on the same query log data.

...read moreread less

Abstract: Long queries often suffer from low recall in Web search due to conjunctive term matching. The chances of matching words in relevant documents can be increased by rewriting query terms into new terms with similar statistical properties. We present a comparison of approaches that deploy user query logs to learn rewrites of query terms into terms from the document space. We show that the best results are achieved by adopting the perspective of bridging the "lexical chasm" between queries and documents by translating from a source language of user queries into a target language of Web documents. We train a state-of-the-art statistical machine translation model on query-snippet pairs from user query logs, and extract expansion terms from the query rewrites produced by the monolingual translation system. We show in an extrinsic evaluation in a real-world Web search task that the combination of a query-to-snippet translation model with a query language model achieves improved contextual query expansion compared to a state-of-the-art query expansion model that is trained on the same query log data.

...read moreread less

88 citations

Journal Article•DOI•

[...]

William Schuler, Samir E. AbdelRahman, Timothy A. Miller, Lane Schwartz

Hierarchical phrase-based translation with weighted finite-state transducers and shallow-n grammars

TL;DR: A model of syntactic processing that operates successfully within severe constraints, by recognizing constituents in a right-corner transformed representation and mapping this representation to random variables in a Hierarchic Hidden Markov Model, a factored time-series model which probabilistically models the contents of a bounded memory store over time.

...read moreread less

Abstract: Human syntactic processing shows many signs of taking place within a general-purpose short-term memory. But this kind of memory is known to have a severely constrained storage capacity---possibly constrained to as few as three or four distinct elements. This article describes a model of syntactic processing that operates successfully within these severe constraints, by recognizing constituents in a right-corner transformed representation (a variant of left-corner parsing) and mapping this representation to random variables in a Hierarchic Hidden Markov Model, a factored time-series model which probabilistically models the contents of a bounded memory store over time. Evaluations of the coverage of this model on a large syntactically annotated corpus of English sentences, and the accuracy of a a bounded-memory parsing strategy based on this model, suggest this model may be cognitively plausible.

...read moreread less

74 citations

Journal Article•DOI•

[...]

Adrià de Gispert¹, Gonzalo Iglesias², Graeme Blackwood¹, Eduardo Rodríguez Banga², William Byrne¹ - Show less +1 more•Institutions (2)

University of Cambridge¹, University of Vigo²

Summarizing short stories

TL;DR: HiFST, a lattice-based decoder for hierarchical phrase-based translation and alignment is described, finding that the use of WFSTs rather than k-best lists requires less pruning in translation search, resulting in fewer search errors, better parameter optimization, and improved translation performance.

...read moreread less

Abstract: In this article we describe HiFST, a lattice-based decoder for hierarchical phrase-based translation and alignment. The decoder is implemented with standard Weighted Finite-State Transducer (WFST) operations as an alternative to the well-known cube pruning procedure. We find that the use of WFSTs rather than k-best lists requires less pruning in translation search, resulting in fewer search errors, better parameter optimization, and improved translation performance. The direct generation of translation lattices in the target language can improve subsequent rescoring procedures, yielding further gains when applying long-span language models and Minimum Bayes Risk decoding. We also provide insights as to how to control the size of the search space defined by hierarchical rules. We show that shallow-n grammars, low-level rule catenation, and other search constraints can help to match the power of the translation system to specific language pairs.

...read moreread less

70 citations

Journal Article•DOI•

[...]

Anna Kazantseva¹, Stan Szpakowicz¹•Institutions (1)

Polish Academy of Sciences¹

String-to-dependency statistical machine translation

TL;DR: An approach to the automatic creation of extractive summaries of literary short stories, which relies on assorted surface indicators about clauses in the short story, suggests that the summaries are helpful in achieving the original objective.

...read moreread less

Abstract: We present an approach to the automatic creation of extractive summaries of literary short stories. The summaries are produced with a specific objective in mind: to help a reader decide whether she would be interested in reading the complete story. To this end, the summaries give the user relevant information about the setting of the story without revealing its plot. The system relies on assorted surface indicators about clauses in the short story, the most important of which are those related to the aspectual type of a clause and to the main entities in a story. Fifteen judges evaluated the summaries on a number of extrinsic and intrinsic measures. The outcome of this evaluation suggests that the summaries are helpful in achieving the original objective.

...read moreread less

58 citations

Journal Article•DOI•

[...]

Libin Shen¹, Jinxi Xu¹, Ralph Weischedel¹•Institutions (1)

BBN Technologies¹

What is not in the bag of words for why-qa?

TL;DR: A novel string-to-dependency algorithm for statistical machine translation that employs a target dependency language model during decoding to exploit long distance word relations, which cannot be modeled with a traditional n-gram language model.

...read moreread less

Abstract: We propose a novel string-to-dependency algorithm for statistical machine translation. This algorithm employs a target dependency language model during decoding to exploit long distance word relations, which cannot be modeled with a traditional n-gram language model. Experiments show that the algorithm achieves significant improvement in MT performance over a state-of-the-art hierarchical string-to-string system on NIST MT06 and MT08 newswire evaluation sets.

...read moreread less

58 citations

Journal Article•DOI•

[...]

Suzan Verberne, Lou Boves, Nelleke Oostdijk, Peter-Arno Coppen

Re-structuring, re-labeling, and re-aligning for syntax-based machine translation

TL;DR: A passage retrieval system that uses off-the-shelf retrieval technology with a re-ranking step incorporating structural information is extended, based on relatively lightweight overlap measures incorporating syntactic constituents, cue words, and document structure.

...read moreread less

Abstract: While developing an approach to why-QA, we extended a passage retrieval system that uses off-the-shelf retrieval technology with a re-ranking step incorporating structural information. We get significantly higher scores in terms of MRR@150 (from 0.25 to 0.34) and success@10. The 23% improvement that we reach in terms of MRR is comparable to the improvement reached on different QA tasks by other researchers in the field, although our re-ranking approach is based on relatively lightweight overlap measures incorporating syntactic constituents, cue words, and document structure.

...read moreread less

Journal Article•DOI•

[...]

Wei Wang, Jonathan May, Kevin Knight, Daniel Marcu

Discriminative word alignment by linear modeling

TL;DR: Three modifications to the MT training data are presented to improve the accuracy of a state-of-the-art syntax MT system: re-structuring changes the syntactic structure of training parse trees to enable reuse of substructures; re-labeling alters bracket labels to enrich rule application context; and re-aligning unifies word alignment across sentences to remove bad word alignments and refine good ones.

...read moreread less

Abstract: This article shows that the structure of bilingual material from standard parsing and alignment tools is not optimal for training syntax-based statistical machine translation (SMT) systems. We present three modifications to the MT training data to improve the accuracy of a state-of-the-art syntax MT system: re-structuring changes the syntactic structure of training parse trees to enable reuse of substructures; re-labeling alters bracket labels to enrich rule application context; and re-aligning unifies word alignment across sentences to remove bad word alignments and refine good ones. Better structures, labels, and word alignments are learned by the EM algorithm. We show that each individual technique leads to improvement as measured by BLEU, and we also show that the greatest improvement is achieved by combining them. We report an overall 1.48 BLEU improvement on the NIST08 evaluation set over a strong baseline in Chinese/English translation.

...read moreread less

Journal Article•DOI•

[...]

Yang Liu¹, Qun Liu¹, Shouxun Lin¹•Institutions (1)

Chinese Academy of Sciences¹

Automatically identifying the source words of lexical blends in english

TL;DR: A discriminative framework for word alignment based on a linear model that achieves state-of-the-art alignment quality on three word alignment shared tasks for five language pairs with varying divergence and richness of resources and improves translation performance for various statistical machine translation systems.

...read moreread less

Abstract: Word alignment plays an important role in many NLP tasks as it indicates the correspondence between words in a parallel text. Although widely used to align large bilingual corpora, generative models are hard to extend to incorporate arbitrary useful linguistic information. This article presents a discriminative framework for word alignment based on a linear model. Within this framework, all knowledge sources are treated as feature functions, which depend on a source language sentence, a target language sentence, and the alignment between them. We describe a number of features that could produce symmetric alignments. Our model is easy to extend and can be optimized with respect to evaluation metrics directly. The model achieves state-of-the-art alignment quality on three word alignment shared tasks for five language pairs with varying divergence and richness of resources. We further show that our approach improves translation performance for various statistical machine translation systems.

...read moreread less

Journal Article•DOI•

[...]

Paul Cook¹, Suzanne Stevenson¹•Institutions (1)

University of Toronto¹

Generating tailored, comparative descriptions with contextually appropriate intonation

TL;DR: In this first study of novel blends, an accuracy of 40% is achieved on the task of inferring a blend's source words, which corresponds to a reduction in error rate of 39% over an informed baseline.

...read moreread less

Abstract: Newly coined words pose problems for natural language processing systems because they are not in a system's lexicon, and therefore no lexical information is available for such words. A common way to form new words is lexical blending, as in cosmeceutical, a blend of cosmetic and pharmaceutical. We propose a statistical model for inferring a blend's source words drawing on observed linguistic properties of blends; these properties are largely based on the recognizability of the source words in a blend. We annotate a set of 1,186 recently coined expressions which includes 515 blends, and evaluate our methods on a 324-item subset. In this first study of novel blends we achieve an accuracy of 40% on the task of inferring a blend's source words, which corresponds to a reduction in error rate of 39% over an informed baseline. We also give preliminary results showing that our features for source word identification can be used to distinguish blends from other kinds of novel words.

...read moreread less

Journal Article•DOI•

[...]

Michael White¹, Robert A. J. Clark, Johanna D. Moore•Institutions (1)

Ohio State University¹

Learning tractable word alignment models with complex constraints

TL;DR: A multi-level approach to presenting user-tailored information in spoken dialogues is described which brings together for the first time multi-attribute decision models, strategic content planning, surface realization that incorporates prosody prediction, and unit selection synthesis that takes the resulting prosodic structure into account.

...read moreread less

Abstract: Generating responses that take user preferences into account requires adaptation at all levels of the generation process. This article describes a multi-level approach to presenting user-tailored information in spoken dialogues which brings together for the first time multi-attribute decision models, strategic content planning, surface realization that incorporates prosody prediction, and unit selection synthesis that takes the resulting prosodic structure into account. The system selects the most important options to mention and the attributes that are most relevant to choosing between them, based on the user model. Multiple options are selected when each offers a compelling trade-off. To convey these trade-offs, the system employs a novel presentation strategy which straightforwardly lends itself to the determination of information structure, as well as the contents of referring expressions. During surface realization, the prosodic structure is derived from the information structure using Combinatory Categorial Grammar in a way that allows phrase boundaries to be determined in a flexible, data-driven fashion. This approach to choosing pitch accents and edge tones is shown to yield prosodic structures with significantly higher acceptability than baseline prosody prediction models in an expert evaluation. These prosodic structures are then shown to enable perceptibly more natural synthesis using a unit selection voice that aims to produce the target tunes, in comparison to two baseline synthetic voices. An expert evaluation and f0 analysis confirm the superiority of the generator-driven intonation and its contribution to listeners' ratings.

...read moreread less

Journal Article•DOI•

[...]

João Graça¹, Kuzman Ganchev², Ben Taskar²•Institutions (2)

INESC-ID¹, University of Pennsylvania²

The noisy channel model for unsupervised word sense disambiguation

TL;DR: This article uses the Posterior Regularization framework to incorporate complex constraints into probabilistic models during learning without changing the efficiency of the underlying model, and presents an efficient learning algorithm for incorporating approximate bijectivity and symmetry constraints.

...read moreread less

Abstract: Word-level alignment of bilingual text is a critical resource for a growing variety of tasks. Probabilistic models for word alignment present a fundamental trade-off between richness of captured constraints and correlations versus efficiency and tractability of inference. In this article, we use the Posterior Regularization framework (Graca, Ganchev, and Taskar 2007) to incorporate complex constraints into probabilistic models during learning without changing the efficiency of the underlying model. We focus on the simple and tractable hidden Markov model, and present an efficient learning algorithm for incorporating approximate bijectivity and symmetry constraints. Models estimated with these constraints produce a significant boost in performance as measured by both precision and recall of manually annotated alignments for six language pairs. We also report experiments on two different tasks where word alignments are required: phrase-based machine translation and syntax transfer, and show promising improvements over standard methods.

...read moreread less

Journal Article•DOI•

[...]

Deniz Yuret¹, Mehmet Ali Yatbaz¹•Institutions (1)

Koç University¹

On paraphrase and coreference

TL;DR: This model uses coarse-grained semantic classes for S internally and the effect of using different levels of granularity on WSD performance is explored, and its performance on noun disambiguation is better than most previously reported unsupervised systems and close to the best supervised systems.

...read moreread less

Abstract: We introduce a generative probabilistic model, the noisy channel model, for unsupervised word sense disambiguation. In our model, each context C is modeled as a distinct channel through which the speaker intends to transmit a particular meaning S using a possibly ambiguous word W. To reconstruct the intended meaning the hearer uses the distribution of possible meanings in the given context P(S|C) and possible words that can express each meaning P(W|S). We assume P(W|S) is independent of the context and estimate it using WordNet sense frequencies. The main problem of unsupervised WSD is estimating context-dependent P(S|C) without access to any sense-tagged text. We show one way to solve this problem using a statistical language model based on large amounts of untagged text. Our model uses coarse-grained semantic classes for S internally and we explore the effect of using different levels of granularity on WSD performance. The system outputs fine-grained senses for evaluation, and its performance on noun disambiguation is better than most previously reported unsupervised systems and close to the best supervised systems.

...read moreread less

Journal Article•DOI•

[...]

Marta Recasens¹, Marta Vila¹•Institutions (1)

University of Barcelona¹

A graph-theoretic framework for semantic distance

TL;DR: This article delimits what the focus of paraphrase extraction and coreference resolution tasks should be, and to what extent they can help each other, in terms of similarities and differences in their linguistic nature.

...read moreread less

Abstract: By providing a better understanding of paraphrase and coreference in terms of similarities and differences in their linguistic nature, this article delimits what the focus of paraphrase extraction and coreference resolution tasks should be, and to what extent they can help each other. We argue for the relevance of this discussion to Natural Language Processing.

...read moreread less

Journal Article•DOI•

[...]

Vivian Tsang¹, Suzanne Stevenson¹•Institutions (1)

University of Toronto¹

What computational linguists can learn from psychologists (and vice versa)

TL;DR: This article introduces an alternative method of measuring the semantic distance between texts that integrates distributional information and ontological knowledge within a network flow formalism, and develops a new measure of semantic coherence that enables us to account for the performance difference across the three data sets.

...read moreread less

Abstract: Many NLP applications entail that texts are classified based on their semantic distance (how similar or different the texts are). For example, comparing the text of a new document to that of documents of known topics can help identify the topic of the new text. Typically, a distributional distance is used to capture the implicit semantic distance between two pieces of text. However, such approaches do not take into account the semantic relations between words. In this article, we introduce an alternative method of measuring the semantic distance between texts that integrates distributional information and ontological knowledge within a network flow formalism. We first represent each text as a collection of frequency-weighted concepts within an ontology. We then make use of a network flow method which provides an efficient way of explicitly measuring the frequency-weighted ontological distance between the concepts across two texts. We evaluate our method in a variety of NLP tasks, and find that it performs well on two of three tasks. We develop a new measure of semantic coherence that enables us to account for the performance difference across the three data sets, shedding light on the properties of a data set that lends itself well to our method.

...read moreread less

Journal Article•DOI•

[...]

Emiel Krahmer¹•Institutions (1)

Tilburg University¹

Entropy, the indus script, and language: A reply to r. sproat

TL;DR: Language technology has been particularly successful for tasks where huge amounts of textual data is available to which statistical machine learning techniques can be applied, and mainstream computational linguistics is now a successful, application-oriented discipline which is particularly good at extracting information from sequences of words.

...read moreread less

Abstract: Sometimes I am amazed by how much the ﬁeld of computational linguistics haschanged in the past 15 to 20 years. In the mid-nineties, I was working in a researchinstitute where language and speech technologists worked in relatively close quarters.Speech technology seemed on the verge of a major breakthrough; this was around thetime that Bill Gates was quoted in Business Week as saying that speech was not justthe future of Windows, but the future of computing itself. At the same time, languagetechnology was, well, nowhere. Bill Gates certainly wasn’t championing language tech-nology in those days. And while the possible applications of speech technology seemedendless (who would use a keyboard in 2010, when speech-driven user interfaces wouldhave replaced traditional computers?) the language people were thinking hard aboutpossible applications for their admittedly somewhat immature technologies.Predicting the future is a tricky thing. No major breakthrough came for speechtechnology — I am still typing this. However, language technology did change almostbeyond recognition. Perhaps one of the main reasons for this has been the explosivegrowth of the internet, which helped language technology in two different ways. Onthe one hand it instigated the development and reﬁnement of techniques needed forsearching in document collections of unprecedented size, on the other it resulted in alarge increase of freely available text data. Recently, language technology has been par-ticularly successful for tasks where huge amounts of textual data is available to whichstatistical machine learning techniques can be applied (Halevy, Norvig, and Pereira2009). As a result of these developments, mainstream computational linguistics is nowa successful, application-oriented discipline which is particularly good at extractinginformation from sequences of words.But there is more to language than that. For speakers, words are the result of acomplex speech production process; for listeners they are what starts off the similarlycomplex comprehension process. However, in many current applications no attentionis given to the processes by which words are produced nor to the processes by whichthey can be understood. Language is treated as a product not as a process, in theterminology of Clark (1996). In addition, we use language not only as a vehicle forfactual information exchange; speakers may have all sorts of other intentions with theirwords; they may want to convince others to do or buy something, they may want toinduce a particular emotion in the addressee etc. These days, most of computationallinguistics (with a few notable exceptions, more about which below) has little to say

...read moreread less

Journal Article•DOI•

[...]

Rajesh P. N. Rao¹, Nisha Yadav², Mayank N. Vahia², Hrishikesh Joglekar, Ronojoy Adhikari, Iravatham Mahadevan - Show less +2 more•Institutions (2)

University of Washington¹, Tata Institute of Fundamental Research²

Ancient symbols, computational linguistics, and the reviewing practices of the general science journals

TL;DR: A more accurate description of the authors' work is presented, the straw man argument used in Sproat (2010) is pointed out, and a more complete characterization of the Indus script debate is provided.

...read moreread less

Abstract: In a recent LastWords column (Sproat 2010), Richard Sproat laments the reviewing practices of “general science journals” after dismissing our work and that of Lee, Jonathan, and Ziman (2010) as “useless” and “trivially and demonstrably wrong.” Although we expect such categorical statements to have already raised some red flags in the minds of readers, we take this opportunity to present a more accurate description of our work, point out the straw man argument used in Sproat (2010), and provide a more complete characterization of the Indus script debate. A separate response by Lee and colleagues in this issue provides clarification of issues not covered here.

...read moreread less

Journal Article•DOI•

[...]

Richard Sproat

University of Pennsylvania¹

TL;DR: Until recently nobody had argued that statistical techniques could be used to determine that a symbol system is linguistic, and it was therefore quite a surprise when a short article by Rajesh Rao of the University of Washington and colleagues at two appeared in Science.

...read moreread less

Abstract: Few archaeological finds are as evocative as artifacts inscribed with symbols. Whenever an archaeologist finds a potsherd or a seal impression that seems to have symbols scratched or impressed on the surface, it is natural to want to “read” the symbols. And if the symbols come from an undeciphered or previously unknown symbol system it is common to ask what language the symbols supposedly represent and whether the system can be deciphered. Of course the first question that really should be asked is whether the symbols are in fact writing. A writing system, as linguists usually define it, is a symbol system that is used to represent language. Familiar examples are alphabets such as the Latin, Greek, Cyrillic or Hangul alphabets, alphasyllabaries such as Devanagari or Tamil, syllabaries such as Cherokee or Kana, and morphosyllabic systems like Chinese characters. But symbol systems that do not encode language abound: European heraldry, mathematical notation, labanotation (used to represent dance), and boy scout merit badges are all examples of symbol systems that represent things, but do not function as part of a system that represents language. Whether an unknown system is writing or not is a difficult question to answer. It can only be answered definitively in the affirmative if one can develop a verifiable decipherment into some language or languages. Statistical techniques have been used in decipherment for years, but these have always been used under the assumption that the system one is dealing with is writing, and the techniques are used to uncover patterns or regularities that might aid in the decipherment. Patterns of symbol distribution might suggest that a symbol system is not linguistic: for example, odd repetition patterns might make it seem that a symbol system is unlikely to be writing. But until recently nobody had argued that statistical techniques could be used to determine that a system is linguistic.1 It was therefore quite a surprise when, in April 2009, there appeared in Science a short article by Rajesh Rao of the University of Washington and colleagues at two

...read moreread less

Journal Article•DOI•

Fred jelinek

[...]

Mark Liberman¹•Institutions (1)

Does giza++ make search errors?

Journal Article•DOI•

[...]

Sujith Ravi¹, Kevin Knight¹•Institutions (1)

University of Southern California¹

An asymptotic model for the english hapax/vocabulary ratio

TL;DR: This article investigates whether the Brown et al. (1993) word alignment algorithm makes search errors when it computes Viterbi alignments, that is, whether it returns alignments that are sub-optimal according to a trained model.

...read moreread less

Abstract: Word alignment is a critical procedure within statistical machine translation (SMT). Brown et al. (1993) have provided the most popular word alignment algorithm to date, one that has been implemented in the GIZA (Al-Onaizan et al., 1999) and GIZA++ (Och and Ney 2003) software and adopted by nearly every SMT project. In this article, we investigate whether this algorithm makes search errors when it computes Viterbi alignments, that is, whether it returns alignments that are sub-optimal according to a trained model.

...read moreread less

Journal Article•DOI•

[...]

Fan Fengxiang¹•Institutions (1)

Dalian Maritime University¹

Natural Language Processing with Python Steven Bird, Ewan Klein, and Edward Loper (University of Melbourne, University of Edinburgh, and BBN Technologies) Sebastopol, CA: O'Reilly Media, 2009, xx+482 pp; paperbound, ISBN 978-0-596-51649-9, $44.99; on-line free of charge at nltk.org/book

TL;DR: A computer simulation shows that as the text size continues to increase, the hapax/vocabulary ratio would approach 1.0, and a computer simulation reveals that initially, as the size of text increases, this ratio decreases; however, after the textsize reaches about 3,000,000 words, it starts to increase steadily.

...read moreread less

Abstract: In the known literature, hapax legomena in an English text or a collection of texts roughly account for about 50% of the vocabulary. This sort of constancy is baffling. The 100-million-word British National Corpus was used to study this phenomenon. The result reveals that the hapax/vocabulary ratio follows a U-shaped pattern. Initially, as the size of text increases, the hapax/vocabulary ratio decreases; however, after the text size reaches about 3,000,000 words, the hapax/vocabulary ratio starts to increase steadily. A computer simulation shows that as the text size continues to increase, the hapax/vocabulary ratio would approach 1.

...read moreread less

Journal Article•DOI•

[...]

Michael Elhadad¹•Institutions (1)

Ben-Gurion University of the Negev¹

15 Dec 2010-Computational Linguistics

TL;DR: This book is a textbook on computational linguistics for science and engineering students; it also serves as practical documentation for the NLTK library, and it finally attempts to provide an introduction to programming and algorithm design for humanities students.

...read moreread less

Journal Article•DOI•

The right tools: Reflections on computation and language

[...]

William A. Woods

A response to richard sproat on random systems, writing, and entropy

TL;DR: I hope you’ll find an appreciation for some of the ideas and where they came from, but also a trajectory that continues forward and suggests some solutions to problems not yet solved in this speech.

...read moreread less

Abstract: Good morning. I want to thank the ACL for awarding me the 2010 Lifetime Achievement Award. I’m honored to be included in the ranks of my respected colleagues who have received this award previously. I want to talk to you this morning about the evolution of some ideas that I think are important, with a little bit of historical and biographical context thrown in. I hope you’ll find in what I say not only an appreciation for some of the ideas and where they came from, but also a trajectory that continues forward and suggests some solutions to problems not yet solved.

...read moreread less

Journal Article•DOI•

[...]

Robert Lee, Philip Jonathan¹, Pauline Ziman•Institutions (1)

Lancaster University¹

Complexity, parsing, and factorization of tree-local multi-component tree-adjoining grammar

TL;DR: In his article "Ancient symbols and computational linguistics" (Sproat 2010), Professor Sproat raised two concerns over a method: first, that the method is unable to detect random but non-equiprobable systems; and second, that it misclassifies kudurru texts.

...read moreread less

Abstract: In his article "Ancient symbols and computational linguistics" (Sproat 2010), Professor Sproat raised two concerns over a method that we have proposed for analyzing small data sets of symbols using entropy (Lee, Jonathan, and Ziman 2010): first, that the method is unable to detect random but non-equiprobable systems; and second, that it misclassifies kudurru texts.W e address these concerns in the following response. © 2010 Association for Computational Linguistics.

...read moreread less

Journal Article•DOI•

[...]

Rebecca Nesson¹, Giorgio Satta², Stuart M. Shieber¹•Institutions (2)

Harvard University¹, University of Padua²