Showing papers on "Phrase published in 2008"

PDF

Open Access

Proceedings Article•DOI•

A Simple and Effective Hierarchical Phrase Reordering Model

[...]

Michel Galley¹, Christopher D. Manning¹•Institutions (1)

25 Oct 2008

TL;DR: A novel hierarchical phrase reordering model aimed at improving non-local reorderings, which seamlessly integrates with a standard phrase-based system with little loss of computational efficiency is presented.

...read moreread less

Abstract: While phrase-based statistical machine translation systems currently deliver state-of-the-art performance, they remain weak on word order changes. Current phrase reordering models can properly handle swaps between adjacent phrases, but they typically lack the ability to perform the kind of long-distance re-orderings possible with syntax-based systems. In this paper, we present a novel hierarchical phrase reordering model aimed at improving non-local reorderings, which seamlessly integrates with a standard phrase-based system with little loss of computational efficiency. We show that this model can successfully handle the key examples often used to motivate syntax-based systems, such as the rotation of a prepositional phrase around a noun phrase. We contrast our model with reordering models commonly used in phrase-based systems, and show that our approach provides statistically significant BLEU point gains for two language pairs: Chinese-English (+0.53 on MT05 and +0.71 on MT08) and Arabic-English (+0.55 on MT05).

...read moreread less

346 citations

Journal Article•DOI•

Syntactic priming persists while the lexical boost decays : Evidence from written and spoken dialogue

[...]

Robert J. Hartsuiker¹, Sarah Bernolet¹, Sofie Schoonbaert¹, Sara Speybroeck², Dieter Vanderelst - Show less +1 more•Institutions (2)

Ghent University¹, Katholieke Universiteit Leuven²

01 Feb 2008-Journal of Memory and Language

TL;DR: This article showed that syntactic priming is a form of implicit learning and that lexically-based, short-term mechanisms operate in tandem with abstract, longer-term learning mechanisms can explain the full pattern of results.

...read moreread less

264 citations

Proceedings Article•DOI•

Syntactic Constraints on Paraphrases Extracted from Parallel Corpora

[...]

Chris Callison-Burch¹•Institutions (1)

Johns Hopkins University¹

25 Oct 2008

TL;DR: This work improves the quality of paraphrases extracted from parallel corpora by requiring that phrases and their paraphrasing be the same syntactic type and altering the phrase extraction algorithm to extract phrase labels alongside bilingual phrase pairs.

...read moreread less

Abstract: We improve the quality of paraphrases extracted from parallel corpora by requiring that phrases and their paraphrases be the same syntactic type. This is achieved by parsing the English side of a parallel corpus and altering the phrase extraction algorithm to extract phrase labels alongside bilingual phrase pairs. In order to retain broad coverage of non-constituent phrases, complex syntactic labels are introduced. A manual evaluation indicates a 19% absolute improvement in paraphrase quality over the baseline method.

...read moreread less

207 citations

Journal Article•DOI•

Semantic indeterminacy in object relative clauses

[...]

Silvia P. Gennari¹, Maryellen C. MacDonald²•Institutions (2)

University of York¹, University of Wisconsin-Madison²

01 Feb 2008-Journal of Memory and Language

TL;DR: The results showed that comprehension difficulty was modulated by animacy configuration and voice, and differences were well correlated with the availability of alternative interpretations as the relative clause unfolds, as revealed by the completion data.

...read moreread less

204 citations

Journal Article•DOI•

German Children’s Comprehension of Word Order and Case Marking in Causative Sentences

[...]

Miriam Dittmar¹, Kirsten Abbot-Smith², Elena Lieven¹, Michael Tomasello¹•Institutions (2)

Max Planck Society¹, University of Plymouth²

01 Jul 2008-Child Development

TL;DR: Findings suggest that prototypical instances of linguistic constructions with redundant grammatical marking play a special role in early acquisition, and only later do children isolate and weigh individual grammatical cues appropriately.

...read moreread less

Abstract: Two comprehension experiments were conducted to investigate whether German children are able to use the grammatical cues of word order and word endings (case markers) to identify agents and patients in a causative sentence and whether they weigh these two cues differently across development. Two-year-olds correctly understood only sentences with both cues supporting each other--the prototypical form. Five-year-olds were able to use word order by itself but not case markers. Only 7-year-olds behaved like adults by relying on case markers over word order when the two cues conflicted. These findings suggest that prototypical instances of linguistic constructions with redundant grammatical marking play a special role in early acquisition, and only later do children isolate and weigh individual grammatical cues appropriately.

...read moreread less

196 citations

Journal Article•DOI•

The emotional weight of I love you in multilinguals' languages

[...]

Jean-Marc Dewaele¹•Institutions (1)

Birkbeck, University of London¹

01 Oct 2008-Journal of Pragmatics

TL;DR: In this article, the perceived emotional weight of the phrase I love you in multilinguals' different languages was investigated and found to be associated with self-perceived language dominance, context of acquisition of the second language, age of onset of learning the first language, degree of socialization in the second, nature of the network of interlocutors in the L2, and selfperceived oral proficiency.

...read moreread less

185 citations

Journal Article•DOI•

Efficient Phrase-Based Document Similarity for Clustering

[...]

Hung Chim¹, Xiaotie Deng¹•Institutions (1)

City University of Hong Kong¹

01 Sep 2008-IEEE Transactions on Knowledge and Data Engineering

TL;DR: The phrase-based document similarity is applied to the group-average Hierarchical Agglomerative Clustering (HAC) algorithm and the new clustering approach is developed, which is very effective on clustering the documents of two standard document benchmark corpora OHSUMED and RCV1.

...read moreread less

Abstract: In this paper, we propose a phrase-based document similarity to compute the pair-wise similarities of documents based on the suffix tree document (STD) model. By mapping each node in the suffix tree of STD model into a unique feature term in the vector space document (VSD) model, the phrase-based document similarity naturally inherits the term tf-idf weighting scheme in computing the document similarity with phrases. We apply the phrase-based document similarity to the group-average Hierarchical Agglomerative Clustering (HAC) algorithm and develop a new document clustering approach. Our evaluation experiments indicate that, the new clustering approach is very effective on clustering the documents of two standard document benchmark corpora OHSUMED and RCV1. The quality of the clustering results significantly surpass the results of traditional single-word \textit{tf-idf} similarity measure in the same HAC algorithm, especially in large document data sets. Furthermore, by studying the property of STD model, we conclude that the feature vector of phrase terms in the STD model can be considered as an expanded feature vector of the traditional single-word terms in the VSD model. This conclusion sufficiently explains why the phrase-based document similarity works much better than the single-word tf-idf similarity measure.

...read moreread less

176 citations

Book Chapter•DOI•

A Hybrid Approach to Word Segmentation of Vietnamese Texts

[...]

Le Hong Phuong, Nguyên Thi Minh Huyên¹, Azim Roussanaly, Hô Tuòng Vinh•Institutions (1)

Vietnam National University, Hanoi¹

01 Jun 2008

TL;DR: The approach combines both finite-state automata technique, regular expression parsing and the maximal-matching strategy which is augmented by statistical methods to resolve ambiguities of segmentation to create vnTokenizer, a highly accurate tokenizer for Vietnamese texts.

...read moreread less

Abstract: We present in this article a hybrid approach to automatically tokenize Vietnamese text. The approach combines both finite-state automata technique, regular expression parsing and the maximal-matching strategy which is augmented by statistical methods to resolve ambiguities of segmentation. The Vietnamese lexicon in use is compactly represented by a minimal finite-state automaton. A text to be tokenized is first parsed into lexical phrases and other patterns using pre-defined regular expressions. The automaton is then deployed to build linear graphs corresponding to the phrases to be segmented. The application of a maximal- matching strategy on a graph results in all candidate segmentations of a phrase. It is the responsibility of an ambiguity resolver, which uses a smoothed bigram language model, to choose the most probable segmentation of the phrase. The hybrid approach is implemented to create vnTokenizer, a highly accurate tokenizer for Vietnamese texts.

...read moreread less

153 citations

Journal Article•DOI•

Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence

[...]

Sankaranarayanan Ananthakrishnan¹, Shrikanth S. Narayanan¹•Institutions (1)

University of Southern California¹

01 Jan 2008-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: This paper builds an automatic detector and classifier for prosodic events in American English, based on their acoustic, lexical, and syntactic correlates, and focuses on accent (prominence, or ldquostressrdquo) and prosodic phrase boundary detection at the syllable level.

...read moreread less

Abstract: With the advent of prosody annotation standards such as tones and break indices (ToBI), speech technologists and linguists alike have been interested in automatically detecting prosodic events in speech. This is because the prosodic tier provides an additional layer of information over the short-term segment-level features and lexical representation of an utterance. As the prosody of an utterance is closely tied to its syntactic and semantic content in addition to its lexical content, knowledge of the prosodic events within and across utterances can assist spoken language applications such as automatic speech recognition and translation. On the other hand, corpora annotated with prosodic events are useful for building natural-sounding speech synthesizers. In this paper, we build an automatic detector and classifier for prosodic events in American English, based on their acoustic, lexical, and syntactic correlates. Following previous work in this area, we focus on accent (prominence, or ldquostressrdquo) and prosodic phrase boundary detection at the syllable level. Our experiments achieved a performance rate of 86.75% agreement on the accent detection task, and 91.61% agreement on the phrase boundary detection task on the Boston University Radio News Corpus.

...read moreread less

153 citations

Patent•

Context-sensitive searches and functionality for instant messaging applications

[...]

John S. Holmes¹, Heather Ferguson¹, Adam C. Czeisler¹, Joshua T. Goodman¹•Institutions (1)

Microsoft¹

24 Jan 2008

TL;DR: In this article, a conversation is analyzed and contextually or textually relevant keywords and phrases are identified in a visually-identifiable manner for selection by an individual participating in the conversation.

...read moreread less

Abstract: In the context of an instant messaging application, a conversation is analyzed and contextually or textually relevant keywords and/or phrases are identified. Keywords or phrases are highlighted in a visually-identifiable manner for selection by an individual participating in the conversation. Once selected by an individual, a user interface is presented and exposes various contextually- or textually-relevant material or functionality that pertains to the selected word or phrase. An individual can also manually select a word or phrase to access the user interface. At least some of this relevant material or functionality is presented to the user in the context of the instant messaging application and in a manner in which it can be consumed by the individual within the instant messaging application itself.

...read moreread less

150 citations

Journal Article•DOI•

Multilevel modeling of between-speaker and within-speaker variation in spontaneous speech tempo.

[...]

Hugo Quené¹•Institutions (1)

Utrecht University¹

01 Feb 2008-Journal of the Acoustical Society of America

TL;DR: Investigation of a corpus of spoken Dutch consisting of interviews with 160 high-school teachers shows that speech tempo depends mainly on phrase length, due to anticipatory shortening, and on the speaker's country, with different speaking styles in The Netherlands and in Flanders.

...read moreread less

Abstract: Speech tempo (articulation rate) varies both between and within speakers. The present study investigates several factors affecting tempo in a corpus of spoken Dutch, consisting of interviews with 160 high-school teachers. Speech tempo was observed for each phrase separately, and analyzed by means of multilevel modeling of the speaker's sex, age, country, and dialect region (between speakers) and length, sequential position of phrase, and autocorrelated tempo (within speakers). Results show that speech tempo in this corpus depends mainly on phrase length, due to anticipatory shortening, and on the speaker's country, with different speaking styles in The Netherlands (faster, less varied) and in Flanders (slower, more varied). Additional analyses showed that phrase length itself is shorter in The Netherlands than in Flanders, and decreases with speaker's age. Older speakers tend to vary their phrase length more (within speakers), perhaps due to their accumulated verbal proficiency.

...read moreread less

Journal Article•DOI•

Learn locally, act globally: learning language from variation set cues.

[...]

Luca Onnis¹, Heidi Waterfall², Shimon Edelman²•Institutions (2)

University of Hawaii¹, Cornell University²

01 Dec 2008-Cognition

TL;DR: The benefits of variation set structure directly are demonstrated directly: in miniature artificial languages, arranging a certain proportion of utterances in a training corpus in variation sets facilitated word and phrase constituent learning in adults.

...read moreread less

Journal Article•DOI•

Abstract sentence representations in 3-year-olds: Evidence from language production and comprehension

[...]

Giulia Bencini¹, Virginia Valian¹•Institutions (1)

City University of New York¹

01 Jul 2008-Journal of Memory and Language

TL;DR: This article used syntactic priming to test the abstractness of sentence representations of young 3-year-olds (35-42 months) and found that children who were primed with passives produced more passives than did children primed with actives.

...read moreread less

Proceedings Article•DOI•

A Phrase-Based Alignment Model for Natural Language Inference

[...]

Bill MacCartney¹, Michel Galley¹, Christopher D. Manning¹•Institutions (1)

Stanford University¹

25 Oct 2008

TL;DR: The MANLI system is presented, a new NLI aligner designed to address the alignment problem, which uses a phrase-based alignment representation, exploits external lexical resources, and capitalizes on a new set of supervised training data.

...read moreread less

Abstract: The alignment problem---establishing links between corresponding phrases in two related sentences---is as important in natural language inference (NLI) as it is in machine translation (MT). But the tools and techniques of MT alignment do not readily transfer to NLI, where one cannot assume semantic equivalence, and for which large volumes of bitext are lacking. We present a new NLI aligner, the MANLI system, designed to address these challenges. It uses a phrase-based alignment representation, exploits external lexical resources, and capitalizes on a new set of supervised training data. We compare the performance of MANLI to existing NLI and MT aligners on an NLI alignment task over the well-known Recognizing Textual Entailment data. We show that MANLI significantly outperforms existing aligners, achieving gains of 6.2% in F1 over a representative NLI aligner and 10.5% over GIZA++.

...read moreread less

Journal Article•DOI•

Bootstrapping lexical and syntactic acquisition.

[...]

Anne Christophe¹, Séverine Millotte¹, Savita Bernal¹, Jeffrey Lidz²•Institutions (2)

École Normale Supérieure¹, University of Maryland, College Park²

01 Mar 2008-Language and Speech

TL;DR: Experimental results show that infants have access to intermediate prosodic phrases during the first year of life, and use these to constrain lexical segmentation, and adult results are presented that test the plausibility of this hypothesis.

...read moreread less

Abstract: This paper focuses on how phrasal prosody and function words may interact during early language acquisition. Experimental results show that infants have access to intermediate prosodic phrases (phonological phrases) during the first year of life, and use these to constrain lexical segmentation. These same intermediate prosodic phrases are used by adults to constrain on-line syntactic analysis. In addition, by two years of age infants can exploit function words to infer the syntactic category of unknown content words (nouns vs. verbs) and guess their plausible meaning (object vs. action). We speculate on how infants may build a partial syntactic structure by relying on both phonological phrase boundaries and function words, and present adult results that test the plausibility of this hypothesis. These results are tied together within a model of the architecture of the first stages of language processing, and their acquisition.

...read moreread less

DOI•

Different phrasal prominence realizations in vo and ov languages

[...]

Marina Nespor, Mohinish Shukla, Ruben van de Vijver, Cinzia Avesani, Hanna Schraudolf, Caterina Donati - Show less +2 more

01 Jan 2008

TL;DR: This paper hypothesizes that the iambic-trochaic law determines the physical realization of main prominence within phonological phrases that contain more than one word, and shows this to be the case both across languages and within a lan- guage.

...read moreread less

Abstract: How do infants start learning the syntax of the language they are exposed to? In this paper, we examine a plausible mechanism for the acquisi- tion of the relative order of heads and complements. We hypothesize that the iambic-trochaic law determines the physical realization of main prominence within phonological phrases that contain more than one word: if it is realized mainly through pitch and intensity, it is in a phonological phrase that is stress- initial and has a complement-head structure, otherwise it is in a phonological phrase that is stress-final and has a head-complement structure. We show this to be the case both across languages (French and Turkish), and within a lan- guage (German, where both orders of head and complement are found). Our finding allows us to consider a psychologically plausible mechanism for the acquisition of the relative order of heads and complements, one of the basic properties of syntax. Because the mechanism is based on auditory percep- tion, it can be utilized before any knowledge of words, thus accounting for the flawlessness in infants' first words combinations.

...read moreread less

Patent•

System and method for targeted advertising

[...]

Patrick Jason Morrison¹•Institutions (1)

AT&T¹

24 Oct 2008

TL;DR: In this paper, a method of receiving an audio stream containing user speech from a first device, generating text based on the user speech, identifying a key phrase in the text, receiving from an advertiser an advertisement related to the identified key phrase, and displaying the advertisement can be displayed after the audio stream terminates.

...read moreread less

Abstract: Disclosed is a method of receiving an audio stream containing user speech from a first device, generating text based on the user speech, identifying a key phrase in the text, receiving from an advertiser an advertisement related to the identified key phrase, and displaying the advertisement. The method can include receiving from an advertiser a set of rules associated with the advertisement and displaying the advertisement in accordance with the associated set of rules. The method can display the advertisement on one or both of a first device and a second device. A central server can generate text based on the speech. A key phrase in the text can be identified based on a confidence score threshold. The advertisement can be displayed after the audio stream terminates.

...read moreread less

Proceedings Article•

A Tree Sequence Alignment-based Tree-to-Tree Translation Model

[...]

Min Zhang¹, Hongfei Jiang², Aiti Aw³, Haizhou Li¹, Chew Lim Tan², Sheng Li¹ - Show less +2 more•Institutions (3)

Agency for Science, Technology and Research¹, Harbin Institute of Technology², National University of Singapore³

01 Jun 2008

TL;DR: A translation model that is based on tree sequence alignment, where a tree sequence refers to a single sequence of subtrees that covers a phrase, that statistically significantly outperforms the baseline systems and supports multi-level structure reordering of tree typology with larger span.

...read moreread less

Abstract: This paper presents a translation model that is based on tree sequence alignment, where a tree sequence refers to a single sequence of subtrees that covers a phrase. The model leverages on the strengths of both phrase-based and linguistically syntax-based method. It automatically learns aligned tree sequence pairs with mapping probabilities from word-aligned biparsed parallel texts. Compared with previous models, it not only captures non-syntactic phrases and discontinuous phrases with linguistically structured features, but also supports multi-level structure reordering of tree typology with larger span. This gives our model stronger expressive power than other reported models. Experimental results on the NIST MT-2005 Chinese-English translation task show that our method statistically significantly outperforms the baseline systems.

...read moreread less

Patent•

Phrase based snippet generation

[...]

Sasha Blair-Goldensohn¹, Kerry Hannan¹, Ryan McDonald¹, Tyler Neylon¹, Jeffrey C. Reynar¹ - Show less +1 more•Institutions (1)

Google¹

25 Jan 2008

TL;DR: In this paper, a method, a system and a computer product for generating a snippet for an entity, wherein each snippet comprises a plurality of sentiments about the entity is selected, and one or more sentiment phrases from the plurality of sentiment phrases are selected to generate a snippet.

...read moreread less

Abstract: Disclosed herein is a method, a system and a computer product for generating a snippet for an entity, wherein each snippet comprises a plurality of sentiments about the entity. One or more textual reviews associated with the entity is selected. A plurality of sentiment phrases are identified based on the one or more textual reviews, wherein each sentiment phrase comprises a sentiment about the entity. One or more sentiment phrases from the plurality of sentiment phrases are selected to generate a snippet.

...read moreread less

Proceedings Article•DOI•

Generating Chinese Couplets using a Statistical MT Approach

[...]

Long Jiang¹, Ming Zhou¹•Institutions (1)

Microsoft¹

18 Aug 2008

TL;DR: A phrase-based SMT approach to generate the second sentence of Chinese couplets, where corresponding words in the two sentences match each other by obeying certain constraints on semantic, syntactic, and lexical relatedness.

...read moreread less

Abstract: Part of the unique cultural heritage of China is the game of Chinese couplets (duilian). One person challenges the other person with a sentence (first sentence). The other person then replies with a sentence (second sentence) equal in length and word segmentation, in a way that corresponding words in the two sentences match each other by obeying certain constraints on semantic, syntactic, and lexical relatedness. This task is viewed as a difficult problem in AI and has not been explored in the research community. In this paper, we regard this task as a kind of machine translation process. We present a phrase-based SMT approach to generate the second sentence. First, the system takes as input the first sentence, and generates as output an N-best list of proposed second sentences, using a phrase-based SMT decoder. Then, a set of filters is used to remove candidates violating linguistic constraints. Finally, a Ranking SVM is applied to rerank the candidates. A comprehensive evaluation, using both human judgments and BLEU scores, has been conducted, and the results demonstrate that this approach is very successful.

...read moreread less

Proceedings Article•DOI•

Four Techniques for Online Handling of Out-of-Vocabulary Words in Arabic-English Statistical Machine Translation

[...]

Nizar Habash¹•Institutions (1)

Columbia University¹

16 Jun 2008

TL;DR: Four techniques for online handling of Out-of-Vocabulary words in Phrase-based Statistical Machine Translation using spelling expansion, morphological expansion, dictionary term expansion and proper name transliteration to reuse or extend a phrase table are presented.

...read moreread less

Abstract: We present four techniques for online handling of Out-of-Vocabulary words in Phrase-based Statistical Machine Translation. The techniques use spelling expansion, morphological expansion, dictionary term expansion and proper name transliteration to reuse or extend a phrase table. We compare the performance of these techniques and combine them. Our results show a consistent improvement over a state-of-the-art baseline in terms of BLEU and a manual error analysis.

...read moreread less

Journal Article•DOI•

Acoustical cues and grammatical units in speech to two preverbal infants.

[...]

Melanie Soderstrom¹, Megan Blossom², Rina Foygel¹, James L. Morgan¹•Institutions (2)

Brown University¹, University of Kansas²

01 Nov 2008-Journal of Child Language

TL;DR: Examining the syntactic and prosodic characteristics of the maternal speech to two infants between six and ten months finds infant-directed speech to be characterized by generally short utterances, isolated words and phrases, and large numbers of questions, but longer utterances are also found.

...read moreread less

Abstract: The current study examines the syntactic and prosodic characteristics of the maternal speech to two infants between six and ten months. Consistent with previous work, we find infant-directed speech to be characterized by generally short utterances, isolated words and phrases, and large numbers of questions, but longer utterances are also found. Prosodic information provides cues to grammatical units not only at utterance boundaries, but also at utterance-internal clause boundaries. Subject-verb phrase boundaries in questions also show reliable prosodic cues, although those of declaratives do not. Prosodic information may thus play an important role in providing preverbal infants with information about the grammatically relevant word groupings. Furthermore, questions may play an important role in infants' discovery of verb phrases in English.

...read moreread less

Patent•

Community Translation On A Social Network

[...]

Yishan Wong¹, Stephen M. Grimm, Nicolas Vera, Marcel Laverdet, Ting Yin Kwan, Christopher W. Putnam, Javier Olivan-Lopez, Katherine Losse, Rebekah Cox, Chad Little - Show less +6 more•Institutions (1)

Facebook¹

05 Dec 2008

TL;DR: In this article, translations of text phrases are received from members of the social network, including content displayed in a social networking system, such as content from social networking objects, and a particular member is provided with content including a text phrase in first language, and the member requests translation into another language.

...read moreread less

Abstract: Embodiments of the invention provide techniques for translating text in a social network. In one embodiment translations of text phrases are received from members of the social network. These text phrases include content displayed in a social networking system, such as content from social networking objects. A particular member is provided with content including a text phrase in a first language, and the member requests translation into another language. Responsive to this request, a translation of the text phrase is selected from a set of available translations. The selection is based on actions by friends of the member in the social network, the actions being associated with the set of available translations. These actions can the viewing of or approval of translations by the friends, for example. The selected translation is then presented to the member requesting the translation.

...read moreread less

Phrase Detectives: A Web-based collaborative annotation game

[...]

Jon Chamberlain¹, Massimo Poesio¹, Udo Kruschwitz¹•Institutions (1)

University of Essex¹

01 Jan 2008

TL;DR: The first version of Phrase Detectives is presented, to the authors' knowledge the first game designed for collaborative linguistic annotation on the Web and applying this method to linguistic annotation tasks like anaphoric annotation.

...read moreread less

Abstract: Annotated corpora of the size needed for modern computational linguistics research cannot be created by small groups of hand annotators. One solution is to exploit collaborative work on the Web and one way to do this is through games like the ESP game. Applying this methodology however requires developing methods for teaching subjects the rules of the game and evaluating their contribution while maintaining the game entertainment. In addition, applying this method to linguistic annotation tasks like anaphoric annotation requires developing methods for presenting text and identifying the components of the text that need to be annotated. In this paper we present the first version of Phrase Detectives (http://www.phrasedetectives.org), to our knowledge the first game designed for collaborative linguistic annotation on the Web.

...read moreread less

Proceedings Article•DOI•

A Systematic Comparison of Phrase-Based, Hierarchical and Syntax-Augmented Statistical MT

[...]

Andreas Zollmann¹, Ashish Venugopal¹, Franz Josef Och¹, Jay Ponte¹•Institutions (1)

Google¹

18 Aug 2008

TL;DR: This work investigates the source of the improvements in translation quality reported when using two PSCFG translation models (hierarchical and syntax-augmented), when extending a state-of-the-art phrase-based baseline that serves as the lexical support for both P SCFG models.

...read moreread less

Abstract: Probabilistic synchronous context-free grammar (PSCFG) translation models define weighted transduction rules that represent translation and reordering operations via nonterminal symbols. In this work, we investigate the source of the improvements in translation quality reported when using two PSCFG translation models (hierarchical and syntax-augmented), when extending a state-of-the-art phrase-based baseline that serves as the lexical support for both PSCFG models. We isolate the impact on translation quality for several important design decisions in each model. We perform this comparison on three NIST language translation tasks; Chinese-to-English, Arabic-to-English and Urdu-to-English, each representing unique challenges.

...read moreread less

Arabic Named Entity Recognition using Conditional Random Fields

[...]

Yassine Benajiba¹, Paolo Rosso¹•Institutions (1)

Polytechnic University of Valencia¹

01 May 2008

TL;DR: A further attempt to enhance the accuracy of ANERsys by changing the probabilistic model from Maximum Entropy to Conditional Random Fields which helped to improve the results significantly.

...read moreread less

Abstract: The Named Entity Recognition (NER) task consists in determining and classifying proper names within an open-domain text. This Natural Language Processing task proved to be harder for languages with a complex morphology such as the Arabic language. NER was also proved to help Natural Language Processing tasks such as Machine Translation, Information Retrieval and Question Answering to obtain a higher performance. In our previous works we have presented the first and the second version of ANERsys: an Arabic Named Entity Recognition system, whose performance we have succeeded to improve by more than 10 points, from the first to the second version, by adopting a different architecture and using additional information such as Part-Of-Speech tags and Base Phrase Chunks. In this paper, we present a further attempt to enhance the accuracy of ANERsys by changing the probabilistic model from Maximum Entropy to Conditional Random Fields which helped to improve the results significantly.

...read moreread less

Journal Article•DOI•

The Extension of Dependency Beyond the Sentence

[...]

Marianne Mithun

01 Jan 2008-Language

TL;DR: This article examines several grammatical developments that have received relatively little attention, but that may be more pervasive than previously recognized, involved the functional extension of markers of grammatical dependency from sentence-level syntax into larger discourse and pragmatic domains.

...read moreread less

Abstract: This article examines several grammatical developments that have received relatively little attention, but that may be more pervasive than previously recognized. They involve the functional extension of markers of grammatical dependency from sentence-level syntax into larger discourse and pragmatic domains. Such developments are first illustrated with material from Navajo and Central Alaskan Yup'ik, then surveyed more briefly in several other unrelated languages. In some cases, secondary effects of such changes can reshape basic clause structure. An awareness of these processes can accordingly aid in understandingcertain recurringbut hitherto unexplained arrays of basic morphological and syntactic patterns, exemplified here with cases of homophonous grammatical markers and of ergative/accusative splits. Like developments described by Gildea (1997, 1998) and Evans (2007), they involve the use of dependent clauses as independent sentences, but the processes described here differ from those in both the mechanisms at work and their results.

...read moreread less

Journal Article•DOI•

HMM Word and Phrase Alignment for Statistical Machine Translation

[...]

Yonggang Deng¹, William Byrne¹•Institutions (1)

IBM¹

01 Mar 2008-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: In analyzing alignment performance, Chinese-English word alignments are shown to be comparable to those of IBM Model 4 even when models are trained over large parallel texts.

...read moreread less

Abstract: Estimation and alignment procedures for word and phrase alignment hidden Markov models (HMMs) are developed for the alignment of parallel text. The development of these models is motivated by an analysis of the desirable features of IBM Model 4, one of the original and most effective models for word alignment. These models are formulated to capture the desirable aspects of Model 4 in an HMM alignment formalism. Alignment behavior is analyzed and compared to human-generated reference alignments, and the ability of these models to capture different types of alignment phenomena is evaluated. In analyzing alignment performance, Chinese-English word alignments are shown to be comparable to those of IBM Model 4 even when models are trained over large parallel texts. In translation performance, phrase-based statistical machine translation systems based on these HMM alignments can equal and exceed systems based on Model 4 alignments, and this is shown in Arabic-English and Chinese-English translation. These alignment models can also be used to generate posterior statistics over collections of parallel text, and this is used to refine and extend phrase translation tables with a resulting improvement in translation quality.

...read moreread less

Proceedings Article•DOI•

Sampling Alignment Structure under a Bayesian Translation Model

[...]

John DeNero¹, Alexandre Bouchard-Côté¹, Dan Klein¹•Institutions (1)

University of California, Berkeley¹

25 Oct 2008

TL;DR: This work describes the first tractable Gibbs sampling procedure for estimating phrase pair frequencies under a probabilistic model of phrase alignment and proposes and evaluates two nonparametric priors that successfully avoid the degenerate behavior noted in previous work.

...read moreread less

Abstract: We describe the first tractable Gibbs sampling procedure for estimating phrase pair frequencies under a probabilistic model of phrase alignment. We propose and evaluate two nonparametric priors that successfully avoid the degenerate behavior noted in previous work, where overly large phrases memorize the training data. Phrase table weights learned under our model yield an increase in BLEU score over the word-alignment based heuristic estimates used regularly in phrase-based translation systems.

...read moreread less

Proceedings Article•DOI•

Viewing term proximity from a different perspective

[...]

Ruihua Song¹, Michael J. Taylor², Ji-Rong Wen², Hsiao-Wuen Hon², Yong Yu¹ - Show less +1 more•Institutions (2)

Shanghai Jiao Tong University¹, Microsoft²

30 Mar 2008

TL;DR: The state-of-the-art probabilistic model BM25 is extended to utilize term proximity from a new perspective, and the relevance contribution of a term occurrence is measured by how many query terms occur in the context phrase and how compact they are.

...read moreread less

Abstract: This paper extends the state-of-the-art probabilistic model BM25 to utilize term proximity from a new perspective. Most previous work only consider dependencies between pairs of terms, and regard phrases as additional independent evidence. It is difficult to estimate the importance of a phrase and its extra contribution to a relevance score, as the phrase actually overlaps with the component terms. This paper proposes a new approach. First, query terms are grouped locally into non-overlapping phrases that may contain one or more query terms. Second, these phrases are not scored independently but are instead treated as providing a context for the component query terms. The relevance contribution of a term occurrence is measured by how many query terms occur in the context phrase and how compact they are. Third, we replace term frequency by the accumulated relevance contribution. Consequently, term proximity is easily integrated into the probabilistic model. Experimental results on TREC-10 and TREC-11 collections show stable improvements in terms of average precision and significant improvements in terms of top precisions.

...read moreread less

Collapse