scispace - formally typeset
Search or ask a question

Showing papers on "Phrase published in 2008"


Proceedings ArticleDOI
25 Oct 2008
TL;DR: A novel hierarchical phrase reordering model aimed at improving non-local reorderings, which seamlessly integrates with a standard phrase-based system with little loss of computational efficiency is presented.
Abstract: While phrase-based statistical machine translation systems currently deliver state-of-the-art performance, they remain weak on word order changes. Current phrase reordering models can properly handle swaps between adjacent phrases, but they typically lack the ability to perform the kind of long-distance re-orderings possible with syntax-based systems. In this paper, we present a novel hierarchical phrase reordering model aimed at improving non-local reorderings, which seamlessly integrates with a standard phrase-based system with little loss of computational efficiency. We show that this model can successfully handle the key examples often used to motivate syntax-based systems, such as the rotation of a prepositional phrase around a noun phrase. We contrast our model with reordering models commonly used in phrase-based systems, and show that our approach provides statistically significant BLEU point gains for two language pairs: Chinese-English (+0.53 on MT05 and +0.71 on MT08) and Arabic-English (+0.55 on MT05).

346 citations


Journal ArticleDOI
TL;DR: This article showed that syntactic priming is a form of implicit learning and that lexically-based, short-term mechanisms operate in tandem with abstract, longer-term learning mechanisms can explain the full pattern of results.

264 citations


Proceedings ArticleDOI
25 Oct 2008
TL;DR: This work improves the quality of paraphrases extracted from parallel corpora by requiring that phrases and their paraphrasing be the same syntactic type and altering the phrase extraction algorithm to extract phrase labels alongside bilingual phrase pairs.
Abstract: We improve the quality of paraphrases extracted from parallel corpora by requiring that phrases and their paraphrases be the same syntactic type. This is achieved by parsing the English side of a parallel corpus and altering the phrase extraction algorithm to extract phrase labels alongside bilingual phrase pairs. In order to retain broad coverage of non-constituent phrases, complex syntactic labels are introduced. A manual evaluation indicates a 19% absolute improvement in paraphrase quality over the baseline method.

207 citations


Journal ArticleDOI
TL;DR: The results showed that comprehension difficulty was modulated by animacy configuration and voice, and differences were well correlated with the availability of alternative interpretations as the relative clause unfolds, as revealed by the completion data.

204 citations


Journal ArticleDOI
TL;DR: Findings suggest that prototypical instances of linguistic constructions with redundant grammatical marking play a special role in early acquisition, and only later do children isolate and weigh individual grammatical cues appropriately.
Abstract: Two comprehension experiments were conducted to investigate whether German children are able to use the grammatical cues of word order and word endings (case markers) to identify agents and patients in a causative sentence and whether they weigh these two cues differently across development. Two-year-olds correctly understood only sentences with both cues supporting each other--the prototypical form. Five-year-olds were able to use word order by itself but not case markers. Only 7-year-olds behaved like adults by relying on case markers over word order when the two cues conflicted. These findings suggest that prototypical instances of linguistic constructions with redundant grammatical marking play a special role in early acquisition, and only later do children isolate and weigh individual grammatical cues appropriately.

196 citations


Journal ArticleDOI
TL;DR: In this article, the perceived emotional weight of the phrase I love you in multilinguals' different languages was investigated and found to be associated with self-perceived language dominance, context of acquisition of the second language, age of onset of learning the first language, degree of socialization in the second, nature of the network of interlocutors in the L2, and selfperceived oral proficiency.

185 citations


Journal ArticleDOI
TL;DR: The phrase-based document similarity is applied to the group-average Hierarchical Agglomerative Clustering (HAC) algorithm and the new clustering approach is developed, which is very effective on clustering the documents of two standard document benchmark corpora OHSUMED and RCV1.
Abstract: In this paper, we propose a phrase-based document similarity to compute the pair-wise similarities of documents based on the suffix tree document (STD) model. By mapping each node in the suffix tree of STD model into a unique feature term in the vector space document (VSD) model, the phrase-based document similarity naturally inherits the term tf-idf weighting scheme in computing the document similarity with phrases. We apply the phrase-based document similarity to the group-average Hierarchical Agglomerative Clustering (HAC) algorithm and develop a new document clustering approach. Our evaluation experiments indicate that, the new clustering approach is very effective on clustering the documents of two standard document benchmark corpora OHSUMED and RCV1. The quality of the clustering results significantly surpass the results of traditional single-word \textit{tf-idf} similarity measure in the same HAC algorithm, especially in large document data sets. Furthermore, by studying the property of STD model, we conclude that the feature vector of phrase terms in the STD model can be considered as an expanded feature vector of the traditional single-word terms in the VSD model. This conclusion sufficiently explains why the phrase-based document similarity works much better than the single-word tf-idf similarity measure.

176 citations


Book ChapterDOI
01 Jun 2008
TL;DR: The approach combines both finite-state automata technique, regular expression parsing and the maximal-matching strategy which is augmented by statistical methods to resolve ambiguities of segmentation to create vnTokenizer, a highly accurate tokenizer for Vietnamese texts.
Abstract: We present in this article a hybrid approach to automatically tokenize Vietnamese text. The approach combines both finite-state automata technique, regular expression parsing and the maximal-matching strategy which is augmented by statistical methods to resolve ambiguities of segmentation. The Vietnamese lexicon in use is compactly represented by a minimal finite-state automaton. A text to be tokenized is first parsed into lexical phrases and other patterns using pre-defined regular expressions. The automaton is then deployed to build linear graphs corresponding to the phrases to be segmented. The application of a maximal- matching strategy on a graph results in all candidate segmentations of a phrase. It is the responsibility of an ambiguity resolver, which uses a smoothed bigram language model, to choose the most probable segmentation of the phrase. The hybrid approach is implemented to create vnTokenizer, a highly accurate tokenizer for Vietnamese texts.

153 citations


Journal ArticleDOI
TL;DR: This paper builds an automatic detector and classifier for prosodic events in American English, based on their acoustic, lexical, and syntactic correlates, and focuses on accent (prominence, or ldquostressrdquo) and prosodic phrase boundary detection at the syllable level.
Abstract: With the advent of prosody annotation standards such as tones and break indices (ToBI), speech technologists and linguists alike have been interested in automatically detecting prosodic events in speech. This is because the prosodic tier provides an additional layer of information over the short-term segment-level features and lexical representation of an utterance. As the prosody of an utterance is closely tied to its syntactic and semantic content in addition to its lexical content, knowledge of the prosodic events within and across utterances can assist spoken language applications such as automatic speech recognition and translation. On the other hand, corpora annotated with prosodic events are useful for building natural-sounding speech synthesizers. In this paper, we build an automatic detector and classifier for prosodic events in American English, based on their acoustic, lexical, and syntactic correlates. Following previous work in this area, we focus on accent (prominence, or ldquostressrdquo) and prosodic phrase boundary detection at the syllable level. Our experiments achieved a performance rate of 86.75% agreement on the accent detection task, and 91.61% agreement on the phrase boundary detection task on the Boston University Radio News Corpus.

153 citations


Patent
24 Jan 2008
TL;DR: In this article, a conversation is analyzed and contextually or textually relevant keywords and phrases are identified in a visually-identifiable manner for selection by an individual participating in the conversation.
Abstract: In the context of an instant messaging application, a conversation is analyzed and contextually or textually relevant keywords and/or phrases are identified. Keywords or phrases are highlighted in a visually-identifiable manner for selection by an individual participating in the conversation. Once selected by an individual, a user interface is presented and exposes various contextually- or textually-relevant material or functionality that pertains to the selected word or phrase. An individual can also manually select a word or phrase to access the user interface. At least some of this relevant material or functionality is presented to the user in the context of the instant messaging application and in a manner in which it can be consumed by the individual within the instant messaging application itself.

150 citations


Journal ArticleDOI
Hugo Quené1
TL;DR: Investigation of a corpus of spoken Dutch consisting of interviews with 160 high-school teachers shows that speech tempo depends mainly on phrase length, due to anticipatory shortening, and on the speaker's country, with different speaking styles in The Netherlands and in Flanders.
Abstract: Speech tempo (articulation rate) varies both between and within speakers. The present study investigates several factors affecting tempo in a corpus of spoken Dutch, consisting of interviews with 160 high-school teachers. Speech tempo was observed for each phrase separately, and analyzed by means of multilevel modeling of the speaker's sex, age, country, and dialect region (between speakers) and length, sequential position of phrase, and autocorrelated tempo (within speakers). Results show that speech tempo in this corpus depends mainly on phrase length, due to anticipatory shortening, and on the speaker's country, with different speaking styles in The Netherlands (faster, less varied) and in Flanders (slower, more varied). Additional analyses showed that phrase length itself is shorter in The Netherlands than in Flanders, and decreases with speaker's age. Older speakers tend to vary their phrase length more (within speakers), perhaps due to their accumulated verbal proficiency.

Journal ArticleDOI
TL;DR: The benefits of variation set structure directly are demonstrated directly: in miniature artificial languages, arranging a certain proportion of utterances in a training corpus in variation sets facilitated word and phrase constituent learning in adults.

Journal ArticleDOI
TL;DR: This article used syntactic priming to test the abstractness of sentence representations of young 3-year-olds (35-42 months) and found that children who were primed with passives produced more passives than did children primed with actives.

Proceedings ArticleDOI
25 Oct 2008
TL;DR: The MANLI system is presented, a new NLI aligner designed to address the alignment problem, which uses a phrase-based alignment representation, exploits external lexical resources, and capitalizes on a new set of supervised training data.
Abstract: The alignment problem---establishing links between corresponding phrases in two related sentences---is as important in natural language inference (NLI) as it is in machine translation (MT). But the tools and techniques of MT alignment do not readily transfer to NLI, where one cannot assume semantic equivalence, and for which large volumes of bitext are lacking. We present a new NLI aligner, the MANLI system, designed to address these challenges. It uses a phrase-based alignment representation, exploits external lexical resources, and capitalizes on a new set of supervised training data. We compare the performance of MANLI to existing NLI and MT aligners on an NLI alignment task over the well-known Recognizing Textual Entailment data. We show that MANLI significantly outperforms existing aligners, achieving gains of 6.2% in F1 over a representative NLI aligner and 10.5% over GIZA++.

Journal ArticleDOI
TL;DR: Experimental results show that infants have access to intermediate prosodic phrases during the first year of life, and use these to constrain lexical segmentation, and adult results are presented that test the plausibility of this hypothesis.
Abstract: This paper focuses on how phrasal prosody and function words may interact during early language acquisition. Experimental results show that infants have access to intermediate prosodic phrases (phonological phrases) during the first year of life, and use these to constrain lexical segmentation. These same intermediate prosodic phrases are used by adults to constrain on-line syntactic analysis. In addition, by two years of age infants can exploit function words to infer the syntactic category of unknown content words (nouns vs. verbs) and guess their plausible meaning (object vs. action). We speculate on how infants may build a partial syntactic structure by relying on both phonological phrase boundaries and function words, and present adult results that test the plausibility of this hypothesis. These results are tied together within a model of the architecture of the first stages of language processing, and their acquisition.

DOI
01 Jan 2008
TL;DR: This paper hypothesizes that the iambic-trochaic law determines the physical realization of main prominence within phonological phrases that contain more than one word, and shows this to be the case both across languages and within a lan- guage.
Abstract: How do infants start learning the syntax of the language they are exposed to? In this paper, we examine a plausible mechanism for the acquisi- tion of the relative order of heads and complements. We hypothesize that the iambic-trochaic law determines the physical realization of main prominence within phonological phrases that contain more than one word: if it is realized mainly through pitch and intensity, it is in a phonological phrase that is stress- initial and has a complement-head structure, otherwise it is in a phonological phrase that is stress-final and has a head-complement structure. We show this to be the case both across languages (French and Turkish), and within a lan- guage (German, where both orders of head and complement are found). Our finding allows us to consider a psychologically plausible mechanism for the acquisition of the relative order of heads and complements, one of the basic properties of syntax. Because the mechanism is based on auditory percep- tion, it can be utilized before any knowledge of words, thus accounting for the flawlessness in infants' first words combinations.

Patent
Patrick Jason Morrison1
24 Oct 2008
TL;DR: In this paper, a method of receiving an audio stream containing user speech from a first device, generating text based on the user speech, identifying a key phrase in the text, receiving from an advertiser an advertisement related to the identified key phrase, and displaying the advertisement can be displayed after the audio stream terminates.
Abstract: Disclosed is a method of receiving an audio stream containing user speech from a first device, generating text based on the user speech, identifying a key phrase in the text, receiving from an advertiser an advertisement related to the identified key phrase, and displaying the advertisement. The method can include receiving from an advertiser a set of rules associated with the advertisement and displaying the advertisement in accordance with the associated set of rules. The method can display the advertisement on one or both of a first device and a second device. A central server can generate text based on the speech. A key phrase in the text can be identified based on a confidence score threshold. The advertisement can be displayed after the audio stream terminates.

Proceedings Article
01 Jun 2008
TL;DR: A translation model that is based on tree sequence alignment, where a tree sequence refers to a single sequence of subtrees that covers a phrase, that statistically significantly outperforms the baseline systems and supports multi-level structure reordering of tree typology with larger span.
Abstract: This paper presents a translation model that is based on tree sequence alignment, where a tree sequence refers to a single sequence of subtrees that covers a phrase. The model leverages on the strengths of both phrase-based and linguistically syntax-based method. It automatically learns aligned tree sequence pairs with mapping probabilities from word-aligned biparsed parallel texts. Compared with previous models, it not only captures non-syntactic phrases and discontinuous phrases with linguistically structured features, but also supports multi-level structure reordering of tree typology with larger span. This gives our model stronger expressive power than other reported models. Experimental results on the NIST MT-2005 Chinese-English translation task show that our method statistically significantly outperforms the baseline systems.

Patent
Sasha Blair-Goldensohn1, Kerry Hannan1, Ryan McDonald1, Tyler Neylon1, Jeffrey C. Reynar1 
25 Jan 2008
TL;DR: In this paper, a method, a system and a computer product for generating a snippet for an entity, wherein each snippet comprises a plurality of sentiments about the entity is selected, and one or more sentiment phrases from the plurality of sentiment phrases are selected to generate a snippet.
Abstract: Disclosed herein is a method, a system and a computer product for generating a snippet for an entity, wherein each snippet comprises a plurality of sentiments about the entity. One or more textual reviews associated with the entity is selected. A plurality of sentiment phrases are identified based on the one or more textual reviews, wherein each sentiment phrase comprises a sentiment about the entity. One or more sentiment phrases from the plurality of sentiment phrases are selected to generate a snippet.

Proceedings ArticleDOI
Long Jiang1, Ming Zhou1
18 Aug 2008
TL;DR: A phrase-based SMT approach to generate the second sentence of Chinese couplets, where corresponding words in the two sentences match each other by obeying certain constraints on semantic, syntactic, and lexical relatedness.
Abstract: Part of the unique cultural heritage of China is the game of Chinese couplets (duilian). One person challenges the other person with a sentence (first sentence). The other person then replies with a sentence (second sentence) equal in length and word segmentation, in a way that corresponding words in the two sentences match each other by obeying certain constraints on semantic, syntactic, and lexical relatedness. This task is viewed as a difficult problem in AI and has not been explored in the research community. In this paper, we regard this task as a kind of machine translation process. We present a phrase-based SMT approach to generate the second sentence. First, the system takes as input the first sentence, and generates as output an N-best list of proposed second sentences, using a phrase-based SMT decoder. Then, a set of filters is used to remove candidates violating linguistic constraints. Finally, a Ranking SVM is applied to rerank the candidates. A comprehensive evaluation, using both human judgments and BLEU scores, has been conducted, and the results demonstrate that this approach is very successful.

Proceedings ArticleDOI
Nizar Habash1
16 Jun 2008
TL;DR: Four techniques for online handling of Out-of-Vocabulary words in Phrase-based Statistical Machine Translation using spelling expansion, morphological expansion, dictionary term expansion and proper name transliteration to reuse or extend a phrase table are presented.
Abstract: We present four techniques for online handling of Out-of-Vocabulary words in Phrase-based Statistical Machine Translation. The techniques use spelling expansion, morphological expansion, dictionary term expansion and proper name transliteration to reuse or extend a phrase table. We compare the performance of these techniques and combine them. Our results show a consistent improvement over a state-of-the-art baseline in terms of BLEU and a manual error analysis.

Journal ArticleDOI
TL;DR: Examining the syntactic and prosodic characteristics of the maternal speech to two infants between six and ten months finds infant-directed speech to be characterized by generally short utterances, isolated words and phrases, and large numbers of questions, but longer utterances are also found.
Abstract: The current study examines the syntactic and prosodic characteristics of the maternal speech to two infants between six and ten months. Consistent with previous work, we find infant-directed speech to be characterized by generally short utterances, isolated words and phrases, and large numbers of questions, but longer utterances are also found. Prosodic information provides cues to grammatical units not only at utterance boundaries, but also at utterance-internal clause boundaries. Subject-verb phrase boundaries in questions also show reliable prosodic cues, although those of declaratives do not. Prosodic information may thus play an important role in providing preverbal infants with information about the grammatically relevant word groupings. Furthermore, questions may play an important role in infants' discovery of verb phrases in English.

Patent
05 Dec 2008
TL;DR: In this article, translations of text phrases are received from members of the social network, including content displayed in a social networking system, such as content from social networking objects, and a particular member is provided with content including a text phrase in first language, and the member requests translation into another language.
Abstract: Embodiments of the invention provide techniques for translating text in a social network. In one embodiment translations of text phrases are received from members of the social network. These text phrases include content displayed in a social networking system, such as content from social networking objects. A particular member is provided with content including a text phrase in a first language, and the member requests translation into another language. Responsive to this request, a translation of the text phrase is selected from a set of available translations. The selection is based on actions by friends of the member in the social network, the actions being associated with the set of available translations. These actions can the viewing of or approval of translations by the friends, for example. The selected translation is then presented to the member requesting the translation.

01 Jan 2008
TL;DR: The first version of Phrase Detectives is presented, to the authors' knowledge the first game designed for collaborative linguistic annotation on the Web and applying this method to linguistic annotation tasks like anaphoric annotation.
Abstract: Annotated corpora of the size needed for modern computational linguistics research cannot be created by small groups of hand annotators. One solution is to exploit collaborative work on the Web and one way to do this is through games like the ESP game. Applying this methodology however requires developing methods for teaching subjects the rules of the game and evaluating their contribution while maintaining the game entertainment. In addition, applying this method to linguistic annotation tasks like anaphoric annotation requires developing methods for presenting text and identifying the components of the text that need to be annotated. In this paper we present the first version of Phrase Detectives (http://www.phrasedetectives.org), to our knowledge the first game designed for collaborative linguistic annotation on the Web.

Proceedings ArticleDOI
18 Aug 2008
TL;DR: This work investigates the source of the improvements in translation quality reported when using two PSCFG translation models (hierarchical and syntax-augmented), when extending a state-of-the-art phrase-based baseline that serves as the lexical support for both P SCFG models.
Abstract: Probabilistic synchronous context-free grammar (PSCFG) translation models define weighted transduction rules that represent translation and reordering operations via nonterminal symbols. In this work, we investigate the source of the improvements in translation quality reported when using two PSCFG translation models (hierarchical and syntax-augmented), when extending a state-of-the-art phrase-based baseline that serves as the lexical support for both PSCFG models. We isolate the impact on translation quality for several important design decisions in each model. We perform this comparison on three NIST language translation tasks; Chinese-to-English, Arabic-to-English and Urdu-to-English, each representing unique challenges.

01 May 2008
TL;DR: A further attempt to enhance the accuracy of ANERsys by changing the probabilistic model from Maximum Entropy to Conditional Random Fields which helped to improve the results significantly.
Abstract: The Named Entity Recognition (NER) task consists in determining and classifying proper names within an open-domain text. This Natural Language Processing task proved to be harder for languages with a complex morphology such as the Arabic language. NER was also proved to help Natural Language Processing tasks such as Machine Translation, Information Retrieval and Question Answering to obtain a higher performance. In our previous works we have presented the first and the second version of ANERsys: an Arabic Named Entity Recognition system, whose performance we have succeeded to improve by more than 10 points, from the first to the second version, by adopting a different architecture and using additional information such as Part-Of-Speech tags and Base Phrase Chunks. In this paper, we present a further attempt to enhance the accuracy of ANERsys by changing the probabilistic model from Maximum Entropy to Conditional Random Fields which helped to improve the results significantly.

Journal ArticleDOI
01 Jan 2008-Language
TL;DR: This article examines several grammatical developments that have received relatively little attention, but that may be more pervasive than previously recognized, involved the functional extension of markers of grammatical dependency from sentence-level syntax into larger discourse and pragmatic domains.
Abstract: This article examines several grammatical developments that have received relatively little attention, but that may be more pervasive than previously recognized. They involve the functional extension of markers of grammatical dependency from sentence-level syntax into larger discourse and pragmatic domains. Such developments are first illustrated with material from Navajo and Central Alaskan Yup'ik, then surveyed more briefly in several other unrelated languages. In some cases, secondary effects of such changes can reshape basic clause structure. An awareness of these processes can accordingly aid in understandingcertain recurringbut hitherto unexplained arrays of basic morphological and syntactic patterns, exemplified here with cases of homophonous grammatical markers and of ergative/accusative splits. Like developments described by Gildea (1997, 1998) and Evans (2007), they involve the use of dependent clauses as independent sentences, but the processes described here differ from those in both the mechanisms at work and their results.

Journal ArticleDOI
Yonggang Deng1, William Byrne1
TL;DR: In analyzing alignment performance, Chinese-English word alignments are shown to be comparable to those of IBM Model 4 even when models are trained over large parallel texts.
Abstract: Estimation and alignment procedures for word and phrase alignment hidden Markov models (HMMs) are developed for the alignment of parallel text. The development of these models is motivated by an analysis of the desirable features of IBM Model 4, one of the original and most effective models for word alignment. These models are formulated to capture the desirable aspects of Model 4 in an HMM alignment formalism. Alignment behavior is analyzed and compared to human-generated reference alignments, and the ability of these models to capture different types of alignment phenomena is evaluated. In analyzing alignment performance, Chinese-English word alignments are shown to be comparable to those of IBM Model 4 even when models are trained over large parallel texts. In translation performance, phrase-based statistical machine translation systems based on these HMM alignments can equal and exceed systems based on Model 4 alignments, and this is shown in Arabic-English and Chinese-English translation. These alignment models can also be used to generate posterior statistics over collections of parallel text, and this is used to refine and extend phrase translation tables with a resulting improvement in translation quality.

Proceedings ArticleDOI
25 Oct 2008
TL;DR: This work describes the first tractable Gibbs sampling procedure for estimating phrase pair frequencies under a probabilistic model of phrase alignment and proposes and evaluates two nonparametric priors that successfully avoid the degenerate behavior noted in previous work.
Abstract: We describe the first tractable Gibbs sampling procedure for estimating phrase pair frequencies under a probabilistic model of phrase alignment. We propose and evaluate two nonparametric priors that successfully avoid the degenerate behavior noted in previous work, where overly large phrases memorize the training data. Phrase table weights learned under our model yield an increase in BLEU score over the word-alignment based heuristic estimates used regularly in phrase-based translation systems.

Proceedings ArticleDOI
30 Mar 2008
TL;DR: The state-of-the-art probabilistic model BM25 is extended to utilize term proximity from a new perspective, and the relevance contribution of a term occurrence is measured by how many query terms occur in the context phrase and how compact they are.
Abstract: This paper extends the state-of-the-art probabilistic model BM25 to utilize term proximity from a new perspective. Most previous work only consider dependencies between pairs of terms, and regard phrases as additional independent evidence. It is difficult to estimate the importance of a phrase and its extra contribution to a relevance score, as the phrase actually overlaps with the component terms. This paper proposes a new approach. First, query terms are grouped locally into non-overlapping phrases that may contain one or more query terms. Second, these phrases are not scored independently but are instead treated as providing a context for the component query terms. The relevance contribution of a term occurrence is measured by how many query terms occur in the context phrase and how compact they are. Third, we replace term frequency by the accumulated relevance contribution. Consequently, term proximity is easily integrated into the probabilistic model. Experimental results on TREC-10 and TREC-11 collections show stable improvements in terms of average precision and significant improvements in terms of top precisions.