scispace - formally typeset
Search or ask a question

Showing papers on "Phrase published in 2010"


Journal ArticleDOI
TL;DR: This article proposes a framework for representing the meaning of word combinations in vector space in terms of additive and multiplicative functions, and introduces a wide range of composition models that are evaluated empirically on a phrase similarity task.

981 citations


Journal ArticleDOI
TL;DR: The authors showed that comprehenders are sensitive to the frequencies of compositional four-word phrases (e.g. don't have to worry) and that more frequent phrases are processed faster.

492 citations


Proceedings Article
09 Oct 2010
TL;DR: A new approach to SMT adaptation that weights out-of-domain phrase pairs according to their relevance to the target domain, determined by both how similar to it they appear to be, and whether they belong to general language or not is described.
Abstract: We describe a new approach to SMT adaptation that weights out-of-domain phrase pairs according to their relevance to the target domain, determined by both how similar to it they appear to be, and whether they belong to general language or not. This extends previous work on discriminative weighting by using a finer granularity, focusing on the properties of instances rather than corpus components, and using a simpler training procedure. We incorporate instance weighting into a mixture-model framework, and find that it yields consistent improvements over a wide range of baselines.

232 citations


Journal ArticleDOI
TL;DR: The Corpus of Contemporary American English is the first large, genre-balanced corpus of any language, which has been designed and constructed from the ground up as a 'monitor corpus', and which can be used to accurately track and study recent changes in the language.
Abstract: The Corpus of Contemporary American English is the first large, genre-balanced corpus of any language, which has been designed and constructed from the ground up as a 'monitor corpus', and which can be used to accurately track and study recent changes in the language. The 400 million words corpus is evenly divided between spoken, fiction, popular magazines, newspapers, and academic journals. Most importantly, the genre balance stays almost exactly the same from year to year, which allows it to accurately model changes in the 'real world'. After discussing the corpus design, we provide a number of concrete examples of how the corpus can be used to look at recent changes in English, including morph- ology (new suffixes -friendly and -gate), syntax (including prescriptive rules, quotative like, so not ADJ, the get passive, resultatives, and verb complementa- tion), semantics (such as changes in meaning with web, green, or gay), and lexis-- including word and phrase frequency by year, and using the corpus architecture to produce lists of all words that have had large shifts in frequency between specific historical periods.

221 citations


Patent
Jung-Eun Kim1, Jeong-mi Cho1
30 Sep 2010
TL;DR: In this paper, an apparatus and system for analyzing intention is presented, which applies a context-free grammar to each of one or more sentences in units of one-or more phrases to perform phrase spotting on each sentence, thereby extending a recognition range for an out-of-grammar (OOG) expression.
Abstract: An apparatus and system for analyzing intention are provided. The apparatus for analyzing an intention applies a context-free grammar to each of one or more sentences in units of one or more phrases to perform phrase spotting on each sentence, thereby extending a recognition range for an out-of-grammar (OOG) expression. Meanwhile, the apparatus for analyzing an intention determines whether sentences that have undergone phrase spotting are grammatically valid by applying a dependency grammar to the sentences to filter an invalid sentence, and generates the intention analysis result of a valid sentence, thereby and grammatically and/or semantically verifying a sentence that has undergone speech recognition while extending a speech recognition range.

206 citations


Proceedings Article
02 Jun 2010
TL;DR: An algorithm is developed that takes a trending phrase or any phrase specified by a user, collects a large number of posts containing the phrase, and provides an automatically created summary of the posts related to the term.
Abstract: In this paper, we focus on a recent Web trend called microblogging, and in particular a site called Twitter. The content of such a site is an extraordinarily large number of small textual messages, posted by millions of users, at random or in response to perceived events or situations. We have developed an algorithm that takes a trending phrase or any phrase specified by a user, collects a large number of posts containing the phrase, and provides an automatically created summary of the posts related to the term. We present examples of summaries we produce along with initial evaluation.

203 citations


Journal ArticleDOI
TL;DR: A new approach to phrase-level sentiment analysis is presented that first determines whether an expression is neutral or polar and then disambiguates the polarity of the polar expressions, achieving results that are significantly better than baseline.
Abstract: There has been a recent swell of interest in the automatic identification and extraction of opinions, emotions, and sentiments in text. Motivation for this task comes from the desire to provide tools for information analysts in government, commercial, and political domains, who want to automatically track attitudes and feelings in the news and on-line forums. How do people feel about recent events in the Middle East? Is the rhetoric from a particular opposition group intensifying? What is the range of opinions being expressed in the world press about the best course of action in Iraq? A system that could automatically identify opinions and emotions from text would be an enormous help to someone trying to answer these kinds of questions. Researchers from many subareas of Artificial Intelligence and Natural Language Processing have been working on the automatic identification of opinions and related tasks. To date, most such work has focused on sentiment or subjectivity classification at the document or sentence level. Document classification tasks include, for example, distinguishing editorials from news articles and classifying reviews as positive or negative. A common sentence-level task is to classify sentences as subjective or objective. This paper presents a new approach to phrase-level sentiment analysis that first determines whether an expression is neutral or polar and then disambiguates the polarity of the polar expressions. With this approach, the system is able to automatically identify the contextual polarity for a large subset of sentiment expressions, achieving results that are significantly better than baseline.

202 citations


Journal Article
TL;DR: This article used a Bayesian framework for grammar induction and showed that an ideal learner could recognize the hierarchical phrase structure of language without having this knowledge innately specified as part of the language faculty.

191 citations


Journal ArticleDOI
TL;DR: The results suggest that the violation of expectations and the difficulty of memory retrieval both contribute to the difficulties of object relative clauses, but that these two sources of difficulty have qualitatively distinct behavioral consequences in normal reading.

176 citations


Patent
25 Oct 2010
TL;DR: In this paper, a method for transliteration includes receiving input such as a word, a sentence, a phrase, and a paragraph, in a source language, creating source language sub-phonetic units for the word and converting the source language SUB-PHONETs to target language subphonETs.
Abstract: A method for transliteration includes receiving input such as a word, a sentence, a phrase, and a paragraph, in a source language, creating source language sub-phonetic units for the word and converting the source language sub-phonetic units for the word to target language sub-phonetic units, retrieving ranking for each of the target language sub-phonetic units from a database and creating target language words for the word in the source language based on the target language sub-phonetic units and ranking of the each of the target language sub-phonetic units. The method further includes identifying candidate target language words based predefined criteria, and displaying candidate target language words.

164 citations


Proceedings Article
11 Jul 2010
TL;DR: Bagel is presented, a statistical language generator which uses dynamic Bayesian networks to learn from semantically-aligned data produced by 42 untrained annotators, and can generate natural and informative utterances from unseen inputs in the information presentation domain.
Abstract: Most previous work on trainable language generation has focused on two paradigms: (a) using a statistical model to rank a set of generated utterances, or (b) using statistics to inform the generation decision process. Both approaches rely on the existence of a handcrafted generator, which limits their scalability to new domains. This paper presents Bagel, a statistical language generator which uses dynamic Bayesian networks to learn from semantically-aligned data produced by 42 untrained annotators. A human evaluation shows that Bagel can generate natural and informative utterances from unseen inputs in the information presentation domain. Additionally, generation performance on sparse datasets is improved significantly by using certainty-based active learning, yielding ratings close to the human gold standard with a fraction of the data.

Journal ArticleDOI
TL;DR: A new concept-based mining model that analyzes terms on the sentence, document, and corpus levels rather than the traditional analysis of the document only is introduced and can efficiently find significant matching concepts between documents, according to the semantics of their sentences.
Abstract: Most of the common techniques in text mining are based on the statistical analysis of a term, either word or phrase. Statistical analysis of a term frequency captures the importance of the term within a document only. However, two terms can have the same frequency in their documents, but one term contributes more to the meaning of its sentences than the other term. Thus, the underlying text mining model should indicate terms that capture the semantics of text. In this case, the mining model can capture terms that present the concepts of the sentence, which leads to discovery of the topic of the document. A new concept-based mining model that analyzes terms on the sentence, document, and corpus levels is introduced. The concept-based mining model can effectively discriminate between nonimportant terms with respect to sentence semantics and terms which hold the concepts that represent the sentence meaning. The proposed mining model consists of sentence-based concept analysis, document-based concept analysis, corpus-based concept-analysis, and concept-based similarity measure. The term which contributes to the sentence semantics is analyzed on the sentence, document, and corpus levels rather than the traditional analysis of the document only. The proposed model can efficiently find significant matching concepts between documents, according to the semantics of their sentences. The similarity between documents is calculated based on a new concept-based similarity measure. The proposed similarity measure takes full advantage of using the concept analysis measures on the sentence, document, and corpus levels in calculating the similarity between documents. Large sets of experiments using the proposed concept-based mining model on different data sets in text clustering are conducted. The experiments demonstrate extensive comparison between the concept-based analysis and the traditional analysis. Experimental results demonstrate the substantial enhancement of the clustering quality using the sentence-based, document-based, corpus-based, and combined approach concept analysis.

Journal ArticleDOI
TL;DR: It is shown that progressive sentences about hand motion facilitate manual action in the same direction, while perfect sentences that are identical in every way except their aspect do not.

Proceedings Article
11 Jul 2010
TL;DR: Experimental results show that the model's output is comparable to human-written highlights in terms of both grammaticality and content.
Abstract: In this paper we present a joint content selection and compression model for single-document summarization. The model operates over a phrase-based representation of the source document which we obtain by merging information from PCFG parse trees and dependency graphs. Using an integer linear programming formulation, the model learns to select and combine phrases subject to length, coverage and grammar constraints. We evaluate the approach on the task of generating "story highlights"---a small number of brief, self-contained sentences that allow readers to quickly gather information on news stories. Experimental results show that the model's output is comparable to human-written highlights in terms of both grammaticality and content.

Journal ArticleDOI
TL;DR: The data suggest that the scope of advance planning during grammatical encoding in sentence production is flexible, rather than structurally fixed.
Abstract: Three picture-word interference experiments addressed the question of whether the scope of grammatical advance planning in sentence production corresponds to some fixed unit or rather is flexible. Subjects produced sentences of different formats under varying amounts of cognitive load. When speakers described 2-object displays with simple sentences of the form "the frog is next to the mug," the 2 nouns were found to be lexically-semantically activated to similar degrees at speech onset, as indexed by similarly sized interference effects from semantic distractors related to either the first or the second noun. When speakers used more complex sentences (including prenominal color adjectives; e.g., "the blue frog is next to the blue mug") much larger interference effects were observed for the first than the second noun, suggesting that the second noun was lexically-semantically activated before speech onset on only a subset of trials. With increased cognitive load, introduced by an additional conceptual decision task and variable utterance formats, the interference effect for the first noun was increased and the interference effect for second noun disappeared, suggesting that the scope of advance planning had been narrowed. By contrast, if cognitive load was induced by a secondary working memory task to be performed during speech planning, the interference effect for both nouns was increased, suggesting that the scope of advance planning had not been affected. In all, the data suggest that the scope of advance planning during grammatical encoding in sentence production is flexible, rather than structurally fixed.

Journal ArticleDOI
TL;DR: An eye-tracking study explored Korean-speaking adults' and 4- and 5-year-olds' ability to recover from misinterpretations of temporarily ambiguous phrases during spoken language comprehension, finding that children, but not adults, had difficulty in recovering from these misinterpretations despite strong disambiguating evidence at the end of the sentence.

Journal ArticleDOI
TL;DR: This work uses blogs as object and data source for Chinese emotional expression analysis, and based on this model, a relatively fine-grained annotation scheme is proposed for manual annotation of an emotion corpus.

Proceedings Article
Dmitriy Genzel1
23 Aug 2010
TL;DR: An approach to automatically learn reordering rules to be applied as a preprocessing step in phrase-based machine translation, showing BLEU improvements for all of them, and demonstrating that many important order transformations can be captured.
Abstract: We describe an approach to automatically learn reordering rules to be applied as a preprocessing step in phrase-based machine translation. We learn rules for 8 different language pairs, showing BLEU improvements for all of them, and demonstrate that many important order transformations (SVO to SOV or VSO, head-modifier, verb movement) can be captured by this approach.

Patent
20 May 2010
TL;DR: In this article, a system and a method for phrase-based translation are disclosed, which includes receiving source language text to be translated into target language text, and a translation, based on the hypothesis scores, is then output.
Abstract: A system and a method for phrase-based translation are disclosed. The method includes receiving source language text to be translated into target language text. One or more dynamic bi-phrases are generated, based on the source text and the application of one or more rules, which may be based on user descriptions. A dynamic feature value is associated with each of the dynamic bi-phrases. For a sentence of the source text, static bi-phrases are retrieved from a bi-phrase table, each of the static bi-phrases being associated with one or more values of static features. Any of the dynamic bi-phrases which each cover at least one word of the source text are also retrieved, which together form a set of active bi-phrases. Translation hypotheses are generated using active bi-phrases from the set and scored with a translation scoring model which takes into account the static and dynamic feature values of the bi-phrases used in the respective hypothesis. A translation, based on the hypothesis scores, is then output.

Journal ArticleDOI
TL;DR: This paper investigated the role of shared word order and alignment with a dialogue partner in the production of code-switched sentences and found that participants had a clear preference for using the shared order when they switched languages, but also aligned their word order choices and code switching patterns with the confederate.

Proceedings Article
Jianfeng Gao1, Xiaolong Li1, Daniel Micol1, Chris Quirk1, Xu Sun2 
23 Aug 2010
TL;DR: The noisy channel model is subsumed by a more general ranker, which allows a variety of features to be easily incorporated and a distributed infrastructure is proposed for training and applying Web scale n-gram language models.
Abstract: This paper makes three significant extensions to a noisy channel speller designed for standard written text to target the challenging domain of search queries. First, the noisy channel model is subsumed by a more general ranker, which allows a variety of features to be easily incorporated. Second, a distributed infrastructure is proposed for training and applying Web scale n-gram language models. Third, a new phrase-based error model is presented. This model places a probability distribution over transformations between multi-word phrases, and is estimated using large amounts of query-correction pairs derived from search logs. Experiments show that each of these extensions leads to significant improvements over the state-of-the-art baseline methods.

Journal ArticleDOI
TL;DR: This paper investigated the relationship between syntactic and prosodic phrase structures in the production and perception of spontaneous speech and found that syntax influences prosody production, listeners' perception of prosodic boundaries is sensitive to acoustic duration, and syntax directly influences boundary perception.
Abstract: The relationship between syntactic and prosodic phrase structures is investigated in the production and perception of spontaneous speech. Three hypotheses are tested: (1) syntax influences prosody production; (2) listeners' perception of prosodic boundaries is sensitive to acoustic duration; and (3) syntax directly influences boundary perception, (partly) independent of the acoustic evidence for boundaries. Data are from the Buckeye corpus of conversational speech, and the real-time prosodic transcription of those data by 97 untrained listeners. Inter-transcriber agreement codes boundary strength at word junctures, and Boundary scores are shown to be correlated with both the syntactic context and vowel duration of a word. Vowel duration is also correlated with syntactic context, but the effect of syntactic context on boundary perception is not fully explained by vowel duration. Regression analyses show that syntactic clause boundaries and vowel duration are the first and second strongest predictors of bou...

Proceedings ArticleDOI
26 Oct 2010
TL;DR: This paper provides a quantitative analysis of the language discrepancy issue, and explores the use of clickthrough data to bridge documents and queries, and demonstrates that standard statistical machine translation techniques can be adapted for building a better Web document retrieval system.
Abstract: Web search is challenging partly due to the fact that search queries and Web documents use different language styles and vocabularies. This paper provides a quantitative analysis of the language discrepancy issue, and explores the use of clickthrough data to bridge documents and queries. We assume that a query is parallel to the titles of documents clicked on for that query. Two translation models are trained and integrated into retrieval models: A word-based translation model that learns the translation probability between single words, and a phrase-based translation model that learns the translation probability between multi-term phrases. Experiments are carried out on a real world data set. The results show that the retrieval systems that use the translation models outperform significantly the systems that do not. The paper also demonstrates that standard statistical machine translation techniques such as word alignment, bilingual phrase extraction, and phrase-based decoding, can be adapted for building a better Web document retrieval system.

Proceedings Article
15 Jul 2010
TL;DR: A novel reordering model for the hierarchical phrase-based approach is introduced which further enhances translation performance, and the effect some recent extended lexicon models have on the performance of the system is analyzed.
Abstract: We present Jane, RWTH's hierarchical phrase-based translation system, which has been open sourced for the scientific community. This system has been in development at RWTH for the last two years and has been successfully applied in different machine translation evaluations. It includes extensions to the hierarchical approach developed by RWTH as well as other research institutions. In this paper we give an overview of its main features. We also introduce a novel reordering model for the hierarchical phrase-based approach which further enhances translation performance, and analyze the effect some recent extended lexicon models have on the performance of the system.

Book
02 May 2010
TL;DR: This book discusses the architecture of the Linguistic-Spatial Interface, as well as Morphological and Semantic Regularities in the Lexicon, and the Ecology of English Noun-Noun Compounds.
Abstract: 1. prologue: The Parallel Architecture and its Components 2. Morphological and Semantic Regularities in the Lexicon 3. On Beyond Zebra: The Relation of Linguistics and Visual Information 4. The Architecture of the Linguistic-Spatial Interface 5. Parts and Boundaries 6. The Proper Treatment of Measuring Out, Telicity, and Perhaps Even Quantification in English 7. English Particle Constructions, the Lexicon, and the Autonomy of Syntax 8. Twistin' the Night Away 9. The English Resultative as a Family of Constructions 10. On The phrase The Phrase 'the phrase' 11. Contrastive Focus Reduplication in English (the salad-salad paper) 12. Construction After Construction and its Theoretical Challenges 13. The Ecology of English Noun-Noun Compounds References

Proceedings Article
11 Jul 2010
TL;DR: A novel leaving-one-out approach to prevent over-fitting is described that allows us to train phrase models that show improved translation performance on the WMT08 Europarl German-English task.
Abstract: Several attempts have been made to learn phrase translation probabilities for phrase-based statistical machine translation that go beyond pure counting of phrases in word-aligned training data. Most approaches report problems with over-fitting. We describe a novel leaving-one-out approach to prevent over-fitting that allows us to train phrase models that show improved translation performance on the WMT08 Europarl German-English task. In contrast to most previous work where phrase models were trained separately from other models used in translation, we include all components such as single word lexica and reordering models in training. Using this consistent training of phrase models we are able to achieve improvements of up to 1.4 points in BLEU. As a side effect, the phrase table size is reduced by more than 80%.

Proceedings Article
01 Jan 2010
TL;DR: Kirsh et al. as mentioned in this paper explored how dancers and choreographers use their bodies to think about dance phrases and found that the body in motion can serve as an anchor and vehicle for thought.
Abstract: Thinking with the Body David Kirsh (kirsh@ucsd.edu) Dept of Cognitive Science University of California, San Diego Abstract To explore the question of physical thinking – using the body as an instrument of cognition – we collected extensive video and interview data on the creative process of a noted choreographer and his company as they made a new dance. A striking case of physical thinking is found in the phenomenon of marking. Marking refers to dancing a phrase in a less than complete manner. Dancers mark to save energy. But they also mark to explore the tempo of a phrase, or its movement sequence, or the intention behind it. Because of its representational nature, marking can serve as a vehicle for thought. Importantly, this vehicle is less complex than the version of the same phrase danced ‘full-out’. After providing evidence for distinguishing different types of marking, three ways of understanding marking as a form of thought are considered: marking as a gestural language for encoding aspects of a target movement, marking as a method of priming neural systems involved in the target movement, and marking as a method for improving the precision of mentally projecting aspects of the target. Keywords: Marking; multimodality; thinking, embodied cognition, ethnography. 1. Introduction This paper explores how dancers and choreographers use their bodies to think about dance phrases. My specific focus is a technique called ‘marking’. Marking refers to dancing a phrase in a less than complete manner. See fig. 1 for an example of hand marking, a form that is far smaller than the more typical method of marking that involves modeling a phrase with the whole body. Marking is part of the practice of dance, pervasive in all phases of creation, practice, rehearsal, and reflection. Virtually all English speaking dancers know the term, though few, if any, scholarly articles exist that describe the process or give instructions on how to do it. 1 When dancers mark a phrase, they use their body’s movement and form as a representational vehicle. They do not recreate the full dance phrase they normally perform; instead, they create a simplified or abstracted version – a model. Dancers mark to save energy, to avoid strenuous movement such as jumps, and sometimes to review or explore specific aspects of a phrase, such as tempo, movement sequence, or underlying intention, without the mental complexity involved in creating the phrase ‘full-out’. Marking is not the only way dancers ‘mentally’ explore phrases. Many imagine themselves performing a phrase. Some of the professional dancers we studied reported visualizing their phrase in bed before going to sleep, others reporting mentally reviewing their phrases while traveling on the tube on their way home. Our evidence suggests that marking, however, gives more insight than mental rehearsal: by physically executing a synoptic version of the whole phrase – by creating a simplified version externally – dancers are able to understand the shape, dynamics, emotion, and spatial elements of a phrase better than through imagination alone. They use marking as an anchor and vehicle for thought. It is this idea – that a body in motion can serve as an anchor and vehicle of thought – that is explored in this paper. It is a highly general claim. It has been said that gesture can facilitate thought, [Golden Meadow 05]; that physically simulating a process can help a thinker understand a process [Collins et al 91], and that mental rehearsal is improved by overt physical movement. [Coffman 90] Why? What extra can physical action or physical structure offer to imagination? The answer, I suggest, is that creating an external structure connected to a thought – whether that external structure be a gesture, dance form, or linguistic structure – is part of an interactive strategy of bootstrapping thought by providing an anchor for mental projection. [Hutchins, 05, Kirsh 09, 10]. Marking a phrase provides the scaffold to mentally project more detailed structure than could otherwise be held in mind. It is part of an interactive strategy for augmenting cognition. By marking, dancers harness their bodies to drive thought deeper than through mental simulation and unaided thinking alone. Hand Marking Fig 1a Fig 1b In Fig 1a an Irish river dancer is caught in mid move. In 1b, the same move is marked using just the hands. River dancing is a type of step dancing where the arms are keep still. Typically, river dancers mark steps and positions using one hand for the movement and the other for the floor. Most marking involves modeling phrases with the whole body, and not just the hands. Search by professional librarians of dance in the UK and US has yet to turn up scholarly articles on the practice of marking.

Patent
23 Dec 2010
TL;DR: The authors describe word-dependent language models, as well as their creation and use, which can be useful in many contexts, including those where one or more letters of the expected phrase are known to the speaker.
Abstract: This document describes word-dependent language models, as well as their creation and use. A word-dependent language model can permit a speech-recognition engine to accurately verify that a speech utterance matches a multi-word phrase. This is useful in many contexts, including those where one or more letters of the expected phrase are known to the speaker.

Proceedings Article
02 Jun 2010
TL;DR: Manual evaluation of generated output shows that the random-walk-based approach to learning paraphrases from bilingual parallel corpora outperforms the state-of-the-art system of Callison-Burch (2008).
Abstract: We present a random-walk-based approach to learning paraphrases from bilingual parallel corpora. The corpora are represented as a graph in which a node corresponds to a phrase, and an edge exists between two nodes if their corresponding phrases are aligned in a phrase table. We sample random walks to compute the average number of steps it takes to reach a ranking of paraphrases with better ones being "closer" to a phrase of interest. This approach allows "feature" nodes that represent domain knowledge to be built into the graph, and incorporates truncation techniques to prevent the graph from growing too large for efficiency. Current approaches, by contrast, implicitly presuppose the graph to be bipartite, are limited to finding paraphrases that are of length two away from a phrase, and do not generally permit easy incorporation of domain knowledge. Manual evaluation of generated output shows that our approach outperforms the state-of-the-art system of Callison-Burch (2008).

Proceedings Article
15 Jul 2010
TL;DR: The SemEval-2010 Cross-Lingual Lexical Substitution task, where given an English target word in context, participating systems had to find an alternative substitute phrase in Spanish, is described.
Abstract: In this paper we describe the SemEval-2010 Cross-Lingual Lexical Substitution task, where given an English target word in context, participating systems had to find an alternative substitute word or phrase in Spanish. The task is based on the English Lexical Substitution task run at SemEval-2007. In this paper we provide background and motivation for the task, we describe the data annotation process and the scoring system, and present the results of the participating systems.