scispace - formally typeset
Search or ask a question

Showing papers on "Phrase published in 2007"


Proceedings Article
01 Jun 2007
TL;DR: In a number of experiments, it is shown that factored translation models lead to better translation performance, both in terms of automatic scores, as well as more grammatical coherence.
Abstract: We present an extension of phrase-based statistical machine translation models that enables the straight-forward integration of additional annotation at the word-level — may it be linguistic markup or automatically generated word classes. In a number of experiments we show that factored translation models lead to better translation performance, both in terms of automatic scores, as well as more grammatical coherence.

582 citations


Proceedings ArticleDOI
28 Oct 2007
TL;DR: Topical n-grams as discussed by the authors is a probabilistic model that generates words in their textual order by, for each word, first sampling a topic, then sampling its status as a unigram or bigram, and then sampling the word from a topic-specific unigrams or bigrams distribution.
Abstract: Most topic models, such as latent Dirichlet allocation, rely on the bag-of-words assumption. However, word order and phrases are often critical to capturing the meaning of text in many text mining tasks. This paper presents topical n-grams, a topic model that discovers topics as well as topical phrases. The probabilistic model generates words in their textual order by, for each word, first sampling a topic, then sampling its status as a unigram or bigram, and then sampling the word from a topic-specific unigram or bigram distribution. Thus our model can model "white house" as a special meaning phrase in the 'politics' topic, but not in the 'real estate' topic. Successive bigrams form longer phrases. We present experiments showing meaningful phrases and more interpretable topics from the NIPS data and improved information retrieval performance on a TREC collection.

510 citations


Proceedings Article
01 Jun 2007
TL;DR: This paper investigates a new strategy for integrating WSD into an SMT system, that performs fully phrasal multi-word disambiguation, and provides the first known empirical evidence that lexical semantics are indeed useful for SMT, despite claims to the contrary.
Abstract: We show for the first time that incorporating the predictions of a word sense disambiguation system within a typical phrase-based statistical machine translation (SMT) model consistently improves translation quality across all three different IWSLT ChineseEnglish test sets, as well as producing statistically significant improvements on the larger NIST Chinese-English MT task— and moreover never hurts performance on any test set, according not only to BLEU but to all eight most commonly used automatic evaluation metrics. Recent work has challenged the assumption that word sense disambiguation (WSD) systems are useful for SMT. Yet SMT translation quality still obviously suffers from inaccurate lexical choice. In this paper, we address this problem by investigating a new strategy for integrating WSD into an SMT system, that performs fully phrasal multi-word disambiguation. Instead of directly incorporating a Senseval-style WSD system, we redefine the WSD task to match the exact same phrasal translation disambiguation task faced by phrase-based SMT systems. Our results provide the first known empirical evidence that lexical semantics are indeed useful for SMT, despite claims to the contrary.

392 citations


Proceedings Article
01 Jun 2007
TL;DR: This work develops faster approaches for efficient decoding based on k-best parsing algorithms and demonstrates their effectiveness on both phrase-based and syntax-based MT systems.
Abstract: Efficient decoding has been a fundamental problem in machine translation, especially with an integrated language model which is essential for achieving good translation quality. We develop faster approaches for this problem based on k-best parsing algorithms and demonstrate their effectiveness on both phrase-based and syntax-based MT systems. In both cases, our methods achieve significant speed improvements, often by more than a factor of ten, over the conventional beam-search method at the same levels of search error and translation accuracy.

313 citations


Proceedings Article
01 Jun 2007
TL;DR: It is shown for the first time that integrating a WSD system improves the performance of a state-of-the-art statistical MT system on an actual translation task, and the improvement is statistically significant.
Abstract: Recent research presents conflicting evidence on whether word sense disambiguation (WSD) systems can help to improve the performance of statistical machine translation (MT) systems. In this paper, we successfully integrate a state-of-the-art WSD system into a state-of-the-art hierarchical phrase-based MT system, Hiero. We show for the first time that integrating a WSD system improves the performance of a state-ofthe-art statistical MT system on an actual translation task. Furthermore, the improvement is statistically significant.

307 citations


Proceedings ArticleDOI
17 Jun 2007
TL;DR: A fast and principled solution to the discovery of significant spatial co-occurrent patterns using frequent itemset mining; a pattern summarization method that deals with the compositional uncertainties in visual phrases; and a top-down refinement scheme of the visual word lexicon by feeding back discovered phrases to tune the similarity measure through metric learning.
Abstract: A visual word lexicon can be constructed by clustering primitive visual features, and a visual object can be described by a set of visual words. Such a "bag-of-words" representation has led to many significant results in various vision tasks including object recognition and categorization. However, in practice, the clustering of primitive visual features tends to result in synonymous visual words that over-represent visual patterns, as well as polysemous visual words that bring large uncertainties and ambiguities in the representation. This paper aims at generating a higher-level lexicon, i.e. visual phrase lexicon, where a visual phrase is a meaningful spatially co-occurrent pattern of visual words. This higher-level lexicon is much less ambiguous than the lower-level one. The contributions of this paper include: (1) a fast and principled solution to the discovery of significant spatial co-occurrent patterns using frequent itemset mining; (2) a pattern summarization method that deals with the compositional uncertainties in visual phrases; and (3) a top-down refinement scheme of the visual word lexicon by feeding back discovered phrases to tune the similarity measure through metric learning.

269 citations


Journal ArticleDOI
TL;DR: The authors conducted a large-scale corpus analysis indicating that pronominal object relative clauses are significantly more frequent than subject relative clauses when the embedded pronoun is personal, but this difference was reversed when impersonal pronouns constituted the embedded noun phrase.

253 citations


Journal ArticleDOI
TL;DR: Two experiments are reported which examine how manipulations of visual attention affect speakers' linguistic choices regarding word order, verb use and syntactic structure when describing simple pictured scenes, finding that early endogenous shifts in attention influence word order choices.

251 citations


Proceedings Article
01 Jun 2007
TL;DR: A set of syntactic reordering rules that exploit systematic differences between Chinese and English word order are described, which are used as a preprocessor for both training and test sentences, transforming Chinese sentences to be much closer to English in terms of their word order.
Abstract: Syntactic reordering approaches are an effective method for handling word-order differences between source and target languages in statistical machine translation (SMT) systems. This paper introduces a reordering approach for translation from Chinese to English. We describe a set of syntactic reordering rules that exploit systematic differences between Chinese and English word order. The resulting system is used as a preprocessor for both training and test sentences, transforming Chinese sentences to be much closer to English in terms of their word order. We evaluated the reordering approach within the MOSES phrase-based SMT system (Koehn et al., 2007). The reordering approach improved the BLEU score for the MOSES system from 28.52 to 30.86 on the NIST 2006 evaluation data. We also conducted a series of experiments to analyze the accuracy and impact of different types of reordering rules.

240 citations


Journal ArticleDOI
TL;DR: In this paper, the consequences of adopting recent proposals by Chomsky, according to which the syntactic derivation proceeds in terms of phases, are explored, and a theory of the fact that syntactic constituents receive default phrase stress not across the board, but as a function of yet-to-beexplicated conditions on their syntactic context.
Abstract: In this article we will explore the consequences of adopting recent proposals by Chomsky, according to which the syntactic derivation proceeds in terms of phases. The notion of phase - through the associated notion ofspellout - allows for an insightful theory of the fact that syntactic constituents receive default phrase stress not across the board, but as a function of yet-to-be-explicated conditions on their syntactic context. We will see that the phonological evidence requires us to modify somewhat the theory of which functional categories actually define a phase. Patterns of default, syntax-determined, phrase stress are argued to result from prosodic spellout requiring the highest phrase in the spellout domain to correspond to a major prosodic phrase in phonological representation, and carry major phrase stress.

237 citations


Proceedings ArticleDOI
23 Jun 2007
TL;DR: The English Lexical Substitution task for SemEval is described, in the task, annotators and systems find an alternative substitute word or phrase for a target word in context that involves both finding the synonyms and disambiguating the context.
Abstract: In this paper we describe the English Lexical Substitution task for SemEval. In the task, annotators and systems find an alternative substitute word or phrase for a target word in context. The task involves both finding the synonyms and disambiguating the context. Participating systems are free to use any lexical resource. There is a subtask which requires identifying cases where the word is functioning as part of a multiword in the sentence and detecting what that multiword is.

Proceedings Article
01 Apr 2007
TL;DR: The phrase translation strategy significantly outperformed the sentence translation strategy and its relative performance was 0.92 to 0.97 compared to directly trained SMT systems.
Abstract: We compare two pivot strategies for phrase-based statistical machine translation (SMT), namely phrase translation and sentence translation. The phrase translation strategy means that we directly construct a phrase translation table (phrase-table) of the source and target language pair from two phrase-tables; one constructed from the source language and English and one constructed from English and the target language. We then use that phrase-table in a phrase-based SMT system. The sentence translation strategy means that we first translate a source language sentence into n English sentences and then translate these n sentences into target language sentences separately. Then, we select the highest scoring sentence from these target sentences. We conducted controlled experiments using the Europarl corpus to evaluate the performance of these pivot strategies as compared to directly trained SMT systems. The phrase translation strategy significantly outperformed the sentence translation strategy. Its relative performance was 0.92 to 0.97 compared to directly trained SMT systems.

BookDOI
31 Jan 2007
TL;DR: The book is an important reference not only for all phonologists, but for all linguists interested in the problem of interfaces and for psycholinguists and for cognitive scientists working on perception of language and language acquisition.
Abstract: Prosodic Phonology by Marina Nespor and Irene Vogel is finally available again. "Nespor & Vogel 1986" is a citation classic, and even after twenty years, it is still recognized as the standard resource on Prosodic Phonology. This groundbreaking work introduces all of the prosodic domains (syllable, foot, word, clitic group, phonological phrase, intonational phrase and utterance) and comments on the evidence in their favor from numerous languages. It also contains a chapter on the phonology of poetic meter, and a chapter on the experimental testing of the role of the prosodic constituents in the perception of ambiguous sentences. The book is an important reference not only for all phonologists, but for all linguists interested in the problem of interfaces. It is a basic resource also for psycholinguists and for cognitive scientists working on perception of language and language acquisition.

Proceedings Article
01 Apr 2007
TL;DR: The output of the automatic post-editing (APE) system is not only better quality than the rule-based MT (both in terms of the BLEU and TER metrics), it is also better than the output of a stateof-the-art phrase-basedMT system used in standalone translation mode.
Abstract: We propose to use a statistical phrasebased machine translation system in a post-editing task: the system takes as input raw machine translation output (from a commercial rule-based MT system), and produces post-edited target-language text. We report on experiments that were performed on data collected in precisely such a setting: pairs of raw MT output and their manually post-edited versions. In our evaluation, the output of our automatic post-editing (APE) system is not only better quality than the rule-based MT (both in terms of the BLEU and TER metrics), it is also better than the output of a stateof-the-art phrase-based MT system used in standalone translation mode. These results indicate that automatic post-editing constitutes a simple and efcient way of combining rule-based and statistical MT technologies.

Proceedings Article
01 Apr 2007
TL;DR: The word-level combination provides the most robust gains but the best results on the development test sets (NIST MT05 and the newsgroup portion of GALE 2006 dry-run) were achieved by combining all three methods.
Abstract: Currently there are several approaches to machine translation (MT) based on different paradigms; e.g., phrasal, hierarchical and syntax-based. These three approaches yield similar translation accuracy despite using fairly different levels of linguistic knowledge. The availability of such a variety of systems has led to a growing interest toward finding better translations by combining outputs from multiple systems. This paper describes three different approaches to MT system combination. These combination methods operate on sentence, phrase and word level exploiting information from -best lists, system scores and target-to-source phrase alignments. The word-level combination provides the most robust gains but the best results on the development test sets (NIST MT05 and the newsgroup portion of GALE 2006 dry-run) were achieved by combining all three methods.

Patent
16 Nov 2007
TL;DR: In this paper, a practical system/method for predicting spoken text (a spoken word or a spoken sentence/phrase) given that text's partial spelling (example, initial characters forming the spelling of a word/sentence).
Abstract: This disclosure describes a practical system/method for predicting spoken text (a spoken word or a spoken sentence/phrase) given that text's partial spelling (example, initial characters forming the spelling of a word/sentence). The partial spelling may be given using “Speech” or may be inputted using the keyboard/keypad or may be obtained using other input methods. The disclosed system is an alternative method for inputting text into devices; the method is faster (especially for long words or phrases) compared to existing predictive-text-input and/or word-completion methods.

Journal ArticleDOI
TL;DR: The different views discussed in the literature are reviewed, data from crucial experiments investigating the temporal and neurotopological parameters of different information types encoded in verbs are reported and the neurophysiological indices for non-local dependency relations vary as a function of the morphological richness of the language.

Book ChapterDOI
12 Sep 2007
TL;DR: The developed Affect Analysis Model was designed to handle not only correctly written text, but also informal messages written in abbreviated or expressive manner, and an avatar was created in order to reflect the detected affective information and social behaviour.
Abstract: In this paper, we address the tasks of recognition and interpretation of affect communicated through text messaging. The evolving nature of language in online conversations is a main issue in affect sensing from this media type, since sentence parsing might fail while syntactical structure analysis. The developed Affect Analysis Model was designed to handle not only correctly written text, but also informal messages written in abbreviated or expressive manner. The proposed rule-based approach processes each sentence in sequential stages, including symbolic cue processing, detection and transformation of abbreviations, sentence parsing, and word/phrase/sentence-level analyses. In a study based on 160 sentences, the system result agrees with at least two out of three human annotators in 70% of the cases. In order to reflect the detected affective information and social behaviour, an avatar was created.

Journal ArticleDOI
TL;DR: The authors showed that interference effects from structural relationships that are inconsistent with any grammatical parse of the perceived input can be observed when items occurring between a head and a dependent overlapped with either syntactic or semantic features of the dependent.
Abstract: Evidence from 3 experiments reveals interference effects from structural relationships that are inconsistent with any grammatical parse of the perceived input. Processing disruption was observed when items occurring between a head and a dependent overlapped with either (or both) syntactic or semantic features of the dependent. Effects of syntactic interference occur in the earliest online measures in the region where the retrieval of a long-distance dependent occurs. Semantic interference effects occur in later online measures at the end of the sentence. Both effects endure in offline comprehension measures, suggesting that interfering items participate in incorrect interpretations that resist reanalysis. The data are discussed in terms of a cue-based retrieval account of parsing, which reconciles the fact that the parser must violate the grammar in order for these interference effects to occur. Broader implications of this research indicate a need for a precise specification of the interface between the parsing mechanism and the memory system that supports language comprehension.

Patent
30 Nov 2007
TL;DR: In this paper, an interactive speech recognition system includes a database containing a plurality of reference terms, a list memory that receives the reference terms of category n, a processing circuit that populates the list memory with the references corresponding to the categories n, and a recognition circuit that processes the references and terms of a spoken phrase.
Abstract: An interactive speech recognition system includes a database containing a plurality of reference terms, a list memory that receives the reference terms of category “n,” a processing circuit that populates the list memory with the reference terms corresponding to the category “n,” and a recognition circuit that processes the reference terms and terms of a spoken phrase. The recognition circuit determines if a reference term of category “n” matches a term of the spoken phrase.

Proceedings Article
01 Jun 2007
TL;DR: Experimental results demonstrate BLEU improvements for triangulated models over a standard phrase-based system and central to this approach is triangulation, the process of translating from a source to a target language via an intermediate third language.
Abstract: Current phrase-based SMT systems perform poorly when using small training sets. This is a consequence of unreliable translation estimates and low coverage over source and target phrases. This paper presents a method which alleviates this problem by exploiting multiple translations of the same source phrase. Central to our approach is triangulation, the process of translating from a source to a target language via an intermediate third language. This allows the use of a much wider range of parallel corpora for training, and can be combined with a standard phrase-table using conventional smoothing methods. Experimental results demonstrate BLEU improvements for triangulated models over a standard phrase-based system.

Journal ArticleDOI
TL;DR: In this article, the authors exploit agreement attraction in order to examine the mechanisms underlying the production of subject-verb agreement in Slovak, and show that the likelihood of interference from the local noun depends on the relative markedness of the subject phrase's head and local noun gender.

Journal ArticleDOI
TL;DR: Two experiments examining the effect of prosodic structure and phrase length on pause duration showed a significant post-boundary effect of Prosodic branching and significant pre- and post- boundary phrase length effects.

Patent
27 Aug 2007
TL;DR: In this article, a system and method for facilitating transactions utilizing phrase tokens are provided, where individual entities can be associated with unambiguous transaction phrase tokens, such as multiple word phrases.
Abstract: A system and method for facilitating transactions utilizing phrase tokens are provided. Individual entities can be associated with unambiguous transaction phrase tokens, such as multiple word phrases. The transaction phrase tokens are associated with transaction accounts by a service provider such that the entities can complete a transaction without having to exchange transaction account information. In a transaction, a transaction phrase token is offered to an accepting party, which tenders the offered transaction phrase token to the service provider. The service provider processes the offered transaction phrase token according to configuration information specified for the transaction phrase token. The service provider can automatically process the transaction request or request additional information.

Journal ArticleDOI
TL;DR: The authors found that laughter "punctuates" speech, occurring during pauses, at phrase boundaries, and before and after statements and questions, the places where punctuation would be placed in a transcript of a conversation.
Abstract: Laughter “punctuates” speech, occurring during pauses, at phrase boundaries, and before and after statements and questions—the places where punctuation would be placed in a transcript of a conversation. Such punctuation indicates that language is dominant over laughter in competition for the vocal tract because laughter seldom interrupts spoken phrases. The punctuation effect is shown here to extend to emoticon placement in website text messages, a nonvocal linguistic medium. As in earlier studies of speaking and manual signing, the phrase structure of language was preserved, indicating the regulation of emotional expression by a common, higher-order linguistic process.

Patent
20 Aug 2007
TL;DR: In this paper, a speech interaction device consisting of a candidate generation section 112 which recognizes speech, and a likelihood for showing probability of the candidate of response, a response sentence generation section 113 for generating a response, an output section 102 for outputting synthesis speech of response sentence, a correction phrase generation section 114 for generating at least one correction phrase corresponding to the phrase included in the response sentence by analyzing the recognition result for the speech a user utters during an output of synthesis speech.
Abstract: PROBLEM TO BE SOLVED: To provide a speech interaction device capable of easily correcting an error part without interrupting interaction. SOLUTION: The speech interaction device comprises: a candidate generation section 112 which recognizes speech, and which generates a candidate of response and a likelihood for showing probability of the candidate of response; a response sentence generation section 113 for generating a response sentence including a phrase for expressing a content that the candidate of the most likely response is selected; an output section 102 for outputting synthesis speech of response sentence; a correction phrase generation section 114 for generating at least one correction phrase corresponding to the phrase included in the response sentence by analyzing the recognition result for the speech a user utters during an output of synthesis speech; a selection section 115 which obtains the candidate of the response including the phrase of the same meaning content with the generated correction phrase from the generated candidate of the response, and which selects the candidate of the most likely response in the obtained response candidates; and an update section 116 for updating the response sentence with the phrase expressing the content of the candidate of the selected response. The output section 102 outputs synthesis speech of the response sentence after updating. COPYRIGHT: (C)2009,JPO&INPIT

Journal ArticleDOI
TL;DR: A quartile analysis showed that for both types of violations, larger average violation effects were associated with lower relative amplitudes of oscillatory activity, implying an inverse relation between ERP amplitude and event-related power magnitude change in sentence processing.

Journal ArticleDOI
TL;DR: Studies of Hindi investigate whether responses to syntactic agreement violations vary as a function of the type and number of incorrect agreement features, using both electrophysiological (ERP) and behavioral measures, and suggest that the P600 response to agreement violations is not additive based on the number of mismatching features and does not reflect top-down, predictive mechanisms.

Proceedings Article
01 Dec 2007
TL;DR: Improved results are obtained by inverting a semantic parser that uses SMT methods to map sentences into meaning representations to natural language, and it is shown that hybridizing these two approaches results in still more accurate generation systems.
Abstract: This paper explores the use of statistical machine translation (SMT) methods for tactical natural language generation. We present results on using phrase-based SMT for learning to map meaning representations to natural language. Improved results are obtained by inverting a semantic parser that uses SMT methods to map sentences into meaning representations. Finally, we show that hybridizing these two approaches results in still more accurate generation systems. Automatic and human evaluation of generated sentences are presented across two domains and four languages.

Proceedings Article
01 Jun 2007
TL;DR: This work describes new lookup algorithms for hierarchical phrase-based translation that reduce the empirical computation time by nearly two orders of magnitude, making on-the-fly lookup feasible for source phrases with gaps.
Abstract: A major engineering challenge in statistical machine translation systems is the efficient representation of extremely large translation rulesets. In phrase-based models, this problem can be addressed by storing the training data in memory and using a suffix array as an efficient index to quickly lookup and extract rules on the fly. Hierarchical phrasebased translation introduces the added wrinkle of source phrases with gaps. Lookup algorithms used for contiguous phrases no longer apply and the best approximate pattern matching algorithms are much too slow, taking several minutes per sentence. We describe new lookup algorithms for hierarchical phrase-based translation that reduce the empirical computation time by nearly two orders of magnitude, making on-the-fly lookup feasible for source phrases with gaps.