Showing papers on "Phrase published in 2012"

PDF

Open Access

Proceedings Article•

A Comparison of Vector-based Representations for Semantic Composition

[...]

William Blacoe¹, Mirella Lapata¹•Institutions (1)

12 Jul 2012

TL;DR: Shallow approaches to modeling compositional meaning for phrases and sentences using distributional methods are found to be as good as more computationally intensive alternatives with regards to two particular tests: phrase similarity and paraphrase detection.

...read moreread less

Abstract: In this paper we address the problem of modeling compositional meaning for phrases and sentences using distributional methods. We experiment with several possible combinations of representation and composition, exhibiting varying degrees of sophistication. Some are shallow while others operate over syntactic structure, rely on parameter learning, or require access to very large corpora. We find that shallow approaches are as good as more computationally intensive alternatives with regards to two particular tests: (1) phrase similarity and (2) paraphrase detection. The sizes of the involved training corpora and the generated vectors are not as important as the fit between the meaning representation and compositional method.

...read moreread less

304 citations

Proceedings Article•

Sentence Simplification by Monolingual Machine Translation

[...]

Sander Wubben¹, Antal van den Bosch², Emiel Krahmer¹•Institutions (2)

Tilburg University¹, Radboud University Nijmegen²

08 Jul 2012

TL;DR: By relatively careful phrase-based paraphrasing this model achieves similar simplification results to state-of-the-art systems, while generating better formed output, and argues that text readability metrics such as the Flesch-Kincaid grade level should be used with caution when evaluating the output of simplification systems.

...read moreread less

Abstract: In this paper we describe a method for simplifying sentences using Phrase Based Machine Translation, augmented with a re-ranking heuristic based on dissimilarity, and trained on a monolingual parallel corpus. We compare our system to a word-substitution baseline and two state-of-the-art systems, all trained and tested on paired sentences from the English part of Wikipedia and Simple Wikipedia. Human test subjects judge the output of the different systems. Analysing the judgements shows that by relatively careful phrase-based paraphrasing our model achieves similar simplification results to state-of-the-art systems, while generating better formed output. We also argue that text readability metrics such as the Flesch-Kincaid grade level should be used with caution when evaluating the output of simplification systems.

...read moreread less

287 citations

Journal Article•DOI•

Vector Space Models of Word Meaning and Phrase Meaning: A Survey

[...]

Katrin Erk¹•Institutions (1)

University of Texas at Austin¹

01 Oct 2012-Language and Linguistics Compass

TL;DR: This survey looks at the use of vector space models to describe the meaning of words and phrases: the phenomena thatvector space models address, and the techniques that they use to do so.

...read moreread less

Abstract: Distributional models represent a word through the contexts in which it has been observed. They can be used to predict similarity in meaning, based on the distributional hypothesis, which states that two words that occur in similar contexts tend to have similar meanings. Distributional approaches are often implemented in vector space models. They represent a word as a point in high-dimensional space, where each dimension stands for a context item, and a word's coordinates represent its context counts. Occurrence in similar contexts then means proximity in space. In this survey we look at the use of vector space models to describe the meaning of words and phrases: the phenomena that vector space models address, and the techniques that they use to do so. Many word meaning phenomena can be described in terms of semantic similarity: synonymy, priming, categorization, and the typicality of a predicate's arguments. But vector space models can do more than just predict semantic similarity. They are a very flexible tool, because they can make use of all of linear algebra, with all its data structures and operations. The dimensions of a vector space can stand for many things: context words, or non-linguistic context like images, or properties of a concept. And vector space models can use matrices or higher-order arrays instead of vectors for representing more complex relationships. Polysemy is a tough problem for distributional approaches, as a representation that is learned from all of a word's contexts will conflate the different senses of the word. It can be addressed, using either clustering or vector combination techniques. Finally, we look at vector space models for phrases, which are usually constructed by combining word vectors. Vector space models for phrases can predict phrase similarity, and some argue that they can form the basis for a general-purpose representation framework for natural language semantics.

...read moreread less

284 citations

Journal Article•DOI•

A Phrasal Expressions List

[...]

Ron Martinez¹, Norbert Schmitt¹•Institutions (1)

San Francisco State University¹

01 Jul 2012-Applied Linguistics

TL;DR: The PHRASE List is presented, a list of the 505 most frequent non-transparent multiword expressions in English, intended especially for receptive use, to provide a basis for the systematic integration of multiword lexical items into teaching materials, vocabulary tests, and learning syllabuses.

...read moreread less

Abstract: There is little dispute that formulaic sequences form an important part of the lexicon, but to date there has been no principled way to prioritize the inclusion of such items in pedagogic materials, such as ESL/EFL textbooks or tests of vocabulary knowledge. While wordlists have been used for decades, they have only provided information about individual word forms (e.g. the General Service List (West 1953) and the Academic Word List (Coxhead 2000)). This article addresses this deficiency by presenting the PHRASal Expressions List (PHRASE List), a list of the 505 most frequent non-transparent multiword expressions in English, intended especially for receptive use. The rationale and development of the list are discussed, as well as its compatibility with British National Corpus single-word frequency lists. It is hoped that the PHRASE List will provide a basis for the systematic integration of multiword lexical items into teaching materials, vocabulary tests, and learning syllabuses.

...read moreread less

244 citations

Proceedings Article•

Irony and Sarcasm: Corpus Generation and Analysis Using Crowdsourcing

[...]

Elena Filatova¹•Institutions (1)

Fordham University¹

01 May 2012

TL;DR: A corpus generation experiment is described where regular and sarcastic Amazon product reviews are collected and the resulting corpus can be used for identifying sarcasm on two levels: a document and a text utterance.

...read moreread less

Abstract: The ability to reliably identify sarcasm and irony in text can improve the performance of many Natural Language Processing (NLP) systems including summarization, sentiment analysis, etc. The existing sarcasm detection systems have focused on identifying sarcasm on a sentence level or for a specific phrase. However, often it is impossible to identify a sentence containing sarcasm without knowing the context. In this paper we describe a corpus generation experiment where we collect regular and sarcastic Amazon product reviews. We perform qualitative and quantitative analysis of the corpus. The resulting corpus can be used for identifying sarcasm on two levels: a document and a text utterance (where a text utterance can be as short as a sentence and as long as a whole document).

...read moreread less

204 citations

Journal Article•DOI•

A dynamic usage based perspective on L2 writing

[...]

Marjolijn Verspoor¹, Marjolijn Verspoor², Monika S. Schmid², Xiaoyan Xu²•Institutions (2)

University of the Free State¹, University of Groningen²

01 Sep 2012-Journal of Second Language Writing

TL;DR: The study shows that even short writing samples can be useful in assessing general proficiency at the lower levels of L2 proficiency and that a cross-sectional study of samples at different proficiency levels can give worthwhile insights into dynamic L2 developmental patterns.

...read moreread less

199 citations

Journal Article•DOI•

“That's So Gay!”: Examining the Covariates of Hearing This Expression Among Gay, Lesbian, and Bisexual College Students

[...]

Michael R. Woodford¹, Michael L. Howell², Perry Silverschanz¹, Lotus Yu¹•Institutions (2)

University of Michigan¹, Appalachian State University²

19 Mar 2012-Journal of American College Health

TL;DR: College professionals and student leaders must acknowledge that the phrase “that's so gay” is a form of heterosexist harassment and policies addressing diversity and harassment should address students’ use of this phrase, aiming to reduce its use.

...read moreread less

Abstract: Objective: The investigators examined the health and well-being correlates of hearing the popular phrase “that's so gay” among gay, lesbian, and bisexual (GLB) emerging adults. Participants: Participants were 114 self-identified GLB students aged 18 to 25 years. Methods: An online survey was distributed to students at a large public university in the Midwest during winter 2009. Results: Participants’ social and physical well-being was negatively associated with hearing this phrase, specifically feeling isolated and experiencing physical health symptoms (ie, headaches, poor appetite, or eating problems). Conclusions: College professionals and student leaders must acknowledge that the phrase is a form of heterosexist harassment. As such, policies addressing diversity and harassment should address students’ use of this phrase, aiming to reduce its use. Additionally, colleges and universities should develop practices that counteract poorer well-being associated with hearing the phrase.

...read moreread less

139 citations

Proceedings Article•

Continuous Space Translation Models for Phrase-Based Statistical Machine Translation

[...]

Holger Schwenk¹•Institutions (1)

University of Maine¹

01 Dec 2012

TL;DR: Experimental evidence is provided that the approach seems to be able to infer meaningful translation probabilities for phrase pairs not seen in the training data, or even predict a list of the most likely translations given a source phrase.

...read moreread less

Abstract: This paper presents a new approach to perform the estimation of the translation model probabilities of a phrase-based statistical machine translation system. We use neural networks to directly learn the translation probability of phrase pairs using continuous representations. The system can be easily trained on the same data used to build standard phrase-based systems. We provide experimental evidence that the approach seems to be able to infer meaningful translation probabilities for phrase pairs not seen in the training data, or even predict a list of the most likely translations given a source phrase. The approach can be used to rescore n-best lists, but we also discuss an integration into the Moses decoder. A preliminary evaluation on the English/French IWSLT task achieved improvements in the BLEU score and a human analysis showed that the new model often chooses semantically better translations. Several extensions of this work are discussed.

...read moreread less

129 citations

Patent•

Detecting potential significant errors in speech recognition results

[...]

William F. Ganong¹, Raghu Vemula¹, Robert Fleming¹•Institutions (1)

Nuance Communications¹

09 Jul 2012

TL;DR: In this paper, the recognition results produced by a speech processing system (which may include two or more recognition results, including a top recognition result and one or more alternative recognition results) based on an analysis of a speech input, are evaluated for indications of potential significant errors.

...read moreread less

Abstract: In some embodiments, the recognition results produced by a speech processing system (which may include two or more recognition results, including a top recognition result and one or more alternative recognition results) based on an analysis of a speech input, are evaluated for indications of potential significant errors. In some embodiments, the recognition results may be evaluated to determine whether a meaning of any of the alternative recognition results differs from a meaning of the top recognition result in a manner that is significant for a domain, such as the medical domain. In some embodiments, words and/or phrases that may be confused by an ASR system may be determined and associated in sets of words and/or phrases. Words and/or phrases that may be determined include those that change a meaning of a phrase or sentence when included in the phrase/sentence.

...read moreread less

126 citations

Proceedings Article•

Combining Word-Level and Character-Level Models for Machine Translation Between Closely-Related Languages

[...]

Preslav Nakov¹, Jörg Tiedemann²•Institutions (2)

Qatar Computing Research Institute¹, Uppsala University²

08 Jul 2012

TL;DR: This work uses character-level translation trained on n-gram-character-aligned bitexts and tuned using word-level BLEU to augment with character-based transliteration at the word level and combine with a word- level translation model.

...read moreread less

Abstract: We propose several techniques for improving statistical machine translation between closely-related languages with scarce resources. We use character-level translation trained on n-gram-character-aligned bitexts and tuned using word-level BLEU, which we further augment with character-based transliteration at the word level and combine with a word-level translation model. The evaluation on Macedonian-Bulgarian movie subtitles shows an improvement of 2.84 BLEU points over a phrase-based word-level baseline.

...read moreread less

119 citations

Patent•

Dynamic pass phrase security system (DPSS)

[...]

Valene Skerpac

09 May 2012

TL;DR: In this article, an n-dimensional biometric security system as well as a method of identifying and validating a user through the use of a automated random one-time passphrase generation were disclosed.

...read moreread less

Abstract: There is disclosed an n-dimensional biometric security system as well as a method of identifying and validating a user through the use of a automated random one-time passphrase generation. The use of tailored templates to generate one-time phase phrase text as well as the use of update subscriptions of templates ensures a high level of security. A verification session preferably uses short, text-independent one-time pass phrases and secure audio tokens with master audio generated from an internal text-to-speech security processor. An automated enrollment process may be implemented in an ongoing and seamless fashion with a user's interactions with the system. Various calibration and tuning techniques are also disclosed.

...read moreread less

Journal Article•DOI•

Word-based self-indexes for natural language text

[...]

Antonio Fariña¹, Nieves R. Brisaboa¹, Gonzalo Navarro², Francisco Claude³, Ángeles Saavedra Places¹, Eduardo Rodríguez¹ - Show less +2 more•Institutions (3)

University of A Coruña¹, University of Chile², University of Waterloo³

06 Mar 2012-ACM Transactions on Information Systems

TL;DR: This article introduces a different kind of index that replaces the text using essentially the same space required by the compressed text alone (compression ratio around 35%).

...read moreread less

Abstract: The inverted index supports efficient full-text searches on natural language text collections. It requires some extra space over the compressed text that can be traded for search speed. It is usually fast for single-word searches, yet phrase searches require more expensive intersections. In this article we introduce a different kind of index. It replaces the text using essentially the same space required by the compressed text alone (compression ratio around 35p). Within this space it supports not only decompression of arbitrary passages, but efficient word and phrase searches. Searches are orders of magnitude faster than those over inverted indexes when looking for phrases, and still faster on single-word searches when little space is available. Our new indexes are particularly fast at counting the occurrences of words or phrases. This is useful for computing relevance of words or phrases.We adapt self-indexes that succeeded in indexing arbitrary strings within compressed space to deal with large alphabets. Natural language texts are then regarded as sequences of words, not characters, to achieve word-based self-indexes. We design an architecture that separates the searchable sequence from its presentation aspects. This permits applying case folding, stemming, removing stopwords, etc. as is usual on inverted indexes.

...read moreread less

Patent•

Methods and apparatus for acoustic disambiguation

[...]

Martin Labsky¹, Jan Kleindienst¹, Tomas Macek¹, David Nahamoo¹, Jan Curin¹, William F. Ganong¹ - Show less +2 more•Institutions (1)

Nuance Communications¹

23 May 2012

TL;DR: In this article, the authors identify at least one text segment, in a textual representation having a plurality of text segments, and annotate the textual representation with disambiguating information to help disambigenize the at least text segment from the word and/or phrase.

...read moreread less

Abstract: Techniques for disambiguating at least one text segment from at least one acoustically similar word and/or phrase. The techniques include identifying at least one text segment, in a textual representation having a plurality of text segments, having at least one acoustically similar word and/or phrase, annotating the textual representation with disambiguating information to help disambiguate the at least one text segment from the at least one acoustically similar word and/or phrase, and synthesizing a speech signal, at least in part, by performing text-to-speech synthesis on at least a portion of the textual representation that includes the at least one text segment, wherein the speech signal includes speech corresponding to the disambiguating information located proximate the portion of the speech signal corresponding to the at least one text segment.

...read moreread less

Journal Article•DOI•

Phrase frequency effects in language production.

[...]

Niels Janssen¹, Horacio A. Barber¹•Institutions (1)

University of La Laguna¹

27 Mar 2012-PLOS ONE

TL;DR: In both experiments, naming latencies decreased with increasing frequency of the multi-word phrase, and were unaffected by the frequency ofThe object name in the utterance.

...read moreread less

Abstract: A classic debate in the psychology of language concerns the question of the grain-size of the linguistic information that is stored in memory. One view is that only morphologically simple forms are stored (e.g., ‘car’, ‘red’), and that more complex forms of language such as multi-word phrases (e.g., ‘red car’) are generated on-line from the simple forms. In two experiments we tested this view. In Experiment 1, participants produced noun+adjective and noun+noun phrases that were elicited by experimental displays consisting of colored line drawings and two superimposed line drawings. In Experiment 2, participants produced noun+adjective and determiner+noun+adjective utterances elicited by colored line drawings. In both experiments, naming latencies decreased with increasing frequency of the multi-word phrase, and were unaffected by the frequency of the object name in the utterance. These results suggest that the language system is sensitive to the distribution of linguistic information at grain-sizes beyond individual words.

...read moreread less

Journal Article•

Gestures Enhance Foreign Language Learning

[...]

Manuela Macedonia¹, Katharina von Kriegstein¹•Institutions (1)

Max Planck Society¹

28 Nov 2012-Biolinguistics

TL;DR: The use of gesture is proposed as a facilitating educational tool that integrates body and mind and indicates that the neural representation of words consists of complex multimodal networks connecting perception and motor acts that occur during learning.

...read moreread less

Abstract: Language and gesture are highly interdependent systems that reciprocally influence each other. For example, performing a gesture when learning a word or a phrase enhances its retrieval compared to pure verbal learning. Although the enhancing effects of co-speech gestures on memory are known to be robust, the underlying neural mechanisms are still unclear. Here, we summarize the results of behavioral and neuroscientific studies. They indicate that the neural representation of words consists of complex multimodal networks connecting perception and motor acts that occur during learning. In this context, gestures can reinforce the sensorimotor representation of a word or a phrase, making it resistant to decay. Also, gestures can favor embodiment of abstract words by creating it from scratch. Thus, we propose the use of gesture as a facilitating educational tool that integrates body and mind.

...read moreread less

Proceedings Article•

Toward Statistical Machine Translation without Parallel Corpora

[...]

Alexandre Klementiev¹, Ann Irvine¹, Chris Callison-Burch¹, David Yarowsky¹•Institutions (1)

Johns Hopkins University¹

23 Apr 2012

TL;DR: The degradation in translation performance when bilingually estimated translation probabilities are removed is examined and it is shown that 80%+ of the loss can be recovered with monolingually estimated features alone.

...read moreread less

Abstract: We estimate the parameters of a phrase-based statistical machine translation system from monolingual corpora instead of a bilingual parallel corpus. We extend existing research on bilingual lexicon induction to estimate both lexical and phrasal translation probabilities for MT-scale phrase-tables. We propose a novel algorithm to estimate reordering probabilities from monolingual data. We report translation results for an end-to-end translation system using these monolingual features alone. Our method only requires monolingual corpora in source and target languages, a small bilingual dictionary, and a small bitext for tuning feature weights. In this paper, we examine an idealization where a phrase-table is given. We examine the degradation in translation performance when bilingually estimated translation probabilities are removed and show that 80%+ of the loss can be recovered with monolingually estimated features alone. We further show that our monolingual features add 1.5 BLEU points when combined with standard bilingually estimated phrase table features.

...read moreread less

Proceedings Article•

NiuTrans: An Open Source Toolkit for Phrase-based and Syntax-based Machine Translation

[...]

Tong Xiao¹, Jingbo Zhu¹, Hao Zhang¹, Qiang Li¹•Institutions (1)

Northeastern University¹

10 Jul 2012

TL;DR: The key innovation provided by the toolkit is that the decoder can work with various grammars and offers different choices of decoding algrithms, such as phrase-based decoding, decoding as parsing/tree-parsing and forest-based decode.

...read moreread less

Abstract: We present a new open source toolkit for phrase-based and syntax-based machine translation. The toolkit supports several state-of-the-art models developed in statistical machine translation, including the phrase-based model, the hierachical phrase-based model, and various syntax-based models. The key innovation provided by the toolkit is that the decoder can work with various grammars and offers different choices of decoding algrithms, such as phrase-based decoding, decoding as parsing/tree-parsing and forest-based decoding. Moreover, several useful utilities were distributed with the toolkit, including a discriminative reordering model, a simple and fast language model, and an implementation of minimum error rate training for weight tuning.

...read moreread less

Proceedings Article•

Document-Wide Decoding for Phrase-Based Statistical Machine Translation

[...]

Christian Hardmeier¹, Joakim Nivre¹, Jörg Tiedemann¹•Institutions (1)

Uppsala University¹

12 Jul 2012

TL;DR: This work proposes a stochastic local search decoding method for phrase-based SMT, which permits free document-wide dependencies in the models and explores the stability and the search parameters of this method and demonstrates that it can be successfully used to optimise a document-level semantic language model.

...read moreread less

Abstract: Independence between sentences is an assumption deeply entrenched in the models and algorithms used for statistical machine translation (SMT), particularly in the popular dynamic programming beam search decoding algorithm. This restriction is an obstacle to research on more sophisticated discourse-level models for SMT. We propose a stochastic local search decoding method for phrase-based SMT, which permits free document-wide dependencies in the models. We explore the stability and the search parameters of this method and demonstrate that it can be successfully used to optimise a document-level semantic language model.

...read moreread less

Journal Article•DOI•

Can prosody be used to discover hierarchical structure in continuous speech

[...]

Alan Langus¹, Erika Marchetto², Ricardo A. H. Bion³, Marina Nespor⁴•Institutions (4)

International School for Advanced Studies¹, École Normale Supérieure², Stanford University³, University of Milano-Bicocca⁴

01 Jan 2012-Journal of Memory and Language

TL;DR: This article used prosodic cues such as pitch declination and final lengthening to segment continuous speech into phrases and group these phrases into sentences, and found that prosodic cue signals signaled hierarchical structures (i.e., phrases embedded within sentences) and non-adjacent relations (e.g., AxB rules within phrases), while transitional probabilities favored adjacent dependencies that straddled phrase and sentence boundaries.

...read moreread less

Proceedings Article•

Maximum Expected BLEU Training of Phrase and Lexicon Translation Models

[...]

Xiaodong He¹, Li Deng¹•Institutions (1)

Microsoft¹

08 Jul 2012

TL;DR: The proposed method, evaluated on the Europarl German-to-English dataset, leads to a 1.1 BLEU point improvement over a state-of-the-art baseline translation system.

...read moreread less

Abstract: This paper proposes a new discriminative training method in constructing phrase and lexicon translation models. In order to reliably learn a myriad of parameters in these models, we propose an expected BLEU score-based utility function with KL regularization as the objective, and train the models on a large parallel dataset. For training, we derive growth transformations for phrase and lexicon translation probabilities to iteratively improve the objective. The proposed method, evaluated on the Europarl German-to-English dataset, leads to a 1.1 BLEU point improvement over a state-of-the-art baseline translation system. In IWSLT 2011 Benchmark, our system using the proposed method achieves the best Chinese-to-English translation result on the task of translating TED talks.

...read moreread less

Journal Article•DOI•

Deletion versus pro-forms: an overly simple dichotomy?

[...]

Mark Baltin¹•Institutions (1)

New York University¹

01 May 2012-Natural Language and Linguistic Theory

TL;DR: It is shown that, within a phase-based syntax, Voice must be a phase rather than v, but that both functional heads must exist, and offers a new explanation for the incompatibility of passive and British English do, as well as an account of why some languages, like English, lack impersonal passives while others, such as Dutch, allow them.

...read moreread less

Abstract: This paper examines an anaphoric construction, British English do, and locates it within the dichotomy in the ellipsis literature between deleted phrases and null pro-forms, concluding that the choice is a false one, in that pro-forms involve deletion as well; the question, then, is how to account for the differential permeability to dependencies that require external licensing of the various deleted constituents. British English do has some characteristics of a fully deleted phrase, and some of a pro-form. The paper proposes that deletion is involved in this construction, but of a smaller constituent than can host wh-movement or long quantifier-raising. Therefore, deletion must occur within the syntax, in order to bleed syntactic processes. It is further shown that, within a phase-based syntax, Voice must be a phase rather than v, but that both functional heads must exist, and offers a new explanation for the incompatibility of passive and British English do, as well as an account of why some languages, like English, lack impersonal passives, while others, such as Dutch, allow them.

...read moreread less

Proceedings Article•

A Phrase-Discovering Topic Model Using Hierarchical Pitman-Yor Processes

[...]

Robert V. Lindsey¹, William P. Headden Iii, Michael Stipicevic²•Institutions (2)

University of Colorado Boulder¹, Google²

12 Jul 2012

TL;DR: This article presents a hierarchical generative probabilistic model of topical phrases that simultaneously infers the location, length, and topic of phrases within a corpus and relaxes the bag-of-words assumption within phrases by using a hierarchy of Pitman-Yor processes.

...read moreread less

Abstract: Topic models traditionally rely on the bag-of-words assumption. In data mining applications, this often results in end-users being presented with inscrutable lists of topical unigrams, single words inferred as representative of their topics. In this article, we present a hierarchical generative probabilistic model of topical phrases. The model simultaneously infers the location, length, and topic of phrases within a corpus and relaxes the bag-of-words assumption within phrases by using a hierarchy of Pitman-Yor processes. We use Markov chain Monte Carlo techniques for approximate inference in the model and perform slice sampling to learn its hyperparameters. We show via an experiment on human subjects that our model finds substantially better, more interpretable topical phrases than do competing models.

...read moreread less

Proceedings Article•DOI•

Randomized visual phrases for object search

[...]

Yuning Jiang¹, Jingjing Meng¹, Junsong Yuan¹•Institutions (1)

Nanyang Technological University¹

16 Jun 2012

TL;DR: A randomized approach to deriving visual phrase, in the form of spatial random partition, which lends itself to easy parallelization and also allows a flexible trade-off between accuracy and speed by adjusting the number of partition times.

...read moreread less

Abstract: Accurate matching of local features plays an essential role in visual object search. Instead of matching individual features separately, using the spatial context, e.g., bundling a group of co-located features into a visual phrase, has shown to enable more discriminative matching. Despite previous work, it remains a challenging problem to extract appropriate spatial context for matching. We propose a randomized approach to deriving visual phrase, in the form of spatial random partition. By averaging the matching scores over multiple randomized visual phrases, our approach offers three benefits: 1) the aggregation of the matching scores over a collection of visual phrases of varying sizes and shapes provides robust local matching; 2) object localization is achieved by simple thresholding on the voting map, which is more efficient than subimage search; 3) our algorithm lends itself to easy parallelization and also allows a flexible trade-off between accuracy and speed by adjusting the number of partition times. Both theoretical studies and experimental comparisons with the state-of-the-art methods validate the advantages of our approach.

...read moreread less

Proceedings Article•

The Effect of Learner Corpus Size in Grammatical Error Correction of ESL Writings

[...]

Tomoya Mizumoto¹, Yuta Hayashibe¹, Mamoru Komachi¹, Masaaki Nagata², Yuji Matsumoto¹ - Show less +1 more•Institutions (2)

Nara Institute of Science and Technology¹, Nippon Telegraph and Telephone²

01 Dec 2012

TL;DR: It is shown that the phrase-based SMT approach is effective in correcting frequent errors that can be identified by local context, and that it is difficult for phrase- based SMT to correct errors that need long range contextual information.

...read moreread less

Abstract: English as a Second Language (ESL) learners’ writings contain various grammatical errors. Previous research on automatic error correction for ESL learners’ grammatical errors deals with restricted types of learners’ errors. Some types of errors can be corrected by rules using heuristics, while others are difficult to correct without statistical models using native corpora and/or learner corpora. Since adding error annotation to learners’ text is time-consuming, it was not until recently that large scale learner corpora became publicly available. However, little is known about the effect of learner corpus size in ESL grammatical error correction. Thus, in this paper, we investigate the effect of learner corpus size on various types of grammatical errors, using an error correction system based on phrase-based statistical machine translation (SMT) trained on a large scale errortagged learner corpus. We show that the phrase-based SMT approach is effective in correcting frequent errors that can be identified by local context, and that it is difficult for phrase-based SMT to correct errors that need long range contextual information.

...read moreread less

Patent•

Mobile device voice activation

[...]

Hoai Nguyen¹•Institutions (1)

Google¹

11 Oct 2012

TL;DR: In this article, a mobile computerized device receives an indication of a first user input comprising a button actuation to initiate a push-to-talk voice search, and then the device generates a search query using the one or more search terms in the spoken search phrase, responsive to receiving the second user input.

...read moreread less

Abstract: A mobile computerized device receives an indication of a first user input comprising a button actuation to initiate a push-to-talk voice search. The device receives from the user a spoken search phrase comprising one or more search terms, and receives an indication of a second user input comprising releasing the actuated button which indicates that the user has finished speaking the search phrase. The mobile device generates a search query using the one or more search terms in the spoken search phrase, responsive to receiving the second user input. In a further example, the computerized mobile device displays one or more likely text search phrases derived from the spoken search phrase via voice-to-text conversion, receives a user input indicating which of the likely text search phrases is an intended search phrase, and uses the intended search phrase as the one or more search terms used in generating the search query.

...read moreread less

Book Chapter•DOI•

Linguistic properties of multi-word passphrases

[...]

Joseph Bonneau¹, Ekaterina Shutova¹•Institutions (1)

University of Cambridge¹

02 Mar 2012

TL;DR: Analysis of patterns of human choice in a passphrase-based authentication system deployed by Amazon, a large online merchant, finds that phrase selection is far from random, with users strongly preferring simple noun bigrams which are common in natural language.

...read moreread less

Abstract: We examine patterns of human choice in a passphrase-based authentication system deployed by Amazon, a large online merchant. We tested the availability of a large corpus of over 100,000 possible phrases at Amazon's registration page, which prohibits using any phrase already registered by another user. A number of large, readily-available lists such as movie and book titles prove effective in guessing attacks, suggesting that passphrases are vulnerable to dictionary attacks like all schemes involving human choice. Extending our analysis with natural language phrases extracted from linguistic corpora, we find that phrase selection is far from random, with users strongly preferring simple noun bigrams which are common in natural language. The distribution of chosen passphrases is less skewed than the distribution of bigrams in English text, indicating that some users have attempted to choose phrases randomly. Still, the distribution of bigrams in natural language is not nearly random enough to resist offline guessing, nor are longer three- or four-word phrases for which we see rapidly diminishing returns.

...read moreread less

Journal Article•DOI•

Predicted errors in children's early sentence comprehension

[...]

Yael Gertner¹, Cynthia Fisher¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

01 Jul 2012-Cognition

TL;DR: Evidence that 21-month-olds use what they have learned about noun order in English sentences to understand new transitive verbs is capitalized on, suggesting that toddlers exploit partial representations of sentence structure to guide sentence interpretation.

...read moreread less

Proceedings Article•

Analysing the Effect of Out-of-Domain Data on SMT Systems

[...]

Barry Haddow¹, Philipp Koehn¹•Institutions (1)

University of Edinburgh¹

07 Jun 2012

TL;DR: It is shown that the out-of-domain data improves coverage and translation of rare words, but may degrade the translation quality for more common words.

...read moreread less

Abstract: In statistical machine translation (SMT), it is known that performance declines when the training data is in a different domain from the test data. Nevertheless, it is frequently necessary to supplement scarce in-domain training data with out-of-domain data. In this paper, we first try to relate the effect of the out-of-domain data on translation performance to measures of corpus similarity, then we separately analyse the effect of adding the out-of-domain data at different parts of the training pipeline (alignment, phrase extraction, and phrase scoring). Through experiments in 2 domains and 8 language pairs it is shown that the out-of-domain data improves coverage and translation of rare words, but may degrade the translation quality for more common words.

...read moreread less

Journal Article•DOI•

Two types of external argument licensing: The case of causers

[...]

Florian Schäfer¹•Institutions (1)

University of Stuttgart¹

01 Aug 2012-Studia Linguistica

TL;DR: It is argued that the thematic licensing of causer arguments is not a strictly lexical property but depends on the event configuration within the verbal phrase, and the need to dissociate the verbal layers introducing causative-resultative event structure from those layers introducing external arguments syntactically.

...read moreread less

Abstract: . This article argues that the thematic licensing of causer arguments is not a strictly lexical property but depends on the event configuration within the verbal phrase. The central observation leading to this conclusion is that three morphosyntactically different types of causer-DPs are subject to the same licensing condition: they are licit only in the context of a bi-eventive, resultative event structure. This licensing constellation is not only provided by lexically bi-eventive verbs, but also by overt syntactic composition of a mono-eventive verb with a secondary result predicate where the mono-eventive verb does not license causers on its own. The latter constellation argues against coding causer-roles in a verb’s lexical entry. Instead, it argues for an account that assumes event decomposition of lexically resultative verbs and some version of a configurational θ-theory. Concentrating on existing syntactic versions of such an account, it is shown that they need to be updated to cover the set of data presented in this paper. A central claim put forward is the need to dissociate the verbal layers introducing causative-resultative event structure (which acts as thematic licenser of causers) from those layers introducing external arguments syntactically (formal licensers). Concerning the latter, it is shown that causers, although thematically external arguments, are not necessarily introduced by a Voice projection on top of the verbal predicate.

...read moreread less

Journal Article•DOI•

The processing of number and gender agreement in Spanish: an event related potential investigation of the effects of structural distance

[...]

José Alemán Bañón¹, Robert Fiorentino¹, Alison Gabriele¹•Institutions (1)

University of Kansas¹

25 May 2012-Brain Research

TL;DR: Event-related potentials were used to examine the extent to which structural distance impacts the processing of Spanish number and gender agreement and yielded more positive waveforms than across-phrase agreement, suggesting that agreement violations are processed similarly at the brain level.

...read moreread less

Collapse