scispace - formally typeset
Search or ask a question

Showing papers on "Phrase published in 2006"


Journal ArticleDOI
TL;DR: Experiments demonstrate that the proposed method provides a similarity measure that shows a significant correlation to human intuition and can be used in a variety of applications that involve text knowledge representation and discovery.
Abstract: Sentence similarity measures play an increasingly important role in text-related research and applications in areas such as text mining, Web page retrieval, and dialogue systems. Existing methods for computing sentence similarity have been adopted from approaches used for long text documents. These methods process sentences in a very high-dimensional space and are consequently inefficient, require human input, and are not adaptable to some application domains. This paper focuses directly on computing the similarity between very short texts of sentence length. It presents an algorithm that takes account of semantic information and word order information implied in the sentences. The semantic similarity of two sentences is calculated using information from a structured lexical database and from corpus statistics. The use of a lexical database enables our method to model human common sense knowledge and the incorporation of corpus statistics allows our method to be adaptable to different domains. The proposed method can be used in a variety of applications that involve text knowledge representation and discovery. Experiments on two sets of selected sentence pairs demonstrate that the proposed method provides a similarity measure that shows a significant correlation to human intuition

850 citations


Journal ArticleDOI
TL;DR: An emerging theoretical framework for a working memory system that incorporates several independently motivated principles of memory: a sharply limited attentional focus, rapid retrieval of item information subject to interference from similar items, and activation decay (forgetting over time).

609 citations


Journal ArticleDOI
TL;DR: This paper showed that discourse context can immediately overrule local lexical-semantic violations, and therefore suggest that language comprehension does not involve an initially context-free semantic analysis, which is an innate organizing principle of cognition.
Abstract: In linguistic theories of how sentences encode meaning, a distinction is often made between the context-free rule-based combination of lexical-semantic features of the words within a sentence (“semantics”), and the contributions made by wider context (“pragmatics”). In psycholinguistics, this distinction has led to the view that listeners initially compute a local, context-independent meaning of a phrase or sentence before relating it to the wider context. An important aspect of such a two-step perspective on interpretation is that local semantics cannot initially be overruled by global contextual factors. In two spoken-language event-related potential experiments, we tested the viability of this claim by examining whether discourse context can overrule the impact of the core lexical-semantic feature animacy, considered to be an innate organizing principle of cognition. Two-step models of interpretation predict that verb-object animacy violations, as in “The girl comforted the clock,” will always perturb the unfolding interpretation process, regardless of wider context. When presented in isolation, such anomalies indeed elicit a clear N400 effect, a sign of interpretive problems. However, when the anomalies were embedded in a supportive context (e.g., a girl talking to a clock about his depression), this N400 effect disappeared completely. Moreover, given a suitable discourse context (e.g., a story about an amorous peanut), animacy-violating predicates (“the peanut was in love”) were actually processed more easily than canonical predicates (“the peanut was salted”). Our findings reveal that discourse context can immediately overrule local lexical-semantic violations, and therefore suggest that language comprehension does not involve an initially context-free semantic analysis.

414 citations


Journal ArticleDOI
TL;DR: A neurocognitive model of online comprehension that accounts for cross-linguistic unity and diversity in the processing of core constituents (verbs and arguments) and can derive the appearance of similar neurophysiological and neuroanatomical processing correlates in seemingly disparate structures in different languages.
Abstract: Real-time language comprehension is a principal cognitive ability and thereby relates to central properties of the human cognitive architecture. Yet how do the presumably universal cognitive and neural substrates of language processing relate to the astounding diversity of human languages (over 5,000)? The authors present a neurocognitive model of online comprehension, the extended argument dependency model (eADM), that accounts for cross-linguistic unity and diversity in the processing of core constituents (verbs and arguments). The eADM postulates that core constituent processing proceeds in three hierarchically organized phases: (1) constituent structure building without relational interpretation, (2) argument role assignment via a restricted set of cross-linguistically motivated information types (e.g., case, animacy), and (3) completion of argument interpretation using information from further domains (e.g., discourse context, plausibility). This basic architecture is assumed to be universal, with cross-linguistic variation deriving primarily from the information types applied in Phase 2 of comprehension. This conception can derive the appearance of similar neurophysiological and neuroanatomical processing correlates in seemingly disparate structures in different languages and, conversely, of cross-linguistic differences in the processing of similar sentence structures.

389 citations


Proceedings Article
04 Dec 2006
TL;DR: This work demonstrates that the trend toward predictability-sensitive syntactic reduction (Jaeger, 2006) is robust in the face of a wide variety of control variables, and presents evidence that speakers use both surface and structural cues for predictability estimation.
Abstract: If language users are rational, they might choose to structure their utterances so as to optimize communicative properties. In particular, information-theoretic and psycholinguistic considerations suggest that this may include maximizing the uniformity of information density in an utterance. We investigate this possibility in the context of syntactic reduction, where the speaker has the option of either marking a higher-order unit (a phrase) with an extra word, or leaving it unmarked. We demonstrate that speakers are more likely to reduce less information-dense phrases. In a second step, we combine a stochastic model of structured utterance production with a logistic-regression model of syntactic reduction to study which types of cues speakers employ when estimating the predictability of upcoming elements. We demonstrate that the trend toward predictability-sensitive syntactic reduction (Jaeger, 2006) is robust in the face of a wide variety of control variables, and present evidence that speakers use both surface and structural cues for predictability estimation.

387 citations


Proceedings ArticleDOI
08 Jun 2006
TL;DR: This work uses a target language parser to generate parse trees for each sentence on the target side of the bilingual training corpus, matching them with phrase table lattices built for the corresponding source sentence.
Abstract: We present translation results on the shared task "Exploiting Parallel Texts for Statistical Machine Translation" generated by a chart parsing decoder operating on phrase tables augmented and generalized with target language syntactic categories. We use a target language parser to generate parse trees for each sentence on the target side of the bilingual training corpus, matching them with phrase table lattices built for the corresponding source sentence. Considering phrases that correspond to syntactic categories in the parse trees we develop techniques to augment (declare a syntactically motivated category for a phrase pair) and generalize (form mixed terminal and nonterminal phrases) the phrase table into a synchronous bilingual grammar. We present results on the French-to-English task for this workshop, representing significant improvements over the workshop's baseline system. Our translation system is available open-source under the GNU General Public License.

347 citations


Book ChapterDOI
01 Aug 2006
TL;DR: The authors reviewed the major findings of the competition model in Japanese and Korean sentence comprehension, with a focus on sentence comprehension of simple sentences with two noun phrases and one transitive verb phrase.
Abstract: Introduction One outgrowth of psycholinguists' increasing attention to languages with various structural features is the Competition Model (CM) of MacWhinney and Bates (1989). Invoking emergentist concepts from functional linguistics and cognitive psychology, this model seeks to integrate the traditions of L1 acquisition, L2 acquisition, and adult processing research without relying on hard-wiring of principles from Universal Grammar. This chapter will outline the model, and then review some of the major findings of research it has inspired, with a focus on sentence comprehension in Japanese and Korean. Outline of the competition model Cue coalition and competition Although the Competition Model addresses issues in both production and comprehension, the majority of studies have focused on comprehension, because it is easier to control experimentally. Many of those studies have examined comprehension of simple sentences with two noun phrases and one transitive verb phrase. Others have looked at comprehension of datives (McDonald, 1987), causatives (Sasaki, 1998), relative clauses (MacWhinney & Pleh, 1988), and pronouns (McDonald & MacWhinney, 1995), as well as sentence production (Bates & Devescovi, 1989). In standard CM experiments, participants listen to sentences and then judge which of the two nouns was the actor. Young children do this by selecting between toys, or enacting the scene with them (enactment task). Older children and adults may press a button or name the noun.

266 citations


Proceedings ArticleDOI
17 Jul 2006
TL;DR: A novel reordering model for phrase-based statistical machine translation (SMT) that uses a maximum entropy (MaxEnt) model to predicate reorderings of neighbor blocks (phrase pairs) that obtains significant improvements in BLEU score on the NIST MT-05 and IWSLT-04 tasks.
Abstract: We propose a novel reordering model for phrase-based statistical machine translation (SMT) that uses a maximum entropy (MaxEnt) model to predicate reorderings of neighbor blocks (phrase pairs). The model provides content-dependent, hierarchical phrasal reordering with generalization based on features automatically learned from a real-world bitext. We present an algorithm to extract all reordering events of neighbor blocks from bilingual data. In our experiments on Chinese-to-English translation, this MaxEnt-based reordering model obtains significant improvements in BLEU score on the NIST MT-05 and IWSLT-04 tasks.

264 citations


Journal ArticleDOI
TL;DR: In this article, the authors test the assumption that idioms have their own lexical entry, which is linked to its constituent lemmas (Cutting & Bock, 1997).

244 citations


Proceedings ArticleDOI
22 Jul 2006
TL;DR: SPMT, a new class of statistical Translation Models that use Syntactified target language Phrases, is introduced that outperform a state of the art phrase-based baseline model and ranks translations on a human-based quality metric.
Abstract: We introduce SPMT, a new class of statistical Translation Models that use Syntactified target language Phrases. The SPMT models outperform a state of the art phrase-based baseline model by 2.64 Bleu points on the NIST 2003 Chinese-English test corpus and 0.28 points on a human-based quality metric that ranks translations on a scale from 1 to 5.

239 citations


Proceedings Article
08 Aug 2006
TL;DR: A simple-yet-effective algorithm to generate non-duplicate k-best translations for n-gram rescoring is devised and a direct probability model is defined and a linear-time dynamic programming algorithm is used to search for the best derivation.
Abstract: In syntax-directed translation, the source-language input is first parsed into a parse-tree, which is then recursively converted into a string in the target-language. We model this conversion by an extended tree-to-string transducer that has multi-level trees on the source-side, which gives our system more expressive power and flexibility. We also define a direct probability model and use a linear-time dynamic programming algorithm to search for the best derivation. The model is then extended to the general log-linear frame-work in order to incorporate other features like n-gram language models. We devise a simple-yet-effective algorithm to generate non-duplicate k-best translations for n-gram rescoring. Preliminary experiments on English-to-Chinese translation show a significant improvement in terms of translation quality compared to a state-of-the- art phrase-based system.

Patent
15 Nov 2006
TL;DR: In this article, an exclusionary phrase index is determined for each cluster, and representative phrases are selected from the indexes, and the representative phrases may be used as cluster labels in an interactive information exploration interface.
Abstract: Disclosed information exploration system and method embodiments operate on a document set to determine a document cluster hierarchy. An exclusionary phrase index is determined for each cluster, and representative phrases are selected from the indexes. The selection process may enforce pathwise uniqueness and balanced sub-cluster representation. The representative phrases may be used as cluster labels in an interactive information exploration interface.

Patent
26 May 2006
TL;DR: In this article, a method for authenticating a user based on the phrase, the biometric voice print, and the device identifier is presented. But the method is limited to a single user and cannot be used to authenticate multiple users.
Abstract: A method (700) and system (900) for authenticating a user is provided. The method can include receiving one or more spoken utterances from a user (702), recognizing a phrase corresponding to one or more spoken utterances (704), identifying a biometric voice print of the user from one or more spoken utterances of the phrase (706), determining a device identifier associated with the device (708), and authenticating the user based on the phrase, the biometric voice print, and the device identifier (710). A location of the handset or the user can be employed as criteria for granting access to one or more resources (712).

Journal ArticleDOI
TL;DR: This paper found a significant on-line interaction between syntactic complexity and similarity between the memory-nouns and the sentence-Nouns in the three memorynoun conditions, such as subject-and object-extracted relative clauses.

Journal ArticleDOI
TL;DR: This paper examined the effect of pronouns that are inconsistent with the bias of a preceding implicit causality verb (e.g., “David praised Linda because he…”), and found that bias-inconsistent pronouns immediately slowed down reading at the two words immediately following the pronoun.

Journal ArticleDOI
TL;DR: Online measures of speech processing are used in a looking-while-listening procedure and suggest familiar frames may enable the infant to 'listen ahead' more efficiently for the focused word at the end of the sentence.
Abstract: In child-directed speech (CDS), adults often use utterances with very few words; many include short, frequently used sentence frames, while others consist of a single word in isolation. Do such features of CDS provide perceptual advantages for the child? Based on descriptive analyses of parental speech, some researchers argue that isolated words should help infants in word recognition by facilitating segmentation, while others predict no advantage. To address this question directly, we used online measures of speech processing in a looking-while-listening procedure. In two experiments, 18-month-olds were presented with familiar object names in isolation and in a sentence frame. Infants were 120 ms slower to interpret target words in isolation than when the same words were preceded by a familiar carrier phrase, suggesting that the sentence frame facilitated word recognition. Familiar frames may enable the infant to ‘listen ahead’ more efficiently for the focused word at the end of the sentence.

Patent
27 Nov 2006
TL;DR: In this paper, a semantic abstract includes state vectors in the topological vector space, each state vector representing one lexeme or lexeme phrase about the document and the state vectors can also correspond to words in the document that are most significant to the document's meaning.
Abstract: Codifying the “most prominent measurement points” of a document can be used to measure semantic distances given an area of study (e.g., white papers on some subject area). A semantic abstract is created for each document. The semantic abstract is a semantic measure of the subject or theme of the document providing a new and unique mechanism for characterizing content. The semantic abstract includes state vectors in the topological vector space, each state vector representing one lexeme or lexeme phrase about the document. The state vectors can be dominant phrase vectors in the topological vector space mapped from dominant phrases extracted from the document. The state vectors can also correspond to words in the document that are most significant to the document's meaning (the state vectors are called dominant vectors in this case). One semantic abstract can be directly compared with another semantic abstract, resulting in a numeric semantic distance between the semantic abstracts being compared.

Patent
30 Dec 2006
TL;DR: In this paper, the authors present a method for analyzing a verbal expression for use in a vehicle navigation system that can include a database having a map with text describing street names and points of interest of the map.
Abstract: In embodiments the present invention includes a method for analyzing a verbal expression for use in a vehicle navigation system. Such a navigation system can include a database having a map with text describing street names and points of interest of the map. The method can include the steps of obtaining from the database text of multisyllable words describing at least one of road names and points of interest of the map, applying a greedy algorithm to the text of the multisyllable words to construct a set of phrases comprising each syllable of the multisyllable words, recording the verbal expression of each phrase of the set of phrases, analyzing each recorded phrase of the set of phrases to define a start and an end of each syllable in each phrase for the set of phrases, forming a separate recording for each syllable as defined by the start and the end of each syllable, and storing in a phrase database each separate recoding of a syllable.

Journal ArticleDOI
TL;DR: Findings support the theory that the cost of syntactic processing on a verb is influenced by the precise thematic relationships between that verb and its preceding arguments.
Abstract: Event-related potentials were measured as subjects read sentences presented word by word. A small N400 and a robust P600 effect were elicited by verbs that assigned the thematic role of Agent to their preceding noun-phrase argument when this argument was inanimate in nature. The amplitude of the P600, but not the N400, was modulated by the transitivity of the critical verbs and by plausibility ratings of passivised versions of these sentences (reflecting the fit between the critical verb and the inanimate noun-phrase as the verb's Theme). The P600 was similar in scalp distribution although smaller in amplitude, than that elicited by verbs with morphosyntactic violations. Pragmatically unlikely verbs that did not violate thematic constraints elicited a larger N400 but no P600 effect. These findings support the theory that the cost of syntactic processing on a verb is influenced by the precise thematic relationships between that verb and its preceding arguments.

Journal ArticleDOI
TL;DR: The present ERP results support the view that there is a process of monitoring that takes place in language perception which is reflected by the P600, which occurs whenever a conflict between a strong tendency to accept and one to reject a word brings the cognitive system in state of indecision.

Journal ArticleDOI
TL;DR: This article examined the role of punctuation in reading and found that first pass times were longer at the end of comma-marked clauses than clauses without a comma (or the same material in clause medial position).

Proceedings ArticleDOI
Yaser Al-Onaizan1, Kishore Papineni1
17 Jul 2006
TL;DR: A new distortion model is proposed that can be used with existing phrase-based SMT decoders to address n-gram language model limitations and a novel metric to measure word order similarity (or difference) between any pair of languages based on word alignments is proposed.
Abstract: In this paper, we argue that n-gram language models are not sufficient to address word reordering required for Machine Translation. We propose a new distortion model that can be used with existing phrase-based SMT decoders to address those n-gram language model limitations. We present empirical results in Arabic to English Machine Translation that show statistically significant improvements when our proposed model is used. We also propose a novel metric to measure word order similarity (or difference) between any pair of languages based on word alignments.

Proceedings ArticleDOI
23 Oct 2006
TL;DR: This paper draws an analogy between image retrieval and text retrieval and proposes a visual phrase-based approach to retrieve images containing desired objects and devise methods on how to construct visual phrases from images and how to encode the visual phrase for indexing and retrieval.
Abstract: In this paper, we draw an analogy between image retrieval and text retrieval and propose a visual phrase-based approach to retrieve images containing desired objects. The visual phrase is defined as a pair of adjacent local image patches and is constructed using data mining. We devise methods on how to construct visual phrases from images and how to encode the visual phrase for indexing and retrieval. Our experiments demonstrate that visual phrase-based retrieval approach can be very efficient and can be 20% more effective than its visual word-based counterpart.

Patent
09 Feb 2006
TL;DR: In this article, a computer implemented method, system, and program for checking text in an electronic document is presented, where words in the text are scanned and parsed, and a determination is made of whether one or more words form a contact phrase providing information to identify or address a person or entity.
Abstract: Provided is a computer implemented method, system, and program for checking text in an electronic document. Words in the text are scanned and parsed. For each set of one or more scanned and parsed words, a determination is made of whether one or more words form a contact phrase providing information to identify or address a person or entity. After one contact phrase is scanned, contact information is accessed including contact phrases. The contact information is searched to determine if the scanned contact phrase matches contact phrases in the searched contact information.

Journal ArticleDOI
TL;DR: The data establish that a pivotal syntactic planning process is affected by verbal working memory limitations, and constrain existing proposals about the role of working memory in language production.
Abstract: In order to study the role of working memory in sentence formulation, we elicited errors of subject-verb agreement in spoken sentence completion, while speakers did or did not maintain an extrinsic memory load (a word list). We compared participants with low and high speaking spans (a measure of verbal working memory for sentence production). As in previous studies, agreement errors occurred more frequently for sentence fragments with a singular subject noun and a plural noun than for corresponding fragments in which both nouns were singular. Agreement errors also occurred more frequently when the fragment had a distributive interpretation, so that the conceptual number of the subject mismatched its grammatical number, than when the fragment was not distributive. Importantly, there were effects of memory span and of memory load, and these variables interacted: Load affected only low-span speakers. Distributivity did not interact with either load or span. These data establish that a pivotal syntactic planning process is affected by verbal working memory limitations. As such, they constrain existing proposals about the role of working memory in language production.

01 Jan 2006
TL;DR: In this article, the authors explore a particular part of the prosodic hierarchy, the area falling between the word and the phonological phrase, and develop a framework that reduces the types of genuine prosodic categories while making systematic use of adjunction structures and concomitant functional notions like maximal and minimal instantiations of categories.
Abstract: This paper explores a particular part of the prosodic hierarchy—the area falling between the prosodic word and the phonological phrase. It develops a framework that reduces the types of genuine prosodic categories while at the same time making systematic use of adjunction structures and concomitant functional notions like maximal and minimal instantiations of categories. A detailed analysis of the prosodic typology of compounds in Japanese suggests that the theory maintains enough flexibility to distinguish what needs to be distinguished but avoids multiplying prosodic categories beyond necessity.

Journal ArticleDOI
TL;DR: It is proposed that a processing theory, together with a syntactic account, does a better job of describing and explaining the data on verb phrase-ellipsis.

Patent
20 Apr 2006
TL;DR: In this paper, an author-centric search that facilitates identifying a source commonly associated with a topic by providing a ranked listing of experts in a field of knowledge related to a search phrase is presented.
Abstract: Disclosed is an author-centric search that facilitates identifying a source commonly associated with a topic by, for example, providing a ranked listing of experts in a field of knowledge related to a search phrase. The search phrase can be captured and parsed into the individual words (e.g., substrings) of the search phrase. Based on occurrences of the words in one or more documented communications, statistics can be generated to determine the relevancy of each documented communication in relation to the search phrase. Further, additional statistics can be generated describing the occurrence of multiple words in a documented communication and/or a distance of words between the search phrase words in a documented communication. The statistics can be utilized to generate expert scores. The expert scores can be sorted for and/or displayed to the user.

01 Jan 2006
TL;DR: A novel sentence segmentation method which is specifically tailored to the requirements of machine translation algorithms and is competitive with state-of-the-art approaches for detecting sentence-like units is presented.
Abstract: This paper studies the impact of automatic sentence segmentation and punctuation prediction on the quality of machine translation of automatically recognized speech. We present a novel sentence segmentation method which is specifically tailored to the requirements of machine translation algorithms and is competitive with state-of-the-art approaches for detecting sentence-like units. We also describe and compare three strategies for predicting punctuation in a machine translation framework, including the simple and effective implicit punctuation generation by a statistical phrase-based machine translation system. Our experiments show the robust performance of the proposed sentence segmentation and punctuation prediction approaches on the IWSLT Chinese-to-English and TC-STAR English-to-Spanish speech translation tasks in terms of translation quality.

Journal ArticleDOI
TL;DR: This article examined the question of how indeterminate phrases in Japanese associate with relevant particles higher in the structure and proposed a straightforward uniform picture of the syntax-semantics mapping of the universal construction and wh-questions, building upon Hamblin semantics for wh-phrases as sets of alternatives.
Abstract: This paper examines the question of how so-called indeterminate phrases in Japanese (Kuroda 1965) associate with relevant particles higher in the structure. In the universal construction in Japanese, the restrictor (provided by an indeterminate phrase) sometimes appears to be separate from the universal particle mo. It is proposed that quantification at a distance is only apparent, and that the restriction is in fact provided locally by the sister constituent of mo as a whole. The proposal leads us to a straightforward uniform picture of the syntax-semantics mapping of the universal construction and wh-questions, building upon Hamblin’s (1973) semantics for wh-phrases as sets of alternatives. It allows for a switch of perspective on a long-standing puzzle regarding locality effects in the indeterminate–particle association by deriving the locality pattern from the way indeterminate phrases are interpreted and associated with particles, without any stipulations.