scispace - formally typeset
Search or ask a question

Showing papers on "Natural language published in 2007"


Proceedings Article
06 Jan 2007
TL;DR: This work proposes Explicit Semantic Analysis (ESA), a novel method that represents the meaning of texts in a high-dimensional space of concepts derived from Wikipedia that results in substantial improvements in correlation of computed relatedness scores with human judgments.
Abstract: Computing semantic relatedness of natural language texts requires access to vast amounts of common-sense and domain-specific world knowledge. We propose Explicit Semantic Analysis (ESA), a novel method that represents the meaning of texts in a high-dimensional space of concepts derived from Wikipedia. We use machine learning techniques to explicitly represent the meaning of any text as a weighted vector of Wikipedia-based concepts. Assessing the relatedness of texts in this space amounts to comparing the corresponding vectors using conventional metrics (e.g., cosine). Compared with the previous state of the art, using ESA results in substantial improvements in correlation of computed relatedness scores with human judgments: from r = 0.56 to 0.75 for individual words and from r = 0.60 to 0.72 for texts. Importantly, due to the use of natural concepts, the ESA model is easy to explain to human users.

2,285 citations


Book
01 Apr 2007
TL;DR: In this paper, the authors describe algorithms in a C-like language for automatic processing of natural language, analysis of molecular sequences and management of textual databases, and present examples related to the automatic processing and analysis of natural languages.
Abstract: Describing algorithms in a C-like language, this text presents examples related to the automatic processing of natural language, to the analysis of molecular sequences and to the management of textual databases.

686 citations


Proceedings Article
01 Dec 2007
TL;DR: The tasks of the different tracks are defined and how the data sets were created from existing treebanks for ten languages are described, to characterize the different approaches of the participating systems and report the test results and provide a first analysis of these results.
Abstract: The Conference on Computational Natural Language Learning features a shared task, in which participants train and test their learning systems on the same data sets. In 2007, as in 2006, the shared task has been devoted to dependency parsing, this year with both a multilingual track and a domain adaptation track. In thispaper, we definethe tasksof the different tracks and describe how the data sets were created from existing treebanks for ten languages. In addition, we characterize the different approaches of the participating systems, report the test results, and provide a first analysis of these results.

606 citations


Journal ArticleDOI
TL;DR: It is argued that the social brain 'gates' the computational mechanisms involved in human language learning.
Abstract: I advance the hypothesis that the earliest phases of language acquisition ‐ the developmental transition from an initial universal state of language processing to one that is language-specific ‐ requires social interaction. Relating human language learning to a broader set of neurobiological cases of communicative development, I argue that the social brain ‘gates’ the computational mechanisms involved in human language learning.

574 citations


Journal ArticleDOI
TL;DR: The authors used simple convolution and superposition mechanisms to learn distributed holographic representations for words, which can be used for higher order models of language comprehension, relieving the complexity required at the higher level.
Abstract: The authors present a computational model that builds a holographic lexicon representing both word meaning and word order from unsupervised experience with natural language. The model uses simple convolution and superposition mechanisms (cf. B. B. Murdock, 1982) to learn distributed holographic representations for words. The structure of the resulting lexicon can account for empirical data from classic experiments studying semantic typicality, categorization, priming, and semantic constraint in sentence completions. Furthermore, order information can be retrieved from the holographic representations, allowing the model to account for limited word transitions without the need for built-in transition rules. The model demonstrates that a broad range of psychological data can be accounted for directly from the structure of lexical representations learned in this way, without the need for complexity to be built into either the processing mechanisms or the representations. The holographic representations are an appropriate knowledge representation to be used by higher order models of language comprehension, relieving the complexity required at the higher level.

549 citations


Proceedings ArticleDOI
24 May 2007
TL;DR: This work investigates using natural language processing (NLP) techniques to identify duplicates in defect reports at Sony Ericsson mobile communications, and shows that about 2/3 of the duplicates can possibly be found using the NLP techniques.
Abstract: Defect reports are generated from various testing and development activities in software engineering. Sometimes two reports are submitted that describe the same problem, leading to duplicate reports. These reports are mostly written in structured natural language, and as such, it is hard to compare two reports for similarity with formal methods. In order to identify duplicates, we investigate using natural language processing (NLP) techniques to support the identification. A prototype tool is developed and evaluated in a case study analyzing defect reports at Sony Ericsson mobile communications. The evaluation shows that about 2/3 of the duplicates can possibly be found using the NLP techniques. Different variants of the techniques provide only minor result differences, indicating a robust technology. User testing shows that the overall attitude towards the technique is positive and that it has a growth potential.

535 citations


01 Jan 2007
TL;DR: A computational model that builds a holographic lexicon representing both word meaning and word order from unsupervised experience with natural language demonstrates that a broad range of psychological data can be accounted for directly from the structure of lexical representations learned in this way, without the need for complexity to be built into either the processing mechanisms or the representations.
Abstract: The authors present a computational model that builds a holographic lexicon representing both word meaning and word order from unsupervised experience with natural language. The model uses simple convolution and superposition mechanisms (cf. B. B. Murdock, 1982) to learn distributed holographic representations for words. The structure of the resulting lexicon can account for empirical data from classic experiments studying semantic typicality, categorization, priming, and semantic constraint in sentence completions. Furthermore, order information can be retrieved from the holographic representations, allowing the model to account for limited word transitions without the need for built-in transition rules. The model demonstrates that a broad range of psychological data can be accounted for directly from the structure of lexical representations learned in this way, without the need for complexity to be built into either the processing mechanisms or the representations. The holographic representations are an appropriate knowledge representation to be used by higher order models of language comprehension, relieving the complexity required at the higher level.

506 citations


Book
03 May 2007
TL;DR: From Corpus to Classroom summarises and makes accessible recent work in corpus research, focusing particularly on spoken data, based on analysis of corpora such as CANCODE and Cambridge International Corpus.
Abstract: From Corpus to Classroom summarises and makes accessible recent work in corpus research, focusing particularly on spoken data. It is based on analysis of corpora such as CANCODE and Cambridge International Corpus, and written with particular reference to the development of corpus-informed pedagogy.The book explains how corpora can be designed and used, and focuses on what they tell us about language teaching. It examines the relevance of corpora to materials writers, course designers and language teachers and considers the needs of the learner in relation to authentic data. It shows how the answers to key questions such as 'Is there a basic, everyday vocabulary for English?', 'How should idioms be taught?' and 'What are the most common spoken language chunks?' are best explored by means of a clearer understanding of the workings of language in context.

492 citations


Patent
11 Dec 2007
TL;DR: In this paper, a conversational, natural language voice user interface may provide an integrated voice navigation services environment, where the user can speak conversationally, using natural language, to issue queries, commands, or other requests relating to the navigation services provided in the environment.
Abstract: A conversational, natural language voice user interface may provide an integrated voice navigation services environment. The voice user interface may enable a user to make natural language requests relating to various navigation services, and further, may interact with the user in a cooperative, conversational dialogue to resolve the requests. Through dynamic awareness of context, available sources of information, domain knowledge, user behavior and preferences, and external systems and devices, among other things, the voice user interface may provide an integrated environment in which the user can speak conversationally, using natural language, to issue queries, commands, or other requests relating to the navigation services provided in the environment.

450 citations


Book Chapter
13 Dec 2007
TL;DR: A general methodology for rapidly collecting, building, and aligning parallel corpora for medium density languages, illustrating the main points on the case of Hungarian, Romanian, and Slovenian is described.
Abstract: The choice of natural language technology appropriate for a given language is greatly impacted by density (availability of digitally stored material). More than half of the world speaks medium density languages, yet many of the methods appropriate for high or low density languages yield suboptimal results when applied to the medium density case. In this paper we describe a general methodology for rapidly collecting, building, and aligning parallel corpora for medium density languages, illustrating our main points on the case of Hungarian, Romanian, and Slovenian. We also describe and evaluate the hybrid sentence alignment method we are using.

447 citations


Journal ArticleDOI
TL;DR: The conclusions drawn are that automatic summarisation has made valuable progress, with useful applications, better evaluation, and more task understanding, but summarising systems are still poorly motivated in relation to the factors affecting them, and evaluation needs taking much further to engage with the purposes summaries are intended to serve.
Abstract: This paper reviews research on automatic summarising in the last decade. This work has grown, stimulated by technology and by evaluation programmes. The paper uses several frameworks to organise the review, for summarising itself, for the factors affecting summarising, for systems, and for evaluation. The review examines the evaluation strategies applied to summarising, the issues they raise, and the major programmes. It considers the input, purpose and output factors investigated in recent summarising research, and discusses the classes of strategy, extractive and non-extractive, that have been explored, illustrating the range of systems built. The conclusions drawn are that automatic summarisation has made valuable progress, with useful applications, better evaluation, and more task understanding. But summarising systems are still poorly motivated in relation to the factors affecting them, and evaluation needs taking much further to engage with the purposes summaries are intended to serve and the contexts in which they are used.

Journal ArticleDOI
01 Mar 2007-Lingua
TL;DR: This paper explored the possibility that the linguistic forms and structures employed by our earliest language-using ancestors might have been significantly different from those observed in the languages we are most familiar with today, not because of a biological difference between them and us, but because the communicative context in which they operated was fundamentally different from that of most modern humans.


Proceedings ArticleDOI
14 Mar 2007
TL;DR: A semi-automated concern location and comprehension tool designed to reduce the time developers spend on maintenance tasks and to increase their confidence in the results of these tasks, which is effective because it searches a unique natural language-based representation of source code.
Abstract: Most current software systems contain undocumented high-level ideas implemented across multiple files and modules. When developers perform program maintenance tasks, they often waste time and effort locating and understanding these scattered concerns. We have developed a semi-automated concern location and comprehension tool, Find-Concept, designed to reduce the time developers spend on maintenance tasks and to increase their confidence in the results of these tasks. Find-Concept is effective because it searches a unique natural language-based representation of source code, uses novel techniques to expand initial queries into more effective queries, and displays search results in an easy-to-comprehend format. We describe the Find-Concept tool, the underlying program analysis, and an experimental study comparing Find-Concept's search effectiveness with two state-of-the-art lexical and information retrieval-based search tools. Across nine action-oriented concern location tasks derived from open source bug reports, our Eclipse-based tool produced more effective queries more consistently than either competing search tool with similar user effort.

Journal ArticleDOI
TL;DR: In this article, the authors focus on international periodical publications where more than 75 percent of the articles in the social sciences and humanities and well over 90 percent in the natural sciences are written in English.
Abstract: Throughout the 20th century, international communication has shifted from a plural use of several languages to a clear pre-eminence of English, especially in the field of science. This paper focuses on international periodical publications where more than 75 percent of the articles in the social sciences and humanities and well over 90 percent in the natural sciences are written in English. The shift towards English implies that an increasing number of scientists whose mother tongue is not English have already moved to English for publication. Consequently, other international languages, namely French, German, Russian, Spanish and Japanese lose their attraction as languages of science. Many observers conclude that it has become inevitable to publish in English, even in English only. The central question is whether the actual hegemony of English will create a total monopoly, at least at an international level, or if changing global conditions and language policies may allow alternative solutions. The paper analyses how the conclusions of an inevitable monopoly of English are constructed, and what possible disadvantages such a process might entail. Finally, some perspectives of a new plurilingual approach in scientific production and communication are sketched.

Proceedings Article
01 Apr 2007
TL;DR: This work evaluates a system that uses interpolated predictions of reading difficulty that are based on both vocabulary and grammatical features, and indicates that Grammatical features may play a more important role in second language readability than in first languagereadability.
Abstract: This work evaluates a system that uses interpolated predictions of reading difficulty that are based on both vocabulary and grammatical features. The combined approach is compared to individual grammar- and language modeling-based approaches. While the vocabulary-based language modeling approach outperformed the grammar-based approach, grammar-based predictions can be combined using confidence scores with the vocabulary-based predictions to produce more accurate predictions of reading difficulty for both first and second language texts. The results also indicate that grammatical features may play a more important role in second language readability than in first language readability.

Journal ArticleDOI
TL;DR: A striking finding is reported: Infants are better able to extract rules from sequences of nonspeech—such as sequences of musical tones, animal sounds, or varying timbres—if they first hear those rules instantiated in sequences of speech.
Abstract: Sequences of speech sounds play a central role in human cognitive life, and the principles that govern such sequences are crucial in determining the syntax and semantics of natural languages. Infants are capable of extracting both simple transitional probabilities and simple algebraic rules from sequences of speech, as demonstrated by studies using ABB grammars (la ta ta, gai mu mu, etc.). Here, we report a striking finding: Infants are better able to extract rules from sequences of nonspeech—such as sequences of musical tones, animal sounds, or varying timbres—if they first hear those rules instantiated in sequences of speech.

Journal Article
TL;DR: This paper explores the implementation of SemEQUAL using OrdPath, a positional representation for nodes of a hierarchy that is used successfully for supporting XML documents in relational systems, and proposes the use of OrdPath to represent position within the Wordnet hierarchy, leveraging its ability to compute transitive closures efficiently.
Abstract: The volume of information in natural languages in electronic format is increasing exponentially. The demographics of users of information management systems are becoming increasingly multilingual. Together these trends create a requirement for information management systems to support processing of information in multiple natural languages seamlessly. Database systems, the backbones of information management, should support this requirement effectively and efficiently. Earlier research in this area had proposed multilingual operators [7, 8] for relational database systems, and discussed their implementation using existing database features. In this paper, we specifically focus on the SemEQUAL operator [8], implementing a multilingual semantic matching predicate using WordNet [12]. We explore the implementation of SemEQUAL using OrdPath [10], a positional representation for nodes of a hierarchy that is used successfully for supporting XML documents in relational systems. We propose the use of OrdPath to represent position within the Wordnet hierarchy, leveraging its ability to compute transitive closures efficiently. We show theoretically that an implementation using OrdPath will outperform those implementations proposed previously. Our initial experimental results confirm this analysis, and show that the OrdPath implementation performs significantly better. Further, since our technique is not specifically rooted to linguistic hierarchies, the same approach may benefit other applications that utilize alternative hierarchical ontologies.

Book ChapterDOI
03 Jun 2007
TL;DR: PANTO is presented, a Portable nAtural laNguage inTerface to Ontologies, which accepts generic natural language queries and outputs SPARQL queries, and adopts a triple-based data model to interpret the parse trees output by an off-the-shelf parser.
Abstract: Providing a natural language interface to ontologies will not only offer ordinary users the convenience of acquiring needed information from ontologies, but also expand the influence of ontologies and the semantic web consequently. This paper presents PANTO, a Portable nAtural laNguage inTerface to Ontologies, which accepts generic natural language queries and outputs SPARQL queries. Based on a special consideration on nominal phrases, it adopts a triple-based data model to interpret the parse trees output by an off-the-shelf parser. Complex modifications in natural language queries such as negations, superlative and comparative are investigated. The experiments have shown that PANTO provides state-of-the-art results.

Journal ArticleDOI
TL;DR: Investigation of the development of early language and literacy skills among Spanish-speaking students in 2 large urban school districts, 1 middle-size urban district, and 1 border district suggests that pedagogical decisions for ELLs should not only consider effective instructional literacy strategies but also acknowledge that the language of instruction forSpanish-speaking ELLS may produce varying results for different students.
Abstract: Purpose The purpose of this study was to examine the effects of initial first and second language proficiencies as well as the language of instruction that a student receives on the relationship be...

Proceedings ArticleDOI
06 Jul 2007
TL;DR: This work describes an execution strategy based on translation to datalog with constraints, and table-based resolution that is sound, complete, and always terminates, despite recursion and negation, as long as simple syntactic conditions are met.
Abstract: We present a declarative authorization language that strikes a careful balance between syntactic and semantic simplicity, policy expressiveness, and execution efficiency. The syntax is close to natural language, and the semantics consists of just three deduction rules. The language can express many common policy idioms using constraints, controlled delegation, recursive predicates, and negated queries. We describe an execution strategy based on translation to datalog with constraints, and table-based resolution. We show that this execution strategy is sound, complete, and always terminates, despite recursion and negation, as long as simple syntactic conditions are met.

Book ChapterDOI
11 Nov 2007
TL;DR: The results of the study confirm that NLIs are useful for querying Semantic Web data and introduce four interfaces each allowing a different query language and present a usability study benchmarking these interfaces.
Abstract: Natural language interfaces offer end-users a familiar and convenient option for querying ontology-based knowledge bases. Several studies have shown that they can achieve high retrieval performance as well as domain independence. This paper focuses on usability and investigates if NLIs are useful from an end-user's point of view. To that end, we introduce four interfaces each allowing a different query language and present a usability study benchmarking these interfaces. The results of the study reveal a clear preference for full sentences as query language and confirm that NLIs are useful for querying Semantic Web data.

Journal ArticleDOI
TL;DR: A method to systematically bridge the disparate biology and engineering domains using natural language analysis is described, able to algorithmically generate several biologically meaningful keywords, including defend, that are not obviously related to the engineering problem.
Abstract: Biomimetic, or biologically inspired, design uses analogous biological phenomena to develop solutions for engineering problems. Several instances of biomimetic design result from personal observations of biological phenomena. However, many engineers' knowledge of biology may be limited, thus reducing the potential of biologically inspired solutions. Our approach to biomimetic design takes advantage of the large amount of biological knowledge already available in books, journals, and so forth, by performing keyword searches on these existing natural-language sources. Because of the ambiguity and imprecision of natural language, challenges inherent to natural language processing were encountered. One challenge of retrieving relevant cross-domain information involves differences in domain vocabularies, or lexicons. A keyword meaningful to biologists may not occur to engineers. For an example problem that involved cleaning, that is, removing dirt, a biochemist suggested the keyword “defend.” Defend is not an obvious keyword to most engineers for this problem, nor are the words defend and “cleansremove” directly related within lexical references. However, previous work showed that biological phenomena retrieved by the keyword defend provided useful stimuli and produced successful concepts for the cleansremove problem. In this paper, we describe a method to systematically bridge the disparate biology and engineering domains using natural language analysis. For the cleansremove example, we were able to algorithmically generate several biologically meaningful keywords, including defend, that are not obviously related to the engineering problem. We developed a method to organize and rank the set of biologically meaningful keywords identified, and confirmed that we could achieve similar results for two other examples in encapsulation and microassembly. Although we specifically address cross-domain information retrieval from biology, the bridging process presented in this paper is not limited to biology, and can be used for any other domain given the availability of appropriate domain-specific knowledge sources and references.

Book
23 Mar 2007
TL;DR: The international guide to speech acquisition, The international guideto speech acquisition , کتابخانه دیجیتال جندی اهواز
Abstract: The international guide to speech acquisition , The international guide to speech acquisition , کتابخانه دیجیتال جندی شاپور اهواز

Book ChapterDOI
12 Sep 2007
TL;DR: The developed Affect Analysis Model was designed to handle not only correctly written text, but also informal messages written in abbreviated or expressive manner, and an avatar was created in order to reflect the detected affective information and social behaviour.
Abstract: In this paper, we address the tasks of recognition and interpretation of affect communicated through text messaging. The evolving nature of language in online conversations is a main issue in affect sensing from this media type, since sentence parsing might fail while syntactical structure analysis. The developed Affect Analysis Model was designed to handle not only correctly written text, but also informal messages written in abbreviated or expressive manner. The proposed rule-based approach processes each sentence in sequential stages, including symbolic cue processing, detection and transformation of abbreviations, sentence parsing, and word/phrase/sentence-level analyses. In a study based on 160 sentences, the system result agrees with at least two out of three human annotators in 70% of the cases. In order to reflect the detected affective information and social behaviour, an avatar was created.

Reference EntryDOI
01 Jun 2007
TL;DR: Major theories of language acquisition are presented, along with the basic facts of language development, from childrens' acquisition of simple grammatical constructions to complex constructions and discourse.
Abstract: This chapter is about how young children master the use of a language, with a focus on grammatical constructions. Major theories of language acquisition are presented, along with the basic facts of language development, from childrens' acquisition of simple grammatical constructions to complex constructions and discourse. Also covered are the language children hear, the acquisition of morphology, individual differences, and atypical development. Keywords: analogy; constructions; distribution learning; grammar; language acquisition; learning

Book
19 Apr 2007
TL;DR: This book discusses language in the Wild, Gesture, Sign, and Speech, and the Ritualization of Language, as well as conceptual Spaces and Embodied Actions, and The Gesture-Language Interface.
Abstract: 1. Grasping Language: Sign and the Evolution of Language 2. Language in the Wild: Paleontological and Primatological Evidence for Gestural Origins 3. Gesture, Sign, and Speech 4. Gesture, Sign, and Grammar: The Ritualization of Language 5. Conceptual Spaces and Embodied Actions 6. The Gesture-Language Interface 7. Invention of Visual Languages

Journal Article
TL;DR: The authors describe three professional development contexts in the U.S., where teachers have engaged in language analysis based on functional linguistics that has given them new insights into both content and learning processes.
Abstract: Classrooms around the world are becoming more multilingual and teachers in all subject areas are faced with new challenges in enabling learners' academic language development without losing focus on content. These challenges require new ways of conceptualizing the relationship between language and content as well as new pedagogies that incorporate a dual focus on language and content in subject matter instruction. This article describes three professional development contexts in the U.S., where teachers have engaged in language analysis based on functional linguistics (for example, Halliday & Hasan, 1989; Christie, 1989) that has given them new insights into both content and learning processes. In these contexts, teachers in history classrooms with English Language Learners and teachers of languages other than English in classrooms with heritage speakers needed support to develop students' academic language development in a second language. The functional linguistics metalanguage and analysis skills they developed gave them new ways of approaching the texts read and written in their classrooms and enabled them to recognize how language constructs the content they are teaching, to critically assess how the content is presented in their teaching materials, and to engage students in richer conversation about content.

BookDOI
TL;DR: This paper introduces the novel notion of Formal Classification, as a graph structure where labels are written in a propositional concept language, which allows to reason about classifications, and to reduce document classification and query answering to reasoning about subsumption.
Abstract: Classifications have been used for centuries with the goal of cataloguing and searching large sets of objects. In the early days it was mainly books; lately it has also become Web pages, pictures and any kind of digital resources. Classifications describe their contents using natural language labels, an approach which has proved very effective in manual classification. However natural language labels show their limitations when one tries to automate the process, as they make it very hard to reason about classifications and their contents. In this paper we introduce the novel notion of Formal Classification, as a graph structure where labels are written in a propositional concept language. Formal Classifications turn out to be some form of lightweight ontologies. This, in turn, allows us to reason about them, to associate to each node a normal form formula which univocally describes its contents, and to reduce document classification and query answering to reasoning about subsumption.

Patent
15 May 2007
TL;DR: In this article, a method for recognizing a named entity included in natural language, comprising the steps of: performing gradual parsing model training with the natural language to obtain a classification model, performing gradually parsing and recognition according to the obtained classification model to obtain information on positions and types of candidate named entities; performing a refusal recognition process for the candidate named entity; and generating a candidate named Entity lattice from the refusal-recognition-processed candidate namedEntity, and searching for a optimal path.
Abstract: The present invention provides a method for recognizing a named entity included in natural language, comprising the steps of: performing gradual parsing model training with the natural language to obtain a classification model; performing gradual parsing and recognition according to the obtained classification model to obtain information on positions and types of candidate named entities; performing a refusal recognition process for the candidate named entities; and generating a candidate named entity lattice from the refusal-recognition-processed candidate named entities, and searching for a optimal path. The present invention uses a one-class classifier to score or evaluate these results to obtain the most reliable beginning and end borders of the named entities on the basis of the forward and backward parsing and recognizing results obtained only by using the local features.