scispace - formally typeset
Search or ask a question

Showing papers on "Natural language published in 2010"


Journal ArticleDOI
01 Dec 2010
TL;DR: The book is a practical guide to NLP achieving a balance between NLP theory and practical programming skills, and alternates between focusing on natural language, supported by pertinent programming examples, or focusing on the Python programming language while linguistic examples play a supporting role.
Abstract: Natural Language Processing (NLP) is experiencing rapid growth as its theories and methods are more and more deployed in a wide range of different fields. In the humanities, the work on corpora is gaining increasing prominence. Within industry, people need NLP for market analysis, web software development to name a few examples. For this reason it is important for many people to have some working knowledge of NLP. The book ‘‘Natural Language Processing with Python’’ by Steven Bird, Ewan Klein and Edward Loper is a recent contribution to cover this demand. It introduces the freely available Natural Language Toolkit (NLTK)—a project by the same authors—that was designed with the following goals: simplicity, consistency, extensibility and modularity. The book pursues pedagogical aims and is intended for students or others who want to learn to write programs that analyze natural language. Programming knowledge is not necessarily expected since the book is written for people ‘‘new to programming’’, ‘‘new to Python’’ and ‘‘already dreaming in Python’’ (p. x). Furthermore it targets lecturers who can use it in their courses. The book is a practical guide to NLP, achieving a balance between NLP theory and practical programming skills. It alternates between focusing on natural language, supported by pertinent programming examples, or focusing on the Python programming language while linguistic examples play a supporting role. The reader gets to know many real-world NLP applications and learns by example. The book is well structured. Each chapter starts with some key questions that give a rough idea what information will be provided in the chapter. The chapters finish with a summary, exercises of levels ‘‘easy’’, ‘‘intermediate’’ and ‘‘difficult’’

1,358 citations


PatentDOI
TL;DR: In this paper, a system for receiving speech and non-speech communications of natural language questions and commands, transcribing the speech and NN communications to textual messages, and executing the questions and/or commands is presented.
Abstract: Systems and methods are provided for receiving speech and non-speech communications of natural language questions and/or commands, transcribing the speech and non-speech communications to textual messages, and executing the questions and/or commands. The invention applies context, prior information, domain knowledge, and user specific profile data to achieve a natural environment for one or more users presenting questions or commands across multiple domains. The systems and methods creates, stores and uses extensive personal profile information for each user, thereby improving the reliability of determining the context of the speech and non-speech communications and presenting the expected results for a particular question or command.

1,164 citations


Journal ArticleDOI
20 Jan 2010-PLOS ONE
TL;DR: The proposed Linguistic Niche Hypothesis hypothesizes that language structures are subjected to different evolutionary pressures in different social environments, and appears to adapt to the environment (niche) in which they are being learned and used.
Abstract: Author(s): Lupyan, Gary; Dale, Rick | Abstract: BackgroundLanguages differ greatly both in their syntactic and morphological systems and in the social environments in which they exist. We challenge the view that language grammars are unrelated to social environments in which they are learned and used.Methodology/principal findingsWe conducted a statistical analysis of g2,000 languages using a combination of demographic sources and the World Atlas of Language Structures--a database of structural language properties. We found strong relationships between linguistic factors related to morphological complexity, and demographic/socio-historical factors such as the number of language users, geographic spread, and degree of language contact. The analyses suggest that languages spoken by large groups have simpler inflectional morphology than languages spoken by smaller groups as measured on a variety of factors such as case systems and complexity of conjugations. Additionally, languages spoken by large groups are much more likely to use lexical strategies in place of inflectional morphology to encode evidentiality, negation, aspect, and possession. Our findings indicate that just as biological organisms are shaped by ecological niches, language structures appear to adapt to the environment (niche) in which they are being learned and used. As adults learn a language, features that are difficult for them to acquire, are less likely to be passed on to subsequent learners. Languages used for communication in large groups that include adult learners appear to have been subjected to such selection. Conversely, the morphological complexity common to languages used in small groups increases redundancy which may facilitate language learning by infants.Conclusions/significanceWe hypothesize that language structures are subjected to different evolutionary pressures in different social environments. Just as biological organisms are shaped by ecological niches, language structures appear to adapt to the environment (niche) in which they are being learned and used. The proposed Linguistic Niche Hypothesis has implications for answering the broad question of why languages differ in the way they do and makes empirical predictions regarding language acquisition capacities of children versus adults.

460 citations


Journal ArticleDOI
TL;DR: Key ideas from the two areas of paraphrasing and textual entailment are summarized by considering in turn recognition, generation, and extraction methods, also pointing to prominent articles and resources.
Abstract: Paraphrasing methods recognize, generate, or extract phrases, sentences, or longer natural language expressions that convey almost the same information. Textual entailment methods, on the other hand, recognize, generate, or extract pairs of natural language expressions, such that a human who reads (and trusts) the first element of a pair would most likely infer that the other element is also true. Paraphrasing can be seen as bidirectional textual entailment and methods from the two areas are often similar. Both kinds of methods are useful, at least in principle, in a wide range of natural language processing applications, including question answering, summarization, text generation, and machine translation. We summarize key ideas from the two areas by considering in turn recognition, generation, and extraction methods, also pointing to prominent articles and resources.

435 citations


Journal ArticleDOI
TL;DR: The idea that iconicity need also be recognized as a general property of language, which may serve the function of reducing the gap between linguistic form and conceptual representation to allow the language system to “hook up” to motor, perceptual, and affective experience is put forward.
Abstract: Current views about language are dominated by the idea of arbitrary connections between linguistic form and meaning. However, if we look beyond the more familiar Indo-European languages and also include both spoken and signed language modalities, we find that motivated, iconic form-meaning mappings are, in fact, pervasive in language. In this paper, we review the different types of iconic mappings that characterize languages in both modalities, including the predominantly visually iconic mappings found in signed languages. Having shown that iconic mapping are present across languages, we then proceed to review evidence showing that language users (signers and speakers) exploit iconicity in language processing and language acquisition. While not discounting the presence and importance of arbitrariness in language, we put forward the idea that iconicity need also be recognized as a general property of language, which may serve the function of reducing the gap between linguistic form and conceptual representation to allow the language system to "hook up" to motor, perceptual, and affective experience.

435 citations


Proceedings ArticleDOI
02 Mar 2010
TL;DR: This work presents a system that follows natural language directions by extracting a sequence of spatial description clauses from the linguistic input and then infers the most probable path through the environment given only information about the environmental geometry and detected visible objects.
Abstract: Speaking using unconstrained natural language is an intuitive and flexible way for humans to interact with robots. Understanding this kind of linguistic input is challenging because diverse words and phrases must be mapped into structures that the robot can understand, and elements in those structures must be grounded in an uncertain environment. We present a system that follows natural language directions by extracting a sequence of spatial description clauses from the linguistic input and then infers the most probable path through the environment given only information about the environmental geometry and detected visible objects. We use a probabilistic graphical model that factors into three key components. The first component grounds landmark phrases such as "the computers" in the perceptual frame of the robot by exploiting co-occurrence statistics from a database of tagged images such as Flickr. Second, a spatial reasoning component judges how well spatial relations such as "past the computers" describe a path. Finally, verb phrases such as "turn right" are modeled according to the amount of change in orientation in the path. Our system follows 60% of the directions in our corpus to within 15 meters of the true destination, significantly outperforming other approaches.

368 citations


Journal ArticleDOI
TL;DR: This paper found that language exerts pervasive and indirect influences that are not described by the effect sizes used in the meta-analysis, and unlike code-related skills that develop rapidly during the years studied, language develops over an extended time span.
Abstract: Although the National Early Literacy Panel report provides an important distillation of research, the manner in which the data are reported underrepresents the importance of language. Unlike other predictors with moderate associations with later reading, language exerts pervasive and indirect influences that are not described by the effect sizes used in the meta-analysis. Also, unlike code-related skills that develop rapidly during the years studied, language develops over an extended time span. Because it is relatively difficult to devise interventions that dramatically alter children’s language abilities, the authors of this response are concerned that schools will target the more malleable code-based skills. They warn against such a move.

325 citations


Journal ArticleDOI
17 Jun 2010
TL;DR: An image parsing to text description (I2T) framework that generates text descriptions of image and video content based on image understanding and uses automatic methods to parse image/video in specific domains and generate text reports that are useful for real-world applications.
Abstract: In this paper, we present an image parsing to text description (I2T) framework that generates text descriptions of image and video content based on image understanding. The proposed I2T framework follows three steps: 1) input images (or video frames) are decomposed into their constituent visual patterns by an image parsing engine, in a spirit similar to parsing sentences in natural language; 2) the image parsing results are converted into semantic representation in the form of Web ontology language (OWL), which enables seamless integration with general knowledge bases; and 3) a text generation engine converts the results from previous steps into semantically meaningful, human readable, and query-able text reports. The centerpiece of the I2T framework is an and-or graph (AoG) visual knowledge representation, which provides a graphical representation serving as prior knowledge for representing diverse visual patterns and provides top-down hypotheses during the image parsing. The AoG embodies vocabularies of visual elements including primitives, parts, objects, scenes as well as a stochastic image grammar that specifies syntactic relations (i.e., compositional) and semantic relations (e.g., categorical, spatial, temporal, and functional) between these visual elements. Therefore, the AoG is a unified model of both categorical and symbolic representations of visual knowledge. The proposed I2T framework has two objectives. First, we use semiautomatic method to parse images from the Internet in order to build an AoG for visual knowledge representation. Our goal is to make the parsing process more and more automatic using the learned AoG model. Second, we use automatic methods to parse image/video in specific domains and generate text reports that are useful for real-world applications. In the case studies at the end of this paper, we demonstrate two automatic I2T systems: a maritime and urban scene video surveillance system and a real-time automatic driving scene understanding system.

322 citations


Patent
22 Feb 2010
TL;DR: In this article, a system and method for processing multi-modal device interactions in a natural language voice services environment is presented, in which context relating to the non-voice interaction and the natural language utterance may be extracted and combined to determine an intent of the multidomal device interaction, and a request may then be routed to one or more of the electronic devices based on the determined intent.
Abstract: A system and method for processing multi-modal device interactions in a natural language voice services environment may be provided. In particular, one or more multi-modal device interactions may be received in a natural language voice services environment that includes one or more electronic devices. The multi-modal device interactions may include a non-voice interaction with at least one of the electronic devices or an application associated therewith, and may further include a natural language utterance relating to the non-voice interaction. Context relating to the non-voice interaction and the natural language utterance may be extracted and combined to determine an intent of the multi-modal device interaction, and a request may then be routed to one or more of the electronic devices based on the determined intent of the multi-modal device interaction.

321 citations


Proceedings Article
09 Oct 2010
TL;DR: This paper uses higher-order unification to define a hypothesis space containing all grammars consistent with the training data, and develops an online learning algorithm that efficiently searches this space while simultaneously estimating the parameters of a log-linear parsing model.
Abstract: This paper addresses the problem of learning to map sentences to logical form, given training data consisting of natural language sentences paired with logical representations of their meaning. Previous approaches have been designed for particular natural languages or specific meaning representations; here we present a more general method. The approach induces a probabilistic CCG grammar that represents the meaning of individual words and defines how these meanings can be combined to analyze complete sentences. We use higher-order unification to define a hypothesis space containing all grammars consistent with the training data, and develop an online learning algorithm that efficiently searches this space while simultaneously estimating the parameters of a log-linear parsing model. Experiments demonstrate high accuracy on benchmark data sets in four languages with two different meaning representations.

304 citations


Proceedings Article
15 Jul 2010
TL;DR: A general overview of the CoNLL-2010 Shared Task, including the annotation protocols of the training and evaluation datasets, the exact task definitions, the evaluation metrics employed and the overall results is provided.
Abstract: The CoNLL-2010 Shared Task was dedicated to the detection of uncertainty cues and their linguistic scope in natural language texts. The motivation behind this task was that distinguishing factual and uncertain information in texts is of essential importance in information extraction. This paper provides a general overview of the shared task, including the annotation protocols of the training and evaluation datasets, the exact task definitions, the evaluation metrics employed and the overall results. The paper concludes with an analysis of the prominent approaches and an overview of the systems submitted to the shared task.

Proceedings ArticleDOI
Sumit Gulwani1
26 Jul 2010
TL;DR: This tutorial will describe the three key dimensions that should be taken into account in designing any program synthesis system: expression of user intent, space of programs over which to search, and the search technique.
Abstract: Program Synthesis, which is the task of discovering programs that realize user intent, can be useful in several scenarios: discovery of new algorithms, helping regular programmers automatically discover tricky/mundane programming details, enabling people with no programming background to develop scripts for performing repetitive tasks (end-user programming), and even problem solving in the context of automating teaching. In this tutorial, I will describe the three key dimensions that should be taken into account in designing any program synthesis system: expression of user intent, space of programs over which to search, and the search technique [1]. (i) The user intent can be expressed in the form of logical relations between inputs and outputs, input-output examples, demonstrations, natural language, and inefficient or related programs. (ii) The search space can be over imperative or functional programs (with possible restrictions on the control structure or the operator set), or over restricted models of computations such as regular/context-free grammars/transducers, or succinct logical representations. (iii) The search technique can be based on exhaustive search, version space algebras, machine learning techniques (such as belief propagation or genetic programming), or logical reasoning techniques based on SAT/SMT solvers. I will illustrate these concepts by brief description of various program synthesis projects that target synthesis of a wide variety of programs such as standard undergraduate textbook algorithms (e.g., sorting, dynamic programming), program inverses (e.g., decoders, deserializers), bitvector manipulation routines, deobfuscated programs, graph algorithms, text-manipulating routines, geometry algorithms etc.

Patent
10 Nov 2010
TL;DR: In this article, a system and method described in this paper may provide a natural language content dedication service in a voice services environment, which may generally include detecting multi-modal device interactions that include requests to dedicate content, identifying the content requested for dedication from natural language utterances included in the multiuser device interactions, processing transactions for the content requests for dedication, processing natural language to customize the content for recipients of the dedications, and delivering the customized content to the recipients of dedications.
Abstract: The system and method described herein may provide a natural language content dedication service in a voice services environment. In particular, providing the natural language content dedication service may generally include detecting multi-modal device interactions that include requests to dedicate content, identifying the content requested for dedication from natural language utterances included in the multi-modal device interactions, processing transactions for the content requested for dedication, processing natural language to customize the content for recipients of the dedications, and delivering the customized content to the recipients of the dedications.

BookDOI
TL;DR: This work fills the gap in empirical data that has been missing in qualitative analyses of the problems involved in building natural language interfaces (NLIs) and proposes a methodology to address these problems.
Abstract: While qualitative analyses of the problems involved in building natural language interfaces (NLIs) have been available, a quantitative grounding in empirical data has been missing. We fill this gap ...

Patent
22 Oct 2010
TL;DR: The authors generalized disambiguation of natural language expressions to disjunctive sets of interpretations that can be specialized hierarchically, in accordance with one or more specialization hierarchies between semantic descriptors.
Abstract: Disambiguation of the meaning of a natural language expression proceeds by constructing a natural language expression, and then incrementally specializing the meaning representation to more specific meanings as more information and constraints are obtained, in accordance with one or more specialization hierarchies between semantic descriptors. The method is generalized to disjunctive sets of interpretations that can be specialized hierarchically.

Patent
16 Sep 2010
TL;DR: In this paper, a system and method for hybrid processing in a natural language voice services environment that includes a plurality of multi-modal devices may be provided, in particular, the hybrid processing may generally include the plurality of multidomal devices cooperatively interpreting and processing one or more natural language utterances included in one or multiple multimodal requests.
Abstract: A system and method for hybrid processing in a natural language voice services environment that includes a plurality of multi-modal devices may be provided. In particular, the hybrid processing may generally include the plurality of multi-modal devices cooperatively interpreting and processing one or more natural language utterances included in one or more multi-modal requests. For example, a virtual router may receive various messages that include encoded audio corresponding to a natural language utterance contained in a multi-modal interaction provided to one or more of the devices. The virtual router may then analyze the encoded audio to select a cleanest sample of the natural language utterance and communicate with one or more other devices in the environment to determine an intent of the multi-modal interaction. The virtual router may then coordinate resolving the multi-modal interaction based on the intent of the multi-modal interaction.

Proceedings Article
11 Jul 2010
TL;DR: A system that learns to follow navigational natural language directions by learning by apprenticeship from routes through a map paired with English descriptions using a reinforcement learning algorithm, which grounds the meaning of spatial terms like above and south into geometric properties of paths.
Abstract: We present a system that learns to follow navigational natural language directions. Where traditional models learn from linguistic annotation or word distributions, our approach is grounded in the world, learning by apprenticeship from routes through a map paired with English descriptions. Lacking an explicit alignment between the text and the reference path makes it difficult to determine what portions of the language describe which aspects of the route. We learn this correspondence with a reinforcement learning algorithm, using the deviation of the route we follow from the intended path as a reward signal. We demonstrate that our system successfully grounds the meaning of spatial terms like above and south into geometric properties of paths.

Journal ArticleDOI
01 Dec 2010
TL;DR: In this article, the authors present a declarative authorization language, where policies and credentials are expressed using predicates defined by logical clauses, in the style of constraint logic programming.
Abstract: We present a declarative authorization language. Policies and credentials are expressed using predicates defined by logical clauses, in the style of constraint logic programming. Access requests are mapped to logical authorization queries, consisting of predicates and constraints combined by conjunctions, disjunctions, and negations. Access is granted if the query succeeds against the current database of clauses. Predicates ascribe rights to particular principals, with flexible support for delegation and revocation. At the discretion of the delegator, delegated rights can be further delegated, either to a fixed depth, or arbitrarily deeply. Our language strikes a careful balance between syntactic and semantic simplicity, policy expressiveness, and execution efficiency. The syntax is close to natural language, and the semantics consists of just three deduction rules. The language can express many common policy idioms using constraints, controlled delegation, recursive predicates, and negated queries. We describe an execution strategy based on translation to Datalog with Constraints, and table-based resolution. We show that this execution strategy is sound, complete, and always terminates, despite recursion and negation, as long as simple syntactic conditions are met.

Book ChapterDOI
30 May 2010
TL;DR: This work presents FREyA, which combines syntactic parsing with the knowledge encoded in ontologies in order to reduce the customisation effort, and is evaluated using Mooney Geoquery dataset with very high precision and recall.
Abstract: With large datasets such as Linked Open Data available, there is a need for more user-friendly interfaces which will bring the advantages of these data closer to the casual users. Several recent studies have shown user preference to Natural Language Interfaces (NLIs) in comparison to others. Although many NLIs to ontologies have been developed, those that have reasonable performance are domain-specific and tend to require customisation for each new domain which, from a developer's perspective, makes them expensive to maintain. We present our system FREyA, which combines syntactic parsing with the knowledge encoded in ontologies in order to reduce the customisation effort. If the system fails to automatically derive an answer, it will generate clarification dialogs for the user. The user's selections are saved and used for training the system in order to improve its performance over time. FREyA is evaluated using Mooney Geoquery dataset with very high precision and recall.

Journal ArticleDOI
TL;DR: A position paper is started, written by some of the members of the CIS fuzzy systems technical committee task force on CWW, that answers the question "What does CWW mean to me"?
Abstract: Computing with words (CWW) means different things to different people. This article is the start of a position paper, written by some of the members of the CIS fuzzy systems technical committee task force on CWW, that answers the question "What does CWW mean to me"?

Journal ArticleDOI
TL;DR: A central challenge for the future will be to explore how different forms of information transmission affect this process, and how individual-level behaviours result in population-level linguistic phenomena.

Patent
10 May 2010
TL;DR: In this article, a system for and method of providing reusable software service information based on natural language queries is presented, where the system and method may include receiving, from a user system, query data in a natural language format that indicates a request for a plurality of reusable software services applications that are configured to perform a particular task, processing the query data to generate search criteria that include query values, and searching a database, for the plurality of reuse software service applications based on the query values.
Abstract: A system for and method of providing reusable software service information based on natural language queries. The system and method may include receiving, from a user system, query data in a natural language format that indicates a request for a plurality of reusable software service applications that are configured to perform a particular task, processing the query data to generate search criteria that include query values, and searching, a database, for the plurality of reusable software service applications based on the query values.

Journal ArticleDOI
TL;DR: Diffusion chains, but not isolate learners, were found to cumulatively increase predictability of plural marking by lexicalising the choice of plural marker, suggesting that such gradual, cumulative population-level processes offer a possible explanation for regularity in language.

Journal ArticleDOI
TL;DR: In this article, a system capable of analyzing the combinatorics of a wide range of conventionally implicated and expressive constructions in natural language via an extension of Potts's (2005) L_CI logic for supplementary conventional implicatures is presented.
Abstract: This paper provides a system capable of analyzing the combinatorics of a wide range of conventionally implicated and expressive constructions in natural language via an extension of Potts's (2005) L_CI logic for supplementary conventional implicatures. In particular, the system is capable of analyzing objects of mixed conventionally implicated/expressive and at-issue type, and objects with conventionally implicated or expressive meanings which provide the main content of their utterances. The logic is applied to a range of constructions and lexical items in several languages. doi:10.3765/sp.3.8 BibTeX info

Proceedings ArticleDOI
02 Mar 2010
TL;DR: This work investigates how statistical machine translation techniques can be used to bridge the gap between natural language route instructions and a map of an environment built by a robot.
Abstract: Mobile robots that interact with humans in an intuitive way must be able to follow directions provided by humans in unconstrained natural language. In this work we investigate how statistical machine translation techniques can be used to bridge the gap between natural language route instructions and a map of an environment built by a robot. Our approach uses training data to learn to translate from natural language instructions to an automatically-labeled map. The complexity of the translation process is controlled by taking advantage of physical constraints imposed by the map. As a result, our technique can efficiently handle uncertainty in both map labeling and parsing. Our experiments demonstrate the promising capabilities achieved by our approach.

Proceedings Article
15 Jul 2010
TL;DR: TIPSem, a system to extract temporal information from natural language texts for English and Spanish, learns CRF models from training data and achieves the best F1 score in all the tasks.
Abstract: This paper presents TIPSem, a system to extract temporal information from natural language texts for English and Spanish. TIPSem, learns CRF models from training data. Although the used features include different language analysis levels, the approach is focused on semantic information. For Spanish, TIPSem achieved the best F1 score in all the tasks. For English, it obtained the best F1 in tasks B (events) and D (event-dct links); and was among the best systems in the rest.

Journal ArticleDOI
TL;DR: How measures of language productivity and organization in two languages converge with children's measured language abilities on the Bilingual English Spanish Assessment (BESA), a standardized measure of language ability is evaluated.


Journal ArticleDOI
TL;DR: Individual differences in prediction performances are shown to strongly correlate with participants' sentence processing of complex, long-distance dependencies in natural language.
Abstract: Prediction-based processes appear to play an important role in language. Few studies, however, have sought to test the relationship within individuals between prediction learning and natural language processing. This paper builds upon existing statistical learning work using a novel paradigm for studying the on-line learning of predictive dependencies. Within this paradigm, a new "prediction task" is introduced that provides a sensitive index of individual differences for developing probabilistic sequential expectations. Across three interrelated experiments, the prediction task and results thereof are used to bridge knowledge of the empirical relation between statistical learning and language within the context of nonadjacency processing. We first chart the trajectory for learning nonadjacencies, documenting individual differences in prediction learning. Subsequent simple recurrent network simulations then closely capture human performance patterns in the new paradigm. Finally, individual differences in prediction performances are shown to strongly correlate with participants' sentence processing of complex, long-distance dependencies in natural language.

Book
29 Jan 2010
TL;DR: Grammar for English Language Teachers is designed to help practising and trainee teachers to develop their knowledge of English grammar systems and evaluates the 'rules of thumb' presented to learners in course materials.
Abstract: An invaluable resource helping teachers at all levels of experience to develop their understanding of English grammar. Grammar for English Language Teachers is designed to help practising and trainee teachers to develop their knowledge of English grammar systems. It encourages teachers to appreciate factors that affect grammatical choices, and evaluates the 'rules of thumb' presented to learners in course materials. Consolidation exercises provide an opportunity for teachers to test these rules against real language use and to evaluate classroom and reference materials.