scispace - formally typeset
Search or ask a question

Showing papers on "Natural language published in 2013"


Journal ArticleDOI
TL;DR: The proposed system to automatically generate natural language descriptions from images is very effective at producing relevant sentences for images and generates descriptions that are notably more true to the specific image content than previous work.
Abstract: We present a system to automatically generate natural language descriptions from images. This system consists of two parts. The first part, content planning, smooths the output of computer vision-based detection and recognition algorithms with statistics mined from large pools of visually descriptive text to determine the best content words to use to describe an image. The second step, surface realization, chooses words to construct natural language sentences based on the predicted content and general statistics from natural language. We present multiple approaches for the surface realization step and evaluate each using automatic measures of similarity to human generated reference descriptions. We also collect forced choice human evaluations between descriptions from the proposed generation system and descriptions from competing approaches. The proposed system is very effective at producing relevant sentences for images. It also generates descriptions that are notably more true to the specific image content than previous work.

791 citations


Proceedings ArticleDOI
01 Dec 2013
TL;DR: This paper presents a solution that takes a short video clip and outputs a brief sentence that sums up the main activity in the video, such as the actor, the action and its object, and uses a Web-scale language model to ``fill in'' novel verbs.
Abstract: Despite a recent push towards large-scale object recognition, activity recognition remains limited to narrow domains and small vocabularies of actions. In this paper, we tackle the challenge of recognizing and describing activities ``in-the-wild''. We present a solution that takes a short video clip and outputs a brief sentence that sums up the main activity in the video, such as the actor, the action and its object. Unlike previous work, our approach works on out-of-domain actions: it does not require training videos of the exact activity. If it cannot find an accurate prediction for a pre-trained model, it finds a less specific answer that is also plausible from a pragmatic standpoint. We use semantic hierarchies learned from the data to help to choose an appropriate level of generalization, and priors learned from Web-scale natural language corpora to penalize unlikely combinations of actors/actions/objects, we also use a Web-scale language model to ``fill in'' novel verbs, i.e. when the verb does not appear in the training set. We evaluate our method on a large YouTube corpus and demonstrate it is able to generate short sentence descriptions of video clips better than baseline approaches.

555 citations


Proceedings ArticleDOI
13 May 2013
TL;DR: ClausIE is a novel, clause-based approach to open information extraction, which extracts relations and their arguments from natural language text using a small set of domain-independent lexica, operates sentence by sentence without any post-processing, and requires no training data.
Abstract: We propose ClausIE, a novel, clause-based approach to open information extraction, which extracts relations and their arguments from natural language text. ClausIE fundamentally differs from previous approaches in that it separates the detection of ``useful'' pieces of information expressed in a sentence from their representation in terms of extractions. In more detail, ClausIE exploits linguistic knowledge about the grammar of the English language to first detect clauses in an input sentence and to subsequently identify the type of each clause according to the grammatical function of its constituents. Based on this information, ClausIE is able to generate high-precision extractions; the representation of these extractions can be flexibly customized to the underlying application. ClausIE is based on dependency parsing and a small set of domain-independent lexica, operates sentence by sentence without any post-processing, and requires no training data (whether labeled or unlabeled). Our experimental study on various real-world datasets suggests that ClausIE obtains higher recall and higher precision than existing approaches, both on high-quality text as well as on noisy text as found in the web.

537 citations


Journal ArticleDOI
TL;DR: This paper shows semantic parsing can be used within a grounded CCG semantic parsing approach that learns a joint model of meaning and context for interpreting and executing natural language instructions, using various types of weak supervision.
Abstract: The context in which language is used provides a strong signal for learning to recover its meaning. In this paper, we show it can be used within a grounded CCG semantic parsing approach that learns a joint model of meaning and context for interpreting and executing natural language instructions, using various types of weak supervision. The joint nature provides crucial benefits by allowing situated cues, such as the set of visible objects, to directly influence learning. It also enables algorithms that learn while executing instructions, for example by trying to replicate human actions. Experiments on a benchmark navigational dataset demonstrate strong performance under differing forms of supervision, including correctly executing 60% more instruction sets relative to the previous state of the art.

530 citations


Proceedings ArticleDOI
04 Sep 2013
TL;DR: This paper discusses some implementation and data processing challenges encountered while developing a new multilingual version of DBpedia Spotlight that is faster, more accurate and easier to configure, and compares the solution to the previous system.
Abstract: There has recently been an increased interest in named entity recognition and disambiguation systems at major conferences such as WWW, SIGIR, ACL, KDD, etc. However, most work has focused on algorithms and evaluations, leaving little space for implementation details. In this paper, we discuss some implementation and data processing challenges we encountered while developing a new multilingual version of DBpedia Spotlight that is faster, more accurate and easier to configure. We compare our solution to the previous system, considering time performance, space requirements and accuracy in the context of the Dutch and English languages. Additionally, we report results for 9 additional languages among the largest Wikipedias. Finally, we present challenges and experiences to foment the discussion with other developers interested in recognition and disambiguation of entities in natural language text.

529 citations


Proceedings ArticleDOI
01 Dec 2013
TL;DR: This paper generates a rich semantic representation of the visual content including e.g. object and activity labels and proposes to formulate the generation of natural language as a machine translation problem using the semantic representation as source language and the generated sentences as target language.
Abstract: Humans use rich natural language to describe and communicate visual perceptions. In order to provide natural language descriptions for visual content, this paper combines two important ingredients. First, we generate a rich semantic representation of the visual content including e.g. object and activity labels. To predict the semantic representation we learn a CRF to model the relationships between different components of the visual input. And second, we propose to formulate the generation of natural language as a machine translation problem using the semantic representation as source language and the generated sentences as target language. For this we exploit the power of a parallel corpus of videos and textual descriptions and adapt statistical machine translation to translate between our two languages. We evaluate our video descriptions on the TACoS dataset, which contains video snippets aligned with sentence descriptions. Using automatic evaluation and human judgments we show significant improvements over several baseline approaches, motivated by prior work. Our translation approach also shows improvements over related work on an image description task.

438 citations


Patent
28 Feb 2013
TL;DR: In this paper, the authors introduce Z-webs, including Z-factors and Z-nodes, for the understanding of relationships between objects, subjects, abstract ideas, concepts, or the like, including face, car, images, people, emotions, mood, text, natural language, voice, music, video, locations, formulas, facts, historical data, landmarks, personalities, ownership, family, friends, love, happiness, social behavior, voting behavior, and the like.
Abstract: Here, we introduce Z-webs, including Z-factors and Z-nodes, for the understanding of relationships between objects, subjects, abstract ideas, concepts, or the like, including face, car, images, people, emotions, mood, text, natural language, voice, music, video, locations, formulas, facts, historical data, landmarks, personalities, ownership, family, friends, love, happiness, social behavior, voting behavior, and the like, to be used for many applications in our life, including on the search engine, analytics, Big Data processing, natural language processing, economy forecasting, face recognition, dealing with reliability and certainty, medical diagnosis, pattern recognition, object recognition, biometrics, security analysis, risk analysis, fraud detection, satellite image analysis, machine generated data analysis, machine learning, training samples, extracting data or patterns (from the video, images, and the like), editing video or images, and the like. Z-factors include reliability factor, confidence factor, expertise factor, bias factor, and the like, which is associated with each Z-node in the Z-web.

398 citations


Book ChapterDOI
01 Jan 2013
TL;DR: This work discusses the problem of parsing natural language commands to actions and control structures that can be readily implemented in a robot execution system, and learns a parser based on example pairs of English commands and corresponding control language expressions.
Abstract: As robots become more ubiquitous and capable of performing complex tasks, the importance of enabling untrained users to interact with them has increased. In response, unconstrained natural-language interaction with robots has emerged as a significant research area. We discuss the problem of parsing natural language commands to actions and control structures that can be readily implemented in a robot execution system. Our approach learns a parser based on example pairs of English commands and corresponding control language expressions. We evaluate this approach in the context of following route instructions through an indoor environment, and demonstrate that our system can learn to translate English commands into sequences of desired actions, while correctly capturing the semantic intent of statements involving complex control structures. The procedural nature of our formal representation allows a robot to interpret route instructions online while moving through a previously unknown environment.

385 citations


Proceedings Article
01 Aug 2013
TL;DR: An analysis of the performance of publicly available, state-of-the-art tools on all layers and languages in the OntoNotes v5.0 corpus should set the benchmark for future development of various NLP components in syntax and semantics, and possibly encourage research towards an integrated system that makes use of the various layers jointly to improve overall performance.
Abstract: Large-scale linguistically annotated corpora have played a crucial role in advancing the state of the art of key natural language technologies such as syntactic, semantic and discourse analyzers, and they serve as training data as well as evaluation benchmarks. Up till now, however, most of the evaluation has been done on monolithic corpora such as the Penn Treebank, the Proposition Bank. As a result, it is still unclear how the state-of-the-art analyzers perform in general on data from a variety of genres or domains. The completion of the OntoNotes corpus, a large-scale, multi-genre, multilingual corpus manually annotated with syntactic, semantic and discourse information, makes it possible to perform such an evaluation. This paper presents an analysis of the performance of publicly available, state-of-the-art tools on all layers and languages in the OntoNotes v5.0 corpus. This should set the benchmark for future development of various NLP components in syntax and semantics, and possibly encourage research towards an integrated system that makes use of the various layers jointly to improve overall performance.

381 citations


Proceedings Article
01 Aug 2013
TL;DR: The dialog state tracking challenge seeks to address this by providing a heterogeneous corpus of 15K human-computer dialogs in a standard format, along with a suite of 11 evaluation metrics, and shows that the suite of performance metrics cluster into 4 natural groups.
Abstract: In a spoken dialog system, dialog state tracking deduces information about the user’s goal as the dialog progresses, synthesizing evidence such as dialog acts over multiple turns with external data sources. Recent approaches have been shown to overcome ASR and SLU errors in some applications. However, there are currently no common testbeds or evaluation measures for this task, hampering progress. The dialog state tracking challenge seeks to address this by providing a heterogeneous corpus of 15K human-computer dialogs in a standard format, along with a suite of 11 evaluation metrics. The challenge received a total of 27 entries from 9 research groups. The results show that the suite of performance metrics cluster into 4 natural groups. Moreover, the dialog systems that benefit most from dialog state tracking are those with less discriminative speech recognition confidence scores. Finally, generalization is a key problem: in 2 of the 4 test sets, fewer than half of the entries out-performed simple baselines. 1 Overview and motivation Spoken dialog systems interact with users via natural language to help them achieve a goal. As the interaction progresses, the dialog manager maintains a representation of the state of the dialog in a process called dialog state tracking (DST). For example, in a bus schedule information system, the dialog state might indicate the user’s desired bus route, origin, and destination. Dialog state tracking is difficult because automatic speech ∗Most of the work for the challenge was performed when the second and third authors were with Honda Research Institute, Mountain View, CA, USA recognition (ASR) and spoken language understanding (SLU) errors are common, and can cause the system to misunderstand the user’s needs. At the same time, state tracking is crucial because the system relies on the estimated dialog state to choose actions – for example, which bus schedule information to present to the user. Most commercial systems use hand-crafted heuristics for state tracking, selecting the SLU result with the highest confidence score, and discarding alternatives. In contrast, statistical approaches compute scores for many hypotheses for the dialog state (Figure 1). By exploiting correlations between turns and information from external data sources – such as maps, bus timetables, or models of past dialogs – statistical approaches can overcome some SLU errors. Numerous techniques for dialog state tracking have been proposed, including heuristic scores (Higashinaka et al., 2003), Bayesian networks (Paek and Horvitz, 2000; Williams and Young, 2007), kernel density estimators (Ma et al., 2012), and discriminative models (Bohus and Rudnicky, 2006). Techniques have been fielded which scale to realistically sized dialog problems and operate in real time (Young et al., 2010; Thomson and Young, 2010; Williams, 2010; Mehta et al., 2010). In end-to-end dialog systems, dialog state tracking has been shown to improve overall system performance (Young et al., 2010; Thomson and Young, 2010). Despite this progress, direct comparisons between methods have not been possible because past studies use different domains and system components, for speech recognition, spoken language understanding, dialog control, etc. Moreover, there is little agreement on how to evaluate dialog state tracking. Together these issues limit progress in this research area. The Dialog State Tracking Challenge (DSTC) provides a first common testbed and evaluation

379 citations


Proceedings ArticleDOI
23 Jun 2013
TL;DR: This paper proposes a hybrid system consisting of a low level multimodal latent topic model for initial keyword annotation, a middle level of concept detectors and a high level module to produce final lingual descriptions that captures the most relevant contents of a video in a natural language description.
Abstract: The problem of describing images through natural language has gained importance in the computer vision community. Solutions to image description have either focused on a top-down approach of generating language through combinations of object detections and language models or bottom-up propagation of keyword tags from training images to test images through probabilistic or nearest neighbor techniques. In contrast, describing videos with natural language is a less studied problem. In this paper, we combine ideas from the bottom-up and top-down approaches to image description and propose a method for video description that captures the most relevant contents of a video in a natural language description. We propose a hybrid system consisting of a low level multimodal latent topic model for initial keyword annotation, a middle level of concept detectors and a high level module to produce final lingual descriptions. We compare the results of our system to human descriptions in both short and long forms on two datasets, and demonstrate that final system output has greater agreement with the human descriptions than any single level.

Proceedings Article
Zhengdong Lu1, Hang Li1
05 Dec 2013
TL;DR: This paper proposes a new deep architecture to more effectively model the complicated matching relations between two objects from heterogeneous domains and applies this model to matching tasks in natural language, e.g., finding sensible responses for a tweet, or relevant answers to a given question.
Abstract: Many machine learning problems can be interpreted as learning for matching two types of objects (e.g., images and captions, users and products, queries and documents, etc.). The matching level of two objects is usually measured as the inner product in a certain feature space, while the modeling effort focuses on mapping of objects from the original space to the feature space. This schema, although proven successful on a range of matching tasks, is insufficient for capturing the rich structure in the matching process of more complicated objects. In this paper, we propose a new deep architecture to more effectively model the complicated matching relations between two objects from heterogeneous domains. More specifically, we apply this model to matching tasks in natural language, e.g., finding sensible responses for a tweet, or relevant answers to a given question. This new architecture naturally combines the localness and hierarchy intrinsic to the natural language problems, and therefore greatly improves upon the state-of-the-art models.

Proceedings ArticleDOI
18 May 2013
TL;DR: A novel solution to adapt, configure and effectively use a topic modeling technique, namely Latent Dirichlet Allocation (LDA), to achieve better (acceptable) performance across various SE tasks is proposed.
Abstract: Information Retrieval (IR) methods, and in particular topic models, have recently been used to support essential software engineering (SE) tasks, by enabling software textual retrieval and analysis. In all these approaches, topic models have been used on software artifacts in a similar manner as they were used on natural language documents (e.g., using the same settings and parameters) because the underlying assumption was that source code and natural language documents are similar. However, applying topic models on software data using the same settings as for natural language text did not always produce the expected results. Recent research investigated this assumption and showed that source code is much more repetitive and predictable as compared to the natural language text. Our paper builds on this new fundamental finding and proposes a novel solution to adapt, configure and effectively use a topic modeling technique, namely Latent Dirichlet Allocation (LDA), to achieve better (acceptable) performance across various SE tasks. Our paper introduces a novel solution called LDA-GA, which uses Genetic Algorithms (GA) to determine a near-optimal configuration for LDA in the context of three different SE tasks: (1) traceability link recovery, (2) feature location, and (3) software artifact labeling. The results of our empirical studies demonstrate that LDA-GA is ableto identify robust LDA configurations, which lead to a higher accuracy on all the datasets for these SE tasks as compared to previously published results, heuristics, and the results of a combinatorial search.

Proceedings ArticleDOI
01 Dec 2013
TL;DR: This paper starts by building a manually annotated dataset and then takes the reader through the detailed steps of building the lexicon, which addresses both approaches to SA for the Arabic language.
Abstract: The emergence of the Web 2.0 technology generated a massive amount of raw data by enabling Internet users to post their opinions, reviews, comments on the web. Processing this raw data to extract useful information can be a very challenging task. An example of important information that can be automatically extracted from the users' posts and comments is their opinions on different issues, events, services, products, etc. This problem of Sentiment Analysis (SA) has been studied well on the English language and two main approaches have been devised: corpus-based and lexicon-based. This paper addresses both approaches to SA for the Arabic language. Since there is a limited number of publically available Arabic dataset and Arabic lexicons for SA, this paper starts by building a manually annotated dataset and then takes the reader through the detailed steps of building the lexicon. Experiments are conducted throughout the different stages of this process to observe the improvements gained on the accuracy of the system and compare them to corpus-based approach.

Book ChapterDOI
01 Jan 2013
TL;DR: The latest iteration of ConceptNet is presented, ConceptNet 5, with a focus on its fundamental design decisions and ways to interoperate with it.
Abstract: ConceptNet is a knowledge representation project, providing a large semantic graph that describes general human knowledge and how it is expressed in natural language. Here we present the latest iteration, ConceptNet 5, with a focus on its fundamental design decisions and ways to interoperate with it.

Journal ArticleDOI
TL;DR: An information granulation of the linguistic information used in group decision making problems defined in heterogeneous contexts, i.e., where the experts have associated importance degrees reflecting their ability to handle the problem.

Patent
Ming Zhou1, Furu Wei1, Xiaohua Liu1, Hong Sun1, Yajuan Duan1, Chengjie Sun1, Heung-Yeung Shum1 
02 Jul 2013
TL;DR: In this article, a natural language question is analyzed to extract query units and to determine a question type, answer type, and/or lexical answer type using rules-based heuristics and machine learning trained classifiers.
Abstract: Techniques described enable answering a natural language question using machine learning-based methods to gather and analyze evidence from web searches. A received natural language question is analyzed to extract query units and to determine a question type, answer type, and/or lexical answer type using rules-based heuristics and/or machine learning trained classifiers. Query generation templates are employed to generate a plurality of ranked queries to be used to gather evidence to determine the answer to the natural language question. Candidate answers are extracted from the results based on the answer type and/or lexical answer type, and ranked using a ranker previously trained offline. Confidence levels are calculated for the candidate answers and top answer(s) may be provided to the user if the confidence levels of the top answer(s) surpass a threshold.

Patent
08 Jan 2013
TL;DR: In this article, a phrase-based modeling of generic structures of verbal interaction is proposed for the purpose of automating part of the design of grammar networks, which can regulate, control, and define the content and scope of human-machine interaction in natural language voice user interfaces.
Abstract: The invention enables creation of grammar networks that can regulate, control, and define the content and scope of human-machine interaction in natural language voice user interfaces (NLVUI). More specifically, the invention concerns a phrase-based modeling of generic structures of verbal interaction and use of these models for the purpose of automating part of the design of such grammar networks.

Patent
17 May 2013
TL;DR: In this article, a system and method for representing, storing and retrieving real-world knowledge on a computer or network of computers is disclosed, where knowledge is broken down into permanent atomic "facts" which can be stored in a standard relational database and processed very efficiently.
Abstract: A system and method for representing, storing and retrieving real-world knowledge on a computer or network of computers is disclosed. Knowledge is broken down into permanent atomic “facts” which can be stored in a standard relational database and processed very efficiently. It also provides for the efficient querying of a knowledge base, efficient inference of new knowledge and translation into and out of natural language. Queries can also be processed with full natural language explanations of where the answers came from. The method can also be used in a distributed fashion enabling the system to be a large network of computers and the technology can be integrated into a web browser adding to the browser's functionality.

Journal ArticleDOI
TL;DR: This paper overviews the relationship between decision making and CWW, and focuses on symbolic linguistic computing models that have been widely used in linguistic decision making to analyse if all of them can be considered inside of the CWW paradigm.
Abstract: It is common that experts involved in complex real-world decision problems use natural language for expressing their knowledge in uncertain frameworks. The language is inherent vague, hence probabilistic decision models are not very suitable in such cases. Therefore, other tools such as fuzzy logic and fuzzy linguistic approaches have been successfully used to model and manage such vagueness. The use of linguistic information implies to operate with such a type of information, i.e. processes of computing with words (CWW). Different schemes have been proposed to deal with those processes, and diverse symbolic linguistic computing models have been introduced to accomplish the linguistic computations. In this paper, we overview the relationship between decision making and CWW, and focus on symbolic linguistic computing models that have been widely used in linguistic decision making to analyse if all of them can be considered inside of the CWW paradigm.

Journal ArticleDOI
TL;DR: This paper introduces Logical Semantics with Perception (LSP), a model for grounded language acquisition that learns to map natural language statements to their referents in a physical environment and finds that LSP outperforms existing, less expressive models that cannot represent relational language.
Abstract: This paper introduces Logical Semantics with Perception (LSP), a model for grounded language acquisition that learns to map natural language statements to their referents in a physical environment. For example, given an image, LSP can map the statement “blue mug on the table” to the set of image segments showing blue mugs on tables. LSP learns physical representations for both categorical (“blue,” “mug”) and relational (“on”) language, and also learns to compose these representations to produce the referents of entire statements. We further introduce a weakly supervised training procedure that estimates LSP’s parameters using annotated referents for entire statements, without annotated referents for individual words or the parse structure of the statement. We perform experiments on two applications: scene understanding and geographical question answering. We find that LSP outperforms existing, less expressive models that cannot represent relational language. We further find that weakly supervised training is competitive with fully supervised training while requiring significantly less annotation effort.

Journal ArticleDOI
TL;DR: The results showed that language ability predicts gains in data analysis/probability and geometry, but not in arithmetic or algebra, after controlling for visual-spatial working memory, reading ability, and sex, which suggests that early language experiences are important for later mathematical development regardless of language background.

Journal ArticleDOI
01 Jan 2013-Database
TL;DR: A simple extensible mark-up language format to share text documents and annotations, which allows a large number of different annotations to be represented including sentences, tokens, parts of speech, named entities such as genes or diseases and relationships between named entities is proposed.
Abstract: A vast amount of scientific information is encoded in natural language text, and the quantity of such text has become so great that it is no longer economically feasible to have a human as the first step in the search process. Natural language processing and text mining tools have become essential to facilitate the search for and extraction of information from text. This has led to vigorous research efforts to create useful tools and to create humanly labeled text corpora, which can be used to improve such tools. To encourage combining these efforts into larger, more powerful and more capable systems, a common interchange format to represent, store and exchange the data in a simple manner between different language processing systems and text mining tools is highly desirable. Here we propose a simple extensible mark-up language format to share text documents and annotations. The proposed annotation approach allows a large number of different annotations to be represented including sentences, tokens, parts of speech, named entities such as genes or diseases and relationships between named entities. In addition, we provide simple code to hold this data, read it from and write it back to extensible mark-up language files and perform some sample processing. We also describe completed as well as ongoing work to apply the approach in several directions. Code and data are available at http://bioc.sourceforge.net/. Database URL: http://bioc.sourceforge.net/

Patent
25 Jun 2013
TL;DR: In this article, an electronic device with one or more processors and memory provides a digital photograph of a real-world scene and provides a natural language text string corresponding to a speech input associated with the digital photograph.
Abstract: The electronic device with one or more processors and memory provides a digital photograph of a real-world scene. The electronic device provides a natural language text string corresponding to a speech input associated with the digital photograph. The electronic device performs natural language processing on the text string to identify one or more terms associated with an entity, an activity, or a location. The electronic device tags the digital photograph with the one or more terms and their associated entity, activity, or location.

Journal ArticleDOI
01 Jan 2013
TL;DR: The negative association between the number of second language speakers and nominal casecomplexity generalizestodifferentlanguageareas andfamilies and the idea that morphosyntactic complexity is reduced by a high degree of language contact involving adult learners is supported.
Abstract: Inthispaper,weprovidequantitativeevidenceshowingthatlanguagesspokenbymanysecond languagespeakerstendtohaverelativelysmallnominalcasesystemsornonominalcaseatall In our sample, all languages with more than 50% second language speakers had no nominal case The negative association between the number of second language speakers and nominal casecomplexitygeneralizestodifferentlanguageareasandfamiliesAstherearemanystudies attestingtothedifficultyofacquiringmorphologicalcaseinsecondlanguageacquisition,this result supports the idea that languages adapt to the cognitive constraints of their speakers, as wellastothesociolinguisticnichesoftheirspeakingcommunitiesWediscussourresultswith respecttosociolinguistictypologyandtheLinguisticNicheHypothesis,aswellaswithrespect toqualitativedatafromhistoricallinguisticsAllinall,multiplelinesofevidenceconvergeon the idea that morphosyntactic complexity is reduced by a high degree of language contact involving adult learners

Journal ArticleDOI
22 Jul 2013
TL;DR: A new metaphor of two-dimensional text for data-driven semantic modeling of natural language is proposed, which provides an entirely new angle on the representation of text: not only syntagmatic relations are annotated in the text, but also paradigmatic relations are made explicit by generating lexical expansions.
Abstract: A new metaphor of two-dimensional text for data-driven semantic modeling of natural language is proposed, which provides an entirely new angle on the representation of text: not only syntagmatic relations are annotated in the text, but also paradigmatic relations are made explicit by generating lexical expansions We operationalize distributional similarity in a general framework for large corpora, and describe a new method to generate similar terms in context Our evaluation shows that distributional similarity is able to produce highquality lexical resources in an unsupervised and knowledge-free way, and that our highly scalable similarity measure yields better scores in a WordNet-based evaluation than previous measures for very large corpora Evaluating on a lexical substitution task, we find that our contextualization method improves over a non-contextualized baseline across all parts of speech, and we show how the metaphor can be applied successfully to part-of-speech tagging A number of ways to extend and improve the contextualization method within our framework are discussed As opposed to comparable approaches, our framework defines a model of lexical expansions in context that can generate the expansions as opposed to ranking a given list, and thus does not require existing lexical-semantic resources

Book ChapterDOI
12 Jul 2013

Proceedings Article
01 Aug 2013
TL;DR: A sophisticated template-based approach is introduced that incorporates semantic role labels into a system that automatically generates natural language questions to support online learning and indicates it is a promising one for supporting learning.
Abstract: When instructors prepare learning materials for students, they frequently develop accompanying questions to guide learning. Natural language processing technology can be used to automatically generate such questions but techniques used have not fully leveraged semantic information contained in the learning materials or the full context in which the question generation task occurs. We introduce a sophisticated template-based approach that incorporates semantic role labels into a system that automatically generates natural language questions to support online learning. While we have not yet incorporated the full learning context into our approach, our preliminary evaluation and evaluation methodology indicate our approach is a promising one for supporting learning.

Journal ArticleDOI
22 Oct 2013-PLOS ONE
TL;DR: It is argued that this consensus figure vastly underestimates the danger of digital language death, in that less than 5% of all languages can still ascend to the digital realm.
Abstract: Of the approximately 7,000 languages spoken today, some 2,500 are generally considered endangered. Here we argue that this consensus figure vastly underestimates the danger of digital language death, in that less than 5% of all languages can still ascend to the digital realm. We present evidence of a massive die-off caused by the digital divide.

Book ChapterDOI
01 Jan 2013
TL;DR: Permission statements contain occurrences of a certain one-place existential operator which are often concealed in the surface structure; Represent this operator by the letter 'F', and call F the 'focus-operator' as it is its function to move the sub-formula which stands within its scope into focus.
Abstract: Permission statements contain occurrences of a certain one-place existential operator which are often concealed in the surface structure; Represent this operator by the letter 'F'. And call F the 'focus-operator' as it is its function to 'move' the sub-formula which stands within its scope into focus and thereby subject it to the particular semantic-pragmatic operation which is associated with the mode, or pragmatic function. It has been a tacit assumption behind most formal analyses of natural language that at least all assertions have the same pragmatic function. The semantical analysis of natural language is of fundamentally greater complexity than is imagined by those who believe that describe the semantics of at least the declarative parts of natural languages by theories which are based on no other concepts than those of satisfaction and truth. Keywords:language; permission; pragmatic; semantics