Showing papers on "Natural language published in 2006"

PDF

Open Access

Proceedings Article•DOI•

[...]

Steven Bird¹•Institutions (1)

17 Jul 2006

TL;DR: The Natural Language Toolkit has been rewritten, simplifying many linguistic data structures and taking advantage of recent enhancements in the Python language.

...read moreread less

Abstract: The Natural Language Toolkit is a suite of program modules, data sets and tutorials supporting research and teaching in computational linguistics and natural language processing. NLTK is written in Python and distributed under the GPL open source license. Over the past year the toolkit has been rewritten, simplifying many linguistic data structures and taking advantage of recent enhancements in the Python language. This paper reports on the simplified toolkit and explains how it is used in teaching NLP.

...read moreread less

2,835 citations

Journal Article•DOI•

Evaluation of unsupervised semantic mapping of natural language with Leximancer concept mapping

[...]

Andrew L. Smith¹, Michael S. Humphreys¹•Institutions (1)

University of Queensland¹

01 May 2006-Behavior Research Methods

TL;DR: This article is an attempt to validate the output of Leximancer, using a set of evaluation criteria taken from content analysis that are appropriate for knowledge discovery tasks.

...read moreread less

Abstract: The Leximancer system is a relatively new method for transforming lexical co-occurrence information from natural language into semantic patterns in an unsupervised manner. It employs two stages of co-occurrence information extraction—semantic andrelational—using a different algorithm for each stage. The algorithms used are statistical, but they employ nonlinear dynamics and machine learning. This article is an attempt to validate the output of Leximancer, using a set of evaluation criteria taken from content analysis that are appropriate for knowledge discovery tasks.

...read moreread less

1,034 citations

Proceedings Article•DOI•

A Hierarchical Bayesian Language Model Based On Pitman-Yor Processes

[...]

Yee Whye Teh¹•Institutions (1)

National University of Singapore¹

17 Jul 2006

TL;DR: It is shown that an approximation to the hierarchical Pitman-Yor language model recovers the exact formulation of interpolated Kneser-Ney, one of the best smoothing methods for n-gram language models.

...read moreread less

Abstract: We propose a new hierarchical Bayesian n-gram model of natural languages. Our model makes use of a generalization of the commonly used Dirichlet distributions called Pitman-Yor processes which produce power-law distributions more closely resembling those in natural languages. We show that an approximation to the hierarchical Pitman-Yor language model recovers the exact formulation of interpolated Kneser-Ney, one of the best smoothing methods for n-gram language models. Experiments verify that our model gives cross entropy results superior to interpolated Kneser-Ney and comparable to modified Kneser-Ney.

...read moreread less

580 citations

Journal Article•DOI•

Language Selectivity is the Exception, Not the Rule: Arguments Against a Fixed Locus of Language Selection in Bilingual Speech.

[...]

Judith F. Kroll¹, Susan C. Bobb¹, Zofia Wodniecka²•Institutions (2)

Pennsylvania State University¹, Jagiellonian University²

01 Jul 2006-Bilingualism: Language and Cognition

TL;DR: The authors argue that language selection depends on a set of factors that vary according to the experience of the bilinguals, the demands of the production task, and the degree of activity of the nontarget language.

...read moreread less

Abstract: Bilingual speech requires that the language of utterances be selected prior to articulation. Past research has debated whether the language of speaking can be determined in advance of speech planning and, if not, the level at which it is eventually selected. We argue that the reason that it has been difficult to come to an agreement about language selection is that there is not a single locus of selection. Rather, language selection depends on a set of factors that vary according to the experience of the bilinguals, the demands of the production task, and the degree of activity of the nontarget language. We demonstrate that it is possible to identify some conditions that restrict speech planning to one language alone and others that open the process to cross-language influences. We conclude that the presence of language nonselectivity at all levels of planning spoken utterances renders the system itself fundamentally nonselective.

...read moreread less

539 citations

Patent•

System and method of supporting adaptive misrecognition in conversational speech

[...]

Philippe Di Cristo, Chris Weider, Robert A. Kennewick

04 Aug 2006

TL;DR: In this article, a conversational human-machine interface that includes conversational speech analyzer, a general cognitive model, an environmental model, and a personalized cognitive model to determine context, domain knowledge, and invoke prior information to interpret a spoken utterance or a received non-spoken message is presented.

...read moreread less

Abstract: A system and method are provided for receiving speech and/or non-speech communications of natural language questions and/or commands and executing the questions and/or commands. The invention provides a conversational human-machine interface that includes a conversational speech analyzer, a general cognitive model, an environmental model, and a personalized cognitive model to determine context, domain knowledge, and invoke prior information to interpret a spoken utterance or a received non-spoken message. The system and method creates, stores and uses extensive personal profile information for each user, thereby improving the reliability of determining the context of the speech or non-speech communication and presenting the expected results for a particular question or command.

...read moreread less

430 citations

Journal Article•

Inductive learning algorithms and representations for text categorization

[...]

Shi Bing¹•Institutions (1)

Shandong University¹

01 Jan 2006-Computer Engineering and Design

TL;DR: Text categorization-assignment of natural language texts to one or more predefined categories based on their content-is an important component in many information organization and management tasks.

...read moreread less

Abstract: Text categorization-assignment of natural language texts to one or more predefined categories based on their content-is an important component in many information organization and management tasks.Different automatic learning algorithms for text categori-zation have different classification accuracy.Very accurate text classifiers can be learned automatically from training examples.

...read moreread less

384 citations

Proceedings Article•

Making Tree Kernels Practical for Natural Language Learning.

[...]

Alessandro Moschitti¹•Institutions (1)

University of Rome Tor Vergata¹

01 Apr 2006

TL;DR: This paper provides a simple algorithm to compute tree kernels in linear average running time and a study on the classification properties of diverse tree kernels show that kernel combinations always improve the traditional methods.

...read moreread less

Abstract: In recent years tree kernels have been proposed for the automatic learning of natural language applications. Unfortunately, they show (a) an inherent super linear complexity and (b) a lower accuracy than traditional attribute/value methods. In this paper, we show that tree kernels are very helpful in the processing of natural language as (a) we provide a simple algorithm to compute tree kernels in linear average running time and (b) our study on the classification properties of diverse tree kernels show that kernel combinations always improve the traditional methods. Experiments with Support Vector Machines on the predicate argument classification task provide empirical support to our thesis.

...read moreread less

335 citations

Patent•

Intelligent portal engine

[...]

Joseph B. Elad, Apperson H. Johnson, David S. Cleaver, Daniel L. Chester, Keith Decker, Thomas A. Roper, Irene H. Philips - Show less +3 more

14 Aug 2006

TL;DR: A human-computer interface system and methods for providing intelligent, adaptive, multimodal interaction with users while accomplishing tasks on their behalf in some particular domain or combination of domains as mentioned in this paper.

...read moreread less

Abstract: A human-computer interface system and methods for providing intelligent, adaptive, multimodal interaction with users while accomplishing tasks on their behalf in some particular domain or combination of domains. Specifically, this system accepts user input via natural language text, mouse actions, human speech, whistles, gestures, pedal movements, facial or postural changes, and conveys results via natural language text, automatically-generated speech, and displays of graphs, tables, animation, video, and mechanical and chemical effectors that convey heat, tactile sensation, taste and smell.

...read moreread less

311 citations

Proceedings Article•DOI•

Machine Learning of Temporal Relations

[...]

Inderjeet Mani¹, Marc Verhagen², Ben Wellner³, Chong Min Lee¹, James Pustejovsky² - Show less +1 more•Institutions (3)

Georgetown University¹, Brandeis University², Mitre Corporation³

17 Jul 2006

TL;DR: This paper used temporal reasoning as an over-sampling method to dramatically expand the amount of training data, resulting in predictive accuracy on link labeling as high as 93% using a Maximum Entropy classifier on human annotated data.

...read moreread less

Abstract: This paper investigates a machine learning approach for temporally ordering and anchoring events in natural language texts. To address data sparseness, we used temporal reasoning as an over-sampling method to dramatically expand the amount of training data, resulting in predictive accuracy on link labeling as high as 93% using a Maximum Entropy classifier on human annotated data. This method compared favorably against a series of increasingly sophisticated baselines involving expansion of rules derived from human intuitions.

...read moreread less

293 citations

Proceedings Article•DOI•

Using String-Kernels for Learning Semantic Parsers

[...]

Rohit J. Kate¹, Raymond J. Mooney¹•Institutions (1)

University of Texas at Austin¹

17 Jul 2006

TL;DR: This work presents a new approach for mapping natural language sentences to their formal meaning representations using string-kernel-based classifiers, which compares favorably to other existing systems and is particularly robust to noise.

...read moreread less

Abstract: We present a new approach for mapping natural language sentences to their formal meaning representations using string-kernel-based classifiers. Our system learns these classifiers for every production in the formal language grammar. Meaning representations for novel natural language sentences are obtained by finding the most probable semantic parse using these string classifiers. Our experiments on two real-world data sets show that this approach compares favorably to other existing systems and is particularly robust to noise.

...read moreread less

253 citations

Journal Article•DOI•

Toward a writing pedagogy of shuttling between languages: Learning from multilingual writers

[...]

A. Suresh Canagarajah

01 Jul 2006-College English

TL;DR: This paper used contrastive rhetoric as an explanation for the difficulty for multilingual writers in composing an essay in English, and they used a "correlationist" model to study the different types of media that can complicate the realization of texts in different languages.

...read moreread less

Abstract: The dominant approaches to studying multilingual writing have been ham pered by monolingualist assumptions that conceive literacy as a unidirec tional acquisition of competence, preventing us from fully understanding the resources multilinguals bring to their texts. In this essay, I attempt to change the questions and frameworks of such inquiry in order to do justice to the creativity of multilingual writers.1 How do teachers and researchers of English writing orient to linguistic and cultural difference in the essays they read? In what I will call the "inference" model, if they see a peculiar tone, style, organization, or discourse, many teachers instinc tively turn to the first language (LI) or "native" culture (CI) of the writer for an explanation. This was the practice of some early versions of contrastive rhetoric (see Kaplan). Even now, sympathetic scholars in our field seek explanations from LI or CI for what they perceive as difficulties for multilingual writers in composing an essay in English (see Fox). Among other problems, the writer is treated as being conditioned so strongly by LI and CI that even when he or she writes in another language, those influences are supposed to manifest themselves in the new text. There is also the misleading assumption that one can unproblematically describe the tradi tions of LI literacy by studying the English essay of a multilingual writer (even if the writer is a student in a developmental writing program). While the inference model fails to acknowledge the different types of media tion that can complicate the realization of texts in different languages, some scholars have now slightly modified their approach in what I call a "correlationist" model. They study the texts in LI descriptively before they draw on this information to

...read moreread less

Book•DOI•

Essentials of language documentation

[...]

Jost Gippert, Nikolaus P. Himmelmann, Ulrike Mosel

15 Jan 2006

TL;DR: This volume presents in-depth introductions to major aspects of language documentation, including overviews on fieldwork ethics and data processing, guidelines for the basic annotation of digitally-stored multimedia corpora and a discussion on how to build and maintain a language archive.

...read moreread less

Abstract: Language documentation is a rapidly emerging new field in linguistics which is concerned with the methods, tools and theoretical underpinnings for compiling a representative and lasting multipurpose record of a natural language. This volume presents in-depth introductions to major aspects of language documentation, including overviews on fieldwork ethics and data processing, guidelines for the basic annotation of digitally-stored multimedia corpora and a discussion on how to build and maintain a language archive. It combines theoretical and practical considerations and makes specific suggestions for the most common problems encountered in language documentation. Key features textbook introduction to Language Documentation considers all common problems

...read moreread less

Book•

Inductive Dependency Parsing

[...]

Joakim Nivre¹•Institutions (1)

Columbia University¹

01 Jan 2006

TL;DR: This book provides an in-depth description of the framework of inductive dependency parsing, a methodology for robust and efficient syntactic analysis of unrestricted natural language text.

...read moreread less

Abstract: This book provides an in-depth description of the framework of inductive dependency parsing, a methodology for robust and efficient syntactic analysis of unrestricted natural language text. This me ...

...read moreread less

Book•

Gesenius' Hebrew Grammar

[...]

Wilhelm Gesenius, E. Kautzsch

06 Jan 2006

TL;DR: With a facsimile of the Siloam Inscription by Euting, J., a table of alphabets by Lidzbarski, M., and a dictionary of all the alphabetic characters.

...read moreread less

Abstract: With a facsimile of the Siloam Inscription by: Euting, J.; a table of alphabets by: Lidzbarski, M.;

...read moreread less

Book•

Ignorance of Language

[...]

Michael Devitt

01 Jan 2006

TL;DR: In this paper, the relation of LANGUAGE to thought is discussed. But the focus is on the use of language in the present tense, not on the past tense.

...read moreread less

Abstract: I. LINGUISTICS IS NOT PSYCHOLOGY II. POSITIONS ON PSYCHOLOGICAL REALITY III. 'PHILOSOPHICAL' ARGUMENTS FOR THE REPRESENTATIONAL THESIS IV. THE RELATION OF LANGUAGE TO THOUGHT V. LANGUAGE USE AND ACQUISITION

...read moreread less

Journal Article•DOI•

Ontology‐based retrieval of geographic information

[...]

Michael Lutz¹, Eva Klien¹•Institutions (1)

University of Münster¹

01 Mar 2006-International Journal of Geographical Information Science

TL;DR: This paper presents an approach to ontology‐based GI retrieval that contributes to solving existing problems of semantic heterogeneity and hides most of the complexity of the required procedure from the requester.

...read moreread less

Abstract: Discovering and accessing suitable geographic information (GI) in the open and distributed environments of current Spatial Data Infrastructures (SDIs) is a crucial task. Catalogues provide searchable repositories of information descriptions, but the mechanisms to support GI retrieval are still insufficient. Problems of semantic heterogeneity caused by the ambiguity of natural language can arise during keyword‐based search in catalogues and when formulating a query to access the discovered data. In this paper, we present an approach to ontology‐based GI retrieval that contributes to solving existing problems of semantic heterogeneity and hides most of the complexity of the required procedure from the requester. A query language and graphical user interface allow a requester to intuitively formulate a query using a well‐known domain vocabulary. From this query, an ontology concept is derived, which is then used to search a catalogue for a data source that provides all the information required to answer the ...

...read moreread less

Patent•

Natural language personal information management

[...]

Brian Tunning¹, Evan Gridley¹•Institutions (1)

Microsoft¹

06 Jun 2006

TL;DR: In this article, a PIM application provides a single page natural language interface for entering and managing PIM data, which can be associated with a task, calendar, contact, or other data type.

...read moreread less

Abstract: A PIM application provides a single page natural language interface for entering and managing PIM data. The natural language interface may receive a natural language entry as a text character string. The entry may be associated with a task, calendar, contact or other PIM data type. The received entries are processed (for example, parsed) to determine the PIM data type and other information. The original entry is not discarded from the natural language interface as a result of processing. After processing one or more received natural language entries, the entries remain in the natural language interface to be viewed and managed. The entry is maintained so it can be managed with other natural language entries provided in the interface.

...read moreread less

Proceedings Article•DOI•

Experiments with a Multilanguage Non-Projective Dependency Parser

[...]

Giuseppe Attardi

08 Jun 2006

TL;DR: The best performing systems at the TREC Question Answering track employ parsing for analyzing sentences in order to identify the query focus, to extract relations and to disambiguate meanings of words.

...read moreread less

Abstract: Parsing natural language is an essential step in several applications that involve document analysis, e.g. knowledge extraction, question answering, summarization, filtering. The best performing systems at the TREC Question Answering track employ parsing for analyzing sentences in order to identify the query focus, to extract relations and to disambiguate meanings of words.

...read moreread less

Journal Article•DOI•

From computing with numbers to computing with words. From manipulation of measurements to manipulation of perceptions

[...]

Lotfi A. Zadeh¹•Institutions (1)

University of California, Berkeley¹

25 Jan 2006-Annals of the New York Academy of Sciences

TL;DR: The computational theory of perceptions (CTP) as discussed by the authors is based on the methodology of computing with words, which is inspired by the remarkable human capability to perform a wide variety of physical and mental tasks without any measurements and any computations.

...read moreread less

Abstract: Interest in issues relating to consciousness has grown markedly during the last several years. And yet, nobody can claim that consciousness is a well-understood concept that lends itself to precise analysis. It may be argued that, as a concept, consciousness is much too complex to fit into the conceptual structure of existing theories based on Aristotelian logic and probability theory. An approach suggested in this paper links consciousness to perceptions and perceptions to their descriptors in a natural language. In this way, those aspects of consciousness which relate to reasoning and concept formation are linked to what is referred to as the methodology of computing with words (CW). Computing, in its usual sense, is centered on manipulation of numbers and symbols. In contrast, computing with words, or CW for short, is a methodology in which the objects of computation are words and propositions drawn from a natural language (e.g., small, large, far, heavy, not very likely, the price of gas is low and declining, Berkeley is near San Francisco, it is very unlikely that there will be a significant increase in the price of oil in the near future, etc.). Computing with words is inspired by the remarkable human capability to perform a wide variety of physical and mental tasks without any measurements and any computations. Familiar examples of such tasks are parking a car, driving in heavy traffic, playing golf, riding a bicycle, understanding speech, and summarizing a story. Underlying this remarkable capability is the brain's crucial ability to manipulate perceptions--perceptions of distance, size, weight, color, speed, time, direction, force, number, truth, likelihood, and other characteristics of physical and mental objects. Manipulation of perceptions plays a key role in human recognition, decision and execution processes. As a methodology, computing with words provides a foundation for a computational theory of perceptions: a theory which may have an important bearing on how humans make--and machines might make--perception-based rational decisions in an environment of imprecision, uncertainty, and partial truth. A basic difference between perceptions and measurements is that, in general, measurements are crisp, whereas perceptions are fuzzy. One of the fundamental aims of science has been and continues to be that of progressing from perceptions to measurements. Pursuit of this aim has led to brilliant successes. We have sent men to the moon; we can build computers that are capable of performing billions of computations per second; we have constructed telescopes that can explore the far reaches of the universe; and we can date the age of rocks that are millions of years old. But alongside the brilliant successes stand conspicuous underachievements and outright failures. We cannot build robots that can move with the agility of animals or humans; we cannot automate driving in heavy traffic; we cannot translate from one language to another at the level of a human interpreter; we cannot create programs that can summarize non-trivial stories; our ability to model the behavior of economic systems leaves much to be desired; and we cannot build machines that can compete with children in the performance of a wide variety of physical and cognitive tasks. It may be argued that underlying the underachievements and failures is the unavailability of a methodology for reasoning and computing with perceptions rather than measurements. An outline of such a methodology--referred to as a computational theory of perceptions--is presented in this paper. The computational theory of perceptions (CTP) is based on the methodology of CW. In CTP, words play the role of labels of perceptions, and, more generally, perceptions are expressed as propositions in a natural language. CW-based techniques are employed to translate propositions expressed in a natural language into what is called the Generalized Constraint Language (GCL). In this language, the meaning of a proposition is expressed as a generalized constraint, X isr R, where X is the constrained variable, R is the constraining relation, and isr is a variable copula in which r is an indexing variable whose value defines the way in which R constrains X. Among the basic types of constraints are possibilistic, veristic, probabilistic, random set, Pawlak set, fuzzy graph, and usuality. The wide variety of constraints in GCL makes GCL a much more expressive language than the language of predicate logic. In CW, the initial and terminal data sets, IDS and TDS, are assumed to consist of propositions expressed in a natural language. These propositions are translated, respectively, into antecedent and consequent constraints. Consequent constraints are derived from antecedent constraints through the use of rules of constraint propagation. The principal constraint propagation rule is the generalized extension principle. (ABSTRACT TRUNCATED)

...read moreread less

Book Chapter•DOI•

GINO – a guided input natural language ontology editor

[...]

Abraham Bernstein¹, Esther Kaufmann¹•Institutions (1)

University of Zurich¹

05 Nov 2006

TL;DR: This paper introduces GINO, a guided input natural language ontology editor that allows users to edit and query ontologies in a language akin to English and believes that the use of guided entry overcomes thehabitability problem, which adversely affects most natural language systems.

...read moreread less

Abstract: The casual user is typically overwhelmed by the formal logic of the Semantic Web. The gap between the end user and the logic-based scaffolding has to be bridged if the Semantic Web’s capabilities are to be utilized by the general public. This paper proposes that controlled natural languages offer one way to bridge the gap. We introduce GINO, a guided input natural language ontology editor that allows users to edit and query ontologies in a language akin to English. It uses a small static grammar, which it dynamically extends with elements from the loaded ontologies. The usability evaluation shows that GINO is well-suited for novice users when editing ontologies. We believe that the use of guided entry overcomes thehabitability problem, which adversely affects most natural language systems. Additionally, the approach’s dynamic grammar generation allows for easy adaptation to new ontologies.

...read moreread less

Proceedings Article•DOI•

Combining linguistic and statistical analysis to extract relations from web documents

[...]

Fabian M. Suchanek¹, Georgiana Ifrim¹, Gerhard Weikum¹•Institutions (1)

Max Planck Society¹

20 Aug 2006

TL;DR: The authors used deep linguistic structures instead of surface text patterns to extract pairs of a given semantic relation from text documents and applied them to a corpus to find new pairs, and demonstrated the benefits of their approach by extensive experiments with their prototype system LEILA.

...read moreread less

Abstract: The World Wide Web provides a nearly endless source of knowledge, which is mostly given in natural language. A first step towards exploiting this data automatically could be to extract pairs of a given semantic relation from text documents - for example all pairs of a person and her birthdate. One strategy for this task is to find text patterns that express the semantic relation, to generalize these patterns, and to apply them to a corpus to find new pairs. In this paper, we show that this approach profits significantly when deep linguistic structures are used instead of surface text patterns. We demonstrate how linguistic structures can be represented for machine learning, and we provide a theoretical analysis of the pattern matching approach. We show the benefits of our approach by extensive experiments with our prototype system LEILA.

...read moreread less

Patent•

Methods and systems for generating natural language descriptions from data

[...]

Roger Billerey-Mosier

17 Jan 2006

TL;DR: A natural language generation (NLG) software system that generates rich, content-sensitive human language descriptions based on unparsed raw domain-specific data is described in this paper.

...read moreread less

Abstract: The invention is directed to a natural language generation (NLG) software system that generates rich, content-sensitive human language descriptions based on unparsed raw domain-specific data. In one embodiment, the NLG software system may include a data parser/normalizer, a comparator, a language engine, and a document generator. The data parser/normalizer may be configured to retrieve specification information for items to be described by the NLG software system, to extract pertinent information from the raw specification information, and to convert and normalize the extracted information so that the items may be compared specification by specification. The comparator may be configured to use the normalized data from the data parser/normalizer to compare the specifications of the items using comparison functions and interpretation rules to determine outcomes of the comparisons. The language engine may be configured to cycle through all or a subset of the normalized specification information, to retrieve all sentence templates associated with each of the item specifications, to call the comparator to compute or retrieve the results of the comparisons between the item specifications, and to recursively generate every possible syntactically legal sentence associated with the specifications based on the retrieved sentence templates. The document generator may be configured to select one or more discourse models having instructions regarding the selection, organization and modification of the generated sentences, and to apply the instructions of the discourse model to the generated sentences to generate a natural language description of the selected items.

...read moreread less

The Oxford Handbook of the Philosophy of Language

[...]

Ernest Lepore, Barry C. Smith

01 Jan 2006

Journal Article•DOI•

Analogical Processes in Language Learning

[...]

Dedre Gentner¹, Laura L. Namy²•Institutions (2)

Northwestern University¹, Emory University²

01 Dec 2006-Current Directions in Psychological Science

TL;DR: Evidence that analogical comparison is instrumental in language learning is reviewed, suggesting a larger role for general learning processes in the acquisition of language.

...read moreread less

Abstract: The acquisition of language has long stood as a challenge to general learning accounts, leading many theorists to propose domain-specific knowledge and processes to explain language acquisition. Here we review evidence that analogical comparison is instrumental in language learning, suggesting a larger role for general learning processes in the acquisition of language.

...read moreread less

Patent•

Method and apparatus for identifying and classifying query intent

[...]

Edwin Riley Cooper, Michael Peter Dukes, Gann Alexander Bierner, Filippo Ferndinando Paulo Beghelli

14 Aug 2006

TL;DR: In this article, linguistic analysis is used to identify queries that use different natural language formations to request similar information, and common intent categories are identified for the queries requesting similar information. Intent responses can then be provided that are associated with the identified intent categories.

...read moreread less

Abstract: Linguistic analysis is used to identify queries that use different natural language formations to request similar information. Common intent categories are identified for the queries requesting similar information. Intent responses can then be provided that are associated with the identified intent categories. An intent management tool can be used for identifying new intent categories, identifying obsolete intent categories, or refining existing intent categories.

...read moreread less

Proceedings Article•DOI•

An empirical study of natural language parsing of privacy policy rules using the SPARCLE policy workbench

[...]

Carolyn Brodie¹, Clare-Marie Karat¹, John Karat¹•Institutions (1)

IBM¹

12 Jul 2006

TL;DR: The successful implementation of the parsing capabilities that are part of the functional version of the SPARCLE authoring utility are presented, including a set of grammars which execute on a shallow parser that are designed to identify the rule elements in privacy policy rules.

...read moreread less

Abstract: Today organizations do not have good ways of linking their written privacy policies with the implementation of those policies. To assist organizations in addressing this issue, our human-centered research has focused on understanding organizational privacy management needs, and, based on those needs, creating a usable and effective policy workbench called SPARCLE. SPARCLE will enable organizational users to enter policies in natural language, parse the policies to identify policy elements and then generate a machine readable (XML) version of the policy. In the future, SPARCLE will then enable mapping of policies to the organization's configuration and provide audit and compliance tools to ensure that the policy implementation operates as intended. In this paper, we present the strategies employed in the design and implementation of the natural language parsing capabilities that are part of the functional version of the SPARCLE authoring utility. We have created a set of grammars which execute on a shallow parser that are designed to identify the rule elements in privacy policy rules. We present empirical usability evaluation data from target organizational users of the SPARCLE system and highlight the parsing accuracy of the system with the organizations' privacy policies. The successful implementation of the parsing capabilities is an important step towards our goal of providing a usable and effective method for organizations to link the natural language version of privacy policies to their implementation, and subsequent verification through compliance auditing of the enforcement logs.

...read moreread less

Journal Article•DOI•

The AT&T spoken language understanding system

[...]

Narendra K. Gupta¹, Gokhan Tur¹, Dilek Hakkani-Tur¹, Srinivas Bangalore¹, Giuseppe Riccardi², Mazin E. Gilbert² - Show less +2 more•Institutions (2)

AT&T Labs¹, AT&T²

01 Dec 2006-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: The focus of this work is to exploit data and to use machine learning techniques to create scalable SLU systems which can be quickly deployed for new domains with minimal human intervention.

...read moreread less

Abstract: Spoken language understanding (SLU) aims at extracting meaning from natural language speech. Over the past decade, a variety of practical goal-oriented spoken dialog systems have been built for limited domains. SLU in these systems ranges from understanding predetermined phrases through fixed grammars, extracting some predefined named entities, extracting users' intents for call classification, to combinations of users' intents and named entities. In this paper, we present the SLU system of VoiceTone/spl reg/ (a service provided by ATT 2) extending statistical classifiers to seamlessly integrate hand crafted classification rules with the rules learned from data; and 3) developing an active learning framework to minimize the human labeling effort for quickly building the classifier models and adapting them to changes. We present an evaluation of this system using two deployed applications of VoiceTone/spl reg/.

...read moreread less

Proceedings Article•DOI•

The hiding virtues of ambiguity: quantifiably resilient watermarking of natural language text through synonym substitutions

[...]

Umut Topkara¹, Mercan Topkara¹, Mikhail J. Atallah¹•Institutions (1)

Purdue University¹

26 Sep 2006

TL;DR: A better way to use synonym substitution is proposed, one that is no longer entirely guided by the mark-insertion process, but is also guided by a resilience requirement, subject to a maximum allowed distortion constraint.

...read moreread less

Abstract: Information-hiding in natural language text has mainly consisted of carrying out approximately meaning-preserving modifications on the given cover text until it encodes the intended mark. A major technique for doing so has been synonym-substitution. In these previous schemes, synonym substitutions were done until the text "confessed", i.e., carried the intended mark message. We propose here a better way to use synonym substitution, one that is no longer entirely guided by the mark-insertion process: It is also guided by a resilience requirement, subject to a maximum allowed distortion constraint. Previous schemes for information hiding in natural language text did not use numeric quantification of the distortions introduced by transformations, they mainly used heuristic measures of quality based on conformity to a language model (and not in reference to the original cover text). When there are many alternatives to carry out a substitution on a word, we prioritize these alternatives according to a quantitative resilience criterion and use them in that order. In a nutshell, we favor the more ambiguous alternatives. In fact not only do we attempt to achieve the maximum ambiguity, but we want to simultaneously be as close as possible to the above-mentioned distortion limit, as that prevents the adversary from doing further transformations without exceeding the damage threshold; that is, we continue to modify the document even after the text has "confessed" to the mark, for the dual purpose of maximizing ambiguity while deliberately getting as close as possible to the distortion limit. The quantification we use makes possible an application of the existing information-theoretic framework, to the natural language domain, which has unique challenges not present in the image or audio domains. The resilience stems from both (i) the fact that the adversary does not know where the changes were made, and (ii) the fact that automated disambiguation is a major difficulty faced by any natural language processing system (what is bad news for the natural language processing area, is good news for our scheme's resilience). In addition to the above mentioned design and analysis, another contribution of this paper is the description of the implementation of the scheme and of the experimental data obtained.

...read moreread less

Proceedings Article•DOI•

Why are they excited?: identifying and explaining spikes in blog mood levels

[...]

Krisztian Balog¹, Gilad Mishne¹, Maarten de Rijke¹•Institutions (1)

University of Amsterdam¹

05 Apr 2006

TL;DR: Simple techniques based on comparing corpus frequencies, coupled with large quantities of data, are shown to be effective for identifying the events underlying changes in global moods.

...read moreread less

Abstract: We describe a method for discovering irregularities in temporal mood patterns appearing in a large corpus of blog posts, and labeling them with a natural language explanation. Simple techniques based on comparing corpus frequencies, coupled with large quantities of data, are shown to be effective for identifying the events underlying changes in global moods.

...read moreread less

Journal Article•DOI•

An algorithm for the unsupervised learning of morphology

[...]

John Goldsmith¹•Institutions (1)

University of Chicago¹

01 Dec 2006-Natural Language Engineering

TL;DR: This paper describes in detail an algorithm for the unsupervised learning of natural language morphology, with emphasis on challenges that are encountered in languages typologically similar to European languages.

...read moreread less

Abstract: This paper describes in detail an algorithm for the unsupervised learning of natural language morphology, with emphasis on challenges that are encountered in languages typologically similar to European languages. It utilizes the Minimum Description Length analysis described in Goldsmith (2001), and has been implemented in software that is available for downloading and testing.

...read moreread less

Collapse