scispace - formally typeset
Search or ask a question

Showing papers on "Natural language published in 1995"


Journal ArticleDOI
TL;DR: WordNet1 provides a more effective combination of traditional lexicographic information and modern computing, and is an online lexical database designed for use under program control.
Abstract: Because meaningful sentences are composed of meaningful words, any system that hopes to process natural languages as people do must have information about words and their meanings. This information is traditionally provided through dictionaries, and machine-readable dictionaries are now widely available. But dictionary entries evolved for the convenience of human readers, not for machines. WordNet1 provides a more effective combination of traditional lexicographic information and modern computing. WordNet is an online lexical database designed for use under program control. English nouns, verbs, adjectives, and adverbs are organized into sets of synonyms, each representing a lexicalized concept. Semantic relations link the synonym sets [4].

15,068 citations


Book
01 Jan 1995

916 citations


01 Jan 1995
TL;DR: How lexical chains can be constructed by means of WordNet, and how they can be applied in one particularlinguistic task: the detection and correction of malapropisms is shown.
Abstract: Natural language utterances are, in general, highlyambiguous, and a unique interpretationcan usuallybe determined only by taking into account the constraining influence of the context in which theutterance occurred. Much of the research in natural language understanding in the last twenty yearscan be thought of as attempts to characterize and represent context and then derive interpretationsthatfit best with that context. Typically, this research was heavy with AI, taking context to be nothing lessthan a complete conceptual understanding of the preceding utterances. This was reasonable, as suchan understanding of a text was often the main task anyway. However, there are many text-processingtasksthatrequireonlya partialunderstandingofthetext, andhencea ‘lighter’representationofcontextis sufficient. In this paper, we examine the idea oflexical chains as such a representation. We showhow they can be constructed by means of WordNet, and how they can be applied in one particularlinguistic task: the detection and correction of malapropisms.A malapropism is the confounding of an intended word with another word of similar sound orsimilar spelling that has a quite different and malapropos meaning, e.g., an ingenuous [for ingenious]machine forpeelingoranges. In thisexample, there isaone-letterdifference betweenthe malapropismand the correct word. Ignorance, or a simple typing mistake, might cause such errors. However, sinceingenuous is a correctly spelled word, traditional spelling checkers cannot detect this kind of mistake.In section 4, we will propose an algorithm for detecting and correcting malapropisms that is based onthe construction of lexical chains.

915 citations


Posted Content
TL;DR: This article developed a formal grammatical system called a link grammar and showed how English grammar can be encoded in such a system, and gave algorithms for efficiently parsing with a link grammars.
Abstract: We develop a formal grammatical system called a link grammar, show how English grammar can be encoded in such a system, and give algorithms for efficiently parsing with a link grammar. Although the expressive power of link grammars is equivalent to that of context free grammars, encoding natural language grammars appears to be much easier with the new system. We have written a program for general link parsing and written a link grammar for the English language. The performance of this preliminary system -- both in the breadth of English phenomena that it captures and in the computational resources used -- indicates that the approach may have practical uses as well as linguistic significance. Our program is written in C and may be obtained through the internet.

839 citations


Journal ArticleDOI
01 Mar 1995-Language
TL;DR: In this paper, the authors discuss the importance of coherence and coherence in literature, and propose a formalization and functionalism in linguistic criticism using generative grammar and WFSSTs.
Abstract: Academics and standards Ageing and language Alphabet - religious beliefs Alternate sign languages Aphasia Australian languages Automatic speech recognition - stochastic techniques Black English in education Blasphemy Burushaski Caddoan languages Cargo cults Case grammar Chart parsing and WFSSTs Chomsky's philosophy of grammar Classroom language - observation and research Cognitive grammar Cohesion and coherence in literature Communicative intention Computers and language use Conservation analysis Counterfactuals Deaf community and culture Deconstruction Dependency phonology Dictionaries, rhyming Diderot, Denis Dravidian languages Dysarthrias, developmental Ethnicity and language Ethnopoetics Factivity Fluency - disorders Formal semantics Formalization and functionalism in linguistic criticism Formulaic speech Generative grammar Gestures Grammar - typological and areal issues Historiography of linguistics Immigrant languages in education - Sweden Indirect speech acts Information theory Interjections Intonation - pragmatics Irish bardic grammarians Italic languages Japanese writing system Journalism Language as a platonic reality Language death Language promotion by governments Language - Hindu views Lexical semantics Lexicography, post-classical Greek Linguistic philosophy Logical positivism Mesoamerican writing Metaphor in language Mood and modality Morphological universals Multilingual states - political implications of language policies Namibia - language situation Naming of children Natural language generation New Englishes Occitan Ogden, Charles Kay Origins of language - recent theories Oxford English dictionary Palaeontology, linguistic Pathology of language - evaluation Performative hypothesis Philosophy of linguistics Phonology - redundancy rules Picture theory of meaning Planudes, maximus Pragmatic presuppositions Preaching Procedural semantics Pronoun systems Propositional calculus Proxemics Pseudolinguistics Puns Rajasthani Reading processes in adults Shorthand Sign bilingualism - applications to education Slang - sociology Social networks and language South America - sign languages Speaker-characterization in speech technology Speech aerodynamics Structuralism and semiotics, literary Subcategorization Syntax and semantics - relationship Technical vocabulary - medieval and renaissance English Telegraph and telephone Tense Text pragmatics Text-to-speech conversion systems Thought and language Translation, machine-aided Translinguistics Ugaritic Universals of language Urban dialectology Valency changing alternations Voice quality Whistles and whistled speech Word-formation processes (Part contents)

727 citations


01 Aug 1995
TL;DR: The authors developed a formal grammatical system called a link grammar and showed how English grammar can be encoded in such a system, and gave algorithms for efficiently parsing with a link grammars.
Abstract: We develop a formal grammatical system called a link grammar, show how English grammar can be encoded in such a system, and give algorithms for efficiently parsing with a link grammar. Although the expressive power of link grammars is equivalent to that of context free grammars, encoding natural language grammars appears to be much easier with the new system. We have written a program for general link parsing and written a link grammar for the English language. The performance of this preliminary system ‐ both in the breadth of English phenomena that it captures and in the computational resources used ‐ indicates that the approach may have practical uses as well as linguistic significance. Our program is written in C and may be obtained through the internet.

726 citations


Posted Content
TL;DR: Natural language interfaces to databases (NLIDBs) as discussed by the authors have been studied extensively in the field of natural language processing and have attracted much attention in the last few decades, especially for query languages, form-based interfaces and graphical interfaces.
Abstract: This paper is an introduction to natural language interfaces to databases (NLIDBs). A brief overview of the history of NLIDBs is first given. Some advantages and disadvantages of NLIDBs are then discussed, comparing NLIDBs to formal query languages, form-based interfaces, and graphical interfaces. An introduction to some of the linguistic problems NLIDBs have to confront follows, for the benefit of readers less familiar with computational linguistics. The discussion then moves on to NLIDB architectures, portability issues, restricted natural language input systems (including menu-based NLIDBs), and NLIDBs with reasoning capabilities. Some less explored areas of NLIDB research are then presented, namely database updates, meta-knowledge questions, temporal questions, and multi-modal NLIDBs. The paper ends with reflections on the current state of the art.

694 citations


Journal ArticleDOI
TL;DR: This paper is an introduction to natural language interfaces to databases (NLIDBS) and some less explored areas of NLIDB research are presented, namely database updates, meta-knowledge questions, temporal questions, and multi-modal NLIDBS.
Abstract: This paper is an introduction to natural language interfaces to databases (NLIDBS). A brief overview of the history of NLIDBS is first given. Some advantages and disadvantages of NLIDBS are then discussed, comparing NLIDBS to formal query languages, form-based interfaces, and graphical interfaces. An introduction to some of the linguistic problems NLIDBS have to confront follows, for the benefit of readers less familiar with computational linguistics. The discussion then moves on to NLIDB architectures, portability issues, restricted natural language input systems (including menu-based NLIDBS), and NLIDBS with reasoning capabilities. Some less explored areas of NLIDB research are then presented, namely database updates, meta-knowledge questions, temporal questions, and multi-modal NLIDBS. The paper ends with reflections on the current state of the art.

679 citations


Journal ArticleDOI
10 Feb 1995-Science
TL;DR: A language-independent means of gauging topical similarity in unrestricted text by combining information derived from n-grams with a simple vector-space technique that makes sorting, categorization, and retrieval feasible in a large multilingual collection of documents.
Abstract: A language-independent means of gauging topical similarity in unrestricted text is described. The method combines information derived from n-grams (consecutive sequences of n characters) with a simple vector-space technique that makes sorting, categorization, and retrieval feasible in a large multilingual collection of documents. No prior information about document content or language is required. Context, as it applies to document similarity, can be accommodated by a well-defined procedure. When an existing document is used as an exemplar, the completeness and accuracy with which topically related documents are retrieved is comparable to that of the best existing systems. The results of a formal evaluation are discussed, and examples are given using documents in English and Japanese.

630 citations


Book
01 Jan 1995
TL;DR: The nature of gesture, signed and spoken languages differently organized, and the origin of syntax: gesture as name and relation are discussed.
Abstract: This book proposes a radical alternative to dominant views of the evolution of language, and in particular the origins of syntax. The authors argue that manual and vocal communication developed in parallel, and that the basic elements of syntax are intrinsic to gesture. They draw on evidence from areas such as primatology, anthropology, and linguistics, to present a groundbreaking account of the notion that language emerged through visible bodily action. They go on to examine the implications of their findings for linguistic theory and theories of the biological evolution of the capacity for language. Written in a clear and accessible style, Gesture and the Nature of Language will be indispensable reading for all those interested in the origins of language.

422 citations


Proceedings ArticleDOI
Tony Robinson1, J. Fransen1, D. Pye1, J.T. Foote1, Steve Renals1 
09 May 1995
TL;DR: The motivation for the corpus, the processes undertaken in its construction and the utilities needed as support tools are described, and comparative results on these tasks for British and American English are concluded.
Abstract: A significant new speech corpus of British English has been recorded at Cambridge University. Derived from the Wall Street Journal text corpus, WSJCAMO constitutes one of the largest corpora of spoken British English currently in existence. It has been specifically designed for the construction and evaluation of speaker-independent speech recognition systems. The database consists of 140 speakers each speaking about 110 utterances. This paper describes the motivation for the corpus, the processes undertaken in its construction and the utilities needed as support tools. All utterance transcriptions have been verified and a phonetic dictionary has been developed to cover the training data and evaluation tasks. Two evaluation tasks have been defined using standard 5000 word bigram and 20000 word trigram language models. The paper concludes with comparative results on these tasks for British and American English.

Journal ArticleDOI
TL;DR: The description will focus on the development of the multilingual MIT Voyager spoken language system, which can engage in verbal dialogues with users about a geographical region within Cambridge, MA in the USA.

Patent
21 Jun 1995
TL;DR: In this paper, the input natural language information is sequentially processed as word by word, and the kind of a subsequent word is expected from a currently processed word by using knowledge concerning the word order of words in the information.
Abstract: A natural language processing method, by which a sequence of natural language information is analyzed so as to derive a concept represented by the information. In this method, the input natural language information is sequentially processed as word by word. At that time, the kind of a subsequent word is expected from a currently processed word by using knowledge concerning the word order of words in the natural language information. Thus the processing is performed by eliminating ambiguity in the information on the basis of such an expectation.

Journal ArticleDOI
TL;DR: The Terminology Server provides external referencing to concept entities, coercion between data types, and makes its services available through a uniform applications programming interface.
Abstract: GALEN is developing a Terminology Server to support the development and integration of clinical systems through a range of key terminological services, built around a language-independent, re-usable, shared system of concepts--the CORE model. The focus is on supporting applications for medical records, clinical user interfaces and clinical information systems, but also includes systems for natural language understanding, clinical decision support, management of coding and classification schemes, and bibliographic retrieval. The Terminology Server integrates three modules: the Concept Module which implements the GRAIL formalism and manages the internal representation of concept entities, the Multilingual Module which manages the mapping of concept entities to natural language, and the Code Conversion Module which manages the mapping of concept entities to and from existing coding and classification schemes. The Terminology Server also provides external referencing to concept entities, coercion between data types, and makes its services available through a uniform applications programming interface. Taken together these services represent a new approach to the development of clinical systems and the sharing of medical knowledge.

Book
23 Mar 1995
TL;DR: This book attempts to define a "Linnaean taxonomy" for the English language: an annotation scheme, the SUSANNE scheme, which yields a labelled constituency structure for any string of English, comprehensively identifying all of its surface and logical structural properties.
Abstract: Computer processing of natural language is a burgeoning field, but until now there has been no agreement on a standardized classification of the diverse structural elements that occur in real-life language material. This book attempts to define a "Linnaean taxonomy" for the English language: an annotation scheme, the SUSANNE scheme, which yields a labelled constituency structure for any string of English, comprehensively identifying all of its surface and logical structural properties. The structure is specified with sufficient rigour that analysts working independently must produce identical annotations for a given example. The scheme is based on large sample of real-life use of British and American written and spoken English. The book also describes the SUSANNE electronic corpus of English which is annotated in accordance with the scheme. It is freely available as a research resource to anyone working at a computer conected to Internet, and since 1992 has come into widespread use in academic and commerical research environments on four continents.

Book
01 Oct 1995
TL;DR: The role of teachers in the development of linguistic, cognitive, and academic skills of limited English-English-proficient (LEE/ELL) students is discussed in this article.
Abstract: Limited-English-proficient students in the mainstream classroom limited-English-proficient students/ English language learners - who are they? cultural and linguistic diversity in the classroom alternatives to mainstreaming the integrated development or oral and written language instructional strategies for LEP/ELL student's oral and written English language development integrating language and social studies learning integrating language and science learning integrating language and mathematics learning the role of teachers in the development of linguistic, cognitive, and academic skills of LEP/ELL students.

Journal ArticleDOI
TL;DR: A new data structure is described, the semantic classification tree (SCT), that learns semantic rules from training data and can be a building block for robust matchers for NLU tasks.
Abstract: This article describes a new method for building a natural language understanding (NLU) system, in which the system's rules are learnt automatically from training data. The method has been applied to design of a speech understanding (SU) system. Designers of such systems rely increasingly on robust matchers to perform the task of extracting meaning from one or several word sequence hypotheses generated by a speech recognizer. We describe a new data structure, the semantic classification tree (SCT), that learns semantic rules from training data and can be a building block for robust matchers for NLU tasks. By reducing the need for handcoding and debugging a large number of rules, this approach facilitates rapid construction of an NLU system. In the case of an SU system, the rules learned by an SCT are highly resistant to errors by the speaker or by the speech recognizer because they depend on a small number of words in each utterance. Our work shows that semantic rules can be learned automatically from training data, yielding successful NLU for a realistic application. >

Patent
07 Jun 1995
TL;DR: In this article, a program is provided for organizing a natural language such as English into binary units of two basic elements, Nounness and Verbness, which combine in two idea word patterns, called Primary Image and Conditional Image, and two Detail word patterns.
Abstract: A program is provided for organizing a natural language, such as English, into binary units of two basic elements, Nounness and Verbness, which combine in two idea word patterns, called Primary Image and Conditional Image, and two Detail word patterns, called Process Detail and Background Detail. These two basic elements, Nounness and Verbness, function binarily within the program, either in combination for the two Image word patterns or separately for the two Detail word patterns. All word terns, except the verb-in-tense in the two Image word patterns, function binarily within the program in one of two positions: as Nounness, called Nesting, or as modifiers, called Qualifying. Since meaning in an English sentence is determined solely by word and word pattern location, binary units can be created which allow meaning to be changed by moving words or word patterns from one location to another, called Flipping. Natural language, thus organized into binary units, can be thus analyzed in computer programs for purposes such as, but not limited to, natural language processing which is not restricted to limited language domains, voice activation, machine translation from one natural language to another, context analysis of documents, data base searching, syntax analysis of documents, and the teaching of writing in natural language.

Journal ArticleDOI
TL;DR: The need for multidisciplinary research is reviewed, for development of shared corpora and related resources, for computational support and far rapid communication among researchers, and the expected benefits of this technology are reviewed.
Abstract: A spoken language system combines speech recognition, natural language processing and human interface technology. It functions by recognizing the person's words, interpreting the sequence of words to obtain a meaning in terms of the application, and providing an appropriate response back to the user. Potential applications of spoken language systems range from simple tasks, such as retrieving information from an existing database (traffic reports, airline schedules), to interactive problem solving tasks involving complex planning and reasoning (travel planning, traffic routing), to support for multilingual interactions. We examine eight key areas in which basic research is needed to produce spoken language systems: (1) robust speech recognition; (2) automatic training and adaptation; (3) spontaneous speech; (4) dialogue models; (5) natural language response generation; (6) speech synthesis and speech generation; (7) multilingual systems; and (8) interactive multimodal systems. In each area, we identify key research challenges, the infrastructure needed to support research, and the expected benefits. We conclude by reviewing the need for multidisciplinary research, for development of shared corpora and related resources, for computational support and far rapid communication among researchers. The successful development of this technology will increase accessibility of computers to a wide range of users, will facilitate multinational communication and trade, and will create new research specialties and jobs in this rapidly expanding area. >

Journal ArticleDOI
Vivian Cook1
TL;DR: The authors discusses the persistent tendency in L2 pedagogy, from the 1920s to the present, to make fallacious comparisons between multi-competent L2 learners and monoglot speakers of the target language.
Abstract: The term ‘multi‐competence’ is used to define an individual's knowledge of a native language and a second language, that is L1 linguistic competence plus L2 interlanguage. The paper discusses the persistent tendency in L2 pedagogy, from the 1920s to the present, to make fallacious comparisons between multi‐competent L2 learners and monoglot speakers of the target language. The fallacy is perpetuated by many formal models of language acquisition, such as Universal Grammar, which is opposed to any notion of multiple competences. The paper lists and describes the principal elements of multi‐competence and presents a number of their implications for the construction of syllabi and examinations and the development of teaching methods.

BookDOI
20 Nov 1995
TL;DR: This article proposed a new generation model in which a first pass builds a draft containing only the essential new facts to report and a second pass incrementally revises this draft to opportunistically add as many background facts as can fit within the space limit.
Abstract: Automatically summarizing vast amounts of on-line quantitative data with a short natural language paragraph has a wide range of real-world applications. However, this specific task raises a number of difficult issues that are quite distinct from the generic task of language generation: conciseness, complex sentences, floating concepts, historical background, paraphrasing power and implicit content. In this thesis, I address these specific issues by proposing a new generation model in which a first pass builds a draft containing only the essential new facts to report and a second pass incrementally revises this draft to opportunistically add as many background facts as can fit within the space limit. This model requires a new type of linguistic knowledge: revision operations, which specifies the various ways a draft can be transformed in order to concisely accommodate a new piece of information. I present an in-depth corpus analysis of human-written sports summaries that resulted in an extensive set of such revision operations. I also present the implementation, based on functional unification grammars, of the system scSTREAK, which relies on these operations to incrementally generate complex sentences summarizing basketball games. This thesis also contains two quantitative evaluations. The first shows that the new revision-based generation model is far more robust than the one-pass model of previous generators. The second evaluation demonstrates that the revision operations acquired during the corpus analysis and implemented in scSTREAK are, for the most part, portable to at least one other quantitative domain (the stock market). scSTREAK is the first report generator that systematically places the facts which it summarizes in their historical perspective. It is more concise than previous systems thanks to its ability to generate more complex sentences and to opportunistically convey facts by adding a few words to carefully chosen draft constituents. The revision operations on which scSTREAK is based constitute the first set of corpus-based linguistic knowledge geared towards incremental generation. The evaluation presented in this thesis is also the first attempt to quantitatively assess the robustness of a new generation model and the portability of a new type of linguistic knowledge.

Journal ArticleDOI
TL;DR: The PALKA (Parallel Automatic Linguistic Knowledge Acquisition) system is presented that acquires linguistic patterns from a set of domain specific training texts and their desired outputs and a specialized representation of patterns called FP structures has been defined.
Abstract: The paper presents an automatic acquisition of linguistic patterns that can be used for knowledge based information extraction from texts. In knowledge based information extraction, linguistic patterns play a central role in the recognition and classification of input texts. Although the knowledge based approach has been proved effective for information extraction on limited domains, there are difficulties in construction of a large number of domain specific linguistic patterns. Manual creation of patterns is time consuming and error prone, even for a small application domain. To solve the scalability and the portability problem, an automatic acquisition of patterns must be provided. We present the PALKA (Parallel Automatic Linguistic Knowledge Acquisition) system that acquires linguistic patterns from a set of domain specific training texts and their desired outputs. A specialized representation of patterns called FP structures has been defined. Patterns are constructed in the form of FP structures from training texts, and the acquired patterns are tuned further through the generalization of semantic constraints. Inductive learning mechanism is applied in the generalization step. The PALKA system has been used to generate patterns for our information extraction system developed for the fourth Message Understanding Conference (MUC-4). >

Journal ArticleDOI
TL;DR: How the natural language system was made compatible with the existing CIS is described and engineering issues which involve performance, robustness, and accessibility of the data from the end users' viewpoint are discussed.
Abstract: This paper describes a natural language text extraction system, called MEDLEE, that has been applied to the medical domain. The system extracts, structures, and encodes clinical information from textual patient reports. It was integrated with the Clinical Information System (CIS), which was developed at Columbia-Presbyterian Medical Center (CPMC) to help improve patient care. MEDLEE is currently used on a daily basis to routinely process radiological reports of patients at CPMC.In order to describe how the natural language system was made compatible with the existing CIS, this paper will also discuss engineering issues which involve performance, robustness, and accessibility of the data from the end users' viewpoint.Also described are the three evaluations that have been performed on the system. The first evaluation was useful primarily for further refinement of the system. The two other evaluations involved an actual clinical application which consisted of retrieving reports that were associated with specified diseases. Automated queries were written by a medical expert based on the structured output forms generated as a result of text processing. The retrievals obtained by the automated system were compared to the retrievals obtained by independent medical experts who read the reports manually to determine whether they were associated with the specified diseases. MEDLEE was shown to perform comparably to the experts. The technique used to perform the last two evaluations was found to be a realistic evaluation technique for a natural language processor.

BookDOI
02 Feb 1995
TL;DR: This work extends the state of the art in Integrated Dialog Processing by extending the Goal and Action Description Language and improving the performance of the Speech Recognizer and Parser.
Abstract: 1. Achieving Spoken Communication with Computers 2. Foundational Work in Integrated Dialog Processing 3. Dialog Processing Theory 4. Computational Model 5. Parsing 6. System Implementation 7. Experimental Results 8. Performance of the Speech Recognizer and Parser 9. Enhanced Dialog Processing: Verifying Doubtful Inputs 10. Extending the State of the Art A. The Goal and Action Description Language B. User's Guide for the Interruptible Prolog Simulator (IPSIM) C. Obtaining the System Software Via the Anonymous FTP

Book
04 May 1995
TL;DR: The aspectual verbs: the linguistic data negation and duality - the basic tools monotonicity properties the aspectual cube and cognition and semantic representation referring with parameters naturalized semantic realism and universal grammar.
Abstract: Part 1 Introduction: what are aspectual classes? controlling the flow of information - filters, plugs and holes situated reasoning about time. Part 2 The aspectual verbs: the linguistic data negation and duality - the basic tools monotonicity properties the aspectual cube. Part 3 Dynamic aspect trees: aspect as control structure DATs for texts reasoning with DATs - chronoscopes DATs - their syntax and semantics. Part 4 States, generic information and constraints: transient states progressive and perfect states generic information conditionals and temporal quantification. Part 5 Perspectives: perspectival coherence and chronoscopes perspectival refinement perspectival binding scenes and scenarios. Part 6 A fragment of English: syntax and lexicon DAT rules semantics further issues. Part 7 Epilogue: cognition and semantic representation referring with parameters naturalized semantic realism and universal grammar.

Proceedings ArticleDOI
20 Feb 1995
TL;DR: A class of systems, called FAQ FINDER systems, that use a natural language question-based interface to distributed information sources, specifically files organized as question/answer pairs such as FAQ files are discussed.
Abstract: In this paper, we will discuss a class of systems, called FAQ FINDER systems, that use a natural language question-based interface to distributed information sources, specifically files organized as question/answer pairs such as FAQ files. In using these systems, users enter question in natural language and the system attempts answer that question, using FAQ files as a resource. We combined two technologies in developing these systems: statistically based IR engines and more semantically based "language matchers". The power of our approach rises out of two features: first, we are using knowledge sources that have already been designed to answer the commonly asked questions in a domain and are more highly organized than free text; second, these systems do not have to comprehend the queries they receive. They only have to identify the relevant files and then match against the segments of text that are used to organize the files themselves. >


Journal Article
TL;DR: Five language and tagset independent stochastic taggers, handling morphological and contextual information, are presented and tested in corpora of seven European languages, using two sets of grammatical tags, and it is shown that the taggers' performance is satisfactory, even though a small training text is available.
Abstract: Five language and tagset independent stochastic taggers, handling morphological and contextual information, are presented and tested in corpora of seven European languages (Dutch, English, French, German, Greek, Italian and Spanish), using two sets of grammatical tags; a small set containing the eleven main grammatical classes and a large set of grammatical categories common to all languages. The unknown words are tagged using an experimentally proven stochastic hypothesis that links the stochastic behavior of the unknown words with that of the less probable known words. A fully automatic training and tagging program has been implemented on an IBM PC-compatible 80386-based computer. Measurements of error rate, time response, and memory requirements have shown that the taggers' performance is satisfactory, even though a small training text is available. The error rate is improved when new texts are used to update the stochastic model parameters.

Journal ArticleDOI
TL;DR: The EDR Electronic Dictionary is described, which seeks to provide a foundation for linguistic databases, and the relation of electronic dictionaries to very large knowledge bases is explained.
Abstract: Natural language processing will grow into a vital industrial technology in the next five to 10 years. But this growth depends on the development of large linguistic databases that capture natural language phenomena [1, 2]. Another important theme for future work is development of large knowledge bases that are shared widely by different groups. One promising approach to such knowledge bases draws on natural language processing and linguistic knowledge. This article describes the EDR Electronic Dictionary [3], which seeks to provide a foundation for linguistic databases, and explains the relation of electronic dictionaries to very large knowledge bases.

Patent
21 Dec 1995
TL;DR: In this article, a natural language understanding system takes a sentence as input and returns some representation of the possible meanings of the sentence as output (the interpretation) using a run-time interpreter that assigns interpretations to sentences and a compiler that produces (in a computer memory) an internal specification needed for the runtime interpreter from a user specification of the semantics of the application.
Abstract: A computerized method for building and running natural language understanding systems, wherein a natural language understanding system takes a sentence as input and returns some representation of the possible meanings of the sentence as output (the “interpretation”) using a run-time interpreter that assigns interpretations to sentences and a compiler that produces (in a computer memory) an internal specification needed for the run-time interpreter from a user specification of the semantics of the application. The compiler builds a natural language system, while the run-time interpreter runs the system.