Showing papers on "Shallow parsing published in 2004"

PDF

Open Access

Journal Article•DOI•

Automatic Acquisition and Expansion of Hypernym Links

[...]

Emmanuel Morin¹, Christian Jacquemin¹•Institutions (1)

Centre national de la recherche scientifique¹

01 Nov 2004-Computers and The Humanities

TL;DR: This paper proposes to bridge the gap between term acquisition and thesaurus construction by offering a framework for automatic structuring of multi-word candidate terms with the help of corpus-based links between single-word terms.

...read moreread less

Abstract: Recent developments in computational terminology call for the design of multiple and complementary tools for the acquisition, the structuring and the exploitation of terminological data. This paper proposes to bridge the gap between term acquisition and thesaurus construction by offering a framework for automatic structuring of multi-word candidate terms with the help of corpus-based links between single-word terms. First, we present a system for corpus-based acquisition of terminological relationships through discursive patterns. This system is built on previous work on automatic extraction of hyponymy links through shallow parsing. Second, we show how hypernym links between single-word terms can be extended to semantic links between multi-word terms through corpus-based extraction of semantic variants. The induced hierarchy is incomplete but provides an automatic generalization of single-word terms relations to multi-word terms that are pervasive in technical thesauri and corpora.

...read moreread less

76 citations

Journal Article•DOI•

Building Cast3LB: A Spanish Treebank

[...]

Montserrat Civit¹, M. A. Martí¹•Institutions (1)

University of Barcelona¹

01 Dec 2004-Research on Language and Computation

TL;DR: This paper focuses on the second phase, presenting the basic guidelines for syntactic annotation and the boundaries of the work being done, and justifies methodological principles and syntactic criteria to build Cast3LB: a treebank for Spanish.

...read moreread less

Abstract: In this paper we present and justify methodological principles and syntactic criteria to build Cast3LB: a treebank for Spanish. As a previous work necessary to develop it, some automatic and semiautomatic processes have been carried out: automatic morphological analysis and disambiguation; manual validation of the tagging process, which guarantees the quality of the data; and, finally, automatic shallow parsing. The syntactic annotation consists of the labelling of constituents, including some elliptical elements, and syntactic functions. In this paper we focus on the second phase, presenting the basic guidelines for syntactic annotation and the boundaries of the work being done.

...read moreread less

51 citations

Book Chapter•DOI•

A Cascaded Syntactic Analyser for Basque

[...]

Itziar Aduriz¹, Maxux Aranzabe², Jose Maria Arriola², Arantza Díaz de Ilarraza², Koldo Gojenola², Maite Oronoz², Larraitz Uria² - Show less +3 more•Institutions (2)

University of Barcelona¹, University of the Basque Country²

15 Feb 2004

TL;DR: A robust syntactic analyser for Basque and the different modules it contains that is carried out using the Constraint Grammar (CG) formalism and the standardisation process of the parsing formats using XML is described.

...read moreread less

Abstract: This article presents a robust syntactic analyser for Basque and the different modules it contains. Each module is structured in different analysis layers for which each layer takes the information provided by the previous layer as its input; thus creating a gradually deeper syntactic analysis in cascade. This analysis is carried out using the Constraint Grammar (CG) formalism. Moreover, the article describes the standardisation process of the parsing formats using XML.

...read moreread less

37 citations

Patent•

Method for processing Chinese natural language sentence

[...]

Chang Feng-Lin, Chen Yi-Chun, Cheng Hua-Sen

07 Jun 2004

TL;DR: This article proposed a method for parsing Chinese sentences by employing lexical and syntactical information to extract more prominent entities in a Chinese sentence, and the sentence is then transformed into a Triple representation by employing the Triple rules referring to elemental Chinese syntax.

...read moreread less

Abstract: A method for processing natural language Chinese sentences can transform a Chinese sentence into a Triple representation using shallow parsing techniques. The method is concerned with parsing Chinese sentences by employing lexical and syntactical information to extract more prominent entities in a Chinese sentence, and the sentence is then transformed into a Triple representation by employing the Triple rules referring to elemental Chinese syntax—SVO (subject, verb, and object in order). The lexical and syntactical information in our method is referring a lexicon possessed of part-of-speech (POS) information and phrase-level syntax in Chinese respectively. The Triple representation consists of three elements which are agent, predicate, and patient in a sentence.

...read moreread less

27 citations

Morphosyntactic disambiguation and shallow parsing in computational processing of Basque

[...]

Itziar Aduriz, Arantza Díaz de Ilarraza Sánchez

01 Jan 2004

TL;DR: In this article, the authors present the improvements in the computational treatment of Basque, and more specifically, in the areas of morphosyntactic disambiguation and shallow parsing.

...read moreread less

Abstract: Our goal in this article is to show the improvements in the computational treatment of Basque, and more specifically, in the areas of morphosyntactic disambiguation and shallow parsing. The improvements presented in this paper include the following: analyses of previously identified ambiguities in morphosyntax and in syntactic functions, their disambiguation, and finally, an outline of possible steps in terms ofshallow parsing based on the results provided by the disambiguation process. The work is part of the current research within the field of Natural Language Processing (NLP) in Basque, and more specifically, part of the work that is being done within the IXA group.

...read moreread less

25 citations

Proceedings Article•

A memory-based shallow parser for spoken Dutch

[...]

Sander Canisius, A. van den Bosch

01 Jan 2004

TL;DR: It is concluded that existing memory-based parsing approaches can be applied to spoken Dutch successfully, but that there is room for improvement in the tagger-chunker.

...read moreread less

Abstract: We describe the development of a Dutch memory-based shallow parser. The availability of large treebanks for Dutch, such as the one provided by the Spoken Dutch Corpus, allows memory-based learners to be trained on examples of shallow parsing taken from the treebank, and act as a shallow parser after training. An overview is given of a modular memory-based learning approach to shallow parsing, composed of a part-of-speech tagger-chunker and two grammatical relation finders, which has originally been developed for English. This approach is applied to the syntactically annotated part of the Spoken Dutch Corpus to construct a Dutch shallow parser. From the generalisation scores of the parser we conclude that existing memory-based parsing approaches can be applied to spoken Dutch successfully, but that there is room for improvement in the tagger-chunker

...read moreread less

10 citations

Journal Article•

Morphological and syntactic processing for text retrieval

[...]

Jesús Vilares, Miguel A. Alonso, Manuel Vilares

01 Jan 2004-Lecture Notes in Computer Science

TL;DR: This paper described the application of lemmatization and shallow parsing as a linguistically-based alternative to stemming in Text Retrieval, with the aim of managing linguistic variation at both word level and phrase level.

...read moreread less

Abstract: This article describes the application of lemmatization and shallow parsing as a linguistically-based alternative to stemming in Text Retrieval, with the aim of managing linguistic variation at both word level and phrase level. Several alternatives for selecting the index terms among the syntactic dependencies detected by the parser are evaluated. Though this article focuses on Spanish, this approach is extensible to other languages by simply adapting the grammar used by the parser.

...read moreread less

9 citations

Towards a Dependency Parser for Basque

[...]

Maxux Aranzabe, Jose Mari Arriola¹, A. Diaz de Ilarraza•Institutions (1)

University of the Basque Country¹

28 Aug 2004

TL;DR: The Dependency Parser, called Maxuxta, is presented, which can serve as a representative of agglutinative languages that are also characterized by the free order of its constituents, for the linguistic processing of Basque.

...read moreread less

Abstract: We present the Dependency Parser, called Maxuxta, for the linguistic processing of Basque, which can serve as a representative of agglutinative languages that are also characterized by the free order of its constituents. The Dependency syntactic model is applied to establish the dependency-based grammatical relations between the components within the clause. Such a deep analysis is used to improve the output of the shallow parsing where syntactic structure ambiguity is not fully and explicitly resolved. Previous to the completion of the grammar for the dependency parsing, the design of the Dependency Structure-based Scheme had to be accomplished; we concentrated on issues that must be resolved by any practical system that uses such models. This scheme was used both to the manual tagging of the corpus and to develop the parser. The manually tagged corpus has been used to evaluate the accuracy of the parser. We have evaluated the application of the grammar to corpus, measuring the linking of the verb with its dependents, with satisfactory results.

...read moreread less

9 citations

Book Chapter•DOI•

Morphological and Syntactic Processing for Text Retrieval

[...]

Jesús Vilares, Miguel A. Alonso, Manuel Vilares¹•Institutions (1)

University of Vigo¹

30 Aug 2004

TL;DR: The application of lemmatization and shallow parsing is described as a linguistically-based alternative to stemming in Text Retrieval, with the aim of managing linguistic variation at both word level and phrase level.

...read moreread less

8 citations

Towards the assessment of free learner's utterances in CALL

[...]

Svetla Boytcheva, Irena Vitanova, Albena Strupchanska, Milena Yankova, Galia Angelova - Show less +1 more

01 Jan 2004

TL;DR: It is shown how the combination of shallow and deep semantic NLP techniques can improve the effectiveness of eLearning systems which support communication in free natural language and can make them more satisfactory and pleasant for their users.

...read moreread less

Abstract: Computer-Aided Language Learning (CALL) should play an important role in the modern training process because it provides easy accessible, adaptive and flexible ways of learning. This paper addresses the scenario of tutor-learner question answering and attempts to automate the free answers evaluation using the advantages of Natural Language Processing (NLP). Our current approach integrates shallow parsing for analysing the answers and allows the learners to enter various utterances to express themselves. However this variety does not impede the assessment of the student’s answer as we check the utterances against the automatically generated scope of the correct answers. The usage of a “set of answers” instead of one predefined correct answer enables feedback elaboration that helps learners to understand better their knowledge gaps. Briefly, in this paper we show how the combination of shallow and deep semantic NLP techniques can improve the effectiveness of eLearning systems which support communication in free natural language and can make them more satisfactory and pleasant for their users.

...read moreread less

8 citations

Book Chapter•DOI•

Towards the Design of a Syntactico-Semantic Lexicon for Polish

[...]

Adam Przepiórkowski¹•Institutions (1)

Polish Academy of Sciences¹

01 Jan 2004

TL;DR: An approach based on recent results in formal and computational linguistics is proposed, which takes into consideration the morphosyntactic and syntactic structure of Polish and which avoids various known problems of previous valence dictionaries.

...read moreread less

Abstract: This article presents the design of a syntactico-semantic dictionary for Polish, i.e., a valence dictionary enriched with certain semantic informations. Valence dictionaries, specifying the number and morphosyntactic form of arguments of verbs, are useful in many Natural Language Processing applications, including deep parsing, e.g., for the purpose of machine translation, shallow parsing, e.g., for the purpose of information extraction, and rule-based morphosyntactic disambiguation, e.g., for the purpose of corpus annotation. An approach based on recent results in formal and computational linguistics is proposed, which takes into consideration the morphosyntactic and syntactic structure of Polish and which avoids various known problems of previous valence dictionaries, some of them stemming from their impoverished theoretical framework, unable to take proper care of the syntax-semantics interface, case variations and raising predicates. An implementation of a grammar of Polish deploying the ideas presented here is currently under development.

...read moreread less

Analysis of Free-Form Battlefield Reports with Shallow Parsing Techniques

[...]

Matthias Hecking

01 Mar 2004

TL;DR: The authors first introduce the promising field of information extraction, and then describe in detail how shallow parsing techniques are used in project SOKRATES, a project to analyze German free-form battlefield reports.

...read moreread less

Abstract: : A natural way to communicate with C2 systems would be to use natural language. Natural language components are already used in military systems, for example, CommandTalk is a spoken-language interface to the ModSAF battlefield simulator. In project SOKRATES, the authors use shallow parsing techniques for written language to analyze German free-form battlefield reports. These reports are processed by transducers. The extraction result is formalized in feature structures, semantically enriched by the semantic analysis, and the augmented result is then stored in the ATCCIS database. After storage in the database, triggers initiate a change in the position of a tactical symbol on the tactical map. Shallow parsing techniques are the basis for information extraction. In this paper, the authors first introduce the promising field of information extraction, and then describe in detail how shallow parsing techniques are used in project SOKRATES. Twenty-four briefing charts summarize the presentation. (2 figures, 15 refs.)

...read moreread less

Proceedings Article•

The Construction of A Chinese Shallow Treebank

[...]

Ruifeng Xu, Qin Lu, Yin Li, Wanyin Li

01 Jul 2004

TL;DR: The PolyU Treebank is based on shallow parsing in which only partial syntactical structures are annotated, and well-designed workflow and effective semiautomatic and automatic annotation checking are used to ensure annotation accuracy and consistency.

...read moreread less

Abstract: This paper presents the construction of a manually annotated Chinese shallow Treebank, named PolyU Treebank. Different from traditional Chinese Treebank based on full parsing, the PolyU Treebank is based on shallow parsing in which only partial syntactical structures are annotated. This Treebank can be used to support shallow parser training, testing and other natural language applications. Phrase-based Grammar, proposed by Peking University, is used to guide the design and implementation of the PolyU Treebank. The design principles include good resource sharing, low structural complexity, sufficient syntactic information and large data scale. The design issues, including corpus material preparation, standard for word segmentation and POS tagging, and the guideline for phrase bracketing and annotation, are presented in this paper. Well-designed workflow and effective semiautomatic and automatic annotation checking are used to ensure annotation accuracy and consistency. Currently, the PolyU Treebank has completed the annotation of a 1-million-word corpus. The evaluation shows that the accuracy of annotation is higher than 98%.

...read moreread less

DOI•

Auto-Generation of NVEF Knowledge in Chinese

[...]

Jia-Lin Tsai¹, Gladys Hsieh¹, Wen-Lian Hsu¹•Institutions (1)

Academia Sinica¹

01 Feb 2004

TL;DR: A method is proposed here for automatically acquiring large-scale NVEF knowledge without human intervention in order to identify a large, varied range of NVEf-sentences (sentences containing at least one NVEFs word-pair).

...read moreread less

Abstract: Noun-verb event frame (NVEF) knowledge in conjunction with an NVEF word-pair identifier [Tsai et al. 2002] comprises a system that can be used to support natural language processing (NLP) and natural language understanding (NLU). In [Tsai et al. 2002a], we demonstrated that NVEF knowledge can be used effectively to solve the Chinese word-sense disambiguation (WSD) problem with 93.7% accuracy for nouns and verbs. In [Tsai et al. 2002b], we showed that NVEF knowledge can be applied to the Chinese syllable-to-word (STW) conversion problem to achieve 99.66% accuracy for the NVEF related portions of Chinese sentences. In [Tsai et al. 2002a], we defined a collection of NVEF knowledge as an NVEF word-pair (a meaningful NV word-pair) and its corresponding NVEF sense-pairs. No methods exist that can fully and automatically find collections of NVEF knowledge from Chinese sentences. We propose a method here for automatically acquiring large-scale NVEF knowledge without human intervention in order to identify a large, varied range of NVEF-sentences (sentences containing at least one NVEF word-pair). The auto-generation of NVEF knowledge (AUTO-NVEF) includes four major processes: (1) segmentation checking; (2) Initial Part-of-Speech (IPOS) sequence generation; (3) NV knowledge generation; and (4) NVEF knowledge auto-confirmation. Our experimental results show that AUTO-NVEF achieved 98.52% accuracy for news and 96.41% for specific text types, which included research reports, classical literature and modern literature. AUTO-NVEF automatically discovered over 400,000 NVEF word-pairs from the 2001 United Daily News (2001 UDN) corpus. According to our estimation, the acquired NVEF knowledge from 2001 UDN helped to identify 54% of the NVEF-sentences in the Academia Sinica Balanced Corpus (ASBC), and 60% in the 2001 UDN corpus. We plan to expand NVEF knowledge so that it is able to identify more than 75% of NVEF-sentences in ASBC. We will also apply the acquired NVEF knowledge to support other NLP and NLU researches, such as machine translation, shallow parsing, syllable and speech understanding and text indexing. The auto-generation of bilingual, especially Chinese-English, NVEF knowledge will be also addressed in our future work.

...read moreread less

Journal Article•

A Statistical Model for Identifying Grammatical Relations in Korean Sentences

[...]

Songwook Lee

01 Dec 2004-IEICE Transactions on Information and Systems

Book Chapter•DOI•

A Treebank-Based Case Role Annotation Using an Attributed String Matching

[...]

Samuel W. K. Chan¹•Institutions (1)

The Chinese University of Hong Kong¹

13 Dec 2004

TL;DR: A novel approach of identifying case role is proposed which goes beyond shallow parsing to a deeper level of language understanding, while preserving robustness, without being bogged down into a complete linguistic analysis.

...read moreread less

Abstract: A novel approach of identifying case role is proposed. The approach makes use of an attributed string matching technique which takes full advantage of a huge number of sentence patterns in a Treebank. Based on the syntactic and semantic tags encoded in the Treebank, the approach goes beyond shallow parsing to a deeper level of language understanding, while preserving robustness, without being bogged down into a complete linguistic analysis. An evaluation of 5,000 Chinese sentences is examined in order to justify its statistical significances.

...read moreread less

Journal Article•

Skeleton Parsing for Specific Domain Chinese Text

[...]

Qi Hao

01 Jan 2004-Journal of Chinese information processing

TL;DR: A method of skeleton parsing for domain specific Chinese text is put forward, using shallow parsing, cascade hidden Markov Model, to combine phrases and template matching through template matching from the tree of shallow parsing.

...read moreread less

Abstract: A method of skeleton parsing for domain specific Chinese text is put forward in this paper. The method includes two key steps: shallow parsing and template matching. The template is adopted to represent the sentence skeleton. We use shallow parsing, cascade hidden Markov Model, to combine phrases. The skeleton parsing is achieved through template matching from the tree of shallow parsing. An experiment on sports news shows the performance of the proposed method achieves 98 04% precision and 81 43% recall for template matching and 96 97% precision and 84 85% recall at sentence level.

...read moreread less

Journal Article•

A positional linguistics-based system for word alignment

[...]

Ana-Maria Barbu

01 Jan 2004-Lecture Notes in Computer Science

TL;DR: In this article, a linguistics-based system for word-to-word alignment is presented. But this system assumes some hypotheses about the structure of texts which are often infirmed.

...read moreread less

Abstract: This paper describes an algorithm which represents one of the few linguistics-based systems for word-to-word alignment. Most systems are purely statistic and assume some hypotheses about the structure of texts which are often infirmed. Our approach combines statistic methods with positional and linguistic ones in order to can be successfully applied to any kind of bitext as far as the internal stricture of the texts is concerned. The linguistic part uses shallow parsing by regular expressions and relies on very general linguistic principles. However a component of language-specific methods can be developed for improving results. Our word-alignment system was evaluated on a Romanian-English bitext.

...read moreread less

Categorial grammars used to partial parsing of spoken language

[...]

Jeanne Villaneau, Jean-Yves Antoine

01 Jan 2004

TL;DR: This paper describes experiments made with Logus, a spoken understanding system based on incremental methology, and presents the first step of the parsing, a chunking based on rules of categorial grammars and pregroups.

...read moreread less

Abstract: Spoken language understanding is a challenge for the development of Spoken Dialogue Systems. Recognition errors and speech repairs make it impossible to get complete syntactic analysis. Shallow parsing and chunking seem to be ecien t in order to start both a robust and precise analysis. This paper describes experiments made with Logus, a spoken understanding system based on incremental methology. It presents the rst step of the parsing, a chunking based on rules of categorial grammars and pregroups. These formalisms are very appropriate for this treatment and we argue they could be more widely used for applications of this type.

...read moreread less

Book Chapter•DOI•

A Positional Linguistics-Based System for Word Alignment

[...]

Ana-Maria Barbu¹•Institutions (1)

Romanian Academy¹

08 Sep 2004

TL;DR: This paper describes an algorithm which represents one of the few linguistics-based systems for word-to-word alignment and combines statistic methods with positional and linguistic ones in order to can be successfully applied to any kind of bitext as far as the internal structure of the texts is concerned.

...read moreread less

Abstract: This paper describes an algorithm which represents one of the few linguistics-based systems for word-to-word alignment Most systems are purely statistic and assume some hypotheses about the structure of texts which are often infirmed Our approach combines statistic methods with positional and linguistic ones in order to can be successfully applied to any kind of bitext as far as the internal structure of the texts is concerned The linguistic part uses shallow parsing by regular expressions and relies on very general linguistic principles However a component of language-specific methods can be developed for improving results Our word-alignment system was evaluated on a Romanian-English bitext

...read moreread less

Patent•

Apparatus for shallow parsing event sentence for information extraction and method thereof

[...]

Lim Myeong Eun, Lim Su Jong, Yoon Bo Hyeon

20 May 2004

TL;DR: In this article, a simple sentence range recognizer and a necessary component generator were used to extract information from a complex event sentence by using the sentence form information and considering a modifier clause.

...read moreread less

Abstract: PURPOSE: A device and a method for analyzing a simple sentence structure of an event sentence for extracting information are provided to improve a performance of an information extracting system by dividing the long and complex event sentence into a simple sentence based on a declinable word, and extracting the information after analyzing the structure of the divided sentence. CONSTITUTION: A simple sentence range recognizer(10) recognizes a range of the simple sentence from the inputted complex event sentence by using the sentence form information and considering a modifier clause. A necessary component generator(20) generates a necessary component from the recognized simple sentence by using the lower category information and the object name co-occurrence information. A necessary component extender(30) obtains the finally analyzed structure of the simple sentence by recognizing/extending the generated necessary component through a complex noun dictionary and the object name information.

...read moreread less

Proceedings Article•DOI•

Creation of topic map by identifying topic chain in chinese

[...]

Ching-Long Yeh¹, Yi-Chun Chen¹•Institutions (1)

Tatung University¹

28 Oct 2004

TL;DR: This work uses the local discourse structure of local discourse coherence to solve the problem of zero anaphora in Chinese and identifies the topic which is the most salient element in a sentence.

...read moreread less

Abstract: XML Topic maps enable multiple concurrent views of sets of information objects and can be used to different applications. For example thesaurus-like interfaces to corpora navigational tools for cross-references or citation systems information filtering or delivering depending on user profiles etc. However to enrich the information of a topic map or to connect with some document's URI is very labor-intensive and time-consuming. To solve this problem we propose an approach based on natural language processing techniques to identify and extract useful information in raw Chinese text. Unlike most traditional approaches to parsing sentences based on the integration of complex linguistic information and domain knowledge we work on the output of a part-of-speech tagger and use shallow parsing instead of complex parsing to identify the topics of sentences. The key elements of the centering model of local discourse coherence are employed to extract structures of discourse segments. We use the local discourse structure to solve the problem of zero anaphora in Chinese and then identify the topic which is the most salient element in a sentence. After we obtain all the topics of a document we may assign this document into a topic node of the topic map and add the information of the document into the topic element simultaneously.

...read moreread less

Dissertation•

Analysis of a finite-state grammar for parsing aviation-safety reports

[...]

César V. Dragunsky

17 Dec 2004

TL;DR: This project takes a corpus of aviation safety reports parsed by Cass, an existing partial parser, with a particular given grammar, and looks for instances of linguistic constructs whose treatment by the parser could be improved by modifications to the grammar.

...read moreread less

Abstract: With the growth of the World Wide Web in the nineties, alongside the increase in storage and processing capabilities of computer hardware, the problem of information overload resulted in an increased interest in finite-state techniques for Natural Language Analysis as an alternative to fragile, slower algorithms that would attempt to find complete parses for sentences based on general theories of language. As it turns out, shallow parsing, a set of robust parsing techniques based on finite state machines, provide incomplete yet very useful parses for unconstrained running text. The technique, however, will never provide 100% accuracy and requires that grammars be geared to the needs of particular data samples. In this project, we take a corpus of aviation safety reports parsed by Cass, an existing partial parser, with a particular given grammar, and look for instances of linguistic constructs whose treatment by the parser could be improved by modifications to the grammar. A few such constructs are discussed, and the grammar is edited to reflect the desired improvements. A parser accuracy measure is implemented and evaluated before and after the grammar modifications.

...read moreread less

Book Chapter•DOI•

Real Time Customer Opinion Monitoring

[...]

Luca Dini, Giampaolo Mazzini

01 Jan 2004

TL;DR: It is suggested that the use of Language Technologies and — more specifically — of Information Extraction technologies provides a substantial help in Customer Opinion Monitoring, when compared to alternative approaches, including both the “traditional” methodology of employing human operators for reading documents and formalizing relevant opinions/facts to be stored and data mining techniques bases on the non—linguistic structure of the page.

...read moreread less

Abstract: The paper addresses a crucial topic in current CRM processes, i.e. the one of constant monitoring customer opinions. We use the label “Real Time Customer Opinion Monitoring” to denote the process of retrieving, analyzing and assessing opinions, judgments, criticisms about products and brands, from newsgroups, message boards, consumer associations sites and other public sources on the Internet. We suggest that the use of Language Technologies and — more specifically — of Information Extraction technologies provides a substantial help in Customer Opinion Monitoring, when compared to alternative approaches, including both the “traditional” methodology of employing human operators for reading documents and formalizing relevant opinions/facts to be stored, data mining techniques bases on the non—linguistic structure of the page (web mining) or on statistical rather then linguistic analysis of the text (text mining in its standard meaning). In the light of these considerations, a novel application (ArgoServer) is presented, where different technologies cooperate with the core linguistic information extraction engine in order to achieve the result of constantly updating a database of product or brand-related customer opinions automatically gathered from newsgroups. The paper will emphasize how far the currently implemented shallow parsing techniques can go in understanding the contents of customers and users’ messages, thus extracting database records from relevant textual segments. It will also stress the limits inherently associated to the use of pure shallow techniques for the comprehension of language, and show how a new emerging linguistic technology to be developed in the context of the European project Deep Thought could in principle overcome such limits.

...read moreread less

Book Chapter•DOI•

Two-Level Alignment by Words and Phrases Based on Syntactic Information

[...]

Seonho Kim¹, Juntae Yoon, Dong-Yul Ra¹•Institutions (1)

Yonsei University¹

15 Feb 2004

TL;DR: A statistical translation model incorporating linguistic knowledge of syntactic and phrasal information for better translations is presented and it is shown that the structural relationship helps construct a better translation model for structurally different languages like Korean and English.

...read moreread less

Abstract: As a part of work on alignment of the English and Korean parallel corpus, this paper presents a statistical translation model incorporating linguistic knowledge of syntactic and phrasal information for better translations. For this, we propose three models: First, we incorporate syntactic information such as part of speech into the word-based lexical alignment. Based on this model, we propose the second model which finds phrasal correspondence in the parallel corpus. Phrasal mapping through chunk-based shallow parsing enables to settle mismatch of meaningful units in the two languages. Lastly, we develop a two-level alignment model by combining these two models in order to construct both the word and phrase-based translation model. Model parameters are automatically estimated from a set of bilingual sentence pairs by applying the EM algorithm. Experiments show that the structural relationship helps construct a better translation model for structurally different languages like Korean and English.

...read moreread less

A Graphical Model for Shallow Parsing Sequences

[...]

Adrian Silvescu, Vasant Honavar¹•Institutions (1)

Iowa State University¹

01 Jan 2004

TL;DR: A model is explored that does not attempt to be a global characterisation of the sequence as in the case of PCFG/HMM but yet it does not assume independence among the generative processes of the subsequent elements in the sequence.

...read moreread less

Abstract: [Intro] Probabilistic Models for sequence data can be mainly divided into two categories: I. Fairly sophisticated models that are aimed at finding an all encompassing characterisation of the whole sequence by the means of an interdependent generative process (such as PCFGs and HMMs); and II. Relatively simple models that make an independence assumption regarding the generation process of each of the elements of the sequence (such as Unigram, Naive Bayes, Clustering, PLSA and LDA).In this paper we explore the interval between these two extremes with a model that does not attempt to be a global characterisation of the sequence as in the case of PCFG/HMM but yet it does not assume independence among the generative processes of the subsequent elements in the sequence.

...read moreread less