Showing papers on "Shallow parsing published in 2010"

PDF

Open Access

Journal Article•DOI•

Concept Relation Extraction from Construction Documents Using Natural Language Processing

[...]

Mohammed Al Qady¹, Amr Kandil¹•Institutions (1)

01 Mar 2010-Journal of Construction Engineering and Management-asce

TL;DR: The proposed system, technique for concept relation identification using shallow parsing (CRISP), utilizes a shallow parser to extract semantic knowledge from construction contract documents which can be used to improve electronic document management functions such as document categorization and retrieval.

...read moreread less

Abstract: The objective of this research is to present an innovative technique for managing the knowledge contained in construction contract documents to facilitate quick access and efficient use of such knowledge for project management and contract administration tasks. Knowledge Management has become the focus of a lot of scientific research during the second half of the 20th century as researchers discovered the importance of the knowledge resource to business organizations. Despite early expectations of improved document management techniques, document management systems used in the construction industry have failed to deliver the anticipated performance. Recent research attempts to utilize analysis of the contents of documents to improve document categorization and retrieval functions. It is hypothesized that natural language processing can be effectively used to perform document text analysis. The proposed system, technique for concept relation identification using shallow parsing (CRISP), utilizes a shallow parser to extract semantic knowledge from construction contract documents which can be used to improve electronic document management functions such as document categorization and retrieval. When compared with human evaluators, CRISP achieved almost 80% of the average kappa score attained by the evaluators, and approximately 90% of their F-measure score.

...read moreread less

70 citations

Journal Article•DOI•

An Information-Extraction System for Urdu---A Resource-Poor Language

[...]

Smruthi Mukund¹, Rohini K. Srihari¹, Erik J. Peterson•Institutions (1)

University at Buffalo¹

01 Dec 2010-ACM Transactions on Asian Language Information Processing

TL;DR: The objective of this work is to develop an NLP infrastructure for Urdu that is customizable and capable of providing basic analysis on which more advanced information extraction tools can be built.

...read moreread less

Abstract: There has been an increase in the amount of multilingual text on the Internet due to the proliferation of news sources and blogs. The Urdu language, in particular, has experienced explosive growth on the Web. Text mining for information discovery, which includes tasks such as identifying topics, relationships and events, and sentiment analysis, requires sophisticated natural language processing (NLP). NLP systems begin with modules such as word segmentation, part-of-speech tagging, and morphological analysis and progress to modules such as shallow parsing and named entity tagging. While there have been considerable advances in developing such comprehensive NLP systems for English, the work for Urdu is still in its infancy. The tasks of interest in Urdu NLP includes analyzing data sources such as blogs and comments to news articles to provide insight into social and human behavior. All of this requires a robust NLP system. The objective of this work is to develop an NLP infrastructure for Urdu that is customizable and capable of providing basic analysis on which more advanced information extraction tools can be built. This system assimilates resources from various online sources to facilitate improved named entity tagging and Urdu-to-English transliteration. The annotated data required to train the learning models used here is acquired by standardizing the currently limited resources available for Urdu. Techniques such as bootstrap learning and resource sharing from a syntactically similar language, Hindi, are explored to augment the available annotated Urdu data. Each of the new Urdu text processing modules has been integrated into a general text-mining platform. The evaluations performed demonstrate that the accuracies have either met or exceeded the state of the art.

...read moreread less

55 citations

Book Chapter•DOI•

Lexicon based sentiment analysis of Urdu text using SentiUnits

[...]

Afraz Z. Syed, Muhammad Aslam, A. M. Martinez-Enriquez¹•Institutions (1)

CINVESTAV¹

08 Nov 2010

TL;DR: This paper uses sentiment-annotated lexicon based approach for sentiment analysis in Urdu, and aims to highlight the linguistic as well as technical aspects of this multidimensional research problem.

...read moreread less

Abstract: Like other languages, Urdu websites are becoming more popular, because the people prefer to share opinions and express sentiments in their own language Sentiment analyzers developed for other well-studied languages, like English, are not workable for Urdu, due to their scriptic, morphological, and grammatical differences As a result, this language should be studied as an independent problem domain Our approach towards sentiment analysis is based on the identification and extraction of SentiUnits from the given text, using shallow parsing SentiUnits are the expressions, which contain the sentiment information in a sentence We use sentiment-annotated lexicon based approach Unluckily, for Urdu language no such lexicon exists So, a major part of this research consists in developing such a lexicon Hence, this paper is presented as a base line for this colossal and complex task Our goal is to highlight the linguistic (grammar and morphology) as well as technical aspects of this multidimensional research problem The performance of the system is evaluated on multiple texts and the achieved results are quite satisfactory

...read moreread less

51 citations

Book•DOI•

Research in second language processing and parsing

[...]

Bill VanPatten¹, Jill Jegerski²•Institutions (2)

Texas Tech University¹, City University of New York²

15 Dec 2010

TL;DR: This book discusses second language processing and parsing in English and second language gap processing in Japanese scrambling under a Simpler Syntax account, and the processing of subject-object ambiguities by English and Dutch L2 learners of German.

...read moreread less

Abstract: 1. Preface 2. Part I. Introduction 3. Second language processing and parsing: The issues (by VanPatten, Bill) 4. Part II. Relative clauses and wh-movement 5. Relative clause attachment preferences of Turkish L2 speakers of English: Shallow parsing in the L2? (by Dinctopal-Deniz, Nazik) 6. Evidence of syntactic constraints in the processing of wh-movement: A study of Najdi Arabic learners of English (by Aldwayan, Saad) 7. Constraints on L2 learners' processing of wh-dependencies: Evidence from eye movements (by Cunnings, Ian) 8. Part III. Gender and number 9. The effects of linear distance and working memory on the processing of gender agreement in Spanish (by Keating, Gregory D.) 10. Feature assembly in early stages of L2 acquisition: Processing evidence from L2 French (by Renaud, Claire) 11. Part IV. Subjects and objects 12. Second language processing in Japanese scrambled sentences (by Mitsugi, Sanako) 13. Second language gap processing of Japanese scrambling under a Simpler Syntax account (by Hara, Masahiro) 14. The processing of subject-object ambiguities by English and Dutch L2 learners of German (by Jackson, Carrie N.) 15. Connections between processing, production and placement: Acquiring object pronouns in spanish as a second language (by Malovrh, Paul A.) 16. Part V. Phonology and lexicon 17. The exploitation of fine phonetic detail in the processing of L2 French (by Shoemaker, Ellenor M.) 18. Translation ambiguity: Consequences for learning and processing (by Tokowicz, Natasha) 19. Part VI. Prosody and context 20. Reading aloud in two languages: The interplay of syntax and prosody (by Fernandez, Eva M.) 21. Near-nativelike processing of contrastive focus in L2 French (by Reichle, Robert) 22. Author index 23. Subject index

...read moreread less

44 citations

Proceedings Article•

Can Recognising Multiword Expressions Improve Shallow Parsing

[...]

Ioannis Korkontzelos¹, Suresh Manandhar¹•Institutions (1)

University of York¹

02 Jun 2010

TL;DR: This work uses a classification method to aid human annotation of output parses and shows that knowledge about multiword expressions leads to an increase of between 7.5% and 9.

...read moreread less

Abstract: There is significant evidence in the literature that integrating knowledge about multiword expressions can improve shallow parsing accuracy. We present an experimental study to quantify this improvement, focusing on compound nominals, proper names and adjective-noun constructions. The evaluation set of multiword expressions is derived from Word-Net and the textual data are downloaded from the web. We use a classification method to aid human annotation of output parses. This method allows us to conduct experiments on a large dataset of unannotated data. Experiments show that knowledge about multiword expressions leads to an increase of between 7.5% and 9.5% in accuracy of shallow parsing in sentences containing these multiword expressions.

...read moreread less

32 citations

Proceedings Article•

The Design of Syntactic Annotation Levels in the National Corpus of Polish

[...]

Katarzyna Głowińska¹, Adam Przepiórkowski²•Institutions (2)

Polish Academy of Sciences¹, University of Warsaw²

01 May 2010

TL;DR: The paper concentrates on the delimitation of syntactic words (analytical forms, reflexive verbs, discontinuous conjunctions, etc.) and syntactic groups, as well as on problems encountered during the annotation process: syntactic group boundaries, multiword entities, abbreviations, discontinueduous phrases and syntact words.

...read moreread less

Abstract: The paper presents the procedure of syntactic annotation of the National Corpus of Polish. The paper concentrates on the delimitation of syntactic words (analytical forms, reflexive verbs, discontinuous conjunctions, etc.) and syntactic groups, as well as on problems encountered during the annotation process: syntactic group boundaries, multiword entities, abbreviations, discontinuous phrases and syntactic words. It includes the complete tagset for syntactic words and the list of syntactic groups recognized in NKJP. The tagset defines grammatical classes and categories according to morphosyntactic and syntactic criteria only. Syntactic annotation in the National Corpus of Polish is limited to making constituents of combinations of words. Annotation depends on shallow parsing and manual post-editing of the results by annotators. Manual annotation is performed by two independents annotators, with a referee in cases of disagreement. The manually constructed grammar, both for syntactic words and for syntactic groups, is encoded in the shallow parsing system Spejd.

...read moreread less

30 citations

Proceedings Article•DOI•

An Information Extraction Model for Unconstrained Handwritten Documents

[...]

Simon Thomas, Clément Chatelain, Laurent Heutte, Thierry Paquet

23 Aug 2010

TL;DR: A new information extraction system by statistical shallow parsing in unconstrained handwritten documents is introduced that relies on a strong and powerful global handwriting model and is modeled with Hidden Markov Models.

...read moreread less

Abstract: In this paper, a new information extraction system by statistical shallow parsing in unconstrained handwritten documents is introduced Unlike classical approaches found in the literature as keyword spotting or full document recognition, our approch relies on a strong and powerful global handwriting model A entire text line is considered as an indivisible entity and is modeled with Hidden Markov Models In this way, text line shallow parsing allows fast extraction of the relevant information in any document while rejecting at the same time irrelevant information First results are promising and show the interest of the approach

...read moreread less

25 citations

Dissertation•

Unsupervised Learning of Multiword Expressions

[...]

Ioannis Korkontzelos

20 Sep 2010

TL;DR: The results show that it is possible to recognise multiword expressions and decide their compositionality in an unsupervised manner, based on cooccurrence statistics and distributional semantics, andMultiword expressions are beneficial for other fundamental applications of Natural Language Processing either by direct integration or as an evaluation tool.

...read moreread less

Abstract: Multiword expressions are expressions consisting of two or more words that correspond to some conventional way of saying things (Manning & Schutze 1999). Due to the idiomatic nature of many of them and their high frequency of occurence in all sorts of text, they cause problems in many Natural Language Processing (NLP) applications and are frequently responsible for their shortcomings. Efficiently recognising multiword expressions and deciding the degree of their idiomaticity would be useful to all applications that require some degree of semantic processing, such as question-answering, summarisation, parsing, language modelling and language generation. In this thesis we investigate the issues of recognising multiword expressions, domainspecific or not, and of deciding whether they are idiomatic. Moreover, we inspect the extent to which multiword expressions can contribute to a basic NLP task such as shallow parsing and ways that the basic property of multiword expressions, idiomaticity, can be employed to define a novel task for Compositional Distributional Semantics (CDS). The results show that it is possible to recognise multiword expressions and decide their compositionality in an unsupervised manner, based on cooccurrence statistics and distributional semantics. Further, multiword expressions are beneficial for other fundamental applications of Natural Language Processing either by direct integration or as an evaluation tool. In particular, termhood-based methods, which are based on nestedness information, are shown to outperform unithood-based methods, which measure the strength of association among the constituents of a multi-word candidate term. A simple heuristic was proved to perform better than more sophisticated methods. A new graph-based algorithm employing sense induction is proposed to address multiword expression compositionality and is shown to perform better than a standard vector space model. Its parameters were estimated by an unsupervised scheme based on graph connectivity. Multiword expressions are shown to contribute to shallow parsing. Moreover, they are used to define a new evaluation task for distributional semantic composition models.

...read moreread less

18 citations

Proceedings Article•

Verbs are where all the action lies: Experiences of Shallow Parsing of a Morphologically Rich Language

[...]

Harshada Gune¹, Mugdha Bapat¹, Mitesh M. Khapra¹, Pushpak Bhattacharyya¹•Institutions (1)

Indian Institute of Technology Bombay¹

23 Aug 2010

TL;DR: The crux of the approach is to use a powerful morphological analyzer backed by a high coverage lexicon to generate rich features for a CRF based sequence classifier for shallow parsing of morphologically rich language- Marathi.

...read moreread less

Abstract: Verb suffixes and verb complexes of morphologically rich languages carry a lot of information We show that this information if harnessed for the task of shallow parsing can lead to dramatic improvements in accuracy for a morphologically rich language- Marathi The crux of the approach is to use a powerful morphological analyzer backed by a high coverage lexicon to generate rich features for a CRF based sequence classifier Accuracy figures of 94% for Part of Speech Tagging and 97% for Chunking using a modestly sized corpus (20K words) vindicate our claim that for morphologically rich languages linguistic insight can obviate the need for large amount of annotated corpora

...read moreread less

17 citations

Book Chapter•DOI•

Legal language and legal knowledge management applications

[...]

Giulia Venturi

01 Jan 2010

TL;DR: How understanding the syntactic and lexical characteristics of this specialised language has practical importance in the development of domain–specific Knowledge Management applications is put in the emphasis.

...read moreread less

Abstract: This work is an investigation into the peculiarities of legal language with respect to ordinary language. Based on the idea that a shallow parsing approach can help to provide enough detailed linguistic information, this work presents the results obtained by shallow parsing (i.e. chunking) corpora of Italian and English legal texts and comparing them with corpora of ordinary language. In particular, this paper puts the emphasis of how understanding the syntactic and lexical characteristics of this specialised language has practical importance in the development of domain–specific Knowledge Management applications.

...read moreread less

16 citations

Book Chapter•DOI•

Discourse segmentation for Spanish based on shallow parsing

[...]

Iria da Cunha¹, Eric SanJuan, Juan-Manuel Torres-Moreno², Marina Lloberes³, Irene Castellón³ - Show less +1 more•Institutions (3)

National Autonomous University of Mexico¹, École Polytechnique de Montréal², University of Barcelona³

08 Nov 2010

TL;DR: DiSeg is presented, the first discourse segmenter for Spanish, which uses the framework of Rhetorical Structure Theory and is based on lexical and syntactic rules, obtaining promising results.

...read moreread less

Abstract: Nowadays discourse parsing is a very prominent research topic. However, there is not a discourse parser for Spanish texts. The first stage in order to develop this tool is discourse segmentation. In this work, we present DiSeg, the first discourse segmenter for Spanish, which uses the framework of Rhetorical Structure Theory and is based on lexical and syntactic rules. We describe the system and we evaluate its performance against a gold standard corpus, obtaining promising results.

...read moreread less

Proceedings Article•DOI•

Alpha-Numerical Sequences Extraction in Handwritten Documents

[...]

Simon Thomas, Clément Chatelain, Laurent Heutte, Thierry Paquet

16 Nov 2010

TL;DR: The shallow parsing of isolated text lines allows quick information extraction in any document while rejecting at the same time irrelevant information in unconstrained handwritten documents.

...read moreread less

Abstract: In this paper, we introduce an alpha-numerical sequences extraction system (keywords, numerical fields or alpha-numerical sequences) in unconstrained handwritten documents. Contrary to most of the approaches presented in the literature, our system relies on a global handwriting line model describing two kinds of information : i) the relevant information and ii) the irrelevant information represented by a shallow parsing model. The shallow parsing of isolated text lines allows quick information extraction in any document while rejecting at the same time irrelevant information. Results on a public french incoming mails database show the efficiency of the approach.

...read moreread less

Book Chapter•DOI•

Relative clause attachment preferences of Turkish L2 speakers of English: Shallow parsing in the L2?

[...]

Nazik Dinçtopal-Deniz

15 Dec 2010

Proceedings Article•

Joint Training and Decoding Using Virtual Nodes for Cascaded Segmentation and Tagging Tasks

[...]

Xian Qian¹, Qi Zhang¹, Yaqian Zhou¹, Xuanjing Huang¹, Lide Wu¹ - Show less +1 more•Institutions (1)

Fudan University¹

09 Oct 2010

TL;DR: A novel method which integrates graph structures of two sub-tasks into one using virtual nodes, and performs joint training and decoding in the factorized state space is presented.

...read moreread less

Abstract: Many sequence labeling tasks in NLP require solving a cascade of segmentation and tagging subtasks, such as Chinese POS tagging, named entity recognition, and so on. Traditional pipeline approaches usually suffer from error propagation. Joint training/decoding in the cross-product state space could cause too many parameters and high inference complexity. In this paper, we present a novel method which integrates graph structures of two sub-tasks into one using virtual nodes, and performs joint training and decoding in the factorized state space. Experimental evaluations on CoNLL 2000 shallow parsing data set and Fourth SIGHAN Bakeoff CTB POS tagging data set demonstrate the superiority of our method over cross-product, pipeline and candidate reranking approaches.

...read moreread less

Proceedings Article•

Semantics-Driven Shallow Parsing for Chinese Semantic Role Labeling

[...]

Weiwei Sun¹•Institutions (1)

Saarland University¹

11 Jul 2010

TL;DR: This work proposes semantics-driven shallow parsing, which takes into account both syntactic structures and predicate-argument structures, and introduces several new "path" features to improve shallow parsing based SRL method.

...read moreread less

Abstract: One deficiency of current shallow parsing based Semantic Role Labeling (SRL) methods is that syntactic chunks are too small to effectively group words. To partially resolve this problem, we propose semantics-driven shallow parsing, which takes into account both syntactic structures and predicate-argument structures. We also introduce several new "path" features to improve shallow parsing based SRL method. Experiments indicate that our new method obtains a significant improvement over the best reported Chinese SRL result.

...read moreread less

Question Answering on Romanian, English and French Languages.

[...]

Adrian Iftene¹, Diana Trandabat, Maria Husarciuc, Mihai Alex Moruz•Institutions (1)

Alexandru Ioan Cuza University¹

01 Jan 2010

TL;DR: This paper describes UAIC’s Question Answering systems participating in the ResPubliQA 2010 competition, designed to answer questions on a juridical corpora in Romanian, English and French monolingual tasks.

...read moreread less

Abstract: This paper describes UAIC’s Question Answering systems participating in the ResPubliQA 2010 competition, designed to answer questions on a juridical corpora in Romanian, English and French monolingual tasks. Our systems adhere to the classical architecture of a Question Answering system, with an emphasis on simplicity and real time answers: only shallow parsing was used for question processing, the indexes for the retrieval module were built at coarse-grained paragraph level, and the answer extraction component used simple pattern-based rules and lexical similarity metrics for candidate answer ranking.

...read moreread less

Sentence classification and clause detection for Croatian

[...]

Kristina Vučković, Željko Agić, Marko Tadić¹•Institutions (1)

University of Zagreb¹

01 Jan 2010

TL;DR: A method for classifying Croatian sentences by structure and detecting independent and dependent clauses within these sentences and providing its evaluation is presented and a discussion of the obtained results and future research directions is provided.

...read moreread less

Abstract: We present a method for classifying Croatian sentences by structure and detecting independent and dependent clauses within these sentences and provide its evaluation. A prototype system applying the method was implemented by using the NooJ linguistic development environment, both for purposes of this experiment and for further utilization in a prototype rule-based chunking and shallow parsing system for Croatian. With regards to pre-processing, we implemented and evaluated three different approaches to designing the system: (1) no pre-processing of input sentences, (2) automatic morphosyntactic tagging of sentences by using the CroTag stochastic tagger and (3) manual morphosyntactic annotation of input sentences. All three approaches were evaluated for sentence classification and clause detection accuracy in terms of precision and recall. The highest scoring system was the one using sentences with manually assigned morphosyntactic tags as input and it scored an overall F1-measure of 0.861 (P: 0.928, R: 0.813). In the paper, a more detailed discussion of system design and experiment setup is provided, followed by a discussion of the obtained results and future research directions.

...read moreread less

DOI•

Cloud Computing for the Humanities: Two Approaches for Language Technology

[...]

Graham Wilcock¹•Institutions (1)

University of Helsinki¹

08 Aug 2010

TL;DR: The paper describes Aelred, a web application that demonstrates the use of language technology in the Google App Engine cloud computing environment and a range of linguistic annotations including part-of-speech tagging, shallow parsing, and word sense definitions from WordNet.

...read moreread less

Abstract: The paper describes Aelred, a web application that demonstrates the use of language technology in the Google App Engine cloud computing environment. Aelred serves up English literary texts with optional concordances for any word and a range of linguistic annotations including part-of-speech tagging, shallow parsing, and word sense definitions from WordNet. Two alternative approaches are described. In the first approach, annotations are created offline and uploaded to the cloud datastore. In the second approach, annotations are created online within the cloud computing framework. In both cases standard HTML is generated with a template engine so that the annotations can be viewed in ordinary web browsers.

...read moreread less

Proceedings Article•DOI•

Date of birth extraction using precise shallow parsing

[...]

Ray Pereda¹, Kazem Taghva¹•Institutions (1)

University of Nevada, Las Vegas¹

17 Jan 2010

TL;DR: Although the program finds data of birth information with high precision and recall, this type of information extraction task seems to be negatively impacted by OCR errors.

...read moreread less

Abstract: This paper presents the implementation and evaluation of a pattern-based program to extract date of birth information from OCR text. Although the program finds data of birth information with high precision and recall, this type of information extraction task seems to be negatively impacted by OCR errors.

...read moreread less

Application of information extraction techniques to pharmacological domain

[...]

Isabel Segura Bedmar

01 Jan 2010

TL;DR: In this paper, the authors proposed a hybrid approach, which combines shallow parsing and pattern matching to extract relations between drugs from biomedical texts, and the second approximation is based on a supervised machine learning approach, in particular, kernel methods.

...read moreread less

Abstract: A drug-drug interaction occurs when one drug influences the level or activity of another drug. The detection of drug interactions is an important research area in patient safety since these interactions can become very dangerous and increase health care costs. Although there are different databases supporting health care professionals in the detection of drug interactions, this kind of resource is rarely complete. Drug interactions are frequently reported in journals of clinical pharmacology, making medical literature the most effective source for the detection of drug interactions. However, the increasing volume of the literature overwhelms health care professionals trying to keep an up-to-date collection of all reported drug-drug interactions. The development of automatic methods for collecting, maintaining and interpreting this information is crucial to achieving a real improvement in their early detection. Information Extraction techniques can provide an interesting way to reduce the time spent by health care professionals on reviewing the literature. Nevertheless, only a few approaches have tackled the extraction of drug-drug interactions. In this thesis, we have conducted a detailed study about various information extraction techniques applied to biomedical domain. Based on this study, we have proposed two different approximations for the extraction of drug-drug interactions from texts. The first approximation proposes a hybrid approach, which combines shallow parsing and pattern matching to extract relations between drugs from biomedical texts. The second approximation is based on a supervised machine learning approach, in particular, kernel methods. In addition, we have created and annotated the first corpus, DrugDDI, annotated with drug-drug interactions, which allow us to evaluate and compare both approximations. We think the DrugDDI corpus is an important contribution because it could encourage other research groups to investigate in this problem. To the best of our knowledge, the DrugDDI corpus is the only available corpus annotated for drug-drug interactions and this thesis is the first work which addresses the problem of extracting drug-drug interactions from biomedical texts. We have also defined three auxiliary processes to provide crucial information, which will be used by the aforementioned approximations. These auxiliary tasks are as follows: (1) a process for text analysis based on the UMLS MetaMap Transfer tool (MMTx) to provide shallow syntactic and semantic information from texts, (2) a process for drug name recognition and classification, and (3) a process for drug anaphora resolution. Finally, we have developed a pipeline prototype which integrates the different auxiliary processes. The pipeline architecture allows us to easily integrate these modules with each of the approaches proposed in this thesis: pattern-matching or kernels. Several experiments were performed on the DrugDDI corpus. They show while the first approximation based on pattern matching achieves low performance, the approach based on kernel-methods achieves a performance comparable to those obtained by approaches which carry out a similar task as the extraction of protein-protein interactions.

...read moreread less

Proceedings Article•DOI•

Chinese Web Information Retrieval Based on Shallow Parsing

[...]

Zhi-qun Chen¹, Qi-li Zhou¹, Rong-bo Wang¹•Institutions (1)

Hangzhou Dianzi University¹

23 Oct 2010

TL;DR: To improve the retrieval performance, shallow parsing technique for text was introduced and a Chinese Web information retrieval model was designed that evaluates the matching degree between indexed documents and users’ interests based on semantic similarity calculating.

...read moreread less

Abstract: To improve the retrieval performance, shallow parsing technique for text was introduced for Chinese Web information retrieval. Firstly, predicate, prepositive nominal component and succedent nominal component close to the predicate were extracted from Chinese sentence. Then, semantic vector of Chinese text was acquired based on converting predicate and nominal component to conception. An algorithm was presented for similarity calculating of semantic vector, and a Chinese Web information retrieval model was designed. The model evaluates the matching degree between indexed documents and users’ interests based on semantic similarity calculating. Users’ interests were expressed by delivering representative documents. Experimental results show that the precision is improved observably compared with the popular Web search engine.

...read moreread less

Journal Article•DOI•

Chunk Parsing and Entity Relation Extracting to Chinese Text by Using Conditional Random Fields Model

[...]

Junhua Wu¹, Longxia Liu•Institutions (1)

Nanjing Tech University¹

07 Sep 2010-Journal of Intelligent Learning Systems and Applications

TL;DR: Conditional random fields model is the valid probabilistic model to segment and label sequence data and can be used to realize chunk analysis and entities relation extraction in Chinese text.

...read moreread less

Abstract: Currently, large amounts of information exist in Web sites and various digital media. Most of them are in natural lan-guage. They are easy to be browsed, but difficult to be understood by computer. Chunk parsing and entity relation extracting is important work to understanding information semantic in natural language processing. Chunk analysis is a shallow parsing method, and entity relation extraction is used in establishing relationship between entities. Because full syntax parsing is complexity in Chinese text understanding, many researchers is more interesting in chunk analysis and relation extraction. Conditional random fields (CRFs) model is the valid probabilistic model to segment and label sequence data. This paper models chunk and entity relation problems in Chinese text. By transforming them into label solution we can use CRFs to realize the chunk analysis and entities relation extraction.

...read moreread less

Proceedings Article•

PNEPs for Shallow Parsing - NEPs Extended for Parsing Applied to Shallow Parsing.

[...]

Emilio del Rosal García, Alfonso Ortega de la Puente, Diana Pérez-Marín

01 Jan 2010

TL;DR: The current paper is mainly focused on testing the suitability of PNEPs to shallow parsing, which is to analyze the main components of the sentences rather than complete sentences.

...read moreread less

Abstract: PNEPs (Parsing Networks of Evolutionary Processors) extend NEPs with context free (instead of substituting) rules, leftmost derivation, bad terminals check and indexes to rebuild the derivation tree. It is possible to build a PNEP from any context free grammar without additional constraints, able to generate all the different derivations for ambiguous grammars with a temporal performance bound by the depth of the derivation tree. One of the main difficulties encountered by parsing techniques when building complete parsing trees for natural languages is the spatial and temporal performance of the analysis. Shallow parsing tries to overcome these difficulties. The goal of shallow parsing is to analyze the main components of the sentences (for example, noun groups, verb groups, etc.) rather than complete sentences. The current paper is mainly focused on testing the suitability of PNEPs to shallow parsing.

...read moreread less

Journal Article•

Discussion on the Integration of Statistical Learning Method and Artificial Rule Method for Prepositional Phrase Recognition

[...]

LI Zhao-xia¹•Institutions (1)

Xizhou Teachers University¹

01 Jan 2010-Modern Computer

TL;DR: Discusses the integration of statistical learning method and artificial rule method for PP recognition based on several typical PP recognition model in the shallow parsing level, and proposes that the combination of statistical Learning methods and artificial rules methods is the future direction of development.

...read moreread less

Abstract: In recognition of prepositional phrases,statistical learning method and artificial rules method are the two major methods used.Discusses the integration of statistical learning method and artificial rule method for PP recognition based on several typical PP recognition model in the shallow parsing level,and then points out that the feature extraction is an abstract of the pragmatic rules based on corpus.Proposes that the combination of statistical learning methods and artificial rule methods is the future direction of development.

...read moreread less

Journal Article•

KorLexClas 1.5: A Lexical Semantic Network for Korean Numeral Classifiers

[...]

Soonhee Hwang, Hyuk-Chul Kwon, Aesun Yoon

01 Jan 2010-Journal of KIISE:Software and Applications

TL;DR: KorLexClas 1.5 is described, which provides a very large list of Korean numeral classifiers, and with the co-occurring noun categories that select each numeralclassifier, and is expected to be used in a variety of NLP applications, including MT.

...read moreread less

Abstract: This paper aims to describe KorLexClas 1.5 which provides us with a very large list of Korean numeral classifiers, and with the co-occurring noun categories that select each numeral classifier. Differently from KorLex of other POS, of which the structure depends largely on their reference model (Princeton WordNet), KorLexClas 1.0 and its extended version 1.5 adopt a direct building method. They demand a considerable time and expert knowledge to establish the hierarchies of numeral classifiers and the relationships between lexical items. For the efficiency of construction as well as the reliability of KorLexClas 1.5, we use following processes: (1) to use various language resources while their cross-checking for the selection of classifier candidates; (2) to extend the list of numeral classifiers by using a shallow parsing techniques; (3) to set up the hierarchies of the numeral classifiers based on the previous linguistic studies; and (4) to determine LUB(Least Upper Bound) of the numeral classifiers in KorLexNoun 1.5. The last process provides the open list of the co-occurring nouns for KorLexClas 1.5 with the extensibility. KorLexClas 1.5 is expected to be used in a variety of NLP applications, including MT.

...read moreread less

Shallow Parsing of a Tennis Game from Audio Events

[...]

Qiang Huang¹, Stephen Cox•Institutions (1)

University of East Anglia¹

01 Jan 2010

TL;DR: The approach is an effective way to parse a tennis game from a stream of events with minimal human intervention, and makes use of some extra contextual information, namely the time gap between two adjacent match events, which is in itself a reasonable indicator of segmentation.

...read moreread less

Abstract: This paper proposes a method to infer the syntactical units of a sports game (tennis) from a stream of game events. We assume that we are given a sequence of events within the game (examples of events are “serve”, “rally”, “score announcement” etc.), with their durations, and our goal is to segment them into “units” that are meaningful for the game, such as a “point”. Such a segmentation is essential for understanding the way that the events relate to each other, and hence for inferring automatically the structure of the game. We use a multi-gram based technique to segment the event steam into variable-length sequences by estimating the optimal (maximum-likelihood) segmentation using the Viterbi algorithm. We then make use of some extra contextual information, namely the time gap between two adjacent match events, which is in itself a reasonable indicator of segmentation. By integrating this feature into the multigram segmentation, we considerably enhance segmentation performance. The results show that our approach is an effective way to parse a tennis game from a stream of events with minimal human intervention. Keywords-Shallow parsing; variable-length unit; segmentation; game learning;

...read moreread less

Dissertation•

Faceted Search and Browsing of Indonesian Text Collection Using Shallow Parsing Techniques

[...]

Srinivasa Raviteja Sanaka

01 Jan 2010

TL;DR: This page needs a pagination widget to browse across all the documents, and since only 10 results are displayed per page, this page needs to have this widget.

...read moreread less

Abstract: widget. Since only 10 results are displayed per page, we also need a pagination widget to browse across all the documents.

...read moreread less