scispace - formally typeset
Search or ask a question
Topic

Shallow parsing

About: Shallow parsing is a research topic. Over the lifetime, 397 publications have been published within this topic receiving 10211 citations.


Papers
More filters
Dissertation
06 Feb 1998
TL;DR: In this paper, a relaxation labelling algorithm is applied to NLP disambiguation, where language is modelled through context constraint inspired on Constraint Grammars.
Abstract: The thesis describes the application of the relaxation labelling algorithm to NLP disambiguation. Language is modelled through context constraint inspired on Constraint Grammars. The constraints enable the use of a real value statind "compatibility". The technique is applied to POS tagging, Shallow Parsing and Word Sense Disambigation. Experiments and results are reported. The proposed approach enables the use of multi-feature constraint models, the simultaneous resolution of several NL disambiguation tasks, and the collaboration of linguistic and statistical models.

62 citations

Proceedings ArticleDOI
26 Oct 2003
TL;DR: The results from the conducted evaluation suggest that the new procedure is very effective saving time and labour considerably and that the test items produced with the help of the program are not of inferior quality to those produced manually.
Abstract: Summary form only given The paper describes a novel automatic procedure for the generation of multiple-choice tests from electronic documents In addition to employing various NLP techniques including term extraction and shallow parsing, the system makes use of language resources such as corpora and ontologies The system operates in a fully automatic mode and also a semiautomatic environment where the user is offered the option to post-edit the generated test items The results from the conducted evaluation suggest that the new procedure is very effective saving time and labour considerably and that the test items produced with the help of the program are not of inferior quality to those produced manually

62 citations

Posted Content
TL;DR: In this paper, the problem of shallow parsing of Hindi-English code-mixed social media text (CSMT) has been addressed, and a shallow parser has been developed.
Abstract: In this study, the problem of shallow parsing of Hindi-English code-mixed social media text (CSMT) has been addressed. We have annotated the data, developed a language identifier, a normalizer, a part-of-speech tagger and a shallow parser. To the best of our knowledge, we are the first to attempt shallow parsing on CSMT. The pipeline developed has been made available to the research community with the goal of enabling better text analysis of Hindi English CSMT. The pipeline is accessible at 1 .

59 citations

Proceedings Article
09 Aug 2003
TL;DR: This work employs a DBN for natural language processing and shows how to assemble wealth of emerging linguistic instruments for shallow parsing, syntactic and semantic tagging, morphological decomposition, named entity recognition etc in order to incrementally build a robust information extraction system.
Abstract: Dynamic Bayesian networks (DBNs) offer an elegant way to integrate various aspects of language in one model. Many existing algorithms developed for learning and inference in DBNs are applicable to probabilistic language modeling. To demonstrate the potential of DBNs for natural language processing, we employ a DBN in an information extraction task. We show how to assemble wealth of emerging linguistic instruments for shallow parsing, syntactic and semantic tagging, morphological decomposition, named entity recognition etc. in order to incrementally build a robust information extraction system. Our method outperforms previously published results on an established benchmark domain.

58 citations

01 Jan 2005
TL;DR: Linguistics is considered as a way to predict query difficulty rather than a means to model IR, in the context of the ARIEL research project, in which the impact of linguistic processing in IR systems is investigated.
Abstract: Query difficulty can be linked to a number of causes. Some of these causes can be related to the query expression itself, and can therefore be detected through a linguistic analysis of the query text. Using 16 different linguistic features, automatically computed on TREC queries, we looked for significant correlations between these features and the average recall and precision scores obtained by systems. Each of these features can be viewed as a clue to a linguisticallyspecific characteristic, either morphological, syntactical or semantic. Two of these features (syntactic links span and polysemy value) are shown to have a significant impact on either recall or precision scores for previous adhoc TREC campaigns. Although the correlation values are not very high, they indicate a promising link between some linguistic characteristics and query difficulty. 1. CONTEXT This study has been conducted in the context of the ARIEL research project, in which we investigate the impact of linguistic processing in IR systems. The ultimate objective is to build an adaptive IR system, in which several natural language processing (NLP) techniques are available, but are selectively used for a given query, depending on the predicted efficiency of each technique. 2. OBJECTIVE Although linguistics and NLP have been viewed as natural solutions for IR, the overall efficiency of the techniques used in IR systems is doubtful at best. From fine-grained morphological analysis to query expansion based on semantic word classes, the use of linguistically-sound techniques and resources has often been proven to be as efficient as other cruder techniques [5] [8]. In this paper, we consider linguistics as a way to predict query difficulty rather than a means to model IR. 3. RELATED WORK A closely-related approach is the analysis performed by [7] on the CLEF topics. Their intent was to discover if some query features could be correlated to system performance and thus indicate a kind of bias in this evaluation campaign, and further to build a fusion-based IR engine. The linguistic features they used to describe each topic mostly concerned syntactic and word forms aspects, and were calculated by hand. They used a correlation measure between these features and the average precision, but the only significant result was a correlation of 0.4 between the number of proper nouns and average precision. Further studies led the authors to named entities as a useful feature, and they were able to propose a fusion-based model that improved overall precision after a classification of topics according to the number of named entities. The precision increase using this feature varied from 0 to 10%, across several tasks (monoand multi-lingual). Our study deals with more linguistic features, especially in order to deal with syntactic complexity. In addition, we only used automatic analysis methods with NLP techniques. Focusing on documents instead of queries, [6] also used linguistic features in order to characterize documents in IR collections. His main point was to study the notion of relevance, and test whether it could be related to stylistic features, and if the genre of a document could be useful for relevant document selection. [3] also used documents in order to predict query difficulty using a clarity score that depends on both the query and target collection. Both the previous studies therefore need to have exhaustive information on the collection; while we decided to focus on queries only, in order to deal with a wider range of IR situations. In [2] several classes of topic failures were drawn manually, but no elements were given on how to assign automatically a topic to a category. 4. METHOD We selected the following data: TREC 3, 5, 6 and 7 results for the adhoc task; that corresponds to a total of 200 queries (50 per year). Each query in these collections was automatically analysed and described with 16 variables, each corresponding to a specific linguistic feature. We considered the title part of the query as its length and format is the closest to a real user’s query. Because TREC web site makes participants’ runs available (i.e. lists of retrieved documents for each query), it was possible to compute the average recall and precision scores for each run and each query (using the trec-eval utility). We then computed the average recall and precision values over runs for each query. Finally, we computed the correlation between these scores and the linguistic features variables. These correlation values were tested for statistical significance. As a first result, if simple features dealing with the number or size of words in a query or the presence of certain parts of speech do not have clear consequences on a query's difficulty, more sophisticated variables led to interesting results. Globally, the syntactic complexity of a query has a negative impact on the precision scores, and the semantic ambiguity of the query words has a negative impact on the recall scores. A little less significantly, the morphological complexity of words also has a negative effect on recall. 4.1. Linguistic Features The use of linguistic features in order to study a document is a well-known technique. It has been thoroughly used in several NLP tasks, ranging from classification to genre analysis. The principles are quite simple: the text (i.e. query in our case) is first analysed using some generic parsing techniques (e.g. part of speech tagging, chunking, and parsing). Based on the tagged text data, simple programs compute the corresponding information. We used: Tree Tagger for part-of-speech tagging and lemmatisation: this tool attributes a single 1TreeTagger, by H. Schmidt; available at www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/ morphosyntactic category to each word in the input text, based on a general lexicon and a language model; Syntex [4] for shallow parsing (syntactic link detection): this analyser identifies syntactic relation between words in a sentence, based on grammatical rules; In addition, we used the following resources: WordNet 1.6 semantic network to compute semantic ambiguity: this database provides, among other information, the possible meanings for a given word; CELEX database for derivational morphology: this resource gives the morphological decomposition of a given word. According to the final objective, which is an automatic classification of queries, all the features considered are computed without any human intervention, and are as such prone to processing errors. The 16 linguistic features we computed are in Table 1, categorized in three different classes according to their level of linguistic analysis: Table 1: List of linguistic features Morphological features : number (#) of words NBWORDS average word length LENGTH average # of morphemes per word MORPH average # of suffixed tokens word SUFFIX average # of proper nouns PN average # of acronyms ACRO average # of numeral values (dates, quantities, etc.) NUM average # of unknown tokens UNKNOWN Syntactical features : average # of conjunctions CONJ average # of prepositions PREP average # of personal pronouns PP average syntactic depth SYNTDEPTH average syntactic links span SYNTDIST

57 citations


Network Information
Related Topics (5)
Machine translation
22.1K papers, 574.4K citations
81% related
Natural language
31.1K papers, 806.8K citations
79% related
Language model
17.5K papers, 545K citations
79% related
Parsing
21.5K papers, 545.4K citations
79% related
Query language
17.2K papers, 496.2K citations
74% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20217
202012
20196
20185
201711
201611