scispace - formally typeset
Search or ask a question

Showing papers presented at "International Joint Conference on Natural Language Processing in 2008"


Proceedings Article
01 Jan 2008

216 citations


Proceedings Article
01 Jan 2008
TL;DR: An approach to adapting a parser to a new language using existing annotations in the source language achieves performance equivalent to that obtained by training on 1546 trees in the target language.
Abstract: The present paper describes an approach to adapting a parser to a new language. Presumably the target language is much poorer in linguistic resources than the source language. The technique has been tested on two European languages due to test data availability; however, it is easily applicable to any pair of sufficiently related languages, including some of the Indic language group. Our adaptation technique using existing annotations in the source language achieves performance equivalent to that obtained by training on 1546 trees in the target language.

210 citations


Proceedings Article
01 Jan 2008
TL;DR: A novel approach to categorize sentences in scientific abstracts into four sections, objective, methods, results, and conclusions showed that CRFs could model the rhetorical structure of abstracts more suitably.
Abstract: OBJECTIVE: The prior knowledge about the rhetorical structure of scientific abstracts is useful for various text-mining tasks such as information extraction, information retrieval, and automatic summarization. This paper presents a novel approach to categorize sentences in scientific abstracts into four sections, objective, methods, results, and conclusions. METHOD: Formalizing the categorization task as a sequential labeling problem, we employ Conditional Random Fields (CRFs) to annotate section labels into abstract sentences. The training corpus is acquired automatically from Medline abstracts. RESULTS: The proposed method outperformed the previous approaches, achieving 95.5% per-sentence accuracy and 68.8% per-abstract accuracy. CONCLUSION: The experimental results showed that CRFs could model the rhetorical structure of abstracts more suitably.

172 citations


Proceedings Article
01 Jan 2008
TL;DR: The motivation for following thePaninian framework as the annotation scheme is provided and it is argued that the Paninian framework is better suited to model the various linguistic phenomena manifest in Indian languages.
Abstract: The paper introduces a dependency annotation effort which aims to fully annotate a million word Hindi corpus. It is the first attempt of its kind to develop a large scale tree-bank for an Indian language. In this paper we provide the motivation for following the Paninian framework as the annotation scheme and argue that the Paninian framework is better suited to model the various linguistic phenomena manifest in Indian languages. We present the basic annotation scheme. We also show how the scheme handles some phenomenon such as complex verbs, ellipses, etc. Empirical results of some experiments done on the currently annotated sentences are also reported.

138 citations


Proceedings Article
01 Jan 2008
TL;DR: A decisiontree approach inspired by contextual spelling systems for detection and correction suggestions, and a large language model trained on the Gigaword corpus to provide additional information to filter out spurious suggestions are used.
Abstract: We present a modular system for detection and correction of errors made by nonnative (English as a Second Language = ESL) writers. We focus on two error types: the incorrect use of determiners and the choice of prepositions. We use a decisiontree approach inspired by contextual spelling systems for detection and correction suggestions, and a large language model trained on the Gigaword corpus to provide additional information to filter out spurious suggestions. We show how this system performs on a corpus of non-native English text and discuss strategies for future enhancements.

137 citations


Proceedings Article
01 Jan 2008
TL;DR: It has been shown that this system outperforms other existing Bengali NER systems, and makes use of the different contextual information of the words along with the variety of features that are helpful in predicting the various named entity (NE) classes.
Abstract: Named Entity Recognition (NER) aims to classify each word of a document into predefined target named entity classes and is nowadays considered to be fundamental for many Natural Language Processing (NLP) tasks such as information retrieval, machine translation, information extraction, question answering systems and others. This paper reports about the development of a NER system for Bengali using Support Vector Machine (SVM). Though this state of the art machine learning method has been widely applied to NER in several well-studied languages, this is our first attempt to use this method to Indian languages (ILs) and particularly for Bengali. The system makes use of the different contextual information of the words along with the variety of features that are helpful in predicting the various named entity (NE) classes. A portion of a partially NE tagged Bengali news corpus, developed from the archive of a leading Bengali newspaper available in the web, has been used to develop the SVM-based NER system. The training set consists of approximately 150K words and has been manually annotated with the sixteen NE tags. Experimental results of the 10-fold cross validation test show the effectiveness of the proposed SVM based NER system with the overall average Recall, Precision and F-Score of 94.3%, 89.4% and 91.8%, respectively. It has been shown that this system outperforms other existing Bengali NER systems.

105 citations


Proceedings Article
01 Jan 2008
TL;DR: The approach eschews the use of parsing or other sophisticated linguistic tools for the target language (Hindi) making it a useful framework for statistical machine translation from English to Indian languages in general, since such tools are not widely available for Indian languages currently.
Abstract: In this paper, we report our work on incorporating syntactic and morphological information for English to Hindi statistical machine translation Two simple and computationally inexpensive ideas have proven to be surprisingly effective: (i) reordering the English source sentence as per Hindi syntax, and (ii) using the suffixes of Hindi words The former is done by applying simple transformation rules on the English parse tree The latter, by using a simple suffix separation program With only a small amount of bilingual training data and limited tools for Hindi, we achieve reasonable performance and substantial improvements over the baseline phrase-based system Our approach eschews the use of parsing or other sophisticated linguistic tools for the target language (Hindi) making it a useful framework for statistical machine translation from English to Indian languages in general, since such tools are not widely available for Indian languages currently

90 citations


Proceedings Article
Bo Wang1, Houfeng Wang1
01 Jan 2008
TL;DR: Empirical results on three kinds of product reviews indicate the effectiveness of the bootstrapping iterative learning strategy proposed, and a mapping function from opinion words to features is proposed to identify implicit features in sentence.
Abstract: We consider the problem of 1 identifying product features and opinion words in a unified process from Chinese customer reviews when only a much small seed set of opinion words is available. In particular, we consider a problem setting motivated by the task of identifying product features with opinion words and learning opinion words through features alternately and iteratively. In customer reviews, opinion words usually have a close relationship with product features, and the association between them is measured by a revised formula of mutual information in this paper. A bootstrapping iterative learning strategy is proposed to alternately both of them. A linguistic rule is adopted to identify lowfrequent features and opinion words. Furthermore, a mapping function from opinion words to features is proposed to identify implicit features in sentence. Empirical results on three kinds of product reviews indicate the effectiveness of our method.

90 citations


Proceedings Article
01 Jan 2008
TL;DR: This work studies the task of automatically categorizing sentences in a text into Ekman’s six basic emotion categories and achieves Fmeasure values that outperform the rulebased baseline method for all emotion classes.
Abstract: Recognizing the emotive meaning of text can add another dimension to the understanding of text. We study the task of automatically categorizing sentences in a text into Ekman’s six basic emotion categories. We experiment with corpus-based features as well as features derived from two emotion lexicons. One lexicon is automatically built using the classification system of Roget’s Thesaurus, while the other consists of words extracted from WordNet-Affect. Experiments on the data obtained from blogs show that a combination of corpus-based unigram features with emotion-related features provides superior classification performance. We achieve Fmeasure values that outperform the rulebased baseline method for all emotion classes.

89 citations


Proceedings Article
01 Jan 2008
TL;DR: This work proposes a machine learning based method of sentiment classification of sentences using word-level polarity and empirically shows that this method improves the performance of sentiment Classification of sentences especially when the authors have only small amount of training data.
Abstract: We propose a machine learning based method of sentiment classification of sentences using word-level polarity. The polarities of words in a sentence are not always the same as that of the sentence, because there can be polarity-shifters such as negation expressions. The proposed method models the polarity-shifters. Our model can be trained in two different ways: word-wise and sentence-wise learning. In sentence-wise learning, the model can be trained so that the prediction of sentence polarities should be accurate. The model can also be combined with features used in previous work such as bag-of-words and n-grams. We empirically show that our method almost always improves the performance of sentiment classification of sentences especially when we have only small amount of training data.

88 citations


Proceedings Article
01 Jan 2008
TL;DR: Novel unsupervised techniques are used, including a one-word 'seed' vocabulary and iterative retraining for sentiment processing, and a criterion of 'sentiment density' for determining the extent to which a document is opinionated.
Abstract: We address the problem of sentiment and objectivity classification of product reviews in Chinese. Our approach is distinctive in that it treats both positive / negative sentiment and subjectivity / objectivity not as distinct classes but rather as a continuum; we argue that this is desirable from the perspective of would-be customers who read the reviews. We use novel unsupervised techniques, including a one-word 'seed' vocabulary and iterative retraining for sentiment processing, and a criterion of 'sentiment density' for determining the extent to which a document is opinionated. The classifier achieves up to 87% F-measure for sentiment polarity detection.

Proceedings Article
01 Jan 2008
TL;DR: The Fourth International Chinese Language Processing Bakeoff was held in 2007 to assess the state of the art in three important tasks: Chinese word segmentation, named entity recognition and Chinese POS tagging.
Abstract: The Fourth International Chinese Language Processing Bakeoff was held in 2007 to assess the state of the art in three important tasks: Chinese word segmentation, named entity recognition and Chinese POS tagging. Twenty-eight groups submitted result sets in the three tasks across two tracks and a total of seven corpora. Strong results have been found in all the tasks as well as continuing challenges.

Proceedings Article
01 Jan 2008
TL;DR: Experimental results of the 10-fold cross validation test show the effectiveness of the proposed CRF based NER system with an overall average Recall, Precision and F-Score values of 93.8%, 87.8% and 90.7%, respectively.
Abstract: This paper reports about the development of a Named Entity Recognition (NER) system for Bengali using the statistical Conditional Random Fields (CRFs). The system makes use of the different contextual information of the words along with the variety of features that are helpful in predicting the various named entity (NE) classes. A portion of the partially NE tagged Bengali news corpus, developed from the archive of a leading Bengali newspaper available in the web, has been used to develop the system. The training set consists of 150K words and has been manually annotated with a NE tagset of seventeen tags. Experimental results of the 10-fold cross validation test show the effectiveness of the proposed CRF based NER system with an overall average Recall, Precision and F-Score values of 93.8%, 87.8% and 90.7%, respectively.

Proceedings Article
01 Jan 2008
TL;DR: This paper reports about the development of a Named Entity Recognition system for South and South East Asian languages, particularly for Bengali, Hindi, Telugu, Oriya and Urdu as part of the IJCNLP-08 NER Shared Task 1.
Abstract: This paper reports about the development of a Named Entity Recognition (NER) system for South and South East Asian languages, particularly for Bengali, Hindi, Telugu, Oriya and Urdu as part of the IJCNLP-08 NER Shared Task 1 . We have

Proceedings Article
01 Jan 2008
TL;DR: NAZEQA, a Japanese why-QA system based on the proposed corpus-based approach, clearly outperforms a baseline that uses hand-crafted patterns with a Mean Reciprocal Rank (top-5) of 0.305, making it presumably the best-performing fully implemented why- QA system.
Abstract: This paper proposes a corpus-based approach for answering why-questions. Conventional systems use hand-crafted patterns to extract and evaluate answer candidates. However, such hand-crafted patterns are likely to have low coverage of causal expressions, and it is also difficult to assign suitable weights to the patterns by hand. In our approach, causal expressions are automatically collected from corpora tagged with semantic relations. From the collected expressions, features are created to train an answer candidate ranker that maximizes the QA performance with regards to the corpus of why-questions and answers. NAZEQA, a Japanese why-QA system based on our approach, clearly outperforms a baseline that uses hand-crafted patterns with a Mean Reciprocal Rank (top-5) of 0.305, making it presumably the best-performing fully implemented why-QA system.

Proceedings Article
01 Jan 2008
TL;DR: The effort in developing a Named Entity Recognition (NER) system for Hindi using Maximum Entropy (MaxEnt) approach is described and a NER annotated corpora for the purpose is developed.
Abstract: We describe our effort in developing a Named Entity Recognition (NER) system for Hindi using Maximum Entropy (MaxEnt) approach. We developed a NER annotated corpora for the purpose. We have tried to identify the most relevant features for Hindi NER task to enable us to develop an efficient NER from the limited corpora developed. Apart from the orthographic and collocation features, we have experimented on the efficiency of using gazetteer lists as features. We also worked on semi-automatic induction of context patterns and experimented with using these as features of the MaxEnt method. We have evaluated the performance of the system against a blind test set having 4 classes Person, Organization, Location and Date. Our system achieved a f-value of 81.52%.

Proceedings Article
01 Jan 2008
TL;DR: A training set selection method for translation model training using linear translation model interpolation and a language model technique that reduces the translation model size by 50% and improves BLEU score by 1.76% in comparison with a baseline training corpus usage.
Abstract: Target task matched parallel corpora are required for statistical translation model training. However, training corpora sometimes include both target task matched and unmatched sentences. In such a case, training set selection can reduce the size of the translation model. In this paper, we propose a training set selection method for translation model training using linear translation model interpolation and a language model technique. According to the experimental results, the proposed method reduces the translation model size by 50% and improves BLEU score by 1.76% in comparison with a baseline training corpus usage.

Proceedings Article
01 Jan 2008
TL;DR: This paper describes first steps towards extending the METU Turkish Corpus from a sentence-level language resource to a discourse-level resource by annotating its discourse connectives and their arguments with respect to free word order in Turkish and punctuation.
Abstract: This paper describes first steps towards extending the METU Turkish Corpus from a sentence-level language resource to a discourse-level resource by annotating its discourse connectives and their arguments. The project is based on the same principles as the Penn Discourse TreeBank (http://www.seas.upenn.edu/~pdtb) and is supported by TUBITAK, The Scientific and Technological Research Council of Turkey. We first present the goals of the project and the METU Turkish corpus. We then describe how we decided what to take as explicit discourse connectives and the range of syntactic classes they come from. With representative examples of each class, we examine explicit connectives, their linear ordering, and types of syntactic units that can serve as their arguments. We then touch upon connectives with respect to free word order in Turkish and punctuation, as well as the important issue of how much material is needed to specify an argument. We close with a brief discussion of current plans.

Proceedings Article
01 Jan 2008
TL;DR: An algorithm that relies on web frequency counts to identify and correct writing errors made by non-native writers of English, suggesting that a web-based approach should be combined with local linguistic resources to achieve both effectiveness and efficiency.
Abstract: We describe an algorithm that relies on web frequency counts to identify and correct writing errors made by non-native writers of English. Evaluation of the system on a realworld ESL corpus showed very promising performance on the very difficult problem of critiquing English determiner use: 62% precision and 41% recall, with a false flag rate of only 2% (compared to a random-guessing baseline of 5% precision, 7% recall, and more than 80% false flag rate). Performance on collocation errors was less good, suggesting that a web-based approach should be combined with local linguistic resources to achieve both effectiveness and efficiency.

Proceedings Article
01 Jan 2008
TL;DR: This paper investigates the task of labeling Wikipedia pages with standard named entity tags, which can be used further by a range of information extraction and language processing tools and builds a Web service that classifies any Wikipedia page.
Abstract: Wikipedia is the largest organized knowledge repository on the Web, increasingly employed by natural language processing and search tools. In this paper, we investigate the task of labeling Wikipedia pages with standard named entity tags, which can be used further by a range of information extraction and language processing tools. To train the classifiers, we manually annotated a small set of Wikipedia pages and then extrapolated the annotations using the Wikipedia category information to a much larger training set. We employed several distinct features for each page: bag-of-words, page structure, abstract, titles, and entity mentions. We report high accuracies for several of the classifiers built. As a result of this work, a Web service that classifies any Wikipedia page has been made available to the academic community.

Proceedings Article
01 Jan 2008
TL;DR: This paper views medical coding as a multi-label classification problem, where each code is treated as a label for patient records, and compares two efficient algorithms for diagnosis coding on a large patient dataset.
Abstract: A critical, yet not very well studied problem in medical applications is the issue of accurately labeling patient records according to diagnoses and procedures that patients have undergone. This labeling problem, known as coding, consists of assigning standard medical codes (ICD9 and CPT) to patient records. Each patient record can have several corresponding labels/codes, many of which are correlated to specific diseases. The current, most frequent coding approach involves manual labeling, which requires considerable human effort and is cumbersome for large patient databases. In this paper we view medical coding as a multi-label classification problem, where we treat each code as a label for patient records. Due to government regulations concerning patient medical data, previous studies in automatic coding have been quite limited. In this paper, we compare two efficient algorithms for diagnosis coding on a large patient dataset.

Proceedings Article
01 Jan 2008
TL;DR: By using a machine learning technique and pattern matching, this work was able to extract more than 6.3 × 105 relations from hierarchical layouts in the Japanese Wikipedia, and their precision was 76.4%.
Abstract: This paper describes a method for extracting a large set of hyponymy relations from Wikipedia. The Wikipedia is much more consistently structured than generic HTML documents, and we can extract a large number of hyponymy relations with simple methods. In this work, we managed to extract more than 1.4 × 106 hyponymy relations with 75.3% precision from the Japanese version of the Wikipedia. To the best of our knowledge, this is the largest machine-readable thesaurus for Japanese. The main contribution of this paper is a method for hyponymy acquisition from hierarchical layouts in Wikipedia. By using a machine learning technique and pattern matching, we were able to extract more than 6.3 × 105 relations from hierarchical layouts in the Japanese Wikipedia, and their precision was 76.4%. The remaining hyponymy relations were acquired by existing methods for extracting relations from definition sentences and category pages. This means that extraction from the hierarchical layouts almost doubled the number of relations extracted.

Proceedings Article
01 Jan 2008
TL;DR: Experiments show that description length gain outperforms other measures because of its strength for identifying short words and performance improvement is reported by proper candidate pruning and by assemble segmentation to integrate the strengths of individual measures.
Abstract: This paper reports our empirical evaluation and comparison of several popular goodness measures for unsupervised segmentation of Chinese texts using Bakeoff-3 data sets with a unified framework. Assuming no prior knowledge about Chinese, this framework relies on a goodness measure to identify word candidates from unlabeled texts and then applies a generalized decoding algorithm to find the optimal segmentation of a sentence into such candidates with the greatest sum of goodness scores. Experiments show that description length gain outperforms other measures because of its strength for identifying short words. Further performance improvement is also reported, achieved by proper candidate pruning and by assemble segmentation to integrate the strengths of individual measures.

Proceedings Article
01 Jan 2008
TL;DR: A cross platform multilingual multimedia Indian Sign Language (ISL) dictionary building tool that facilitates the phonological annotation of Indian signs in the form of HamNoSys structure.
Abstract: This paper presents a cross platform multilingual multimedia Indian Sign Language (ISL) dictionary building tool. ISL is a linguistically under-investigated language with no source of well documented electronic data. Research on ISL linguistics also gets hindered due to a lack of ISL knowledge and the unavailability of any educational tools. Our system can be used to associate signs corresponding to a given text. The current system also facilitates the phonological annotation of Indian signs in the form of HamNoSys structure. The generated HamNoSys string can be given as input to an avatar module to produce an animated sign representation.

Proceedings Article
01 Jan 2008
TL;DR: A method for script independent word spotting in multilingual handwritten and machine printed documents that accepts a query in the form of text from the user and returns a ranked list of word images from document image corpus based on similarity with the query word.
Abstract: This paper describes a method for script independent word spotting in multilingual handwritten and machine printed documents The system accepts a query in the form of text from the user and returns a ranked list of word images from document image corpus based on similarity with the query word The system is divided into two main components The first component known as Indexer, performs indexing of all word images present in the document image corpus This is achieved by extracting Moment Based features from word images and storing them as index A template is generated for keyword spotting which stores the mapping of a keyword string to its corresponding word image which is used for generating query feature vector The second component, Similarity Matcher, returns a ranked list of word images which are most similar to the query based on a cosine similarity metric A manual Relevance feedback is applied based on Rocchio’s formula, which re-formulates the query vector to return an improved ranked listing of word images The performance of the system is seen to be superior on printed text than on handwritten text Experiments are reported on documents of three different languages: English, Hindi and Sanskrit For handwritten English, an average precision of 67% was obtained for 30 query words For machine printed Hindi, an average precision of 71% was obtained for 75 query words and for Sanskrit, an average precision of 87% with 100 queries was obtained Figure 1: A Sample English Document Spotted Query word shown in the bounding box

Proceedings Article
01 Jan 2008
TL;DR: A system build for Named Entity Recognition for South and South East Asian Languages is described, specifically tuned for Hindi and Telugu, which uses CRF (Conditional Random Fields) based machine learning, followed by post processing which involves using some heuristics or rules.
Abstract: This paper, submitted as an entry for the NERSSEAL-2008 shared task, describes a system build for Named Entity Recognition for South and South East Asian Languages. Our paper combines machine learning techniques with language specific heuristics to model the problem of NER for Indian languages. The system has been tested on five languages: Telugu, Hindi, Bengali, Urdu and Oriya. It uses CRF (Conditional Random Fields) based machine learning, followed by post processing which involves using some heuristics or rules. The system is specifically tuned for Hindi and Telugu, we also report the results for the other four languages.

Proceedings Article
01 Jan 2008
TL;DR: This paper summarizes the corpus and lexical resources being developed for Urdu by the CRULP, in Pakistan.
Abstract: Urdu is spoken by more than 100 million speakers. This paper summarizes the corpus and lexical resources being developed for Urdu by the CRULP, in Pakistan.


Proceedings Article
01 Jan 2008
TL;DR: The potential for identifying computationally relevant typological features from a multilingual corpus of language data built from readily available language data collected off the Web is explored.
Abstract: In this paper we explore the potential for identifying computationally relevant typological features from a multilingual corpus of language data built from readily available language data collected off the Web. Our work builds on previous structural projection work, where we extend the work of projection to building individual CFGs for approximately 100 languages. We then use the CFGs to discover the values of typological parameters such as word order, the presence or absence of definite and indefinite determiners, etc. Our methods have the potential of being extended to many more languages and parameters, and can have significant effects on current research focused on tool and resource development for low-density languages and grammar induction from raw corpora.

Proceedings Article
01 Jan 2008
TL;DR: This work introduces a novel approach for automatic correction of spelling mistakes by deploying finite state automata to propose candidates corrections within a specified edit distance from the misspelled word.
Abstract: Many natural language applications, like machine translation and information extraction, are required to operate on text with spelling errors. Those spelling mistakes have to be corrected automatically to avoid deteriorating the performance of such applications. In this work, we introduce a novel approach for automatic correction of spelling mistakes by deploying finite state automata to propose candidates corrections within a specified edit distance from the misspelled word. After choosing candidate corrections, a language model is used to assign scores the candidate corrections and choose best correction in the given context. The proposed approach is language independent and requires only a dictionary and text data for building a language model. The approach have been tested on both Arabic and English text and achieved accuracy of 89%.