Showing papers presented at "International Joint Conference on Natural Language Processing in 2008"

PDF

Open Access

Proceedings Article•

Balanced Corpus of Contemporary Written Japanese

[...]

Kikuo Maekawa

01 Jan 2008

216 citations

Proceedings Article•

Cross-Language Parser Adaptation between Related Languages

[...]

Daniel Zeman¹, Philip Resnik•Institutions (1)

Charles University in Prague¹

01 Jan 2008

TL;DR: An approach to adapting a parser to a new language using existing annotations in the source language achieves performance equivalent to that obtained by training on 1546 trees in the target language.

...read moreread less

Abstract: The present paper describes an approach to adapting a parser to a new language. Presumably the target language is much poorer in linguistic resources than the source language. The technique has been tested on two European languages due to test data availability; however, it is easily applicable to any pair of sufficiently related languages, including some of the Indic language group. Our adaptation technique using existing annotations in the source language achieves performance equivalent to that obtained by training on 1546 trees in the target language.

...read moreread less

210 citations

Proceedings Article•

Identifying Sections in Scientific Abstracts using Conditional Random Fields

[...]

Kenji Hirohata¹, Naoaki Okazaki², Sophia Ananiadou³, Mitsuru Ishizuka²•Institutions (3)

National Institute of Informatics¹, University of Tokyo², University of Manchester³

01 Jan 2008

TL;DR: A novel approach to categorize sentences in scientific abstracts into four sections, objective, methods, results, and conclusions showed that CRFs could model the rhetorical structure of abstracts more suitably.

...read moreread less

Abstract: OBJECTIVE: The prior knowledge about the rhetorical structure of scientific abstracts is useful for various text-mining tasks such as information extraction, information retrieval, and automatic summarization. This paper presents a novel approach to categorize sentences in scientific abstracts into four sections, objective, methods, results, and conclusions. METHOD: Formalizing the categorization task as a sequential labeling problem, we employ Conditional Random Fields (CRFs) to annotate section labels into abstract sentences. The training corpus is acquired automatically from Medline abstracts. RESULTS: The proposed method outperformed the previous approaches, achieving 95.5% per-sentence accuracy and 68.8% per-abstract accuracy. CONCLUSION: The experimental results showed that CRFs could model the rhetorical structure of abstracts more suitably.

...read moreread less

172 citations

Proceedings Article•

Dependency Annotation Scheme for Indian Languages

[...]

Rafiya Begum¹, Samar Husain¹, Arun Dhwaj¹, Dipti Misra Sharma¹, Lakshmi Bai¹, Rajeev Sangal - Show less +2 more•Institutions (1)

International Institute of Information Technology, Hyderabad¹

01 Jan 2008

TL;DR: The motivation for following thePaninian framework as the annotation scheme is provided and it is argued that the Paninian framework is better suited to model the various linguistic phenomena manifest in Indian languages.

...read moreread less

Abstract: The paper introduces a dependency annotation effort which aims to fully annotate a million word Hindi corpus. It is the first attempt of its kind to develop a large scale tree-bank for an Indian language. In this paper we provide the motivation for following the Paninian framework as the annotation scheme and argue that the Paninian framework is better suited to model the various linguistic phenomena manifest in Indian languages. We present the basic annotation scheme. We also show how the scheme handles some phenomenon such as complex verbs, ellipses, etc. Empirical results of some experiments done on the currently annotated sentences are also reported.

...read moreread less

138 citations

Proceedings Article•

Using Contextual Speller Techniques and Language Modeling for ESL Error Correction

[...]

Michael Gamon¹, Jianfeng Gao¹, Chris Brockett¹, Alexandre Klementiev, William B. Dolan¹, Dmitriy Belenko¹, Lucy Vanderwende - Show less +3 more•Institutions (1)

Microsoft¹

01 Jan 2008

TL;DR: A decisiontree approach inspired by contextual spelling systems for detection and correction suggestions, and a large language model trained on the Gigaword corpus to provide additional information to filter out spurious suggestions are used.

...read moreread less

Abstract: We present a modular system for detection and correction of errors made by nonnative (English as a Second Language = ESL) writers. We focus on two error types: the incorrect use of determiners and the choice of prepositions. We use a decisiontree approach inspired by contextual spelling systems for detection and correction suggestions, and a large language model trained on the Gigaword corpus to provide additional information to filter out spurious suggestions. We show how this system performs on a corpus of non-native English text and discuss strategies for future enhancements.

...read moreread less

137 citations

Proceedings Article•

Bengali Named Entity Recognition Using Support Vector Machine

[...]

Asif Ekbal, Sivaji Bandyopadhyay

01 Jan 2008

TL;DR: It has been shown that this system outperforms other existing Bengali NER systems, and makes use of the different contextual information of the words along with the variety of features that are helpful in predicting the various named entity (NE) classes.

...read moreread less

Abstract: Named Entity Recognition (NER) aims to classify each word of a document into predefined target named entity classes and is nowadays considered to be fundamental for many Natural Language Processing (NLP) tasks such as information retrieval, machine translation, information extraction, question answering systems and others. This paper reports about the development of a NER system for Bengali using Support Vector Machine (SVM). Though this state of the art machine learning method has been widely applied to NER in several well-studied languages, this is our first attempt to use this method to Indian languages (ILs) and particularly for Bengali. The system makes use of the different contextual information of the words along with the variety of features that are helpful in predicting the various named entity (NE) classes. A portion of a partially NE tagged Bengali news corpus, developed from the archive of a leading Bengali newspaper available in the web, has been used to develop the SVM-based NER system. The training set consists of approximately 150K words and has been manually annotated with the sixteen NE tags. Experimental results of the 10-fold cross validation test show the effectiveness of the proposed SVM based NER system with the overall average Recall, Precision and F-Score of 94.3%, 89.4% and 91.8%, respectively. It has been shown that this system outperforms other existing Bengali NER systems.

...read moreread less

105 citations

Proceedings Article•

Simple Syntactic and Morphological Processing Can Help English-Hindi Statistical Machine Translation.

[...]

Ananthakrishnan Ramanathan¹, Jayprasad Hegde, Ritesh Shah², Pushpak Bhattacharyya, M. Sasikumar - Show less +1 more•Institutions (2)

Indian Institute of Technology Bombay¹, University of Auckland²

01 Jan 2008

TL;DR: The approach eschews the use of parsing or other sophisticated linguistic tools for the target language (Hindi) making it a useful framework for statistical machine translation from English to Indian languages in general, since such tools are not widely available for Indian languages currently.

...read moreread less

Abstract: In this paper, we report our work on incorporating syntactic and morphological information for English to Hindi statistical machine translation Two simple and computationally inexpensive ideas have proven to be surprisingly effective: (i) reordering the English source sentence as per Hindi syntax, and (ii) using the suffixes of Hindi words The former is done by applying simple transformation rules on the English parse tree The latter, by using a simple suffix separation program With only a small amount of bilingual training data and limited tools for Hindi, we achieve reasonable performance and substantial improvements over the baseline phrase-based system Our approach eschews the use of parsing or other sophisticated linguistic tools for the target language (Hindi) making it a useful framework for statistical machine translation from English to Indian languages in general, since such tools are not widely available for Indian languages currently

...read moreread less

90 citations

Proceedings Article•

Bootstrapping Both Product Features and Opinion Words from Chinese Customer Reviews with Cross-Inducing

[...]

Bo Wang¹, Houfeng Wang¹•Institutions (1)

Peking University¹

01 Jan 2008

TL;DR: Empirical results on three kinds of product reviews indicate the effectiveness of the bootstrapping iterative learning strategy proposed, and a mapping function from opinion words to features is proposed to identify implicit features in sentence.

...read moreread less

Abstract: We consider the problem of 1 identifying product features and opinion words in a unified process from Chinese customer reviews when only a much small seed set of opinion words is available. In particular, we consider a problem setting motivated by the task of identifying product features with opinion words and learning opinion words through features alternately and iteratively. In customer reviews, opinion words usually have a close relationship with product features, and the association between them is measured by a revised formula of mutual information in this paper. A bootstrapping iterative learning strategy is proposed to alternately both of them. A linguistic rule is adopted to identify lowfrequent features and opinion words. Furthermore, a mapping function from opinion words to features is proposed to identify implicit features in sentence. Empirical results on three kinds of product reviews indicate the effectiveness of our method.

...read moreread less

90 citations

Proceedings Article•

Using Roget's Thesaurus for Fine-grained Emotion Recognition.

[...]

Saima Aman¹, Stan Szpakowicz¹•Institutions (1)

University of Ottawa¹

01 Jan 2008

TL;DR: This work studies the task of automatically categorizing sentences in a text into Ekman’s six basic emotion categories and achieves Fmeasure values that outperform the rulebased baseline method for all emotion classes.

...read moreread less

Abstract: Recognizing the emotive meaning of text can add another dimension to the understanding of text. We study the task of automatically categorizing sentences in a text into Ekman’s six basic emotion categories. We experiment with corpus-based features as well as features derived from two emotion lexicons. One lexicon is automatically built using the classification system of Roget’s Thesaurus, while the other consists of words extracted from WordNet-Affect. Experiments on the data obtained from blogs show that a combination of corpus-based unigram features with emotion-related features provides superior classification performance. We achieve Fmeasure values that outperform the rulebased baseline method for all emotion classes.

...read moreread less

89 citations

Proceedings Article•

Learning to Shift the Polarity of Words for Sentiment Classification

[...]

Daisuke Ikeda¹, Hiroya Takamura², Lev-Arie Ratinov¹, Manabu Okumura¹•Institutions (2)

Tokyo Institute of Technology¹, University of Illinois at Urbana–Champaign²

01 Jan 2008

TL;DR: This work proposes a machine learning based method of sentiment classification of sentences using word-level polarity and empirically shows that this method improves the performance of sentiment Classification of sentences especially when the authors have only small amount of training data.

...read moreread less

Abstract: We propose a machine learning based method of sentiment classification of sentences using word-level polarity. The polarities of words in a sentence are not always the same as that of the sentence, because there can be polarity-shifters such as negation expressions. The proposed method models the polarity-shifters. Our model can be trained in two different ways: word-wise and sentence-wise learning. In sentence-wise learning, the model can be trained so that the prediction of sentence polarities should be accurate. The model can also be combined with features used in previous work such as bag-of-words and n-grams. We empirically show that our method almost always improves the performance of sentiment classification of sentences especially when we have only small amount of training data.

...read moreread less

88 citations

Proceedings Article•

Unsupervised classification of sentiment and objectivity in Chinese text

[...]

Taras Zagibalov¹, John M. Carroll²•Institutions (2)

University of Sussex¹, Pennsylvania State University²

01 Jan 2008

TL;DR: Novel unsupervised techniques are used, including a one-word 'seed' vocabulary and iterative retraining for sentiment processing, and a criterion of 'sentiment density' for determining the extent to which a document is opinionated.

...read moreread less

Abstract: We address the problem of sentiment and objectivity classification of product reviews in Chinese. Our approach is distinctive in that it treats both positive / negative sentiment and subjectivity / objectivity not as distinct classes but rather as a continuum; we argue that this is desirable from the perspective of would-be customers who read the reviews. We use novel unsupervised techniques, including a one-word 'seed' vocabulary and iterative retraining for sentiment processing, and a criterion of 'sentiment density' for determining the extent to which a document is opinionated. The classifier achieves up to 87% F-measure for sentiment polarity detection.

...read moreread less

Proceedings Article•

The Fourth International Chinese Language Processing Bakeoff: Chinese Word Segmentation, Named Entity Recognition and Chinese POS Tagging.

[...]

Guangjin Jin, Xiao Chen

01 Jan 2008

TL;DR: The Fourth International Chinese Language Processing Bakeoff was held in 2007 to assess the state of the art in three important tasks: Chinese word segmentation, named entity recognition and Chinese POS tagging.

...read moreread less

Abstract: The Fourth International Chinese Language Processing Bakeoff was held in 2007 to assess the state of the art in three important tasks: Chinese word segmentation, named entity recognition and Chinese POS tagging. Twenty-eight groups submitted result sets in the three tasks across two tracks and a total of seven corpora. Strong results have been found in all the tasks as well as continuing challenges.

...read moreread less

Proceedings Article•

Named Entity Recognition in Bengali: A Conditional Random Field Approach

[...]

Asif Ekbal, Rejwanul Haque¹, Sivaji Bandyopadhyay•Institutions (1)

Dublin City University¹

01 Jan 2008

TL;DR: Experimental results of the 10-fold cross validation test show the effectiveness of the proposed CRF based NER system with an overall average Recall, Precision and F-Score values of 93.8%, 87.8% and 90.7%, respectively.

...read moreread less

Abstract: This paper reports about the development of a Named Entity Recognition (NER) system for Bengali using the statistical Conditional Random Fields (CRFs). The system makes use of the different contextual information of the words along with the variety of features that are helpful in predicting the various named entity (NE) classes. A portion of the partially NE tagged Bengali news corpus, developed from the archive of a leading Bengali newspaper available in the web, has been used to develop the system. The training set consists of 150K words and has been manually annotated with a NE tagset of seventeen tags. Experimental results of the 10-fold cross validation test show the effectiveness of the proposed CRF based NER system with an overall average Recall, Precision and F-Score values of 93.8%, 87.8% and 90.7%, respectively.

...read moreread less

Proceedings Article•

Language Independent Named Entity Recognition in Indian Languages

[...]

Asif Ekbal¹, Rejwanul Haque¹, Amitava Das, Venkateswarlu Poka, Sivaji Bandyopadhyay¹ - Show less +1 more•Institutions (1)

Jadavpur University¹

01 Jan 2008

TL;DR: This paper reports about the development of a Named Entity Recognition system for South and South East Asian languages, particularly for Bengali, Hindi, Telugu, Oriya and Urdu as part of the IJCNLP-08 NER Shared Task 1.

...read moreread less

Abstract: This paper reports about the development of a Named Entity Recognition (NER) system for South and South East Asian languages, particularly for Bengali, Hindi, Telugu, Oriya and Urdu as part of the IJCNLP-08 NER Shared Task 1 . We have

...read moreread less

Proceedings Article•

Corpus-based Question Answering for why-Questions

[...]

Ryuichiro Higashinaka, Hideki Isozaki

01 Jan 2008

TL;DR: NAZEQA, a Japanese why-QA system based on the proposed corpus-based approach, clearly outperforms a baseline that uses hand-crafted patterns with a Mean Reciprocal Rank (top-5) of 0.305, making it presumably the best-performing fully implemented why- QA system.

...read moreread less

Abstract: This paper proposes a corpus-based approach for answering why-questions. Conventional systems use hand-crafted patterns to extract and evaluate answer candidates. However, such hand-crafted patterns are likely to have low coverage of causal expressions, and it is also difficult to assign suitable weights to the patterns by hand. In our approach, causal expressions are automatically collected from corpora tagged with semantic relations. From the collected expressions, features are created to train an answer candidate ranker that maximizes the QA performance with regards to the corpus of why-questions and answers. NAZEQA, a Japanese why-QA system based on our approach, clearly outperforms a baseline that uses hand-crafted patterns with a Mean Reciprocal Rank (top-5) of 0.305, making it presumably the best-performing fully implemented why-QA system.

...read moreread less

Proceedings Article•

A Hybrid Feature Set based Maximum Entropy Hindi Named Entity Recognition

[...]

Sujan Kumar Saha¹, Sudeshna Sarkar, Pabitra Mitra•Institutions (1)

Indian Institute of Technology Kharagpur¹

01 Jan 2008

TL;DR: The effort in developing a Named Entity Recognition (NER) system for Hindi using Maximum Entropy (MaxEnt) approach is described and a NER annotated corpora for the purpose is developed.

...read moreread less

Abstract: We describe our effort in developing a Named Entity Recognition (NER) system for Hindi using Maximum Entropy (MaxEnt) approach. We developed a NER annotated corpora for the purpose. We have tried to identify the most relevant features for Hindi NER task to enable us to develop an efficient NER from the limited corpora developed. Apart from the orthographic and collocation features, we have experimented on the efficiency of using gazetteer lists as features. We also worked on semi-automatic induction of context patterns and experimented with using these as features of the MaxEnt method. We have evaluated the performance of the system against a blind test set having 4 classes Person, Organization, Location and Date. Our system achieved a f-value of 81.52%.

...read moreread less

Proceedings Article•

Method of Selecting Training Data to Build a Compact and Efficient Translation Model

[...]

Keiji Yasuda¹, Ruiqiang Zhang¹, Hirofumi Yamamoto¹, Eiichiro Sumita¹•Institutions (1)

National Institute of Information and Communications Technology¹

01 Jan 2008

TL;DR: A training set selection method for translation model training using linear translation model interpolation and a language model technique that reduces the translation model size by 50% and improves BLEU score by 1.76% in comparison with a baseline training corpus usage.

...read moreread less

Abstract: Target task matched parallel corpora are required for statistical translation model training. However, training corpora sometimes include both target task matched and unmatched sentences. In such a case, training set selection can reduce the size of the translation model. In this paper, we propose a training set selection method for translation model training using linear translation model interpolation and a language model technique. According to the experimental results, the proposed method reduces the translation model size by 50% and improves BLEU score by 1.76% in comparison with a baseline training corpus usage.

...read moreread less

Proceedings Article•

A Discourse Resource for Turkish: Annotating Discourse Connectives in the METU Corpus

[...]

Deniz Zeyrek¹, Bonnie Webber•Institutions (1)

Middle East Technical University¹

01 Jan 2008

TL;DR: This paper describes first steps towards extending the METU Turkish Corpus from a sentence-level language resource to a discourse-level resource by annotating its discourse connectives and their arguments with respect to free word order in Turkish and punctuation.

...read moreread less

Abstract: This paper describes first steps towards extending the METU Turkish Corpus from a sentence-level language resource to a discourse-level resource by annotating its discourse connectives and their arguments. The project is based on the same principles as the Penn Discourse TreeBank (http://www.seas.upenn.edu/~pdtb) and is supported by TUBITAK, The Scientific and Technological Research Council of Turkey. We first present the goals of the project and the METU Turkish corpus. We then describe how we decided what to take as explicit discourse connectives and the range of syntactic classes they come from. With representative examples of each class, we examine explicit connectives, their linear ordering, and types of syntactic units that can serve as their arguments. We then touch upon connectives with respect to free word order in Turkish and punctuation, as well as the important issue of how much material is needed to specify an argument. We close with a brief discussion of current plans.

...read moreread less

Proceedings Article•

A Web-based English Proofing System for English as a Second Language Users.

[...]

Xing Yi¹, Jianfeng Gao², William B. Dolan²•Institutions (2)

University of Massachusetts Amherst¹, Microsoft²

01 Jan 2008

TL;DR: An algorithm that relies on web frequency counts to identify and correct writing errors made by non-native writers of English, suggesting that a web-based approach should be combined with local linguistic resources to achieve both effectiveness and efficiency.

...read moreread less

Abstract: We describe an algorithm that relies on web frequency counts to identify and correct writing errors made by non-native writers of English. Evaluation of the system on a realworld ESL corpus showed very promising performance on the very difficult problem of critiquing English determiner use: 62% precision and 41% recall, with a false flag rate of only 2% (compared to a random-guessing baseline of 5% precision, 7% recall, and more than 80% false flag rate). Performance on collocation errors was less good, suggesting that a web-based approach should be combined with local linguistic resources to achieve both effectiveness and efficiency.

...read moreread less

Proceedings Article•

Augmenting Wikipedia with Named Entity Tags.

[...]

Wisam Dakka¹, Silviu Cucerzan²•Institutions (2)

Columbia University¹, Microsoft²

01 Jan 2008

TL;DR: This paper investigates the task of labeling Wikipedia pages with standard named entity tags, which can be used further by a range of information extraction and language processing tools and builds a Web service that classifies any Wikipedia page.

...read moreread less

Abstract: Wikipedia is the largest organized knowledge repository on the Web, increasingly employed by natural language processing and search tools. In this paper, we investigate the task of labeling Wikipedia pages with standard named entity tags, which can be used further by a range of information extraction and language processing tools. To train the classifiers, we manually annotated a small set of Wikipedia pages and then extrapolated the annotations using the Wikipedia category information to a much larger training set. We employed several distinct features for each page: bag-of-words, page structure, abstract, titles, and entity mentions. We report high accuracies for several of the classifiers built. As a result of this work, a Web service that classifies any Wikipedia page has been made available to the academic community.

...read moreread less

Proceedings Article•

Large Scale Diagnostic Code Classification for Medical Patient Records

[...]

Lucian Vlad Lita¹, Shipeng Yu, Radu Stefan Niculescu, Jinbo Bi•Institutions (1)

Siemens¹

01 Jan 2008

TL;DR: This paper views medical coding as a multi-label classification problem, where each code is treated as a label for patient records, and compares two efficient algorithms for diagnosis coding on a large patient dataset.

...read moreread less

Abstract: A critical, yet not very well studied problem in medical applications is the issue of accurately labeling patient records according to diagnoses and procedures that patients have undergone. This labeling problem, known as coding, consists of assigning standard medical codes (ICD9 and CPT) to patient records. Each patient record can have several corresponding labels/codes, many of which are correlated to specific diseases. The current, most frequent coding approach involves manual labeling, which requires considerable human effort and is cumbersome for large patient databases. In this paper we view medical coding as a multi-label classification problem, where we treat each code as a label for patient records. Due to government regulations concerning patient medical data, previous studies in automatic coding have been quite limited. In this paper, we compare two efficient algorithms for diagnosis coding on a large patient dataset.

...read moreread less

Proceedings Article•

Hacking Wikipedia for Hyponymy Relation Acquisition

[...]

Asuka Sumida¹, Kentaro Torisawa¹•Institutions (1)

Japan Advanced Institute of Science and Technology¹

01 Jan 2008

TL;DR: By using a machine learning technique and pattern matching, this work was able to extract more than 6.3 × 105 relations from hierarchical layouts in the Japanese Wikipedia, and their precision was 76.4%.

...read moreread less

Abstract: This paper describes a method for extracting a large set of hyponymy relations from Wikipedia. The Wikipedia is much more consistently structured than generic HTML documents, and we can extract a large number of hyponymy relations with simple methods. In this work, we managed to extract more than 1.4 × 106 hyponymy relations with 75.3% precision from the Japanese version of the Wikipedia. To the best of our knowledge, this is the largest machine-readable thesaurus for Japanese. The main contribution of this paper is a method for hyponymy acquisition from hierarchical layouts in Wikipedia. By using a machine learning technique and pattern matching, we were able to extract more than 6.3 × 105 relations from hierarchical layouts in the Japanese Wikipedia, and their precision was 76.4%. The remaining hyponymy relations were acquired by existing methods for extracting relations from definition sentences and category pages. This means that extraction from the hierarchical layouts almost doubled the number of relations extracted.

...read moreread less

Proceedings Article•

An Empirical Comparison of Goodness Measures for Unsupervised Chinese Word Segmentation with a Unified Framework

[...]

Hai Zhao¹, Chunyu Kit¹•Institutions (1)

City University of Hong Kong¹

01 Jan 2008

TL;DR: Experiments show that description length gain outperforms other measures because of its strength for identifying short words and performance improvement is reported by proper candidate pruning and by assemble segmentation to integrate the strengths of individual measures.

...read moreread less

Abstract: This paper reports our empirical evaluation and comparison of several popular goodness measures for unsupervised segmentation of Chinese texts using Bakeoff-3 data sets with a unified framework. Assuming no prior knowledge about Chinese, this framework relies on a goodness measure to identify word candidates from unlabeled texts and then applies a generalized decoding algorithm to find the optimal segmentation of a sentence into such candidates with the greatest sum of goodness scores. Experiments show that description length gain outperforms other measures because of its strength for identifying short words. Further performance improvement is also reported, achieved by proper candidate pruning and by assemble segmentation to integrate the strengths of individual measures.

...read moreread less

Proceedings Article•

A Multilingual Multimedia Indian Sign Language Dictionary Tool

[...]

Tirthankar Dasgupta¹, Sambit Shukla, Sandeep Kumar, Synny Diwakar, Anupam Basu¹ - Show less +1 more•Institutions (1)

Indian Institute of Technology Kharagpur¹

01 Jan 2008

TL;DR: A cross platform multilingual multimedia Indian Sign Language (ISL) dictionary building tool that facilitates the phonological annotation of Indian signs in the form of HamNoSys structure.

...read moreread less

Abstract: This paper presents a cross platform multilingual multimedia Indian Sign Language (ISL) dictionary building tool. ISL is a linguistically under-investigated language with no source of well documented electronic data. Research on ISL linguistics also gets hindered due to a lack of ISL knowledge and the unavailability of any educational tools. Our system can be used to associate signs corresponding to a given text. The current system also facilitates the phonological annotation of Indian signs in the form of HamNoSys structure. The generated HamNoSys string can be given as input to an avatar module to produce an animated sign representation.

...read moreread less

Proceedings Article•

Script Independent Word Spotting in Multilingual Documents.

[...]

Anurag Bhardwaj¹, Damien Jose¹, Venu Govindaraju¹•Institutions (1)

University at Buffalo¹

01 Jan 2008

TL;DR: A method for script independent word spotting in multilingual handwritten and machine printed documents that accepts a query in the form of text from the user and returns a ranked list of word images from document image corpus based on similarity with the query word.

...read moreread less

Abstract: This paper describes a method for script independent word spotting in multilingual handwritten and machine printed documents The system accepts a query in the form of text from the user and returns a ranked list of word images from document image corpus based on similarity with the query word The system is divided into two main components The first component known as Indexer, performs indexing of all word images present in the document image corpus This is achieved by extracting Moment Based features from word images and storing them as index A template is generated for keyword spotting which stores the mapping of a keyword string to its corresponding word image which is used for generating query feature vector The second component, Similarity Matcher, returns a ranked list of word images which are most similar to the query based on a cosine similarity metric A manual Relevance feedback is applied based on Rocchio’s formula, which re-formulates the query vector to return an improved ranked listing of word images The performance of the system is seen to be superior on printed text than on handwritten text Experiments are reported on documents of three different languages: English, Hindi and Sanskrit For handwritten English, an average precision of 67% was obtained for 30 query words For machine printed Hindi, an average precision of 71% was obtained for 75 query words and for Sanskrit, an average precision of 87% with 100 queries was obtained Figure 1: A Sample English Document Spotted Query word shown in the bounding box

...read moreread less

Proceedings Article•

Aggregating Machine Learning and Rule Based Heuristics for Named Entity Recognition

[...]

Karthik Gali¹, Harshit Surana², Ashwini Vaidya¹, Praneeth M Shishtla¹, Dipti Misra Sharma¹ - Show less +1 more•Institutions (2)

International Institute of Information Technology, Hyderabad¹, Carnegie Mellon University²

01 Jan 2008

TL;DR: A system build for Named Entity Recognition for South and South East Asian Languages is described, specifically tuned for Hindi and Telugu, which uses CRF (Conditional Random Fields) based machine learning, followed by post processing which involves using some heuristics or rules.

...read moreread less

Abstract: This paper, submitted as an entry for the NERSSEAL-2008 shared task, describes a system build for Named Entity Recognition for South and South East Asian Languages. Our paper combines machine learning techniques with language specific heuristics to model the problem of NER for Indian languages. The system has been tested on five languages: Telugu, Hindi, Bengali, Urdu and Oriya. It uses CRF (Conditional Random Fields) based machine learning, followed by post processing which involves using some heuristics or rules. The system is specifically tuned for Hindi and Telugu, we also report the results for the other four languages.

...read moreread less

Proceedings Article•

Resources for Urdu Language Processing

[...]

Sarmad Hussain¹•Institutions (1)

National University of Computer and Emerging Sciences¹

01 Jan 2008

TL;DR: This paper summarizes the corpus and lexical resources being developed for Urdu by the CRULP, in Pakistan.

...read moreread less

Abstract: Urdu is spoken by more than 100 million speakers. This paper summarizes the corpus and lexical resources being developed for Urdu by the CRULP, in Pakistan.

...read moreread less

Proceedings Article•

TSUBAKI: An Open Search Engine Infrastructure for Developing New Information Access Methodology

[...]

Keiji Shinzato, Tomohide Shibata, Daisuke Kawahara, Chikara Hashimoto, Sadao Kurohashi - Show less +1 more

01 Jan 2008

Proceedings Article•

Automatically Identifying Computationally Relevant Typological Features

[...]

William Lewis¹, Fei Xia²•Institutions (2)

Microsoft¹, University of Washington²

01 Jan 2008

TL;DR: The potential for identifying computationally relevant typological features from a multilingual corpus of language data built from readily available language data collected off the Web is explored.

...read moreread less

Abstract: In this paper we explore the potential for identifying computationally relevant typological features from a multilingual corpus of language data built from readily available language data collected off the Web. Our work builds on previous structural projection work, where we extend the work of projection to building individual CFGs for approximately 100 languages. We then use the CFGs to discover the values of typological parameters such as word order, the presence or absence of definite and indefinite determiners, etc. Our methods have the potential of being extended to many more languages and parameters, and can have significant effects on current research focused on tool and resource development for low-density languages and grammar induction from raw corpora.

...read moreread less

Proceedings Article•

Language Independent Text Correction using Finite State Automata.

[...]

Ahmed Hassan Awadallah¹, Sara Noeman, Hany Hassan•Institutions (1)

Yahoo!¹

01 Jan 2008

TL;DR: This work introduces a novel approach for automatic correction of spelling mistakes by deploying finite state automata to propose candidates corrections within a specified edit distance from the misspelled word.

...read moreread less

Abstract: Many natural language applications, like machine translation and information extraction, are required to operate on text with spelling errors. Those spelling mistakes have to be corrected automatically to avoid deteriorating the performance of such applications. In this work, we introduce a novel approach for automatic correction of spelling mistakes by deploying finite state automata to propose candidates corrections within a specified edit distance from the misspelled word. After choosing candidate corrections, a language model is used to assign scores the candidate corrections and choose best correction in the given context. The proposed approach is language independent and requires only a dictionary and text data for building a language model. The approach have been tested on both Arabic and English text and achieved accuracy of 89%.

...read moreread less

Collapse