scispace - formally typeset
Search or ask a question

Showing papers in "ACM Transactions on Asian Language Information Processing in 2010"


Journal ArticleDOI
TL;DR: The results show that a simple learning system with six n-gram feature templates and a six-tag set can obtain competitive performance in the cases of learning only from a training corpus and an ensemble learning technique, assistant segmenter, is proposed and its effectiveness is verified.
Abstract: Chinese word segmentation is an active area in Chinese language processing though it is suffering from the argument about what precisely is a word in Chinese. Based on corpus-based segmentation standard, we launched this study. In detail, we regard Chinese word segmentation as a character-based tagging problem. We show that there has been a potent trend of using a character-based tagging approach in this field. In particular, learning from segmented corpus with or without additional linguistic resources is treated in a unified way in which the only difference depends on how the feature template set is selected. It differs from existing work in that both feature template selection and tag set selection are considered in our approach, instead of the previous feature template focus only technique. We show that there is a significant performance difference as different tag sets are selected. This is especially applied to a six-tag set, which is good enough for most current segmented corpora. The linguistic meaning of a tag set is also discussed. Our results show that a simple learning system with six n-gram feature templates and a six-tag set can obtain competitive performance in the cases of learning only from a training corpus. In cases when additional linguistic resources are available, an ensemble learning technique, assistant segmenter, is proposed and its effectiveness is verified. Assistant segmenter is also proven to be an effective method as segmentation standard adaptation that outperforms existing ones. Based on the proposed approach, our system provides state-of-the-art performance in all 12 corpora of three international Chinese word segmentation bakeoffs.

86 citations


Journal ArticleDOI
TL;DR: The objective of this work is to develop an NLP infrastructure for Urdu that is customizable and capable of providing basic analysis on which more advanced information extraction tools can be built.
Abstract: There has been an increase in the amount of multilingual text on the Internet due to the proliferation of news sources and blogs. The Urdu language, in particular, has experienced explosive growth on the Web. Text mining for information discovery, which includes tasks such as identifying topics, relationships and events, and sentiment analysis, requires sophisticated natural language processing (NLP). NLP systems begin with modules such as word segmentation, part-of-speech tagging, and morphological analysis and progress to modules such as shallow parsing and named entity tagging. While there have been considerable advances in developing such comprehensive NLP systems for English, the work for Urdu is still in its infancy. The tasks of interest in Urdu NLP includes analyzing data sources such as blogs and comments to news articles to provide insight into social and human behavior. All of this requires a robust NLP system. The objective of this work is to develop an NLP infrastructure for Urdu that is customizable and capable of providing basic analysis on which more advanced information extraction tools can be built. This system assimilates resources from various online sources to facilitate improved named entity tagging and Urdu-to-English transliteration. The annotated data required to train the learning models used here is acquired by standardizing the currently limited resources available for Urdu. Techniques such as bootstrap learning and resource sharing from a syntactically similar language, Hindi, are explored to augment the available annotated Urdu data. Each of the new Urdu text processing modules has been integrated into a general text-mining platform. The evaluations performed demonstrate that the accuracies have either met or exceeded the state of the art.

55 citations


Journal ArticleDOI
TL;DR: Based on a comparison of word-based and language-independent approaches, it is found that the trunc-4 indexing scheme tends to result in performance levels statistically similar to those of an aggressive stemmer, yet better than the 4-gram indexing schemes.
Abstract: The main goal of this article is to describe and evaluate various indexing and search strategies for the Hindi, Bengali, and Marathi languages. These three languages are ranked among the world’s 20 most spoken languages and they share similar syntax, morphology, and writing systems. In this article we examine these languages from an Information Retrieval (IR) perspective through describing the key elements of their inflectional and derivational morphologies, and suggest a light and more aggressive stemming approach based on them.In our evaluation of these stemming strategies we make use of the FIRE 2008 test collections, and then to broaden our comparisons we implement and evaluate two language independent indexing methods: the n-gram and trunc-n (truncation of the first n letters). We evaluate these solutions by applying our various IR models, including the Okapi, Divergence from Randomness (DFR) and statistical language models (LM) together with two classical vector-space approaches: tf idf and Lnu-ltc.Experiments performed with all three languages demonstrate that the I(ne)C2 model derived from the Divergence from Randomness paradigm tends to provide the best mean average precision (MAP). Our own tests suggest that improved retrieval effectiveness would be obtained by applying more aggressive stemmers, especially those accounting for certain derivational suffixes, compared to those involving a light stemmer or ignoring this type of word normalization procedure. Comparisons between no stemming and stemming indexing schemes shows that performance differences are almost always statistically significant. When, for example, an aggressive stemmer is applied, the relative improvements obtained are ~28p for the Hindi language, ~42p for Marathi, and ~18p for Bengali, as compared to a no-stemming approach. Based on a comparison of word-based and language-independent approaches we find that the trunc-4 indexing scheme tends to result in performance levels statistically similar to those of an aggressive stemmer, yet better than the 4-gram indexing scheme. A query-by-query analysis reveals the reasons for this, and also demonstrates the advantage of applying a stemming or a trunc-4 indexing scheme.

38 citations


Journal ArticleDOI
TL;DR: The utility and practicality of the compositional approach is underscored by showing that a CLIR system integrated with compositional transliteration systems performs consistently on par with, and sometimes better than, that integrated with a direct transliterations system.
Abstract: Machine transliteration is an important problem in an increasingly multilingual world, as it plays a critical role in many downstream applications, such as machine translation or crosslingual information retrieval systems. In this article, we propose compositional machine transliteration systems, where multiple transliteration components may be composed either to improve existing transliteration quality, or to enable transliteration functionality between languages even when no direct parallel names corpora exist between them. Specifically, we propose two distinct forms of composition: serial and parallel. Serial compositional system chains individual transliteration components, say, X → Y and Y → Z systems, to provide transliteration functionality, X → Z. In parallel composition evidence from multiple transliteration paths between X → Z are aggregated for improving the quality of a direct system. We demonstrate the functionality and performance benefits of the compositional methodology using a state-of-the-art machine transliteration framework in English and a set of Indian languages, namely, Hindi, Marathi, and Kannada. Finally, we underscore the utility and practicality of our compositional approach by showing that a CLIR system integrated with compositional transliteration systems performs consistently on par with, and sometimes better than, that integrated with a direct transliteration system.

27 citations


Journal ArticleDOI
TL;DR: This article shows that by properly harnessing the monolingual resources in conjunction with manually created rule base, one can achieve reasonable transliteration performance by exploiting the power of Character Sequence Modeling (CSM), which requires only monolingUAL resources.
Abstract: Today, parallel corpus-based systems dominate the transliteration landscape. But the resource-scarce languages do not enjoy the luxury of large parallel transliteration corpus. For these languages, rule-based transliteration is the only viable option. In this article, we show that by properly harnessing the monolingual resources in conjunction with manually created rule base, one can achieve reasonable transliteration performance. We achieve this performance by exploiting the power of Character Sequence Modeling (CSM), which requires only monolingual resources. We present the results of our rule-based system for Hindi to English, English to Hindi, and Persian to English transliteration tasks. We also perform extrinsic evaluation of transliteration systems in the context of Cross Lingual Information Retrieval. Another important contribution of our work is to explain the widely varying accuracy numbers reported in transliteration literature, in terms of the entropy of the language pairs and the datasets involved.

26 citations


Journal ArticleDOI
TL;DR: The test collections used at FIRE 2008 are described, the approaches adopted by various participants are summarized, the limitations of the datasets are discussed, and the tasks planned for the next iteration of FIRE are outlined.
Abstract: The aim of the Forum for Information Retrieval Evaluation (FIRE) is to create an evaluation framework in the spirit of TREC (Text REtrieval Conference), CLEF (Cross-Language Evaluation Forum), and NTCIR (NII Test Collection for IR Systems), for Indian language Information Retrieval. The first evaluation exercise conducted by FIRE was completed in 2008. This article describes the test collections used at FIRE 2008, summarizes the approaches adopted by various participants, discusses the limitations of the datasets, and outlines the tasks planned for the next iteration of FIRE.

23 citations


Journal ArticleDOI
TL;DR: This article reconsiders the task of MRD-based word sense disambiguation, in extending the basic Lesk algorithm to investigate the impact on WSD performance of different tokenization schemes and methods of definition extension, and demonstrates the utility of ontology induction.
Abstract: This article reconsiders the task of MRD-based word sense disambiguation, in extending the basic Lesk algorithm to investigate the impact on WSD performance of different tokenization schemes and methods of definition extension. In experimentation over the Hinoki Sensebank and the Japanese Senseval-2 dictionary task, we demonstrate that sense-sensitive definition extension over hyponyms, hypernyms, and synonyms, combined with definition extension and word tokenization leads to WSD accuracy above both unsupervised and supervised baselines. In doing so, we demonstrate the utility of ontology induction and establish new opportunities for the development of baseline unsupervised WSD methods.

21 citations


Journal ArticleDOI
TL;DR: A three-tier maximum entropy model incorporating linguistically-motivated features for generating the commonly used Chinese punctuation marks in unpunctuated sentences output by a surface realizer is implemented and significantly outperforms a baseline 5-gram language model.
Abstract: This article investigates a relatively underdeveloped subject in natural language processing---the generation of punctuation marks. From a theoretical perspective, we study 16 Chinese punctuation marks as defined in the Chinese national standard of punctuation usage, and categorize these punctuation marks into three different types according to their syntactic properties. We implement a three-tier maximum entropy model incorporating linguistically-motivated features for generating the commonly used Chinese punctuation marks in unpunctuated sentences output by a surface realizer. Furthermore, we present a method to automatically extract cue words indicating sentence-final punctuation marks as a specialized feature to construct a more precise model. Evaluating on the Penn Chinese Treebank data, the MaxEnt model achieves an f-score of 79.83p for punctuation insertion and 74.61p for punctuation restoration using gold data input, 79.50p for insertion and 73.32p for restoration using parser-based imperfect input. The experiments show that the MaxEnt model significantly outperforms a baseline 5-gram language model that scores 54.99p for punctuation insertion and 52.01p for restoration. We show that our results are not far from human performance on the same task with human insertion f-scores in the range of 81-87p and human restoration in the range of 71-82p. Finally, a manual error analysis of the generation output shows that close to 40p of the mismatched punctuation marks do in fact result in acceptable choices, a fact obscured in the automatic string-matching based evaluation scores.

18 citations


Journal ArticleDOI
TL;DR: The results show that the proposed method gives better perplexity than the comparative baselines, including a word-based/class-based n-gram LM, their interpolated LM, a cache-based LM, and a topic-dependent LM based on Latent Dirichlet Allocation (LDA).
Abstract: Language models (LMs) are an important field of study in automatic speech recognition (ASR) systems. LM helps acoustic models find the corresponding word sequence of a given speech signal. Without it, ASR systems would not understand the language and it would be hard to find the correct word sequence. During the past few years, researchers have tried to incorporate long-range dependencies into statistical word-based n-gram LMs. One of these long-range dependencies is topic. Unlike words, topic is unobservable. Thus, it is required to find the meanings behind the words to get into the topic. This research is based on the belief that nouns contain topic information. We propose a new approach for a topic-dependent LM, where the topic is decided in an unsupervised manner. Latent Semantic Analysis (LSA) is employed to reveal hidden (latent) relations among nouns in the context words. To decide the topic of an event, a fixed size word history sequence (window) is observed, and voting is then carried out based on noun class occurrences weighted by a confidence measure. Experiments were conducted on an English corpus and a Japanese corpus: The Wall Street Journal corpus and Mainichi Shimbun (Japanese newspaper) corpus. The results show that our proposed method gives better perplexity than the comparative baselines, including a word-based/class-based n-gram LM, their interpolated LM, a cache-based LM, a topic-dependent LM based on n-gram, and a topic-dependent LM based on Latent Dirichlet Allocation (LDA). The n-best list rescoring was conducted to validate its application in ASR systems.

15 citations


Journal ArticleDOI
TL;DR: Experimental results show that the proposed framework can effectively retrieve the set of snippets which may contain synonymous transliterations and then extract the target terms, and has higher rank of similarity to the input transliteration compared to other noise terms.
Abstract: The World Wide Web has been considered one of the important sources for information Using search engines to retrieve Web pages can gather lots of information, including foreign information However, to be better understood by local readers, proper names in a foreign language, such as English, are often transliterated to a local language such as Chinese Due to different translators and the lack of translation standard, translating foreign proper nouns may result in different transliterations and pose a notorious headache In particular, it may cause incomplete search results Using one transliteration as a query keyword will fail to retrieve the Web pages which use a different word as the transliteration Consequently, important information may be missed We present a framework for mining synonymous transliterations as many as possible from the Web for a given transliteration The results can be used to construct a database of synonymous transliterations which can be utilized for query expansion so as to alleviate the incomplete search problem Experimental results show that the proposed framework can effectively retrieve the set of snippets which may contain synonymous transliterations and then extract the target terms Most of the extracted synonymous transliterations have higher rank of similarity to the input transliteration compared to other noise terms

13 citations


Journal ArticleDOI
TL;DR: The major findings are: the corpus-based stemming approach is effective as a knowledge-light term conflation step and useful in the case of few language-specific resources, and the number of relevance feedback terms depending on the ratio of word vocabulary size to sub-word vocabulary size almost always slightly increases information retrieval effectiveness.
Abstract: The Forum for Information Retrieval Evaluation (FIRE) provides document collections, topics, and relevance assessments for information retrieval (IR) experiments on Indian languages. Several research questions are explored in this article: 1) How to create create a simple, language-independent corpus-based stemmer, 2) How to identify sub-words and which types of sub-words are suitable as indexing units, and 3) How to apply blind relevance feedback on sub-words and how feedback term selection is affected by the type of the indexing unit. More than 140 IR experiments are conducted using the BM25 retrieval model on the topic titles and descriptions (TD) for the FIRE 2008 English, Bengali, Hindi, and Marathi document collections.The major findings are: The corpus-based stemming approach is effective as a knowledge-light term conflation step and useful in the case of few language-specific resources. For English, the corpus-based stemmer performs nearly as well as the Porter stemmer and significantly better than the baseline of indexing words when combined with query expansion. In combination with blind relevance feedback, it also performs significantly better than the baseline for Bengali and Marathi IR.Sub-words such as consonant-vowel sequences and word prefixes can yield similar or better performance in comparison to word indexing. There is no best performing method for all languages. For English, indexing using the Porter stemmer performs best, for Bengali and Marathi, overlapping 3-grams obtain the best result, and for Hindi, 4-prefixes yield the highest MAP. However, in combination with blind relevance feedback using 10 documents and 20 terms, 6-prefixes for English and 4-prefixes for Bengali, Hindi, and Marathi IR yield the highest MAP.Sub-word identification is a general case of decompounding. It results in one or more index terms for a single word form and increases the number of index terms but decreases their average length. The corresponding retrieval experiments show that relevance feedback on sub-words benefits from selecting a larger number of index terms in comparison with retrieval on word forms. Similarly, selecting the number of relevance feedback terms depending on the ratio of word vocabulary size to sub-word vocabulary size almost always slightly increases information retrieval effectiveness compared to using a fixed number of terms for different languages.

Journal ArticleDOI
TL;DR: A hybrid approach that uses a small amount of linguistic knowledge in the form of orthographic rewrite rules to help refine an existing MI-produced segmentation to learn morphemes and derive underlying analyses of morphs from an existing surface morph segmentation, and from these learn a morpheme-level segmentation.
Abstract: Allomorphic variation, or form variation among morphs with the same meaning, is a stumbling block to morphological induction (MI). To address this problem, we present a hybrid approach that uses a small amount of linguistic knowledge in the form of orthographic rewrite rules to help refine an existing MI-produced segmentation. Using rules, we derive underlying analyses of morphs---generalized with respect to contextual spelling differences---from an existing surface morph segmentation, and from these we learn a morpheme-level segmentation. To learn morphemes, we have extended the Morfessor segmentation algorithm [Creutz and Lagus 2004; 2005; 2006] by using rules to infer possible underlying analyses from surface segmentations. A segmentation produced by Morfessor Categories-MAP Software v. 0.9.2 is used as input to our procedure and as a baseline that we evaluate against. To suggest analyses for our procedure, a set of language-specific orthographic rules is needed. Our procedure has yielded promising improvements for English and Turkish over the baseline approach when tested on the Morpho Challenge 2005 and 2007 style evaluations. On the Morpho Challenge 2007 test evaluation, we report gains over the current best unsupervised contestant for Turkish, where our technique shows a 2.5p absolute F-score improvement.

Journal ArticleDOI
TL;DR: This article presents a pipeline framework for identifying soundbite and its speaker name from Mandarin broadcast news transcripts, and finds that both of the two components impact the pipeline system, with more degradation in the entire system performance than soundbite segment detection.
Abstract: This article presents a pipeline framework for identifying soundbite and its speaker name from Mandarin broadcast news transcripts. Both of the two modules, soundbite segment detection and soundbite speaker name recognition, are based on a supervised classification approach using multiple linguistic features. We systematically evaluated performance for each module as well as the entire system, and investigated the effect of using speech recognition (ASR) output and automatic sentence segmentation. We found that both of the two components impact the pipeline system, with more degradation in the entire system performance due to automatic speaker name recognition errors than soundbite segment detection. In addition, our experimental results show that using ASR output degrades the system performance significantly, and that using automatic sentence segmentation greatly impacts soundbite detection, but has much less effect on speaker name recognition.

Journal ArticleDOI
TL;DR: This issue of TALIP devoted to Indian language IR is a good time to take stock of the lessons learned, as well as to identify interesting problems and issues that can be fruitfully investigated in the future.
Abstract: Over the last few years, there has been a significant growth in the amount of research activity on Indian language Information Retrieval (IR) and allied problems. For much of this time, however, standard benchmark datasets were not publicly available. Thus, it was difficult to measure how much actual progress was being made as a result of this research. More recently, efforts have been made to create evaluation frameworks that provide reusable test collections for experimentation, as well as a common forum for comparing models and techniques. For example, the Forum for Information Retrieval Evaluation1 (FIRE) provides Cranfield-style text collections in Indian languages that may be used to evaluate systems for monolingual and cross-lingual IR. Owing to the availability of the FIRE data, groups have been able to conduct IR experiments on Indian languages on a larger scale than was hitherto possible. Similarly, the Named Entities Workshop (NEWS 2009)2 held in connection with ACLIJCNLP 2009 provided training and test data for a task requiring machine transliteration of proper names from English to a set of languages (including the Indic languages Hindi, Kannada, and Tamil). This is thus a good time to take stock of the lessons learned, as well as to identify interesting problems and issues that can be fruitfully investigated in the future. The idea of a special issue of TALIP devoted to Indian language IR was rooted in this context. A total of sixteen submissions were received in response to the call for articles for this issue. Each article was carefully reviewed by three reviewers, and six articles were finally selected for publication. The first article, “The FIRE 2008 Evaluation Exercise” by Majumder et al., provides the motivation and background for the FIRE initiative. It describes how the FIRE 2008 test collection was constructed, and summarizes the approaches adopted by various participants. The authors also discuss the