scispace - formally typeset
Search or ask a question

Showing papers by "Pushpak Bhattacharyya published in 2009"


Proceedings ArticleDOI
02 Aug 2009
TL;DR: The results on 400 test sentences, translated using an SMT system trained on around 13000 parallel sentences, show that suffix + semantic relation → case marker/suffix is a very useful translation factor, in the sense of making a significant difference to output quality as indicated by subjective evaluation as well as BLEU scores.
Abstract: We report in this paper our work on accurately generating case markers and suffixes in English-to-Hindi SMT. Hindi is a relatively free word-order language, and makes use of a comparatively richer set of case markers and morphological suffixes for correct meaning representation. From our experience of large-scale English-Hindi MT, we are convinced that fluency and fidelity in the Hindi output get an order of magnitude facelift if accurate case markers and suffixes are produced. Now, the moot question is: what entity on the English side encodes the information contained in case markers and suffixes on the Hindi side? Our studies of correspondences in the two languages show that case markers and suffixes in Hindi are predominantly determined by the combination of suffixes and semantic relations on the English side. We, therefore, augment the aligned corpus of the two languages, with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers. Our results on 400 test sentences, translated using an SMT system trained on around 13000 parallel sentences, show that suffix + semantic relation → case marker/suffix is a very useful translation factor, in the sense of making a significant difference to output quality as indicated by subjective evaluation as well as BLEU scores.

52 citations


Proceedings ArticleDOI
07 Aug 2009
TL;DR: The results reported show that performance can be improved using a word language model to disambiguate the output produced by the transducer-only approach, especially when diacritic marks are not present in the Urdu input.
Abstract: We report in this paper a novel hybrid approach for Urdu to Hindi transliteration that combines finite-state machine (FSM) based techniques with statistical word language model based approach. The output from the FSM is filtered with the word language model to produce the correct Hindi output. The main problem handled is the case of omission of diacritical marks from the input Urdu text. Our system produces the correct Hindi output even when the crucial information in the form of diacritic marks is absent. The approach improves the accuracy of the transducer-only approach from 50.7% to 79.1%. The results reported show that performance can be improved using a word language model to disambiguate the output produced by the transducer-only approach, especially when diacritic marks are not present in the Urdu input.

32 citations


Proceedings ArticleDOI
06 Aug 2009
TL;DR: A way of doing Word Sense Disambiguation (WSD) that has its origin in multilingual MT and that is cognizant of the fact that parallel corpora, wordnets and sense annotated corpora are scarce resources is reported.
Abstract: We report in this paper a way of doing Word Sense Disambiguation (WSD) that has its origin in multilingual MT and that is cognizant of the fact that parallel corpora, wordnets and sense annotated corpora are scarce resources. With respect to these resources, languages show different levels of readiness; however a more resource fortunate language can help a less resource fortunate language. Our WSD method can be applied to a language even when no sense tagged corpora for that language is available. This is achieved by projecting wordnet and corpus parameters from another language to the language in question. The approach is centered around a novel synset based multilingual dictionary and the empirical observation that within a domain the distribution of senses remains more or less invariant across languages. The effectiveness of our approach is verified by doing parameter projection and then running two different WSD algorithms. The accuracy values of approximately 75% (F1-score) for three languages in two different domains establish the fact that within a domain it is possible to circumvent the problem of scarcity of resources by projecting parameters like sense distributions, corpus-co-occurrences, conceptual distance, etc. from one language to another.

26 citations


Book ChapterDOI
03 Sep 2009
TL;DR: It is shown empirically that a feedback term is neither good nor bad in itself in general; the behavior of a term depends very much on other expansion terms, implying that a good expansion set can not be found by making term independence assumption in general.
Abstract: It is well known that pseudo-relevance feedback (PRF) improves the retrieval performance of Information Retrieval (IR) systems in general. However, a recent study by Cao et al [3] has shown that a non-negligible fraction of expansion terms used by PRF algorithms are harmful to the retrieval. In other words, a PRF algorithm would be better off if it were to use only a subset of the feedback terms. The challenge then is to find a good expansion set from the set of all candidate expansion terms. A natural approach to solve the problem is to make term independence assumption and use one or more term selection criteria or a statistical classifier to identify good expansion terms independent of each other. In this work, we challenge this approach and show empirically that a feedback term is neither good nor bad in itself in general; the behavior of a term depends very much on other expansion terms. Our finding implies that a good expansion set can not be found by making term independence assumption in general. As a principled solution to the problem, we propose spectral partitioning of expansion terms using a specific term-term interaction matrix. We demonstrate on several test collections that expansion terms can be partitioned into two sets and the best of the two sets gives substantial improvements in retrieval performance over model-based feedback.

12 citations


Proceedings ArticleDOI
07 Aug 2009
TL;DR: A framework for transliteration which uses a word-origin detection engine (pre-processing) and a re-ranking model based on lexicon-lookup (post- processing) and results obtained show that the preprocessing and post-processing modules improve the top-1 accuracy.
Abstract: We propose a framework for transliteration which uses (i) a word-origin detection engine (pre-processing) (ii) a CRF based transliteration engine and (iii) a re-ranking model based on lexicon-lookup (post-processing). The results obtained for English-Hindi and English-Kannada transliteration show that the preprocessing and post-processing modules improve the top-1 accuracy by 7.1%.

10 citations


Book ChapterDOI
18 Feb 2009
TL;DR: It is shown that various formal as well as semantic features of the verbal roots noted by Pānini should be taken into account and stored and will serve the purpose of disambiguation.
Abstract: This paper aims to present a way of storing Sanskrit Verbal roots in a proposed Sanskrit WordNet. The synsets of verbal roots are proposed to be created using all the available dhātupāṭhas. While doing so, it is shown that various formal as well as semantic features of the verbal roots noted by Pānini should be taken into account and stored. This will serve the purpose of disambiguation. It is also shown that verbal roots that denote a different meaning when they occur with upasargas should be stored separately and linked to the synset of the changed meaning. This feature is peculiar to Sanskrit WordNet. Since, IIT Bombay has already developed Hindi as well as Marathi WordNets, information related to storing verbal roots in these two is also presented.

6 citations


04 Jun 2009
TL;DR: This third international workshop on Cross Lingual Information Access aims to bring together various trends in multi-source, cross and multilingual information retrieval and access, and provide a venue for researchers and practitioners from academia, government, and industry to interact and share a broad spectrum of ideas, views and applications.
Abstract: The development of digital and online information repositories is creating many opportunities and also new challenges in information retrieval. The availability of online documents in many different languages makes it possible for users around the world to directly access previously unimagined sources of information. However in conventional information retrieval systems the user must enter a search query in the language of the documents in order to retrieve it. This requires that users can express their queries in those languages in which the information is available and can understand the documents returned by the retrieval process. This restriction clearly limits the amount and type of information that an individual user really has access to. Cross Lingual Information Access is concerned with technologies that let users express their query in their native language, and irrespective of the language in which the information is available, present their information in the user-preferred language or set of languages, in a manner that satisfies the user's information needs. The additional processing may take the form of machine translation of snippets, summarization and subsequent translation of summaries and/or information extraction. In recent times, research in Cross Lingual Information Access has been vigorously pursued through several international fora, such as, the Cross-Language Evaluation Forum (CLEF), NTCIR Asian Language Retrieval, Question-answering Workshop and such other fora. A workshop geared towards cross language information retrieval in Indian languages (FIRE) was organized in December 2008. In addition to CLIR, significant results have been obtained in multilingual summarization workshops and cross-language named entity extraction challenges by the ACL (Association for Computational Linguistics) and the Geographic Information retrieval (GeoCLEF) track of CLEF. The previous two issues of this workshop were held in January 2007, during IJCAI 2007 in Hyderabad, India (http://search.iiit.ac.in/CLIA2007/) and subsequently during IJCNLP 2008 in Hyderabad, India (http://search.iiit.ac.in/CLIA2008/). Both the previous workshops attracted an encouraging number of submissions, and a large number of registered participants. This third international workshop on Cross Lingual Information Access aims to bring together various trends in multi-source, cross and multilingual information retrieval and access, and provide a venue for researchers and practitioners from academia, government, and industry to interact and share a broad spectrum of ideas, views and applications. The present workshop includes an invited keynote talk, presentations of technical papers selected after peer review followed by a panel discussion.

4 citations



01 Jan 2009
TL;DR: Challenges involved in one of the toughest annotation tasks - sense marking of English, Hindi and Marathi texts are discussed.
Abstract: Annotation plays a key role in today’s NLP scenario and this paper discusses challenges involved in one of the toughest annotation tasks - sense marking. In an effort to train the machine to understand the written language and thus to ensure speedy and high-quality translation, a huge amount of data needs to be sense-marked accurately by humans using an authentic and standard lexicon. In the work reported here, the corpus is taken from tourism domain and the Princeton wordnet (Version 2.1) is used as the sense inventory for English text while the Hindi and Marathi wordnets have been used for Hindi and Marathi texts respectively. A word may have a number of senses and in identifying which particular sense has been used in the given context, word sense disambiguation becomes a critical necessity. The corpus was independently tagged by different sense-markers and it was found that the inter annotator agreement on word sense disambiguation was about 80 % across the three languages, i.e., English, Hindi and Marathi. Though the sense distinctions in the wordnets are quite fine-grained, there have been cases when the senses provided there have been inadequate and the human sense-markers have faced problems. The study records such challenges and their handling.

3 citations


Proceedings Article
11 Jul 2009
TL;DR: A novel approach to context sensitive semantic smoothing by making use of an intermediate, "semantically light" representation for sentences, called Semantically Relatable Sequences (SRS), which shows significant improvements over the individual mixture models.
Abstract: We propose a novel approach to context sensitive semantic smoothing by making use of an intermediate, "semantically light" representation for sentences, called Semantically Relatable Sequences (SRS). SRSs of a sentence are tuples of words appearing in the semantic graph of the sentence as linked nodes depicting dependency relations. In contrast to patterns based on consecutive words, SRSs make use of groupings of nonconsecutive but semantically related words. Our experiments on TREC AP89 collection show that the mixture model of SRS translation model and Two Stage Language Model (TSLM) of Lafferty and Zhai achieves MAP scores better than the mixture model of MultiWord Expression (MWE) translation model and TSLM. Furthermore, a system, which for each test query selects either the SRS or the MWE mixture model based on better query MAP score, shows significant improvements over the individual mixture models.

2 citations


01 Jan 2009
TL;DR: A novel approach to re-ranking documents using language modeling and manual relevance feedback using a ranking function modified to perform at the local set level, which achieves better ranking performance than existing approaches that employ both LM and RF.
Abstract: We present a novel approach to re-ranking documents using language modeling (LM) and manual relevance feedback (RF). The documents returned by an initial search algorithm, called the Local Set, is reranked based on manual relevance feedback using a ranking function modified to perform at the local set level. Instead of using the query independent collection model, which is too general, we use the query-specific local set, to model the background distribution. The resultant relevance model learns a more specific set of terms relevant to the query. We achieve better ranking performance than existing approaches that employ both LM and RF. We are guided by efficiency considerations and the need of new search paradigms like personalization, that require re-ranking of initial search results based on various criteria rather than launching a fresh search into the entire corpus.

01 Jan 2009
TL;DR: Progress towards building an interlingua based machine translation system is described, by capturing the semantics of the source language sentences in the form of Universal Networking Language (UNL) graphs from which the target language sentences can be produced.
Abstract: In this paper we describe our progress towards building an interlingua based machine translation system, by capturing the semantics of the source language sentences in the form of Universal Networking Language (UNL) graphs from which the target language sentences can be produced. There are two stages to the UNL graph generation: first, the conceptual arguments of a situation are identified in the form of semantically relatable sequences (SRS) which are potential candidates for linking with semantic relations; next, the conceptual relations such as instrument, source, goal, reason or agent are recognized, irrespective of their different syntactic configurations. The system has been tested against gold standard UNL expressions collected from various sources like Oxford Advanced Learners’ Dictionary, XTAG corpus and Framenet corpus. Results indicate the promise and effectiveness of our approach on the difficult task of interlingua generation from text.