Proceedings ArticleDOI
Mixed-script query labelling using supervised learning and ad hoc retrieval using sub word indexing
Abhinav Mukherjee,Anirudh Ravi,Kaustav Datta +2 more
- pp 86-90
TLDR
The authors used back transliteration to reduce spelling variations, and a set of hand-tailored rules for consonant mapping to take care of breaking and joining of transliterated words, and implemented query labeling of mixed script content using a supervised learning approach where an SVM classifier was trained using character n-grams as features for language identification.Abstract:
Much of the user generated content on the internet is written in their transliterated form instead of in their indigenous script. Due to this search engines receive a large number of transliterated search queries.This paper presents our approach to handle labelling of queries and ad hoc retrieval of documents based on these queries, as part of the FIRE2014 shared task on transliterated search. The content of each document is written in either the native Devanagari script or its transliterated form in Roman script or a combination of both. The queries to retrieve these documents can also be in mixed script. The task is challenging primarily due to the spelling variations that occur in the transliterated form of search queries. This particular problem is addressed by using back transliteration to reduce spelling variations, and a set of hand-tailored rules for consonant mapping. Sub-word indexing is done to take care of breaking and joining of transliterated words. Implementation of query labelling of the mixed script content was done using a supervised learning approach where an SVM classifier was trained using character n-grams as features for language identification. A Naive Bayes classifier was used for classifying transliterated words that can belong to both Hindi and English when looked at individually.The 2 runs submitted by our team (BITS-Lipyantaran) performs best across all metrics for Subtask 2 among all the teams that participated, with a MRR score of 0.8171 and MAP score of 0.6421.read more
Citations
More filters
Journal ArticleDOI
Machine transliteration and transliterated text retrieval: a survey
TL;DR: A survey of the recent body of work in the field of transliteration followed by various deterministic and non-deterministic approaches used to tackle transliterated text-related issues in machine translation and information retrieval.
Journal ArticleDOI
MSIR@FIRE: A Comprehensive Report from 2013 to 2016
Somnath Banerjee,Monojit Choudhury,Kunal Chakma,Sudip Kumar Naskar,Amitava Das,Sivaji Bandyopadhyay,Paolo Rosso +6 more
TL;DR: This document is a comprehensive report on the 4 years of MSIR track evaluated at FIRE between 2013 and 2016, which aimed to systematically formalize several research problems that one must solve to tackle the code mixing in Web search for users of many languages around the world.
Proceedings ArticleDOI
Joint Approach to Deromanization of Code-mixed Texts
TL;DR: The results of the experiments establish the state of the art for the task of deromanization of code-mixed texts and propose a novel approach for handling these two problems together in a single system.
Journal ArticleDOI
Query Expansion for Transliterated Text Retrieval
TL;DR: Experiments performed over Hindi song lyrics retrieval in mixed script domain with three different retrieval models show that proposed approaches outperform the existing techniques in a majority of the cases (sometimes statistically significantly) for a number of metrics like nDCG@1, nDCg@5, n DCG@10, MAP, MRR, and Recall.
References
More filters
Journal ArticleDOI
Support-Vector Networks
Corinna Cortes,Vladimir Vapnik +1 more
TL;DR: High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated and the performance of the support- vector network is compared to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.
Proceedings Article
Labeling the Languages of Words in Mixed-Language Documents using Weakly Supervised Methods
Ben King,Steven Abney +1 more
TL;DR: In this paper, the problem of labeling the languages of words in mixed-language documents is considered in a weakly supervised fashion, as a sequence labeling problem with monolingual text samples for training data.
Proceedings ArticleDOI
Query expansion for mixed-script information retrieval
TL;DR: This paper formally introduces the concept of Mixed-Script IR, and through analysis of the query logs of Bing search engine, the prevalence of this problem is estimated, and gives a principled solution to handle the mixed-script term matching and spelling variation.
Proceedings Article
Query word labeling and Back Transliteration for Indian Languages: Shared task system description
TL;DR: This paper proposes a supervised approach of building a classier with monolingual samples together with a context-switching probability from Indian Language (IL) to English (Eng) and shows the best performing results.