scispace - formally typeset
Proceedings ArticleDOI

Mixed-script query labelling using supervised learning and ad hoc retrieval using sub word indexing

TLDR
The authors used back transliteration to reduce spelling variations, and a set of hand-tailored rules for consonant mapping to take care of breaking and joining of transliterated words, and implemented query labeling of mixed script content using a supervised learning approach where an SVM classifier was trained using character n-grams as features for language identification.
Abstract
Much of the user generated content on the internet is written in their transliterated form instead of in their indigenous script. Due to this search engines receive a large number of transliterated search queries.This paper presents our approach to handle labelling of queries and ad hoc retrieval of documents based on these queries, as part of the FIRE2014 shared task on transliterated search. The content of each document is written in either the native Devanagari script or its transliterated form in Roman script or a combination of both. The queries to retrieve these documents can also be in mixed script. The task is challenging primarily due to the spelling variations that occur in the transliterated form of search queries. This particular problem is addressed by using back transliteration to reduce spelling variations, and a set of hand-tailored rules for consonant mapping. Sub-word indexing is done to take care of breaking and joining of transliterated words. Implementation of query labelling of the mixed script content was done using a supervised learning approach where an SVM classifier was trained using character n-grams as features for language identification. A Naive Bayes classifier was used for classifying transliterated words that can belong to both Hindi and English when looked at individually.The 2 runs submitted by our team (BITS-Lipyantaran) performs best across all metrics for Subtask 2 among all the teams that participated, with a MRR score of 0.8171 and MAP score of 0.6421.

read more

Citations
More filters
Journal ArticleDOI

Machine transliteration and transliterated text retrieval: a survey

TL;DR: A survey of the recent body of work in the field of transliteration followed by various deterministic and non-deterministic approaches used to tackle transliterated text-related issues in machine translation and information retrieval.
Journal ArticleDOI

MSIR@FIRE: A Comprehensive Report from 2013 to 2016

TL;DR: This document is a comprehensive report on the 4 years of MSIR track evaluated at FIRE between 2013 and 2016, which aimed to systematically formalize several research problems that one must solve to tackle the code mixing in Web search for users of many languages around the world.
Proceedings ArticleDOI

Joint Approach to Deromanization of Code-mixed Texts

TL;DR: The results of the experiments establish the state of the art for the task of deromanization of code-mixed texts and propose a novel approach for handling these two problems together in a single system.
Journal ArticleDOI

Query Expansion for Transliterated Text Retrieval

TL;DR: Experiments performed over Hindi song lyrics retrieval in mixed script domain with three different retrieval models show that proposed approaches outperform the existing techniques in a majority of the cases (sometimes statistically significantly) for a number of metrics like nDCG@1, nDCg@5, n DCG@10, MAP, MRR, and Recall.
References
More filters
Journal ArticleDOI

Support-Vector Networks

TL;DR: High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated and the performance of the support- vector network is compared to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.
Proceedings Article

Labeling the Languages of Words in Mixed-Language Documents using Weakly Supervised Methods

TL;DR: In this paper, the problem of labeling the languages of words in mixed-language documents is considered in a weakly supervised fashion, as a sequence labeling problem with monolingual text samples for training data.
Proceedings ArticleDOI

Query expansion for mixed-script information retrieval

TL;DR: This paper formally introduces the concept of Mixed-Script IR, and through analysis of the query logs of Bing search engine, the prevalence of this problem is estimated, and gives a principled solution to handle the mixed-script term matching and spelling variation.
Proceedings Article

Query word labeling and Back Transliteration for Indian Languages: Shared task system description

TL;DR: This paper proposes a supervised approach of building a classier with monolingual samples together with a context-switching probability from Indian Language (IL) to English (Eng) and shows the best performing results.
Related Papers (5)