Mixed-script query labelling using supervised learning and ad hoc retrieval using sub word indexing

doi:10.1145/2824864.2824873

Proceedings ArticleDOI

Mixed-script query labelling using supervised learning and ad hoc retrieval using sub word indexing

- pp 86-90

TLDR

The authors used back transliteration to reduce spelling variations, and a set of hand-tailored rules for consonant mapping to take care of breaking and joining of transliterated words, and implemented query labeling of mixed script content using a supervised learning approach where an SVM classifier was trained using character n-grams as features for language identification.

Abstract:

Much of the user generated content on the internet is written in their transliterated form instead of in their indigenous script. Due to this search engines receive a large number of transliterated search queries.This paper presents our approach to handle labelling of queries and ad hoc retrieval of documents based on these queries, as part of the FIRE2014 shared task on transliterated search. The content of each document is written in either the native Devanagari script or its transliterated form in Roman script or a combination of both. The queries to retrieve these documents can also be in mixed script. The task is challenging primarily due to the spelling variations that occur in the transliterated form of search queries. This particular problem is addressed by using back transliteration to reduce spelling variations, and a set of hand-tailored rules for consonant mapping. Sub-word indexing is done to take care of breaking and joining of transliterated words. Implementation of query labelling of the mixed script content was done using a supervised learning approach where an SVM classifier was trained using character n-grams as features for language identification. A Naive Bayes classifier was used for classifying transliterated words that can belong to both Hindi and English when looked at individually.The 2 runs submitted by our team (BITS-Lipyantaran) performs best across all metrics for Subtask 2 among all the teams that participated, with a MRR score of 0.8171 and MAP score of 0.6421.

Mixed-script query labelling using supervised learning and ad hoc retrieval using sub word indexing

Citations

Machine transliteration and transliterated text retrieval: a survey

MSIR@FIRE: A Comprehensive Report from 2013 to 2016

Joint Approach to Deromanization of Code-mixed Texts

Query Expansion for Transliterated Text Retrieval

The Effect of Stopword Removal on Information Retrieval for Code-Mixed Data Obtained Via Social Media

References

Support-Vector Networks

Labeling the Languages of Words in Mixed-Language Documents using Weakly Supervised Methods

Query expansion for mixed-script information retrieval

Query word labeling and Back Transliteration for Indian Languages: Shared task system description

Related Papers (5)

IIIT-H System Submission for FIRE2014 Shared Task on Transliterated Search

Query word labeling and Back Transliteration for Indian Languages: Shared task system description

Hindi to English and Marathi to English Cross Language Information Retrieval Evaluation

Kannada and Telugu Native Languages to English Cross Language Information Retrieval

Language Identification in Mixed Script