Showing papers on "Malayalam published in 2020"

PDF

Open Access

Proceedings Article•DOI•

Overview of the HASOC Track at FIRE 2020: Hate Speech and Offensive Language Identification in Tamil, Malayalam, Hindi, English and German

[...]

Thomas Mandl¹, Sandip Modha, Anand Kumar M², Bharathi Raja Chakravarthi³•Institutions (3)

University of Hildesheim¹, National Institute of Technology, Karnataka², National University of Ireland³

16 Dec 2020

TL;DR: The HASOC track as mentioned in this paper is dedicated to evaluate technology for finding offensive language and hate speech, which has attracted much interest and over 40 research groups have participated as well as described their approaches in papers.

...read moreread less

Abstract: This paper presents the HASOC track and its two parts. HASOC is dedicated to evaluate technology for finding Offensive Language and Hate Speech. HASOC is creating test collections for languages with few resources and English for comparison. The first track within HASOC has continued work from 2019 and provided a testbed of Twitter posts for Hindi, German and English. The second track within HASOC has created test resources for Tamil and Malayalam in native and Latin script. Posts were extracted mainly from Youtube and Twitter. Both tracks have attracted much interest and over 40 research groups have participated as well as described their approaches in papers. In this overview, we present the tasks, the data and the main results.

...read moreread less

127 citations

Proceedings Article•DOI•

Overview of the track on Sentiment Analysis for Dravidian Languages in Code-Mixed Text

[...]

Bharathi Raja Chakravarthi¹, Ruba Priyadharshini², Vigneshwaran Muralidaran³, Shardul Suryawanshi¹, Navya Jose⁴, Elizabeth Sherly⁴, John P. McCrae⁵ - Show less +3 more•Institutions (5)

National University of Ireland¹, ULTra², Cardiff University³, Indian Institute of Information Technology and Management, Kerala⁴, National University of Ireland, Galway⁵

16 Dec 2020

TL;DR: The Dravidian-CodeMix-FIRE 2020 Track as discussed by the authors focused on sentiment analysis of code-mixed text in code mixed text for Tamil and Malayalam, and participants were given a dataset of YouTube comments and the goal of the shared task submissions was to recognise the sentiment of each comment by classifying them into positive, negative, neutral, mixed-feeling classes or by recognizing whether the comment is not in the intended language.

...read moreread less

Abstract: Sentiment analysis of Dravidian languages has received attention in recent years However, most social media text is code-mixed and there is no research available on sentiment analysis of code-mixed Dravidian languages The Dravidian-CodeMix-FIRE 2020, a track on Sentiment Analysis for Dravidian Languages in Code-Mixed Text, focused on creating a platform for researchers to come together and investigate the problem There were two languages for this track: (i) Tamil, and (ii) Malayalam The participants were given a dataset of YouTube comments and the goal of the shared task submissions was to recognise the sentiment of each comment by classifying them into positive, negative, neutral, mixed-feeling classes or by recognising whether the comment is not in the intended language The performance of the systems was evaluated by weighted-F1 score

...read moreread less

87 citations

Proceedings Article•

Open-source Multi-speaker Speech Corpora for Building Gujarati, Kannada, Malayalam, Marathi, Tamil and Telugu Speech Synthesis Systems

[...]

Fei He, Shan Hui Cathy Chu, Oddur Kjartansson¹, Clara E. Rivera¹, Anna Katanova, Alexander Gutkin¹, Isin Demirsahin¹, Cibu Johny¹, Martin Jansche², Supheakmungkol Sarin, Knot Pipatsrisawat¹ - Show less +7 more•Institutions (2)

Google¹, Amazon.com²

01 May 2020

TL;DR: Free high quality multi-speaker speech corpora for Gujarati, Kannada, Malayalam, Marathi, Tamil and Telugu, which are six of the twenty two official languages of India spoken by 374 million native speakers are presented.

...read moreread less

Abstract: We present free high quality multi-speaker speech corpora for Gujarati, Kannada, Malayalam, Marathi, Tamil and Telugu, which are six of the twenty two official languages of India spoken by 374 million native speakers. The datasets are primarily intended for use in text-to-speech (TTS) applications, such as constructing multilingual voices or being used for speaker or language adaptation. Most of the corpora (apart from Marathi, which is a female-only database) consist of at least 2,000 recorded lines from female and male native speakers of the language. We present the methodological details behind corpora acquisition, which can be scaled to acquiring data for other languages of interest. We describe the experiments in building a multilingual text-to-speech model that is constructed by combining our corpora. Our results indicate that using these corpora results in good quality voices, with Mean Opinion Scores (MOS) > 3.6, for all the languages tested. We believe that these resources, released with an open-source license, and the described methodology will help in the progress of speech applications for the languages described and aid corpora development for other, smaller, languages of India and beyond.

...read moreread less

34 citations

Proceedings Article•DOI•

Efficient Neural Machine Translation for Low-Resource Languages via Exploiting Related Languages

[...]

Vikrant Goyal, Sourav Kumar, Dipti Misra Sharma¹•Institutions (1)

International Institute of Information Technology, Hyderabad¹

01 Jul 2020

TL;DR: This work proposes a technique called Unified Transliteration and Subword Segmentation to leverage language similarity while exploiting parallel data from related language pairs and proposes a Multilingual Transfer Learning technique to leverage parallelData from multiple related languages to assist translation for low resource language pair of interest.

...read moreread less

Abstract: A large percentage of the world’s population speaks a language of the Indian subcontinent, comprising languages from both Indo-Aryan (eg Hindi, Punjabi, Gujarati, etc) and Dravidian (eg Tamil, Telugu, Malayalam, etc) families A universal characteristic of Indian languages is their complex morphology, which, when combined with the general lack of sufficient quantities of high-quality parallel data, can make developing machine translation (MT) systems for these languages difficult Neural Machine Translation (NMT) is a rapidly advancing MT paradigm and has shown promising results for many language pairs, especially in large training data scenarios Since the condition of large parallel corpora is not met for Indian-English language pairs, we present our efforts towards building efficient NMT systems between Indian languages (specifically Indo-Aryan languages) and English via efficiently exploiting parallel data from the related languages We propose a technique called Unified Transliteration and Subword Segmentation to leverage language similarity while exploiting parallel data from related language pairs We also propose a Multilingual Transfer Learning technique to leverage parallel data from multiple related languages to assist translation for low resource language pair of interest Our experiments demonstrate an overall average improvement of 5 BLEU points over the standard Transformer-based NMT baselines

...read moreread less

28 citations

Proceedings Article•

"A Passage to India": Pre-trained Word Embeddings for Indian Languages.

[...]

Saurav Kumar, Saunack Kumar, Diptesh Kanojia¹, Pushpak Bhattacharyya¹•Institutions (1)

Indian Institute of Technology Bombay¹

01 May 2020

TL;DR: This paper uses various existing approaches to create multiple word embeddings for 14 Indian languages using both contextual and non-contextual approaches, and releases a total of 436 models using 8 different approaches.

...read moreread less

Abstract: Dense word vectors or ‘word embeddings’ which encode semantic properties of words, have now become integral to NLP tasks like Machine Translation (MT), Question Answering (QA), Word Sense Disambiguation (WSD), and Information Retrieval (IR). In this paper, we use various existing approaches to create multiple word embeddings for 14 Indian languages. We place these embeddings for all these languages, viz., Assamese, Bengali, Gujarati, Hindi, Kannada, Konkani, Malayalam, Marathi, Nepali, Odiya, Punjabi, Sanskrit, Tamil, and Telugu in a single repository. Relatively newer approaches that emphasize catering to context (BERT, ELMo, etc.) have shown significant improvements, but require a large amount of resources to generate usable models. We release pre-trained embeddings generated using both contextual and non-contextual approaches. We also use MUSE and XLM to train cross-lingual embeddings for all pairs of the aforementioned languages. To show the efficacy of our embeddings, we evaluate our embedding models on XPOS, UPOS and NER tasks for all these languages. We release a total of 436 models using 8 different approaches. We hope they are useful for the resource-constrained Indian language NLP. The title of this paper refers to the famous novel “A Passage to India” by E.M. Forster, published initially in 1924.

...read moreread less

21 citations

Posted Content•

A Multilingual Parallel Corpora Collection Effort for Indian Languages

[...]

Shashank Siripragada¹, Jerin Philip¹, Vinay P. Namboodiri², C. V. Jawahar¹•Institutions (2)

International Institute of Information Technology, Hyderabad¹, Indian Institute of Technology Kanpur²

15 Jul 2020-arXiv: Computation and Language

TL;DR: The methods of constructing sentence aligned parallel corpora using tools enabled by recent advances in machine translation and cross-lingual retrieval using deep neural network based methods are reported on.

...read moreread less

Abstract: We present sentence aligned parallel corpora across 10 Indian Languages - Hindi, Telugu, Tamil, Malayalam, Gujarati, Urdu, Bengali, Oriya, Marathi, Punjabi, and English - many of which are categorized as low resource. The corpora are compiled from online sources which have content shared across languages. The corpora presented significantly extends present resources that are either not large enough or are restricted to a specific domain (such as health). We also provide a separate test corpus compiled from an independent online source that can be independently used for validating the performance in 10 Indian languages. Alongside, we report on the methods of constructing such corpora using tools enabled by recent advances in machine translation and cross-lingual retrieval using deep neural network based methods.

...read moreread less

17 citations

Book Chapter•DOI•

Dynamic Mode-Based Feature with Random Mapping for Sentiment Analysis

[...]

S. Sachin Kumar¹, M. Anand Kumar², K. P. Soman¹, Prabaharan Poornachandran¹•Institutions (2)

Amrita Vishwa Vidyapeetham¹, National Institute of Technology, Karnataka²

01 Jan 2020

TL;DR: The present article discusses the use of dynamic modes from dynamic mode decomposition (DMD) method with random mapping for sentiment classification, and observes that the proposed approach provides competing result.

...read moreread less

Abstract: Sentiment analysis (SA) or polarity identification is a research topic which receives considerable number of attention. The work in this research attempts to explore the sentiments or opinions in text data related to any event, politics, movies, product reviews, sports, etc. The present article discusses the use of dynamic modes from dynamic mode decomposition (DMD) method with random mapping for sentiment classification. Random mapping is performed using random kitchen sink (RKS) method. The present work aims to explore the use of dynamic modes as the feature for sentiment classification task. In order to conduct the experiment and analysis, the dataset used consists of tweets from SAIL 2015 shared task (tweets in Tamil, Bengali, Hindi) and Malayalam languages. The dataset for Malayalam is prepared by us for the work. The evaluations are performed using accuracy, F1-score, recall, and precision. It is observed from the evaluations that the proposed approach provides competing result.

...read moreread less

17 citations

Proceedings Article•

A multilingual parallel corpora collection effort for Indian languages

[...]

Shashank Siripragada¹, Jerin Philip¹, Vinay P. Namboodiri², C. V. Jawahar¹•Institutions (2)

International Institute of Information Technology, Hyderabad¹, Indian Institute of Technology Kanpur²

31 May 2020

TL;DR: This article presented sentence aligned parallel corpora across 10 Indian Languages - Hindi, Telugu, Tamil, Malayalam, Gujarati, Urdu, Bengali, Oriya, Marathi, Punjabi, and English - many of which are categorized as low resource.

...read moreread less

15 citations

Journal Article•DOI•

Parts-of-Speech tagging for Malayalam using deep learning techniques

[...]

K. K. Akhil¹, R. Rajimol¹, V. S. Anoop¹•Institutions (1)

Indian Institute of Information Technology and Management, Kerala¹

01 Sep 2020-International Journal of Information Technology

TL;DR: Experiments conducted on real datasets show that the proposed deep learning-based approach for parts-of-speech tagging for the Malayalam language outperforms some of the already available methods in terms of precision and accuracy.

...read moreread less

Abstract: Parts-of-speech tagging is a process in linguistics which deals with tagging each word in a sentence with their corresponding parts-of-speech This process is considered to be one of the pre-processing steps for many natural language processing tasks Earlier approaches were based on simple heuristics and later several methods were reported in the literature that incorporated machine learning techniques such as artificial neural networks Very recently, with the advancement of deep learning-based approaches, parts-of-speech tagging process became more accurate and a reasonable number of taggers are now available for high resource languages such as English But the low resource languages such as Malayalam is still lacking computationally efficient and accurate methods and techniques for parts-of-speech tagging In this direction, this work proposes a deep learning-based approach for parts-of-speech tagging for the Malayalam language Experiments conducted on real datasets show that the proposed method outperforms some of the already available methods in terms of precision and accuracy

...read moreread less

12 citations

WLV-RIT at HASOC-Dravidian-CodeMix-FIRE2020: Offensive Language Identification in Code-switched YouTube Comments.

[...]

Tharindu Ranasinghe, Sarthak Gupte, Marcos Zampieri, Ifeoma Nwogu

01 Jan 2020

TL;DR: In this paper, the WLV-RIT entry to the Hate Speech and Offensive Content Identification in Indo-European Languages (HASOC) shared task 2020 was described. And they achieved 0.89 weighted average F1 score for the test set and ranked 5th place out of 12 participants.

...read moreread less

Abstract: This paper describes the WLV-RIT entry to the Hate Speech and Offensive Content Identification in Indo-European Languages (HASOC) shared task 2020. The HASOC 2020 organizers provided participants with annotated datasets containing social media posts of code-mixed in Dravidian languages (Malayalam-English and Tamil-English). We participated in task 1: Offensive comment identification in Code-mixed Malayalam Youtube comments. In our methodology, we take advantage of available English data by applying cross-lingual contextual word embeddings and transfer learning to make predictions to Malayalam data. We further improve the results using various fine tuning strategies. Our system achieved 0.89 weighted average F1 score for the test set and it ranked 5th place out of 12 participants.

...read moreread less

10 citations

Journal Article•DOI•

Polyglossic Malabar: Arabi-Malayalam and the Muhiyuddinmala in the age of transition (1600s–1750s)

[...]

P. K. Yasser Arafath

01 Jul 2020-Journal of the Royal Asiatic Society

TL;DR: The authors examines the relations between trade, faith, and textual traditions in early modern Indian Ocean region and the birth of Arabi-Malayalam, a new system of writing which has facilitated the growth of a vernacular Islamic textual tradition in Malabar since the seventeenth century.

...read moreread less

Abstract: This article examines the relations between trade, faith, and textual traditions in early modern Indian Ocean region and the birth of Arabi-Malayalam, a new system of writing which has facilitated the growth of a vernacular Islamic textual tradition in Malabar since the seventeenth century. As a transliterated scriptorial-literary tradition, Arabi-Malayalam emerged out of the polyglossic lingual sphere of the Malabar Coast, and remains as one of the important legacies of social and religious interactions in precolonial south Asia. The first part of this article examines the social, epistemic and normative reasons that led to the scriptorial birth of Arabi-Malayalam, moving beyond a handful of Malayalam writings that locate its origin in the social and economic necessities of Arab traders in the early centuries of Islam. The second part looks at the complex relationship between Muslim scribes and their vernacular audience in the aftermath of Portuguese violence and destruction of Calicut—one of the largest Indian Ocean ports before the sixteenth century. This part focuses on Qadi Muhammed bin Abdul Aziz and his Muhiyuddinmala, the first identifiable text in Arabi-Malayalam, examining how the Muhiyuddinmala represents a transition from classical Arabic theological episteme to the vernacular-popular poetic discourse which changed the pietistic behaviour of the Mappila Muslims of Malabar.

...read moreread less

KBCNMUJAL@HASOC-Dravidian-CodeMix-FIRE2020: Using Machine Learning for Detection of Hate Speech and Offensive Code-Mixed Social Media text.

[...]

Varsha M. Pathak, Manish Joshi, Prasad Joshi, Monica Mundada, Tanmay Joshi - Show less +1 more

01 Jan 2020

TL;DR: This paper describes the system submitted by the team, KBCNMUJAL, for Task 2 of the shared task “Hate Speech and Offensive Content Identification in Indo-European Languages (HASOC)” at Forum for Information Retrieval Evaluation, December 16-20, 2020, Hyderabad, India.

...read moreread less

Abstract: This paper describes the system submitted by our team, KBCNMUJAL, for Task 2 of the shared task Hate Speech and Offensive Content Identification in Indo-European Languages (HASOC), at Forum for Information Retrieval Evaluation, December 16-20, 2020, Hyderabad, India. The datasets of two Dravidian languages Viz. Malayalam and Tamil of size 4000 observations, each were shared by the HASOC organizers. These datasets are used to train the machine using different machine learning algorithms, based on classification and regression models. The datasets consist of tweets or YouTube comments with two class labels offensive and not offensive. The machine is trained to classify such social media messages in these two categories. Appropriate n-gram feature sets are extracted to learn the specific characteristics of the Hate Speech text messages. These feature models are based on TFIDF weights of n-gram. The referred work and respective experiments show that the features such as word, character and combined model of word and character n-grams could be used to identify the term patterns of offensive text contents. As a part of the HASOC shared task, the test data sets are made available by the HASOC track organizers. The best performing classification models developed for both languages are applied on test datasets. The model which gives the highest accuracy result on training dataset for Malayalam language was experimented to predict the categories of respective test data. This system has obtained an F1 score of 0.77. Similarly the best performing model for Tamil language has obtained an F1 score of 0.87. This work has received 2nd and 3rd rank in this shared Task 2 for Malayalam and Tamil language respectively. The proposed system is named HASOC_kbcnmujal.

...read moreread less

Journal Article•DOI•

Image-based features for speech signal classification

[...]

Himadri Mukherjee¹, Ankita Dhar¹, Sk Md Obaidullah², Santanu Phadikar³, Kaushik Roy¹ - Show less +1 more•Institutions (3)

West Bengal State University¹, Aliah University², Islamic Azad University³

01 Dec 2020-Multimedia Tools and Applications

TL;DR: This paper proposes image-based features for speech signal classification because it is possible to identify different patterns by visualizing their speech patterns and the highest accuracy of 94.51% was obtained.

...read moreread less

Abstract: Like other applications, under the purview of pattern classification, analyzing speech signals is crucial. People often mix different languages while talking which makes this task complicated. This happens mostly in India, since different languages are used from one state to another. Among many, Southern part of India suffers a lot from this situation, where distinguishing their languages is important. In this paper, we propose image-based features for speech signal classification because it is possible to identify different patterns by visualizing their speech patterns. Modified Mel frequency cepstral coefficient (MFCC) features namely MFCC- Statistics Grade (MFCC-SG) were extracted which were visualized by plotting techniques and thereafter fed to a convolutional neural network. In this study, we used the top 4 languages namely Telugu, Tamil, Malayalam, and Kannada. Experiments were performed on more than 900 hours of data collected from YouTube leading to over 150000 images and the highest accuracy of 94.51% was obtained.

...read moreread less

Proceedings Article•DOI•

Malayalam-English Code-Switched: Grapheme to Phoneme System.

[...]

Sreeja Manghat, Sreeram Manghat, Tanja Schultz¹•Institutions (1)

University of Bremen¹

25 Oct 2020

TL;DR: A G2P system which can handle code-switching developed based onMalayalam-English code-switch speech and text corpora is presented and the overlapping phonemes for English – Malayalam are identified and analysed.

...read moreread less

Abstract: Grapheme to phoneme conversion is an integral aspect of speech processing. Conversational speech in Malayalam – a low resource Indic language has inter-sentential, intrasentential code-switching as well as frequent intra-word codeswitching with English. Monolingual G2P systems cannot process such special intra-word code-switching scenarios. A G2P system which can handle code-switching developed based on Malayalam-English code-switch speech and text corpora is presented. Since neither Malayalam nor English are phonetic subset of each other, the overlapping phonemes for English – Malayalam are identified and analysed. Additional rules used to handle special cases of Malayalam phonemes and intra-word code-switching in the G2P system is also presented specifically.

...read moreread less

Proceedings Article•DOI•

An Effective Neural Machine Translation for English to Hindi Language

[...]

Saikiran Gogineni, G. Suryanarayana, Sravan Kumar Surendran

01 Sep 2020

TL;DR: In this work, 6 different Indian languages such as Hindi, Bengali, Gujarati, Malayalam, Tamil and Telugu is worked on, and how BLEU varies with the usage of Word embedding technique have been clearly shown.

...read moreread less

Abstract: Machine translation involves the conversion of text from one language to the other language. In the world of web, a huge number of resources are made available in English. Many of the people are not familiar with this global language. Manually transmuting them into native languages such as Hindi (Indian National language) is a tedious task. In such scenarios, automatic machine translation is an efficient approach. In our work, 8 advanced architectures have been experimented and contrasted their efficiencies. Six different Indian languages such as Hindi, Bengali, Gujarati, Malayalam, Tamil and Telugu is worked on. How BLEU varies with the usage of Word embedding technique have been clearly shown.. Encoder to decoder networks are found fine for short sentences. But if the length of the sentence exceeds 20, then attention architecture is suitable. The 4 Layer Bi-directional LSTM is a great choice in these networks to achieve higher BLEU is also observed. In our work, CFILT, UFAL, ILCC datasets have been considered and achieved a BLEU score of 21.97.

...read moreread less

Book Chapter•DOI•

Quantitative Analysis of the Morphological Complexity of Malayalam Language

[...]

Kavya Manohar¹, A. R. Jayan², Rajeev Rajan¹•Institutions (2)

College of Engineering, Trivandrum¹, Government Engineering College, Sreekrishnapuram²

08 Sep 2020

TL;DR: In this work, morphological complexity of Malayalam is quantitatively analyzed on a text corpus containing 8 million words based on the parameters type-token growth rate (TTGR), type-Token ratio (TTR) and moving average type- token ratio (MATTR).

...read moreread less

Abstract: This paper presents a quantitative analysis on the morphological complexity of Malayalam language. Malayalam is a Dravidian language spoken in India, predominantly in the state of Kerala with about 38 million native speakers. Malayalam words undergo inflections, derivations and compounding leading to an infinitely extending lexicon. In this work, morphological complexity of Malayalam is quantitatively analyzed on a text corpus containing 8 million words. The analysis is based on the parameters type-token growth rate (TTGR), type-token ratio (TTR) and moving average type-token ratio (MATTR). The values of the parameters obtained in the current study is compared to that of the values of other morphologically complex languages.

...read moreread less

Book Chapter•DOI•

Comparing English and Malayalam Spelling Errors of Children using a Bilingual Screening Tool

[...]

Mithun Haridas, Nirmala Vasudevan, Georg Gutjahr, Raghu Raman¹, Prema Nedungadi - Show less +1 more•Institutions (1)

Amrita Vishwa Vidyapeetham¹

01 Jan 2020

TL;DR: In this article, the authors compared reading difficulties that arise when studying English and Malayalam and found that reading difficulties manifest differently depending on the characteristics of the language being studied, and the errors were classified into multiple categories.

...read moreread less

Abstract: Despite the high prevalence of reading disabilities among Indian children, many school teachers are not adept at identifying and assessing these difficulties. Screening tools for reading disabilities are available in English but are unavailable in many Indian languages. Reading disabilities manifest differently depending on the characteristics of the language being studied. This paper compares reading difficulties that arise when studying English and Malayalam. In a previous study, we designed a bilingual screening test in English and Malayalam and tested it with 135 school children in Kerala. In the current study, the screening test was modified in light of the findings from our previous study. We administered our updated bilingual screening test to 25 second grade children, ages 7–8, who were studying at two other schools in Kerala. Student errors were classified into multiple categories. Similarities and differences between errors in English and Malayalam were identified, and the errors that were specific to Malayalam were analyzed in further detail.

...read moreread less

Proceedings Article•

IndicSpeech: Text-to-Speech Corpus for Indian Languages

[...]

Nimisha Srivastava¹, Rudrabha Mukhopadhyay², K R Prajwal², C. V. Jawahar²•Institutions (2)

International Institute of Information Technology¹, International Institute of Information Technology, Hyderabad²

01 May 2020

TL;DR: A 24 hour text-to-speech corpus for 3 major Indian languages namely Hindi, Malayalam and Bengali is released and a state-of-the-art TTS system for each of these languages is trained.

...read moreread less

Abstract: India is a country where several tens of languages are spoken by over a billion strong population. Text-to-speech systems for such languages will thus be extremely beneficial for wide-spread content creation and accessibility. Despite this, the current TTS systems for even the most popular Indian languages fall short of the contemporary state-of-the-art systems for English, Chinese, etc. We believe that one of the major reasons for this is the lack of large, publicly available text-to-speech corpora in these languages that are suitable for training neural text-to-speech systems. To mitigate this, we release a 24 hour text-to-speech corpus for 3 major Indian languages namely Hindi, Malayalam and Bengali. In this work, we also train a state-of-the-art TTS system for each of these languages and report their performances. The collected corpus, code, and trained models are made publicly available.

...read moreread less

Book Chapter•DOI•

Setting up a neural machine translation system for English to Indian languages

[...]

Sandeep Saini¹, Vineet Sahula²•Institutions (2)

Myanmar Institute of Information Technology¹, Malaviya National Institute of Technology, Jaipur²

01 Jan 2020

TL;DR: It is observed in this work that NMT requires very less amount of data size for training and thus exhibits satisfactory translation for few thousands of training sentences as well.

...read moreread less

Abstract: Natural language translation is one of the most difficult tasks being handled by computer scientist community. This is certainly one task in which machine is definitely lagging behind the cognitive powers of human beings. From the initial ages of computer science a lot of approaches have been proposed for solving this task. Statistical machine translation (SMT) is one of the conventional and matured ways of solving the problem of MT. This approach is based on the Bayes rule and requires huge datasets to train the system. SMT performs well on similar grammar structured language pairs. In recent years, neural MT (NMT) has emerged as an alternate way of addressing the same issue. In this approach, we train a neural network on the source and target language pairs and train the system to develop the translation rules. In this work, we explore different configurations for setting up an NMT system for six different Indian languages. We have focused more on Hindi, which is the most widely spoken language, and more datasets are available for this language. We have also deployed the same system on Bangla, Tamil, Telugu, Urdu, and Malayalam as well. We have experimented with eight different architecture combinations of NMT for English to Indian languages and compared our results with conventional MT techniques. We have also observed in this work that NMT requires very less amount of data size for training and thus exhibits satisfactory translation for few thousands of training sentences as well.

...read moreread less

Posted Content•

WLV-RIT at HASOC-Dravidian-CodeMix-FIRE2020: Offensive Language Identification in Code-switched YouTube Comments

[...]

Tharindu Ranasinghe, Sarthak Gupte, Marcos Zampieri, Ifeoma Nwogu

01 Nov 2020-arXiv: Computation and Language

TL;DR: This paper describes the WLV-RIT entry to the Hate Speech and Offensive Content Identification in IndoEuropean Languages (HASOC) shared task 2020, and takes advantage of available English data by applying cross-lingual contextual word embeddings and transfer learning to make predictions to Malayalam data.

...read moreread less

Proceedings Article•

Bilingual Lexicon Induction across Orthographically-distinct Under-Resourced Dravidian Languages

[...]

Bharathi Raja Chakravarthi¹, Navaneethan Rajasekaran, Mihael Arcan², Kevin McGuinness³, Noel E. O'Connor³, John P. McCrae² - Show less +2 more•Institutions (3)

National University of Ireland¹, National University of Ireland, Galway², Dublin City University³

13 Dec 2020

TL;DR: In this article, the authors focus on the Dravidian languages, namely Tamil, Telugu, Kannada, and Malayalam, and bring the related languages into a single script.

...read moreread less

Abstract: Bilingual lexicons are a vital tool for under-resourced languages and recent state-of-the-art approaches to this leverage pretrained monolingual word embeddings using supervised or semi- supervised approaches. However, these approaches require cross-lingual information such as seed dictionaries to train the model and find a linear transformation between the word embedding spaces. Especially in the case of low-resourced languages, seed dictionaries are not readily available, and as such, these methods produce extremely weak results on these languages. In this work, we focus on the Dravidian languages, namely Tamil, Telugu, Kannada, and Malayalam, which are even more challenging as they are written in unique scripts. To take advantage of orthographic information and cognates in these languages, we bring the related languages into a single script. Previous approaches have used linguistically sub-optimal measures such as the Levenshtein edit distance to detect cognates, whereby we demonstrate that the longest common sub-sequence is linguistically more sound and improves the performance of bilingual lexicon induction. We show that our approach can increase the accuracy of bilingual lexicon induction methods on these languages many times, making bilingual lexicon induction approaches feasible for such under-resourced languages.

...read moreread less

Journal Article•DOI•

Towards an efficient Malayalam Named Entity Recognizer Analysis on the Challenges

[...]

Sreeja P. S¹, Anitha S. Pillai¹•Institutions (1)

Hindustan University¹

01 Jan 2020-Procedia Computer Science

TL;DR: The challenges in building an efficient NER for one of the south Indian language namely Malayalam are highlighted and the different issues that the authors need to address are presented.

...read moreread less

Posted Content•

Characterising User Content on a Multi-lingual Social Network

[...]

Pushkal Agarwal¹, Kiran Garimella², Sagar Joglekar¹, Nishanth Sastry¹, Gareth Tyson³ - Show less +1 more•Institutions (3)

King's College London¹, Massachusetts Institute of Technology², Queen Mary University of London³

23 Apr 2020-arXiv: Social and Information Networks

TL;DR: In this paper, the authors investigated the cross-lingual dynamics by clustering visually similar images together and exploring how they move across language barriers, and found that Telugu, Malayalam, Tamil and Kannada languages tend to be dominant in soliciting political images.

...read moreread less

Abstract: Social media has been on the vanguard of political information diffusion in the 21st century. Most studies that look into disinformation, political influence and fake-news focus on mainstream social media platforms. This has inevitably made English an important factor in our current understanding of political activity on social media. As a result, there has only been a limited number of studies into a large portion of the world, including the largest, multilingual and multi-cultural democracy: India. In this paper we present our characterisation of a multilingual social network in India called ShareChat. We collect an exhaustive dataset across 72 weeks before and during the Indian general elections of 2019, across 14 languages. We investigate the cross lingual dynamics by clustering visually similar images together, and exploring how they move across language barriers. We find that Telugu, Malayalam, Tamil and Kannada languages tend to be dominant in soliciting political images (often referred to as memes), and posts from Hindi have the largest cross-lingual diffusion across ShareChat (as well as images containing text in English). In the case of images containing text that cross language barriers, we see that language translation is used to widen the accessibility. That said, we find cases where the same image is associated with very different text (and therefore meanings). This initial characterisation paves the way for more advanced pipelines to understand the dynamics of fake and political content in a multi-lingual and non-textual setting.

...read moreread less

Journal Article•DOI•

Machine Learning Approach to Suffix Separation on a Sandhi Rule Annotated Malayalam Data Set

[...]

Mary Priya Sebastian¹, Mary Priya Sebastian², G. Santhosh Kumar¹•Institutions (2)

Cochin University of Science and Technology¹, Rajagiri²

29 May 2020-South Asia Research

TL;DR: The article discusses the results and issues encountered in developing this word-splitting tool for Malayalam, mainly in the context of improving the alignments between parallel texts that form a core resource in the Machine Translation task.

...read moreread less

Abstract: This article explores in depth various sandhi (joining) rules in Kerala’s Malayalam language, which play a vital role in framing of the inflected and agglutinated forms of words and their compounds...

...read moreread less

Journal Article•DOI•

Home and school literacy practices of children attending Malayalam classes in Singapore

[...]

Suvita Thanagopalasamy¹, Anitha Devi Pillai²•Institutions (2)

National University of Singapore¹, National Institute of Education²

31 Dec 2020-Journal of Modern Languages

TL;DR: This article investigated the home and school literacy practices of the children enrolled in the Malayalam classes at Organization X with a view to understand the impact of these lessons on the community in Malayalee.

...read moreread less

Abstract: The Malayalee community, a minority language group in Singapore, lacks institutional support for learning their mother tongue in schools. In school, most Malayalee children opt to study Tamil or Hindi, a Non-Tamil Indian Language (NTIL) in place of Malayalam. Over the years, ground-up initiatives by volunteers have resulted in ad-hoc community-run classes which are conducted by volunteers. In 2010, a community-run initiative, Organization X, was set up to formalize the learning and teaching of Malayalam in Singapore. This paper aims to investigate the home and school literacy practices of the children enrolled in the Malayalam classes at Organization X with a view to understand the impact of these lessons on the community. The study found that when both home and school literacy practices were viewed as social activities, it contributed to the maintenance of the language in the community.

...read moreread less

Journal Article•DOI•

Cinema and the mask of capital: Labour debates in the Malayalam film industry

[...]

Darshana Sreedhar Mini¹•Institutions (1)

University of Wisconsin-Madison¹

01 Dec 2020-Studies in South Asian Film & Media

TL;DR: The authors examined how the film industry's apprenticeship and unpaid labour arrangements affect below-the-line labour and less influential job profiles on a film set, and explored how labour and bargaining rights are conceptualized differently by film organizations based on their ideological positions.

...read moreread less

Abstract: Labour discourses in the film industry are often couched in the language of ‘welfare’ and an effort to maintain harmony among different filmmaking sectors. But such arrangements do not proffer equal participation or bargaining rights to everyone in the industry. Focusing on the Malayalam language film industry based in Kerala, this article examines how the film industry’s apprenticeship and unpaid labour arrangements affect below-the-line labour and less influential job profiles on a film set. In corollary, I also explore how labour and bargaining rights are conceptualized differently by film organizations based on their ideological positions. Using a mixed-methods approach, including media ethnography and interviews with members of different trade guilds who form part of Malayalam cinema’s professional, technical and service sectors, I demonstrate how structural inequalities in the film industry are overlooked while the cine-worker’s agency is co-opted by a neoliberal system that masquerades as welfare.

...read moreread less

Journal Article•DOI•

Sleepless fathers in Malayalam cinema: Unraveling the dynamics of caste and masculinity

[...]

Navaneetha Mokkil¹•Institutions (1)

Jawaharlal Nehru University¹

03 Mar 2020-South Asian Popular Culture

TL;DR: In this paper, the formations of masculinity and its links to the perceived vulnerability of young women's sexualized bodies by focusing on popular Malayalam cinema from Kerala are explored. But, their focus was on women.

...read moreread less

Abstract: This paper explores the formations of masculinity and its links to the perceived vulnerability of young women’s sexualized bodies by focusing on popular Malayalam cinema from Kerala. I will specifi...

...read moreread less

Journal Article•DOI•

Bilingual phonology in dichotic perception: A case study of Malayalam and English voicing

[...]

Sayantan Mandal¹, Catherine T. Best², Jason A. Shaw³, Anne Cutler⁴•Institutions (4)

Concordia University¹, University of Sydney², Yale University³, Max Planck Society⁴

22 Jul 2020-Glossa

TL;DR: This paper used a dichotic listening task with fluent Malayalam-English bilinguals, in which they were presented with synchronized nonce words, one in each language in separate ears, with competing onsets of a labial stop (Malayalam) and an English, both voiced or both voiceless.

...read moreread less

Abstract: Listeners often experience cocktail-party situations, encountering multiple ongoing conversations while tracking just one. Capturing the words spoken under such conditions requires selective attention and processing, which involves using phonetic details to discern phonological structure. How do bilinguals accomplish this in L1-L2 competition? We addressed that question using a dichotic listening task with fluent Malayalam-English bilinguals, in which they were presented with synchronized nonce words, one in each language in separate ears, with competing onsets of a labial stop (Malayalam) and a labial fricative (English), both voiced or both voiceless. They were required to attend to the Malayalam or the English item, in separate blocks, and report the initial consonant they heard. We found that perceptual intrusions from the unattended to the attended language were influenced by voicing, with more intrusions on voiced than voiceless trials. This result supports our proposal for the feature specification of consonants in Malayalam-English bilinguals, which makes use of privative features, underspecification and the “standard approach” to laryngeal features, as against “laryngeal realism”. Given this representational account, we observe that intrusions result from phonetic properties in the unattended signal being assimilated to the closest matching phonological category in the attended language, and are more likely for segments with a greater number of phonological feature specifications.

...read moreread less

Proceedings Article•DOI•

Harnessing Cross-lingual Features to Improve Cognate Detection for Low-resource Languages

[...]

Diptesh Kanojia¹, Diptesh Kanojia², Raj Dabre³, Shubham Dewangan¹, Pushpak Bhattacharyya¹, Gholamreza Haffari, Malhar Kulkarni¹ - Show less +3 more•Institutions (3)

Indian Institute of Technology Bombay¹, Monash University², National Institute of Information and Communications Technology³

01 Dec 2020

TL;DR: The use of cross-lingual word embeddings for detecting cognates among fourteen Indian Languages is demonstrated and the use of context from a knowledge graph to generate improved feature representations for cognate detection is introduced.

...read moreread less

Abstract: Cognates are variants of the same lexical form across different languages; for example “fonema” in Spanish and “phoneme” in English are cognates, both of which mean “a unit of sound”. The task of automatic detection of cognates among any two languages can help downstream NLP tasks such as Cross-lingual Information Retrieval, Computational Phylogenetics, and Machine Translation. In this paper, we demonstrate the use of cross-lingual word embeddings for detecting cognates among fourteen Indian Languages. Our approach introduces the use of context from a knowledge graph to generate improved feature representations for cognate detection. We, then, evaluate the impact of our cognate detection mechanism on neural machine translation (NMT), as a downstream task. We evaluate our methods to detect cognates on a challenging dataset of twelve Indian languages, namely, Sanskrit, Hindi, Assamese, Oriya, Kannada, Gujarati, Tamil, Telugu, Punjabi, Bengali, Marathi, and Malayalam. Additionally, we create evaluation datasets for two more Indian languages, Konkani and Nepali. We observe an improvement of up to 18% points, in terms of F-score, for cognate detection. Furthermore, we observe that cognates extracted using our method help improve NMT quality by up to 2.76 BLEU. We also release our code, newly constructed datasets and cross-lingual models publicly.

...read moreread less

Journal Article•DOI•

A Deep Level Tagger for Malayalam, a Morphologically Rich Language

[...]

A. P. Ajees¹, K. J. Abrar, Mary Idicula Sumam¹, M. Sreenathan•Institutions (1)

Cochin University of Science and Technology¹

05 Jul 2020-Journal of intelligent systems

TL;DR: A methodology for deep level tagging of Malayalam text is demonstrated, which is the process of assigning deeper level information to every noun and verb in the text along with normal POS tags.

...read moreread less

Abstract: Abstract In recent years, there has been tremendous growth in the amount of natural language text through various sources. Computational analysis of this text has got considerable attention among the NLP researchers. Automatic analysis and representation of natural language text is a step by step procedure. Deep level tagging is one of such steps applied over the text. In this paper, we demonstrate a methodology for deep level tagging of Malayalam text. Deep level tagging is the process of assigning deeper level information to every noun and verb in the text along with normal POS tags. In this study, we move towards a direction that is not much explored in the case of Malayalam language. Malayalam is a morphologically rich and agglutinative language. The morphological features of the language are effectively utilized for the computational analysis of Malayalam text. The language level details required for the study are provided by Thunjath Ezhuthachan Malayalam University, Tirur.

...read moreread less