scispace - formally typeset
Search or ask a question

Showing papers on "Marathi published in 2012"


Proceedings Article
01 Dec 2012
TL;DR: The crux of the idea is to use the linked WordNets of two languages to bridge the language gap by using WordNet senses as features for supervised sentiment classification in Hindi and Marathi.
Abstract: Cross-Lingual Sentiment Analysis (CLSA) is the task of predicting the polarity of the opinion expressed in a text in a language Ltest using a classifier trained on the corpus of another language Lt rain. Popular approaches use Machine Translation (MT) to convert the test document in Ltest to Lt rain and use the classifier of Lt rain. However, MT systems do not exist for most pairs of languages and even if they do, their translation accuracy is low. So we present an alternative approach to CLSA using WordNet senses as features for supervised sentiment classification. A document in Ltest is tested for polarity through a classifier trained on sense marked and polarity labeled corpora of Lt rain. The crux of the idea is to use the linked WordNets of two languages to bridge the language gap. We report our results on two widely spoken Indian languages, Hindi (450 million speakers) and Marathi (72 million speakers), which do not have an MT system between them. The sense-based approach gives a CLSA accuracy of 72% and 84% for Hindi and Marathi sentiment classification respectively. This is an improvement of 14%-15% over an approach that uses a bilingual dictionary.

75 citations


Proceedings Article
01 Jan 2012
TL;DR: This paper discusses the efforts in collecting speech databases for Indian languages – Bengali, Hindi, Kannada, Malayalam, Marathi, Tamil and Telugu, and discusses relevant design considerations in collecting these databases.
Abstract: This paper discusses the efforts in collecting speech databases for Indian languages – Bengali, Hindi, Kannada, Malayalam, Marathi, Tamil and Telugu. We discuss relevant design considerations in collecting these databases, and demonstrate their usage in speech synthesis. By releasing these speech databases in the public domain without any restrictions for non commercial and commercial purposes, we hope to promote research and developmental activities in building speech synthesis systems in Indian languages.

70 citations


01 Jan 2012
TL;DR: The efforts made by various researchers to develop automatic speech recognition systems for most of the Indo-Aryan languages have been analysed and then their applicability to Punjabi language has been discussed so that a concrete work can be initiated for PunJabi language.
Abstract:  Abstract— Punjabi, Hindi, Marathi, Gujarati, Sindhi, Bengali, Nepali, Sinhala, Oriya, Assamese, Urdu are prominent members of the family of Indo-Aryan languages. These languages are mainly spoken in India, Pakistan, Bangladesh, Nepal, Sri Lanka and Maldive Islands. All these languages contain huge diversity of phonetic content. In the last two decades, few researchers have worked for the development of Automatic Speech Recognition Systems for most of these languages in such a way that development of this technology can reach at par with the research work which has been done and is being done for the different languages in the rest of the world. Punjabi is the 10 th most widely spoken language in the world for which no considerable work has been done in this area of automatic speech recognition. Being a member of Indo-Aryan languages family and a language rich in literature, Punjabi language deserves attention in this highly growing field of Automatic speech recognition. In this paper, the efforts made by various researchers to develop automatic speech recognition systems for most of the Indo-Aryan languages, have been analysed and then their applicability to Punjabi language has been discussed so that a concrete work can be initiated for Punjabi language.

25 citations


Proceedings ArticleDOI
13 Nov 2012
TL;DR: Some research issues like ambiguity between frication and aspirated plosive are addressed in this paper and the anusvara in both of these languages are produced based on the immediate following consonant.
Abstract: This paper addresses phonetic transcription related issues in Gujarati and Marathi (Indian Languages). Some adhoc approaches to fix relationship between the general alphabetical symbols and phonetic symbols may not always work. Hence, some research issues like ambiguity between frication and aspirated plosive are addressed in this paper. The anusvara in both of these languages are produced based on the immediate following consonant. Implication for this finding for the problem of phonetic transcription is presented. Furthermore, the effect of dialectal variations on phonetic transcription is also analyzed for Marathi. Finally, some examples of phonetic transcription for sentences of these two languages are presented.

12 citations


Journal ArticleDOI
TL;DR: Findings suggest distinct patterns of bilingualism effects on cognition for this previously unexamined language pair, and that the rate of cognates may modulate the association between bilingualism and verbal performance on neuropsychological tests.
Abstract: The present study aimed to examine if bilingualism affects executive functions and verbal fluency in Marathi and Hindi, two major languages in India, with a considerable cognate (e.g., activity is actividad in Spanish) overlap. A total of 174 native Marathi speakers from Pune, India, with varying levels of Hindi proficiency were administered tests of executive functioning and verbal performance in Marathi. A bilingualism index was generated using self-reported Hindi and Marathi proficiency. After controlling for demographic variables, the association between bilingualism and cognitive performance was examined. Degree of bilingualism predicted better performance on the switching (Color Trails-2) and inhibition (Stroop Color-Word) components of executive functioning; but not for the abstraction component (Halstead Category Test). In the verbal domain, bilingualism was more closely associated with noun generation (where the languages share many cognates) than verb generation (which are more disparate across these languages), as predicted. However, contrary to our hypothesis that the bilingualism "disadvantage" would be attenuated on noun generation, bilingualism was associated with an advantage on these measures. These findings suggest distinct patterns of bilingualism effects on cognition for this previously unexamined language pair, and that the rate of cognates may modulate the association between bilingualism and verbal performance on neuropsychological tests.

12 citations


Proceedings Article
01 Dec 2012
TL;DR: During last two decades, most of the named entity (NE) machine transliteration work in India has been carried out by using English as a source language and Indian languages as the target languages using grapheme model with statistical probability approaches and classification tools.
Abstract: During last two decades, most of the named entity (NE) machine transliteration work in India has been carried out by using English as a source language and Indian languages as the target languages using grapheme model with statistical probability approaches and classification tools. It is evident that less amount of work has been carried out for Indian languages to English machine transliteration.

11 citations


Journal ArticleDOI
TL;DR: The phoneme used in Marathi language is discussed as a possible basic unit of speech recognition, for which there is some empirical psychoacoustic support in the case of human and some engineering justification in the cases of machines striving to imitate human abilities.
Abstract: paper discusses the phoneme used in Marathi language as a possible basic unit of speech recognition, for which there is some empirical psychoacoustic support in the case of human and some engineering justification in the case of machines striving to imitate human abilities. For the purpose of the research described in this paper, a basic unit of speech recognition is the intermediate form of speech information around which much of the recognition processing is organized for human beings or for machines. The general opinion of phonetician and psycholinguists is that there is indeed such a unit with relatively few distinct types 1 . For this research a basic unit is ideally an output of acoustic-phonetic processing and an input to the lexical processing stages.

10 citations


Proceedings Article
01 Dec 2012
TL;DR: This work is the first of its kind on a systematic and exhaustive study of the Morphotactics of a suffix-stac king language, leading to high quality morph analyzer for Marathi, a highly inflectional language with agglutinative features.
Abstract: In this paper we describe and evaluate a Finite State Machine (FSM) based Morphological Analyzer (MA) for Marathi, a highly inflectional language with agglutinative su ffixes. Marathi belongs to the Indo-European family and is considerably influenced by Dravidian languages. Adroit handling of participial constructions and other derived forms ( Krudantas and Taddhitas) in addition to inflected forms is crucial to NLP and MT of Marathi. We firs t describe Marathi morphological phenomena, detailing the complexities of inflectional and derivational morphology, and then go into the construction and working of the MA. The MA produces the root word and the features. A thorough evaluation against gold standard data establish es the efficacy of this MA. To the best of our knowledge, this work is t he first of its kind on a systematic and exhaustive study of the Morphotactics of a suffix-stac king language, leading to high quality morph analyzer. The system forms part of a Marathi -Hindi transfer based machine translation system. The methodology delineated in the paper can be replicated fo r other languages showing similar suffix stacking behaviour as Marathi.

10 citations


Proceedings Article
01 Dec 2012
TL;DR: In this paper, Sanskrit compounding system is examined thoroughly and the insight gained from the Sanskrit grammar is applied for the analysis of compounds in Hindi and Marathi.
Abstract: Compounds occur very frequently in Indian Languages. There are no strict orthographic conventions for compounds in modern Indian Languages. In this paper, Sanskrit compounding system is examined thoroughly and the insight gained from the Sanskrit grammar is applied for the analysis of compounds in Hindi and Marathi. It is interesting to note that compounding in Hindi deviates from that in Sanskrit in two aspects. The data analysed for Hindi does not

10 citations


29 Sep 2012
TL;DR: A new methodology called ‘Information Retrieval in Multilingual Environment’ and provides the functionality of processing and retrieval of Indian languages like Hindi, Marathi, Telugu, Gujarati, Urdu, Bengali, Malayalam, Kannada etc., retrieves the Indian language documents in response to query given in English or any Indian language.
Abstract: In today’s world of globalization, local languages storage and retrieval is essential for the developing nations like India. As our country is diversified by languages and only 10% of population is aware of English language, this diversity of languages is becoming barrier to understand and acquainted in digital world. It has been found that when services are provided in local languages, it has been strongly accepted and used. A new methodology called ‘Information Retrieval in Multilingual Environment’ and provides the functionality of processing and retrieval of Indian languages like Hindi, Marathi, Telugu, Gujarati, Urdu, Bengali, Malayalam, Kannada etc., A Cross Lingual Information Retrieval System retrieves the Indian language documents in response to query given in English or any Indian language.

8 citations


01 Mar 2012
TL;DR: A novel method is provided to recognize handwritten Marathi characters based on their features extraction and adaptive smoothing technique and it is shown that, no one technique achieves 100% accuracy in handwritten character recognition area.
Abstract: The growing need have handwritten Marathi character recognition in Indian offices such as passport, railways etc has made it vital area of a research. Similar shape characters are more prone to misclassification. In this paper a novel method is provided to recognize handwritten Marathi characters based on their features extraction and adaptive smoothing technique. Feature selections methods avoid unnecessary patterns in an image whereas adaptive smoothing technique form smooth shape of charecters. Combination of both these approaches leads to the better results. Previous study shows that, no one technique achieves 100% accuracy in handwritten character recognition area. This approach of combining both adaptive smoothing & feature extraction gives better results (approximately 75-100) and expected outcomes.


Journal Article
TL;DR: The proposed procedure to be followed for collecting the isolated words data from the farmers of the Aurangabad District for developing an Automatic Speech Recognition System in Marathi Language is described.
Abstract: Development of Speech Database is the very first step for developing an Automatic Speech Recognition system. The Accuracy of speech recognition depends on the quality of the speech data collected and the training set data quality. This paper describes the proposed procedure to be followed for collecting the isolated words data from the farmers of the Aurangabad District for developing an Automatic Speech Recognition System in Marathi Language.

01 Jan 2012
TL;DR: Indic-Phonetic approach is an efficient and accurate as compared to other two approaches, evaluated by generating cases like length-of-string (LOS), differ in vowel and compound words for Hindi and Marathi.
Abstract: Phonetic matching plays an important role in multilingual information retrieval, where data is manipulated in multiple languages. User needs information in their local language which may be different from the language where data has been maintained. In such an environment, we need a system which matches the strings phonetically irrespective of errors either exactly or approximately. There are many errors or variations can be considered but here we had considered typographical errors, spelling errors as differ in vowel and matching of compound words. There are many approaches has been proposed like soundex, q-gram, phoenix etc., but they may produce an ambiguity in matching or may not be applicable to Indian languages. In this paper, we proposed approaches which match the strings either in Hindi or Marathi accurately. We evaluated the three approaches namely Soundex, Q-gram and Indic-Phonetic by generating cases like length-of-string (LOS), differ in vowel and compound words for Hindi and Marathi. We found that Indic-Phonetic approach is an efficient and accurate as compared to other two approaches.

Proceedings ArticleDOI
02 Nov 2012
TL;DR: The methods by which English Wikipedia data can be used to bootstrap the identification of NEs in other languages which generates a list of NE's are described and utilizing this NE list to improve multilingual Entity Filling which showed promising results.
Abstract: This paper details the approach to identify Named Entities (NEs) from a large non-English corpus and associate them with appropriate tags, requiring minimal human intervention and no linguistic expertise. The main objective in this paper is to focus on Indian languages like Telugu, Hindi, Tamil, Marathi, etc., which are considered to be resource-poor languages when compared to English. The inherent structure of Wikipedia was exploited in developing an efficient co-occurrence frequency based NE identification algorithm for Indian Languages. We describe the methods by which English Wikipedia data can be used to bootstrap the identification of NEs in other languages which generates a list of NE's. Later, the paper focuses on utilizing this NE list to improve multilingual Entity Filling which showed promising results. On a dataset of 2,622 Marathi Wikipedia articles, with around 10,000 NEs manually tagged, an F-Measure of 81.25% was achieved by our system without availing language expertise. Similarly, an F-measure of 80.42% was achieved on around 12,000 NEs tagged within 2,935 Hindi Wikipedia articles.

01 Jan 2012
TL;DR: The process of machine translation can be expanded to include the use of spelling and grammatical checks, intermediate language, sentiment analysis, proverbs and phrases, and more.
Abstract: Machine Translation is one of the fastest growing research areas in the field of Natural Language Processing, with a special area of focus being Asian languages. Substantial work has been done in the case of Hindi and Bengali. The scope of this paper is to discuss the future scope of machine translation, with specific focus on translation of Marathi – a language spoken by over 70 million people [1]. The process of machine translation can be expanded to include the use of spelling and grammatical checks, intermediate language, sentiment analysis, proverbs and phrases.

01 Dec 2012
TL;DR: Proposed phonetic based statistical approach uses phoneme and named entity length as features for supervised learning and transliterates them in English using full consonant based phonetic scheme without support of corpus.
Abstract: Machine transliteration has received significant research attention in last two decades. It is observed that Hindi to English and Marathi to English named entity machine transliteration is comparably less studied. Currently, research work in this domain is carried out by using grapheme based statistical approaches. But, to achieve better accuracy for the transliteration, an adequate bilingual text corpus is a mandatory requirement for statistical approaches. This paper focuses on Hindi to English and Marathi to English direct machine transliteration of Indian-origin named entities such as proper names, place names and organization names. Proposed phonetic based statistical approach uses phoneme and named entity length as features for supervised learning and transliterates them in English using full consonant based phonetic scheme without support of corpus. This system takes Indian origin named entities as an input in Hindi and Marathi using Devanagari script and transliterates it into English by using only two weights.

Journal ArticleDOI
TL;DR: Important issues which frequently occur in Hindi to English and Marathi to English named entities machine transliteration are focused on.
Abstract: Almost all transactions ranging from various domains such as travel, shopping, insurance, entertainment, hotels, appointments etc. are available through Internet based applications. Needless to say, all these applications require the knowledge of English. As Internet users are growing day by day, it is logical to say that, there is a great demand to develop tools and applications to support Indian languages for them. The solution to provide local language support in the web based commercial applications is Machine Translation which can be used to translate static labels on web form and Machine Transliteration to transliterate dynamic user inputs from local language into the default language English. It is challenging to transliterate names and technical terms occurring in the user input across languages with different alphabets and sound inventories. This paper focuses important issues which frequently occur in Hindi to English and Marathi to English named entities machine transliteration.

01 Jan 2012
TL;DR: This paper covers the comprehensive analysis and also the comparison of the affect of language structure related factors (morphology, phonetics, WSD, synonyms,) on the performance of search engines supporting Hindi language.
Abstract: With the internet growing at an exponential rate the web is increasingly hosting web pages in different languages. It is essential for the search engines to be able to search information stored in a specific language. The native users also tend to look for any information on web nowadays. This leads to the need of effective search engines to fulfill native user’s needs and provide them information in their native languages. The major population of India use Hindi as a first language. The Indian constitution identifies 22 languages, of which six languages (Hindi, Telugu, Tamil, Bengali, Marathi and Gujarati) are spoken by at least 50 million people within the boundaries of the country—there are a large number of them living outside the country. The Hindi language web information retrieval is not in a satisfactory condition. The presence of Hindi on the World Wide Web is still limited and tentative because of attitudinal and technical factors. Besides the other technical setbacks the Hindi language search engines face the problem of morphology, phonetics, word sense disambiguation etc. The performance of search engines is affected by these problems. This paper covers the comprehensive analysis and also the comparison of the affect of language structure related factors (morphology, phonetics, WSD, synonyms,) on the performance of search engines supporting Hindi language.

Journal ArticleDOI
TL;DR: This article presents a review of earlier research work related to devanagari character recognition along with some applications of optical character recognition system.
Abstract: Optical character recognition is a vital task in the field of pattern recognition. English character recognition has been extensively studied by many researchers but in case of Indian languages which are complicated; the research work is very limited. Devanagari is an indian script used by huge number of indian people. Devanagari forms the basis for several indian languages including Hindi, Sanskrit, Kashmiri, Marathi and so on. This article presents a review of earlier research work related to devanagari character recognition along with some applications of optical character recognition system.

Journal ArticleDOI
TL;DR: A text independent language recognition system using a common code book and discrete hidden Markov models (DHMM) to achieve a very good LID recognition performance with less computation time comparing with that of a state of art phone based systems available in literature.
Abstract: Language Identification is a task of recognizing the language from an unknown utterance of speech. The ability of machines to distinguish between different languages becomes an important concern with the emerging trends in global communications which are multilingual nature. This paper describes a text independent language recognition system using a common code book and discrete hidden Markov models (DHMM) to achieve a very good LID recognition performance with less computation time comparing with that of a state of art phone based systems available in literature. This approach includes generation of a common codebook and training of DHMM, one for each language. The experiments are carried out on the database of Indian language consists of six languages namely Telugu, Tamil, Hindi, Marathi, Malayalam and Kannada.

01 Dec 2012
TL;DR: In this article, the authors identify paralinguistic markers for emotion in the language, referred to as emotiphons, in two Indian languages: Marathi from Indo-Aryan and Kannada from Dravidian family.
Abstract: In spontaneous speech, emotion information is embedded at several levels: acoustic, linguistic, gestural (non-verbal), etc. For emotion recognition in speech, there is much attention to acoustic level and some attention at the linguistic level. In this study, we identify paralinguistic markers for emotion in the language. We study two Indian languages belonging to two distinct language families. We consider Marathi from Indo-Aryan and Kannada from Dravidian family. We show that there exist large numbers of specific paralinguistic emotion markers in these languages, referred to as emotiphons. They are inter-twined with prosody and semantics. Preprocessing of speech signal with respect to emotiphons would facilitate emotion recognition in speech for Indian languages. Some of them are common between the two languages, indicating cultural influence in language usage.

Journal Article
TL;DR: This work aims at recognizing Marathi language Barakhadi characters by recognizing a vowel and a consonant separately using quadratic classifier.
Abstract: Handwritten character recognition (HCR) is an important subset within the pattern recognition area. Very little work is happening on Marathi Barakhadi characters which are formed by the combination of one of the 12 vowels and 36 consonants resulting in 432 characters. As the number of characters to be uniquely identified is very large, the proposed method aims at recognizing Marathi language Barakhadi characters by recognizing a vowel and a consonant separately. Based on the Devanagiri characters shape analysis and data set, the whole image is split into top region image with information above the header line and middle region image with information below the header line. The middle region is further processed to detect and separate the side modifiers if any, for vowel recognition. Invariant moment features are extracted from the top region and from the side modifiers and classified using quadratic classifier for recognition of vowel matra. If no vowel matra found, the image is cut by 20-30% from the bottom for detecting the presence of lower modifiers. Invariant moment features are extracted from the cut image and classified using quadratic classifier. Core consonant is divided into various zones and invariant moment features are extracted from each zone. These features are compressed using principle component analysis and classified using quadratic classifier for consonant recognition. These features will be trained and tested for both vowel and consonant recognition using quadratic classifier. KeywordsHandwritten character recognition; Marathi Barakhadi; zonal moments; classifier; feature extraction.

Journal ArticleDOI
TL;DR: In this paper, the key character of the Nachya (effeminate male performer) in representations of Tamasha in the Marathi film Natrang, in order to interrogate a productive ambivalence in this character, one that is simultaneously heteronormative and queerly transgressive in its regional Indian context.
Abstract: In this paper, I read the key character of the Nachya (effeminate male performer) in representations of Tamasha in the Marathi film Natrang, in order to interrogate a productive ambivalence in this character – one that is simultaneously heteronormative and queerly transgressive in its regional Indian context. The Nachya is coded as homosexual in Marathi cinema, through his exaggeratedly effeminate appearance, gestures and high-pitched singing voice. He traditionally functions as a comic, ‘wrong’ body by emphasizing the difference between ‘real’ and ‘fake’ femininity. However, he also accrues subversive value and serves as a queer, cultural point of identification. Therefore, by focusing on Tamasha song and dance sequences (specifically, the Lavani as a site of Marathi cinema's sex and gender play), I argue that representation which is normative in the context of the film's production and target mainstream audience can be reclaimed and re-coded through the lens of what could be termed a dynamic, queer, reg...


03 Oct 2012
TL;DR: Comparative study of various character recognition techniques used for feature extraction and recognition of handwritten Marathi character is presented.
Abstract: The different pattern recognition models have been proposed in recent years and the different research groups are working on for the recognition result.Handwritten character recognition for any Indian writing system is rendered complex because of the presence of composite characters. Hence the selection of a feature extraction method is probably the most important factor in achieving high recognition performance for Marathi character recognition.The goal of this paper is to present comparative study of various character recognition techniques used for feature extraction and recognition of handwritten Marathi character.

01 Jan 2012
TL;DR: In this paper, the authors describe the two tests developed and designed for Marathi using non-words, a) plural formation for nonwords, and b) intuition test for gender assignment in which subjects were asked to assign gender to nonwords.
Abstract: In this paper, we describe the two tests developed and designed for Marathi using non-words, a) plural formation for non-words b) intuition test for gender assignment in which subjects were asked to assign gender to non-words. We look at the distribution of nouns across noun classes and genders and discuss the congruence between the problematic classes as observed in the tests and the actual class distribution and frequency in the language.


Dissertation
02 Nov 2012
TL;DR: In this article, the authors propose a method to solve the problem of "uniformity" and "uncertainty" in the context of data mining, e.g.
Abstract: xx