scispace - formally typeset
Search or ask a question

Showing papers on "Malayalam published in 2011"


01 Nov 2011
TL;DR: This paper shows how to build a cross-language part-of-speech tagger for Kannada exploiting the resources of Telugu, and reveals that aCross-language taggers are as efficient as mono-lingual taggers.
Abstract: Indian languages are known to have a large speaker base, yet some of these languages have minimal or non-efficient linguistic resources. For example, Kannada is relatively resource-poor compared to Malayalam, Tamil and Telugu, which in-turn are relatively poor compared to Hindi. Many Indian language pairs exhibit high similarities in morphology and syntactic behaviour e.g. Kannada is highly similar to Telugu. In this paper, we show how to build a cross-language part-of-speech tagger for Kannada exploiting the resources of Telugu. We also build large corpora and a morphological analyser (including lemmatisation) for Kannada. Our experiments reveal that a cross-language taggers are as efficient as mono-lingual taggers. We aim to extend our work to other Indian languages. Our tools are efficient and significantly faster than the existing monolingual tools.

75 citations


Journal ArticleDOI
TL;DR: Using a bilingual dictionary, the Malayalam morphological analyzer and the Tamil morphological generator have been described, a program for analyzing the morphology of an input word.
Abstract: Language Processing (NLP) is both a modern computational technology and a method of investigating and evaluating claims about human language itself. Some prefer the term Computational Linguistics in order to capture this latter function, but NLP is a term that links back into the history of Artificial Intelligence (AI), the general study of cognitive function by computational processes, normally with an emphasis on the role of knowledge representations, that is to say the need for representations of our knowledge of the world in order to understand human language with computers. A morphological analyzer or generator supplies information concerning morphosyntactic properties of the words it analyses or constructs. Morphological Analysis and Generation are important components for building computational grammars as well as Machine Translation. Morphological Analyzer is a program for analyzing the morphology of an input word; the analyzer reads the inflected surface form of each word in a text and provides its lexical form while Generation is the inverse process. Both Analysis and Generation make use of lexicon. Malayalam like the other languages in the Dravidian family exhibits the characteristics of an agglutinative language. Here using a bilingual dictionary, the Malayalam morphological analyzer and the Tamil morphological generator have been described.

38 citations


Proceedings ArticleDOI
23 Mar 2011
TL;DR: This paper proposes a system for recognition of offline handwritten Malayalam vowels using Chain code and Image Centroid for the purpose of extracting features and a two layer feed forward network with scaled conjugate gradient for classification.
Abstract: Optical Character Recognition plays an important role in Digital Image Processing and Pattern Recognition. Even though ambient study had been performed on foreign languages like Chinese and Japanese, effort on Indian script is still immature. OCR in Malayalam language is more complex as it is enriched with largest number of characters among all Indian languages. The challenge of recognition of characters is even high in handwritten domain, due to the varying writing style of each individual. In this paper we propose a system for recognition of offline handwritten Malayalam vowels. The proposed method uses Chain code and Image Centroid for the purpose of extracting features and a two layer feed forward network with scaled conjugate gradient for classification.

30 citations


Posted Content
TL;DR: This paper provides an overview of offline handwritten character recognition in South Indian Scripts, namely Malayalam, Tamil, Kannada and Telungu.
Abstract: Handwritten character recognition is always a frontier area of research in the field of pattern recognition and image processing and there is a large demand for OCR on hand written documents. Even though, sufficient studies have performed in foreign scripts like Chinese, Japanese and Arabic characters, only a very few work can be traced for handwritten character recognition of Indian scripts especially for the South Indian scripts. This paper provides an overview of offline handwritten character recognition in South Indian Scripts, namely Malayalam, Tamil, Kannada and Telungu.

26 citations


Journal ArticleDOI
TL;DR: The authors investigated akshara knowledge in a group of Grade III children learning to read an unexplored alphasyllabary, Malayalam (a Dravidian language spoken in Kerala, India) by extending Nag's study.
Abstract: The reading acquisition literature is mainly based on alphabetic orthographies and is largely silent on reading acquisition in alphasyllabic orthographies. In this context, as a preliminary attempt, the present study investigated akshara (orthographic character of alphasyllabary) knowledge in a group of Grade III children learning to read an unexplored alphasyllabary, Malayalam (a Dravidian language spoken in Kerala, India) by extending Nag’s study [Nag, S. (2007). Early reading in Kannada: The pace of acquisition of orthographic knowledge and phonemic awareness. Journal of Research in Reading, 30(1): 7–22]. Specifically, the study investigated participants’ knowledge of the following akshara types: (a) consonants with inherent vowels (e.g. ക /ka/); (b) consonants with vowel diacritics (ലി /li/); (c) consonant clusters (e.g. തര /thra/); and (d) vowel in primary form (ഇ /i/). The results showed that children master consonants with inherent vowels and vowels in primary form by Grade III. However, they did ...

26 citations


Proceedings ArticleDOI
03 Nov 2011
TL;DR: An algorithm has been developed and successfully used for splitting the compound words in Malayalam and 90% success has been established in initial scrutiny of around 4000 compound words.
Abstract: Morphological analyzers are essential for any type of natural language processing works. As Malayalam like other Dravidian languages is an agglutinative language it needs a compound word splitter as a preprocessor. An algorithm has been developed and successfully used for splitting the compound words. 90% success has been established in initial scrutiny of around 4000 compound words. The splitter can be used for developing and implementing a full fledged morphological analyzer.

19 citations


Proceedings ArticleDOI
08 Apr 2011
TL;DR: The salient features of Malayalam script are introduced and the approaches used for character recognition are listed and the overall structure of OCR system is presented.
Abstract: This paper proposes an algorithm for the recognition of handwritten characters in Malayalam, a South Indian language. It introduces the salient features of Malayalam script and lists the approaches used for character recognition. Malayalam scripts are rich in patterns because of their complex curved form, larger number of basic elements and the presence of conjuncts. The combinations of such patterns make the recognition of characters much complex and these patterns should be exploited to arrive at the solution. Here an image of handwritten Malayalam characters is given as the input and an editable document of Malayalam characters in a predefined format is produced as output. In this paper, initially the overall structure of OCR system is presented. Then, the OCR process is presented in three modules: Pre-processing, Skeletonization and Recognition. In Pre-processing, we scan the input image and separate each character from it. In Skeletoniz ation, we obtain one pixel thick skeleton of the character. In Recognition, we classify the characters based on their features. The features of the characters are extracted based on the analysis of position and count of the horizontal and vertical lines.

17 citations


Journal ArticleDOI
TL;DR: An algorithm which uses the inveterate characteristic features to recognize these characters with perceptive accuracy by utilizing the intensity variations in the way in which they may be written is proposed.
Abstract: People start learning to read and write during the early stage of education. As years pass by they may have acquired good reading and writing skills. It may not be difficult for them to read any kind of either printed or handwritten characters. Most people have no problem in reading any kind of light prints or heavy prints, upside down prints, prints of different fonts and styles, handwritten whether it is neatly or sloppily written. But Computers may find difficultly in deciphering many kinds of printed characters which is of different fonts and styles or handwritten characters. To evolve a panacea to this problem human brains have been indulging in various research activities. This paper is a humble attempt for the recognition of handwritten Malayalam (a South Indian Language) characters. In our study we have classified the connected characters into 3 categories. Here we propose an algorithm which uses the inveterate characteristic features to recognize these characters with perceptive accuracy by utilizing the intensity variations in the way in which they may be written. This algorithm recognizes the antediluvian script of Malayalam characters which are connected in nature. Here the input is a 24-bit bmp image which can be enscribed using the Light pen. The output is editable version of the recognized Malayalam characters. In our study we have classified the connected characters into 3 categories. The algorithm is tested for 3 sets of samples ranging 402 letters in noiseless environment and produces accuracy of 94%.

13 citations


01 Jan 2011
TL;DR: A chronology of the major scientific discoveries and discoveries in the history of embryology is presented.
Abstract: .......................................................................................................................... i List of Tables ................................................................................................................... vii List of Figures ................................................................................................................. viii Acronyms and Abbreviations ......................................................................................... ix CHAPTER 1 ......................................................................................................................

11 citations


30 Jul 2011
TL;DR: The development and evaluation of syllable-based Indian language Text-To-Speech (TTS) synthesis system (around festival TTS) with ORCA and NVDA, for Linux and Windows environments respectively.
Abstract: This paper describes the integration of commonly used screen readers, namely, NVDA [NVDA 2011] and ORCA [ORCA 2011] with Text to Speech (TTS) systems for Indian languages. A participatory design approach was followed in the development of the integrated system to ensure that the expectations of visually challenged people are met. Given that India is a multilingual country (22 official languages), a uniform framework for an integrated text-to-speech synthesis systems with screen readers across six Indian languages are developed, which can be easily extended to other languages as well. Since Indian languages are syllable centred, syllable-based concatenative speech synthesizers are built. This paper describes the development and evaluation of syllable-based Indian language Text-To-Speech (TTS) synthesis system (around festival TTS) with ORCA and NVDA, for Linux and Windows environments respectively. TTS systems for six Indian Languages, namely, Hindi, Tamil, Marathi, Bengali, Malayalam and Telugu were built. Usability studies of the screen readers were performed. The system usability was evaluated by a group of visually challenged people based on a questionnaire provided to them. And a Mean Opinion Score(MoS) of 62.27% was achieved.

11 citations



01 Oct 2011
TL;DR: This paper presents a meta-modelling system that automates the very labor-intensive and therefore time-heavy and expensive and therefore expensive and expensive process of manually cataloging and cataloging individual components of a distributed system.
Abstract: International Journal of Advanced Information Technology (IJAIT) Vol. 1, No.5, October 2011

Proceedings ArticleDOI
01 Dec 2011
TL;DR: The paper presents a strategy for developing Malayalam Text Generator for English Malayala Machine Aided Translation System using AnglaBharati technology, using the Interlingua approach for translation.
Abstract: The paper presents a strategy for developing Malayalam Text Generator for English Malayalam Machine Aided Translation System using AnglaBharati technology. AnglaBharati uses the Interlingua approach for translation. The Interlingua is a language independent, unambiguous representation of the input text. The text generator converts this intermediate representation into the target language (Malayalam) text. In this paper we examine the major tasks involved in the development of Malayalam Text Generator for translation from English.

01 May 2011
TL;DR: A set of novel features exclusively for Malayalam language is described, which are fused to form the feature vector or knowledge vector that is used in all the phases of the writer identification scheme.
Abstract: This paper presents a writer identification scheme for Malayalam documents. As the accomplishment rate of a scheme is highly dependent on the features extracted from the documents, the process of feature selection and extraction is highly relevant. The paper describes a set of novel features exclusively for Malayalam language. The features were studied in detail which resulted in a comparative study of all the features. The features are fused to form the feature vector or knowledge vector. This knowledge vector is then used in all the phases of the writer identification scheme. The scheme has been tested on a test bed of 280 writers of which 50 writers having only one page, 215 writers with at least 2 pages and 15 writers with at least 4 pages. To perform a comparative evaluation of the scheme the test is conducted using WD-LBP method also. A recognition rate of around 95% was obtained for the proposed approach.

Proceedings ArticleDOI
08 Apr 2011
TL;DR: An algorithm is proposed which can accept scanned image of printed characters as input and produce editable Malayalam and English characters in a predefined format as output and an efficiency of 87.25% is obtained.
Abstract: India is a multilingual and multi-script country where a line of a bilingual document page may contain text words both in regional language and in English. Recognition of documents containing multi-scripts is really a challenging task, which needs more effort of the OCR designers for improving the accuracy rate. This paper presents a Bilingual OCR system for printed Malayalam and English text. Here we propose an algorithm which can accept scanned image of printed characters as input and produce editable Malayalam and English characters in a predefined format as output. The image acquired is segmented into line and character-wise using pixel by pixel approach by scanning from top-left of the image to bottom-right. The character image obtained after segmentation is resized to 16 × 16 bitmap which is used for comparison. The database contains characters in various fonts of both the languages. This database is used for comparison with the resized character image. The comparison is done using pixel-match algorithm. The matched character is displayed in the notepad. An efficiency of 87.25% is obtained using this approach.

Journal ArticleDOI
TL;DR: This paper presents a report on the development of a speaker independent, continuous transcription system for Malayalam that employs Hidden Markov Model for acoustic modeling and Mel Frequency Cepstral Coefficient for feature extraction.
Abstract: Malayalam is one of the 22 scheduled languages in India with more than 130 million speakers. This paper presents a report on the development of a speaker independent, continuous transcription system for Malayalam. The system employs Hidden Markov Model (HMM) for acoustic modeling and Mel Frequency Cepstral Coefficient (MFCC) for feature extraction. It is trained with 21 male and female speakers in the age group ranging from 20 to 40 years. The system obtained a word recognition accuracy of 87.4% and a sentence recognition accuracy of 84%, when tested with a set of continuous speech data.

Journal ArticleDOI
TL;DR: A corpus-driven Malayalam text-to-speech (TTS) system based on the concatenative synthesis approach that resembles natural human voice and provides utility to save the synthesized output.
Abstract: a text-to-speech system, spoken utterances are automatically produced from text. In this paper, we present a corpus-driven Malayalam text-to-speech (TTS) system based on the concatenative synthesis approach. The most important qualities of a synthesized speech are naturalness and intelligibility. In this system, words and syllables are used as the basic units for synthesis. Our corpus consists of speech waveforms that are collected for most frequently used words in different domains. The speaker is selected through subjective and objective evaluation of natural and synthesized waveform. The proposed Malayalam text-to-speech system is implemented in Java multimedia framework (JMF) and runs on both in Windows and Linux platforms. The proposed system provides utility to save the synthesized output. The output generated by the proposed Malayalam text-to-speech synthesis system resembles natural human voice. Our text to speech reader software converts a Malayalam text to speech wav file that has high rates of intelligibility and comprehensibility.

Book ChapterDOI
05 Aug 2011
TL;DR: The novelty of the scheme lies in the fact that the graphemes were used in the training and identification phase of the system, and the scheme has been tested on a test bed of 280 writers.
Abstract: This paper proposes a Writer Identification scheme for Malayalam handwritten documents. The novelty of the scheme lies in the fact that the graphemes were used in the training and identification phase of the system. Graphemes are small writing fragments extracted from the handwritten documents which contain meaningful patterns and possess individuality of each writer. The scheme has been tested on a test bed of 280 writers of which 50 writers having only one page, 215 writers with at least 2 pages and 15 writers with at least 4 pages. A recognition rate of 89.28% was achieved.

Book ChapterDOI
01 Jan 2011
TL;DR: The role of Hindi-Bollywood cinema is not crucial in the southern regions of Tamil Nadu (Tamil), Andhra Pradesh (Telugu), Kerala (Malayalam) or Karnataka (Kannada) as mentioned in this paper.
Abstract: At first sight, all Indian cinemas seems to follow the same aesthetic principles: excessively long movies, song-and-dance scenes and star-studded casts. Most call this phenomenon ‘Bollywood’ and assume all Indian film industries fall under this common name. However, the term ‘Bollywood’ covers only the Hindi-speaking film companies from Mumbai (Bombay). India is a multi-lingual nation: while Hindi is the most significant language in the North, the Dravidian language family is of particular importance in the South. Thus, the role of the Hindi-Bollywood cinema is not crucial in the southern regions of Tamil Nadu (Tamil), Andhra Pradesh (Telugu), Kerala (Malayalam) or Karnataka (Kannada). Significantly, these regional cinemas of the South produce more than half of India’s total output of films.1 And even if these moving pictures share some aspects of style in common with Hindi blockbusters from the North, it should be noted that South Indian cinema differs from Bollywood.

Proceedings ArticleDOI
12 Dec 2011
TL;DR: A chunking method for Malayalam sentences based on morpheme based augmented transition network that works with good accuracy with the set of chunk rules proposed and has good potential for use as a full fledged parser forMalayalam language.
Abstract: Various methods have been proposed for chunking sentences in agglutinative languages. For Malayalam a South Indian language, chunking methods proposed are mainly statistical. This paper describes a chunking method for Malayalam sentences based on morpheme based augmented transition network. For the trial set of sentences the system works with good accuracy with the set of chunk rules proposed. The chunking system has good potential for use as a full fledged parser for Malayalam language.

Journal ArticleDOI
TL;DR: The learning process of multiple interrogatives is explored, considering several sources of evidence children are getting in the input, and children acquiring English and Malayalam demonstrated perfect knowledge of the properties of multiple questions, while Russian-acquiring children exhibited some difficulties with the language-specific syntax of these expressions.
Abstract: This article presents the results of four studies exploring the acquisition of the language-specific syntactic and semantic properties of multiple interrogatives in English, Russian, and Malayalam, languages that behave differently with respect to the syntax and semantics of multiple interrogatives. A corpus analysis investigated the frequency of occurrence of multiple interrogatives in parental speech, demonstrating that children face quite limited evidence in the input. It was followed with studies where multiple interrogatives were elicited from children and adults in specific contexts in English, Russian, and Malayalam. Children acquiring English and Malayalam demonstrated perfect knowledge of the properties of multiple questions, while Russian-acquiring children exhibited some difficulties with the language-specific syntax of these expressions. The learning process of multiple interrogatives is explored, considering several sources of evidence children are getting in the input.

Book ChapterDOI
09 Mar 2011
TL;DR: This paper is focused on the classification of verbs based on the past forms and the morphophonemic changes in the verb roots and can be used in the similar NLP applications.
Abstract: In applications like Morphological Analyzer, Machine Aided Translation (MAT), Spell checker, etc. the verb synthesis and or generation are prime tasks. For paradigm approach verb classification is needed. There exist many verb classifications in Malayalam. Suranad Kunjan Pillai’s classification contains sixteen classes, Wickremasinghe and Menon proposed eight, Sekhar and Glazov have twelve, Asher and Prabodhchandran Nair have four and Valentine have two.[1] All descriptions focus on past tense forms, because the much simpler forms present and future tense forms are easily predictable. In regard to verbs an entirely new item of work had to be undertaken. The verbs in the language present a multiplicity of conjugational forms which may perplex anyone who is not thoroughly familiar with them.[3] This paper is focused on the classification of verbs based on the past forms and the morphophonemic changes in the verb roots. This classification is basically done for the rule based MAT System and can be used in the similar NLP applications.

Proceedings ArticleDOI
12 Dec 2011
TL;DR: A chunking method for Malayalam sentences based on morpheme based augmented transition network that works with good accuracy with the set of chunk rules proposed and has good potential for use as a full fledged parser forMalayalam language.
Abstract: Various methods have been proposed for chunking sentences in agglutinative languages. For Malayalam a South Indian language, chunking methods proposed are mainly statistical. This paper describes a chunking method for Malayalam sentences based on morpheme based augmented transition network. For the trial set of sentences the system works with good accuracy with the set of chunk rules proposed. The chunking system has good potential for use as a full fledged parser for Malayalam language.

Journal ArticleDOI
TL;DR: The system presented here is a Named Entity (NE) Classifier created using Multiclass Support Vector Machines based on linguistic grammar principles and Experimental results show that the average precision recall and F-measure values are 89.12, 89.15 and 89.13% respectively.
Abstract: Entity Recognition (NER) seeks to locate and classify atomic elements in text into predefined categories such as names of person, organization, location, Quantities, Percentage etc. Named entities tell us the roles of each meaning bearing word in a sentence and hence identification of these entities certainly helps us to extract the essence of the text which is very important in Question Answering(QA) , Information Extraction (IE) and Summarization. The system presented here is a Named Entity (NE) Classifier created using Multiclass Support Vector Machines based on linguistic grammar principles. Malayalam NER is a difficult task as each word of named entity has no specific feature such as Capitalization feature in English. NERs in other languages are not suitable for Malayalam language since its morphology, syntax and lexical semantics is different from them. Also there is no tagged corpus available for training. For testing this system, documents from well known Malayalam news papers and magazines containing passages from five different fields such as sports, health, politics, science and agriculture are selected. Experimental results show that the average precision recall and F-measure values are 89.12%, 89.15% and 89.13% respectively.

Proceedings Article
01 Jan 2011

Book ChapterDOI
22 Jul 2011
TL;DR: The algorithm proposed recognizes these characters mainly based on the strokes and lines contained in them, which undergoes different phases of processing to produce an editable document of Malayalam characters in a predefined format as output.
Abstract: This paper mainly focuses on the recognition of both simple and conjunct handwritten characters in Malayalam, a South Indian language. The algorithm proposed recognizes these characters mainly based on the strokes and lines contained in them. Here the input is an image of handwritten Malayalam characters, which undergoes different phases of processing to produce an editable document of Malayalam characters in a predefined format as output. In this paper, detailed description of the methods for character identification is given. The whole OCR process is presented in three different modules: Pre-processing, Skeletonization and Recognition. In Pre-processing, the input image is scanned and subjected to line and character separation. In Skeletonization, the digital image is transformed into a set of original components. In Recognition, the characters are classified based on their features. The feature extraction of the characters is done by the analyzing the position and count of the horizontal and vertical lines. A classification of the simple and conjunct characters is also devised based on the count and position of the horizontal and vertical lines which make up those characters.


01 Jan 2011
TL;DR: This paper presents a meta-analyses of the immune system’s response to natural disasters and shows clear patterns of decline in the immune systems of earthquake-triggered disasters.
Abstract: Procedia Technology 00 (2011) 000–000,2nd International Conference on Communication, Computing & Security

Book ChapterDOI
28 Jan 2011
TL;DR: The inherent advantage of the system is that the recognition of Malayalam, English words and numerals present in a bilingual document was achieved without performing script identification initially, which avoids the script identification process which is computationally expensive.
Abstract: In India, bilingual documentation is very common especially in government forms and formats, technical documents, reports, postal documents, railways reservation forms etc., Printed documents having a single Indian language often contain English words and numerals since English is considered as a link language in India. The proposed system is designed to recognize bilingual script having Malayalam and English interspersed at word-level. This problem was considered as it is more realistic. Here, a combined database approach is employed, the scripts involved are treated alike and hence a single OCR is sufficient for recognition of bilingual script. The inherent advantage of the system is that the recognition of Malayalam, English words and numerals present in a bilingual document was achieved without performing script identification initially. This method avoids the script identification process which is computationally expensive. The proposed system achieves a recognition rate of 97.5% and 98.5 % for the two feature extraction approaches respectively.

01 Jan 2011
TL;DR: Results show that this sound has the phonetic characteristics of a clear post-alveolar central approximant, therefore suggesting that Malayalam has a third rhotic.
Abstract: As part of its liquid inventory, Malayalam has two rhotics, two laterals and a fifth liquid which has been called an ‘r-sound’ by some researchers and a lateral by others. This paper presents findings on the phonetic and phonological nature of the fifth liquid in Malayalam, which has never been the subject of experimental research before. Results show that this sound has the phonetic characteristics of a clear post-alveolar central approximant, therefore suggesting that Malayalam has a third rhotic. Interestingly, however, its phonological behaviour displays patterns that are typical of other retroflex rather than of alveolar sounds in the language. An extrinsic phonetic interpretation of phonology is suggested to account for the results.