scispace - formally typeset
Search or ask a question

The Indian Languages.

01 Jan 1969-
About: The article was published on 1969-01-01 and is currently open access. It has received 9 citations till now. The article focuses on the topics: Languages of Asia & Applied linguistics.
Citations
More filters
Journal Article
TL;DR: The paper proposes a Unicode-compliant information retrieval and representation (IRR) system viz.
Abstract: The paper describes the growth and development of open access repositories (OARs) in India. The paper proposes a Unicode-compliant information retrieval and representation (IRR) system viz. BURA (Burdwan University Research Archive) for Indian Universities. It has been developed using a number of open standards and open source software (OSS). This Unicode-compliant interface allows administrators to perform various system level operations as well as end users can browse and search resources in Bengali language. Also, describes the necessity of integrating Indic-script based SKOS-enabled subject access system (here DDC – Dewey Decimal Classification) into the proposed model in order to fulfil the subject search of the users. Finally, offers a single window search interface for harvesting metadata from multiple interoperable OARs.

7 citations

Proceedings ArticleDOI
12 Jun 2023
TL;DR: The International Institute of Information Technology Hyderabad-Crowd Sourced Telugu Database (IIITH-CSTD) as discussed by the authors is a large-scale annotated speech corpus for low-resource Indian languages.
Abstract: Due to the lack of a large annotated speech corpus, many low-resource Indian languages struggle to utilize recent advancements in deep neural network architectures for Automatic Speech Recognition (ASR) tasks. Collecting large-scale databases is an expensive and time-consuming task. Current approaches lack extensive traditional expert-based data acquisition guidelines as they are tedious and complex. In this work, we present the International Institute of Information Technology Hyderabad-Crowd Sourced Telugu Database (IIITH-CSTD), a Telugu corpus collected through crowd-sourcing. In particular, our main objective is to mitigate the low-resource problem for Telugu. We also present the sources, crowd-sourcing pipeline, and the protocols used to collect the corpus for a low-resource language, namely, Telugu. Data of approximately 2000 hours of transcribed audio is presented and released in this paper, covering three major regional dialects of the Telugu language in three different (i.e., read, conversational and spontaneous) speaking styles on topics like politics, sports, and arts, science, etc 1. We also present the experimental results of the collected corpus on ASR tasks. We hope this work will motivate researchers to curate large-scale annotated speech data for other low-resource Indic languages.
References
More filters
Dissertation
01 Jan 2011
TL;DR: In this article, an ethnographic study was conducted to explore the status of Punjabi language in our society by looking at the language usage and linguistic practices of native speakers residing in selected urban and rural areas.
Abstract: Pakistan is a land of linguistic diversity having more than sixty languages. Punjabi, along with its numerous mutually intelligible dialects, is an ancient language.It is mainly spoken in the Pakistani province of Punjab and Indian Punjab in the subcontinent.It is a member of the Indo-Aryan branch of the Indo-European language family.The aim of this ethnographic study is to explore the status of Punjabi language in our society by looking at the language usage and linguistic practices of Punjabi native speakers residing in selected urban and rural areas.Ten families, five from urban area and five from rural area, participated in the study.The participants were selected on the basis of their educational level,marital status, monthly income, occupation, family background and the size of land owned by them.The theoretical framework which informs this research is the constructivist qualitative paradigm.The tools of data collection include semi structured interviews and recordings of informal conversation of the research participants.The analysis of the collected data reveals that in the urban areas, Punjabi language is not the dominant medium of communication among the research participants. The participants do not consider it important and worthwhile to maintain Punjabi language, as they do not see it as economically advantageous and profitable to them.It is just a part of their cultural heritage, but they do not use it for communicative purposes.In the rural areas, however, the research participants expressed a strong sense of association and affiliation with Punjabi language; Punjabi language is their dominant medium of communication with others; they consider Punjabi an inevitable part of their cultural heritage and identity; they support the idea of learning English and Urdu languages but not at the cost of Punjabi language.These findings suggest that language desertion is an urban phenomenon, as Punjabi language is not maintained by the urban research participants due to certain wider socio-political factors which have disrupted and distorted the status of Punjabi language while consolidating the role of English and Urdu in the society.

26 citations

Dissertation
01 Nov 2008

17 citations

Proceedings ArticleDOI
01 Jun 2020
TL;DR: A Sanskrit specific OCR system for printed classical Indic documents written in Sanskrit is developed, and an attention-based LSTM model for reading Sanskrit characters in line images is presented, setting the stage for application of OCRs on large corpora of classic Sanskrit texts containing arbitrarily long and highly conjoined words.
Abstract: OCR for printed classical Indic documents written in Sanskrit is a challenging research problem. It involves complexities such as image degradation, lack of datasets and long-length words. Due to these challenges, the word accuracy of available OCR systems, both academic and industrial, is not very high for such documents. To address these shortcomings, we develop a Sanskrit specific OCR system. We present an attention-based LSTM model for reading Sanskrit characters in line images. We introduce a dataset of Sanskrit document images annotated at line level. To augment real data and enable high performance for our OCR, we also generate synthetic data via curated font selection and rendering designed to incorporate crucial glyph substitution rules. Consequently, our OCR achieves a word error rate of 15.97% and a character error rate of 3.71% on challenging Indic document texts and outperforms strong baselines. Overall, our contributions set the stage for application of OCRs on large corpora of classic Sanskrit texts containing arbitrarily long and highly conjoined words.

15 citations

01 Jan 2010
TL;DR: In this paper, sociolinguistic research conducted among speakers of five AustroAsiatic language varieties in northwest Bangladesh: Koda, Kol, Mahali, Mundari, and Santali is reported.
Abstract: This paper reports on sociolinguistic research conducted among speakers of five AustroAsiatic language varieties in northwest Bangladesh: Koda, Kol, Mahali, Mundari, and Santali. These are collectively referred to as the Santali Cluster because Santali is the most populous and developed language among these five varieties. Linguistic variation within and across these varieties, long-term viability of each variety, and attitudes of speakers towards their own and other language varieties were investigated. The degree of intelligibility in Santali by speakers of the other varieties and the bilingual ability in Bangla of speakers from each variety were also studied. This research was carried out from November 2004 through January 2005 through the use of word lists, questionnaires, a Bangla Sentence Repetition Test, and stories recorded in Santali, Mundari, and Mahali.

9 citations