Home
/
Authors
/
N C Gokul

Author

N C Gokul

Bio: N C Gokul is an academic researcher. The author has contributed to research in topics: Language model & Sign (mathematics). The author has an hindex of 2, co-authored 3 publications receiving 89 citations.

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages

[...]

Divyanshu Kakwani, Anoop Kunchukuttan¹, Satish Golla, N C Gokul, Avik Bhattacharyya², Mitesh M. Khapra², Pratyush Kumar² - Show less +3 more•Institutions (2)

Microsoft¹, Indian Institute of Technology Madras²

01 Nov 2020

TL;DR: This paper introduces NLP resources for 11 major Indian languages from two major language families, and creates datasets for the following tasks: Article Genre Classification, Headline Prediction, Wikipedia Section-Title Prediction, Cloze-style Multiple choice QA, Winograd NLI and COPA.

...read moreread less

Abstract: In this paper, we introduce NLP resources for 11 major Indian languages from two major language families. These resources include: (a) large-scale sentence-level monolingual corpora, (b) pre-trained word embeddings, (c) pre-trained language models, and (d) multiple NLU evaluation datasets (IndicGLUE benchmark). The monolingual corpora contains a total of 8.8 billion tokens across all 11 languages and Indian English, primarily sourced from news crawls. The word embeddings are based on FastText, hence suitable for handling morphological complexity of Indian languages. The pre-trained language models are based on the compact ALBERT model. Lastly, we compile the (IndicGLUE benchmark for Indian language NLU. To this end, we create datasets for the following tasks: Article Genre Classification, Headline Prediction, Wikipedia Section-Title Prediction, Cloze-style Multiple choice QA, Winograd NLI and COPA. We also include publicly available datasets for some Indic languages for tasks like Named Entity Recognition, Cross-lingual Sentence Retrieval, Paraphrase detection, etc. Our embeddings are competitive or better than existing pre-trained embeddings on multiple tasks. We hope that the availability of the dataset will accelerate Indic NLP research which has the potential to impact more than a billion people. It can also help the community in evaluating advances in NLP over a more diverse pool of languages. The data and models are available at https://indicnlp.ai4bharat.org.

...read moreread less

257 citations

Posted Content•

AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for Indic Languages.

[...]

Anoop Kunchukuttan¹, Divyanshu Kakwani, Satish Golla, N C Gokul, Avik Bhattacharyya², Mitesh M. Khapra², Pratyush Kumar² - Show less +3 more•Institutions (2)

Microsoft¹, Indian Institute of Technology Madras²

30 Apr 2020-arXiv: Computation and Language

TL;DR: The IndicNLP corpus, a large-scale, general-domain corpus containing 2.7 billion words for 10 Indian languages from two language families, is presented and it is shown that the IndiNLP embeddings significantly outperform publicly available pre-trained embedding on multiple evaluation tasks.

...read moreread less

Abstract: We present the IndicNLP corpus, a large-scale, general-domain corpus containing 2.7 billion words for 10 Indian languages from two language families. We share pre-trained word embeddings trained on these corpora. We create news article category classification datasets for 9 languages to evaluate the embeddings. We show that the IndicNLP embeddings significantly outperform publicly available pre-trained embedding on multiple evaluation tasks. We hope that the availability of the corpus will accelerate Indic NLP research. The resources are available at this https URL.

...read moreread less

38 citations

Posted Content•

OpenHands: Making Sign Language Recognition Accessible with Pose-based Pretrained Models across Languages

[...]

Prem Selvaraj, N C Gokul, Pratyush Kumar¹, Mitesh M. Khapra¹•Institutions (1)

Indian Institute of Technology Madras¹

12 Oct 2021-arXiv: Computation and Language

TL;DR: OpenHands as mentioned in this paper uses pose extracted through pretrained models as the standard modality of data to reduce training time and enable efficient inference, and provides standardized pose datasets for 6 sign languages - American, Argentinian, Chinese, Greek, Indian, and Turkish.

...read moreread less

Abstract: AI technologies for Natural Languages have made tremendous progress recently. However, commensurate progress has not been made on Sign Languages, in particular, in recognizing signs as individual words or as complete sentences. We introduce OpenHands, a library where we take four key ideas from the NLP community for low-resource languages and apply them to sign languages for word-level recognition. First, we propose using pose extracted through pretrained models as the standard modality of data to reduce training time and enable efficient inference, and we release standardized pose datasets for 6 different sign languages - American, Argentinian, Chinese, Greek, Indian, and Turkish. Second, we train and release checkpoints of 4 pose-based isolated sign language recognition models across all 6 languages, providing baselines and ready checkpoints for deployment. Third, to address the lack of labelled data, we propose self-supervised pretraining on unlabelled data. We curate and release the largest pose-based pretraining dataset on Indian Sign Language (Indian-SL). Fourth, we compare different pretraining strategies and for the first time establish that pretraining is effective for sign language recognition by demonstrating (a) improved fine-tuning performance especially in low-resource settings, and (b) high crosslingual transfer from Indian-SL to few other sign languages. We open-source all models and datasets in OpenHands with a hope that it makes research in sign languages more accessible, available here at https://github.com/AI4Bharat/OpenHands .

...read moreread less

1 citations

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

[...]

Teven Le Scao, Angela Fan, Christopher Akiki, Elizabeth-Jane Pavlick +383 more

09 Nov 2022-arXiv.org

TL;DR: BLOOM as discussed by the authors is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total).

...read moreread less

Abstract: Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.

...read moreread less

407 citations

Book Chapter•DOI•

Exploring the Limits

[...]

Ian S. Dunn

14 Dec 2009

339 citations

Book Chapter•DOI•

Overview of CONSTRAINT 2021 Shared Tasks: Detecting English COVID-19 Fake News and Hindi Hostile Posts

[...]

Parth Patwa¹, Mohit Bhardwaj², Vineeth Guptha³, Gitanjali Kumari⁴, Shivam Sharma³, Shivam Sharma², Srinivas Pykl¹, Amitava Das³, Asif Ekbal⁴, Md. Shad Akhtar², Tanmoy Chakraborty² - Show less +7 more•Institutions (4)

Indian Institutes of Information Technology¹, Indraprastha Institute of Information Technology², Wipro³, Indian Institute of Technology Patna⁴

08 Feb 2021

TL;DR: The findings of the shared tasks conducted at the CONSTRAINT Workshop at AAAI 2021 are presented and the most successful models were BERT or its variations.

...read moreread less

Abstract: Fake news, hostility, defamation are some of the biggest problems faced in social media. We present the findings of the shared tasks (https://constraint-shared-task-2021.github.io/) conducted at the CONSTRAINT Workshop at AAAI 2021. The shared tasks are ‘COVID19 Fake News Detection in English’ and ‘Hostile Post Detection in Hindi’. The tasks attracted 166 and 44 team submissions respectively. The most successful models were BERT or its variations.

...read moreread less

90 citations

Proceedings Article•DOI•

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

[...]

Hugo Laurenccon, Lucile Saulnier, Thomas Wang, Christopher Akiki, A. Villanova del Moral, Teven Le Scao, Leandro von Werra, Chenghao Mou, Eduardo G. Ponferrada, Huu Nguyen, Jorg Frohberg, Mario vSavsko, Quentin Lhoest, Angelina McMillan-Major, Gérard Dupont, Stella Biderman, Anna Rogers, Loubna Ben Allal, Francesco De Toni, Giada Pistilli, Olivier Nguyen, Somaieh Nikpoor, Maraim Masoud, Pierre Colombo, Javier Galiana de la Rosa, Paulo Villegas, Tristan Thrush, Shayne Longpre, Sebastian Nagel, Leon Weber, Manuel Sevilla Muñoz, Jian Zhou, Daniel van Strien, Zaid Alyafeai, Khalid Almubarak, Minh Chien Vu, Itziar Gonzalez-Dios, Aitor Soroa, Kyle Lo, Manan Dey, Pedro Javier Ortiz Suárez, Aaron Gokaslan, Shamik Bose, David Ifeoluwa Adelani, Long Phan, H. Tran, I. Yu, Suhas Pai, Jenny Chim, Violette Lepercq, Suzana Ilic, Margaret Mitchell, Sasha Luccioni, Yacine Jernite - Show less +50 more

07 Mar 2023

TL;DR: The Responsible Open-science Open-collaboration Text Sources (ROOTS) corpus as mentioned in this paper is a 1.6TB dataset spanning 59 languages that was used to train the 176-billion-parameter BigScience Large Open science Open-access Multilingual (BLOOM) language model.

...read moreread less

Abstract: As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with the goal of researching and training large language models as a values-driven undertaking, putting issues of ethics, harm, and governance in the foreground. This paper documents the data creation and curation efforts undertaken by BigScience to assemble the Responsible Open-science Open-collaboration Text Sources (ROOTS) corpus, a 1.6TB dataset spanning 59 languages that was used to train the 176-billion-parameter BigScience Large Open-science Open-access Multilingual (BLOOM) language model. We further release a large initial subset of the corpus and analyses thereof, and hope to empower large-scale monolingual and multilingual modeling projects with both the data and the processing tools, as well as stimulate research around this large multilingual corpus.

...read moreread less

43 citations

Posted Content•

Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages.

[...]

G. Ramesh, Sumanth Doddapaneni, Aravinth Bheemaraj, Mayank Jobanputra, A. Raghavan, Ajitesh Sharma, Sujit Sahoo¹, Harshita Diddee, Mahalakshmi J, Divyanshu Kakwani, Navneet Kumar, Aswin Pradeep, Kumar Deepak, Vivek Raghavan, Anoop Kunchukuttan, Pratyush Kumar, Mitesh Shantadevi Khapra - Show less +13 more•Institutions (1)

Indian Institute of Technology Madras¹

12 Apr 2021-arXiv: Computation and Language

TL;DR: Samanantar as discussed by the authors is the largest publicly available parallel corpora collection for Indic languages, which contains 46.9 million sentence pairs between English and 11 languages (from two language families).

...read moreread less

Abstract: We present Samanantar, the largest publicly available parallel corpora collection for Indic languages. The collection contains a total of 46.9 million sentence pairs between English and 11 Indic languages (from two language families). In particular, we compile 12.4 million sentence pairs from existing, publicly-available parallel corpora, and we additionally mine 34.6 million sentence pairs from the web, resulting in a 2.8X increase in publicly available sentence pairs. We mine the parallel sentences from the web by combining many corpora, tools, and methods. In particular, we use (a) web-crawled monolingual corpora, (b) document OCR for extracting sentences from scanned documents (c) multilingual representation models for aligning sentences, and (d) approximate nearest neighbor search for searching in a large collection of sentences. Human evaluation of samples from the newly mined corpora validate the high quality of the parallel sentences across 11 language pairs. Further, we extracted 82.7 million sentence pairs between all 55 Indic language pairs from the English-centric parallel corpus using English as the pivot language. We trained multilingual NMT models spanning all these languages on Samanantar and compared with other baselines and previously reported results on publicly available benchmarks. Our models outperform existing models on these benchmarks, establishing the utility of Samanantar. Our data (this https URL) and models (this https URL) will be available publicly and we hope they will help advance research in Indic NMT and multilingual NLP for Indic languages.

...read moreread less

32 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57

Collapse