mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

doi:10.18653/V1/2021.NAACL-MAIN.41

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Image-and-Language Understanding from Pixels Only

[...]

Michael Tschannen, Basil Mustafa, Neil Houlsby

arXiv.org

TL;DR: In this article , a pure pixel-based model called CLIP-Pixels Only (CLIPPO) is proposed to perform image, text, and multimodal tasks.

...read moreread less

Abstract: Multimodal models are becoming increasingly effective, in part due to uniﬁed components, such as the Transformer architecture. However, multimodal models still often consist of many task- and modality-speciﬁc pieces and training procedures. For example, CLIP (Radford et al., 2021) trains independent text and image towers via a contrastive loss. We explore an additional uniﬁcation: the use of a pure pixel-based model to perform image, text, and multimodal tasks. Our model is trained with contrastive loss alone, so we call it CLIP-Pixels Only (CLIPPO). CLIPPO uses a single encoder that processes both regular images and text rendered as images. CLIPPO performs image-based tasks such as retrieval and zero-shot image classiﬁcation almost as well as CLIP, with half the number of parameters and no text-speciﬁc tower or embedding. When trained jointly via image-text contrastive learning and next-sentence contrastive learning, CLIPPO can perform well on natural language understanding tasks, without any word-level loss (language modelling or masked language modelling), outperforming pixel-based prior work. Surprisingly, CLIPPO can obtain good accuracy in visual question answering, simply by rendering the question and image together. Finally, we exploit the fact that CLIPPO does not require a tokenizer to show that it can achieve strong performance on multilingual multimodal retrieval without modiﬁcations.

...read moreread less

9 citations

Journal Article•DOI•

Named entity recognition using neural language model and CRF for Hindi language

[...]

Richa Sharma, Sudha Morwal, Basant Agarwal

01 Jan 2022-Computer Speech & Language

TL;DR: In this paper , a state-of-the-art Hindi NER system based on MuRIL language model and CRF is proposed. But, the model is not suitable for the Hindi named entity recognition task.

...read moreread less

9 citations

Proceedings Article•DOI•

AfroLID: A Neural Language Identification Tool for African Languages

[...]

Ifeoluwanimi Adebara, AbdelRahim A. Elmadany, Muhammad Abdul-Mageed, Alcides Alcoba Inciarte

21 Oct 2022

TL;DR: This work introduces AfroLID, a neural LID toolkit for 517 African languages and varieties, and compares it to existing LID tools that each cover a small number of African languages, finding it to outperform them on most languages.

...read moreread less

Abstract: Language identification (LID) is a crucial precursor for NLP, especially for mining web data. Problematically, most of the world’s 7000+ languages today are not covered by LID technologies. We address this pressing issue for Africa by introducing AfroLID, a neural LID toolkit for 517 African languages and varieties. AfroLID exploits a multi-domain web dataset manually curated from across 14 language families utilizing five orthographic systems. When evaluated on our blind Test set, AfroLID achieves 95.89 F_1-score. We also compare AfroLID to five existing LID tools that each cover a small number of African languages, finding it to outperform them on most languages. We further show the utility of AfroLID in the wild by testing it on the acutely under-served Twitter domain. Finally, we offer a number of controlled case studies and perform a linguistically-motivated error analysis that allow us to both showcase AfroLID’s powerful capabilities and limitations

...read moreread less

9 citations

Proceedings Article•DOI•

LINGUIST: Language Model Instruction Tuning to Generate Annotated Utterances for Intent Classification and Slot Tagging

[...]

Andrew Rosenbaum, Saleh Soltan, Wael Hamza, Yannick Versley, Markus Boese - Show less +1 more

20 Sep 2022

TL;DR: This work presents LINGUIST, a method for generating annotated data for Intent Classification and Slot Tagging (IC+ST), via fine-tuning AlexaTM 5B, a 5-billion-parameter multilingual sequence-to-sequence (seq2seq) model, on a flexible instruction prompt, and is the first to demonstrate instruction fine- tuning of a large-scale seq2seq model to control the outputs of multilingual intent- and slot-labeled data generation

...read moreread less

Abstract: We present LINGUIST, a method for generating annotated data for Intent Classification and Slot Tagging (IC+ST), via fine-tuning AlexaTM 5B, a 5-billion-parameter multilingual sequence-to-sequence (seq2seq) model, on a flexible instruction prompt. In a 10-shot novel intent setting for the SNIPS dataset, LINGUIST surpasses state-of-the-art approaches (Back-Translation and Example Extrapolation) by a wide margin, showing absolute improvement for the target intents of +1.9 points on IC Recall and +2.5 points on ST F1 Score. In the zero-shot cross-lingual setting of the mATIS++ dataset, LINGUIST out-performs a strong baseline of Machine Translation with Slot Alignment by +4.14 points absolute on ST F1 Score across 6 languages, while matching performance on IC. Finally, we verify our results on an internal large-scale multilingual dataset for conversational agent IC+ST and show significant improvements over a baseline which uses Back-Translation, Paraphrasing and Slot Catalog Resampling. To our knowledge, we are the first to demonstrate instruction fine-tuning of a large-scale seq2seq model to control the outputs of multilingual intent- and slot-labeled data generation.

...read moreread less

9 citations

Proceedings Article•DOI•

Maestro-U: Leveraging Joint Speech-Text Representation Learning for Zero Supervised Speech ASR

[...]

Zhehuai Chen, Ankur Bapna, Andrew Rosenberg, Yu Zhang, Bhuvana Ramabhadran, Pedro J. Moreno, Nanxin Chen - Show less +3 more

18 Oct 2022

TL;DR: This work demonstrates that a modality-matched joint speech and text model introduced in [1] can be leveraged to train a massively multilingual ASR model without any supervised (manually transcribed) speech for some languages and shows that Maestro-U can promote knowledge transfer from languages with supervised speech even when there is limited to no graphemic overlap.

...read moreread less

Abstract: Training state-of-the-art Automated Speech Recognition (ASR) models typically requires a substantial amount of transcribed speech. In this work, we demonstrate that a modality-matched joint speech and text model introduced in [1] can be leveraged to train a massively multilingual ASR model without any supervised (manually transcribed) speech for some languages. This paper explores the use of jointly learnt speech and text representations in a massively multilingual, zero supervised speech, real-world setting to expand the set of languages covered by ASR with only unlabeled speech and text in the target languages. Using the FLEURS dataset, we define the task to cover 102 languages, where transcribed speech is available in 52 of these languages and can be used to improve end-to-end ASR quality on the remaining 50. First, we show that by combining speech representations with byte-level text representations and use of language embeddings, we can dramatically reduce the Character Error Rate (CER) on languages with no supervised speech from 64.8% to 30.8%, a relative reduction of 53%. Second, using a subset of South Asian languages we show that Maestro-U can promote knowledge transfer from languages with supervised speech even when there is limited to no graphemic overlap. Overall, Maestro-U closes the gap to oracle performance by 68.5% relative and reduces the CER of 19 languages below 15%.

...read moreread less

8 citations

Collapse

mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

Citations

References

"mT5: A Massively Multilingual Pre-t..." refers methods in this paper

"mT5: A Massively Multilingual Pre-t..." refers background or methods in this paper

Related Papers (5)

Trending Questions (3)