How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models
Phillip Rust,Jonas Pfeiffer,Ivan Vuli,Sebastian Ruder,Iryna Gurevych +4 more
- pp 3118-3135
Reads0
Chats0
TLDR
The authors provide a systematic and comprehensive empirical comparison of pretrained multilingual language models versus their monolingual counterparts with regard to their monolinguistic task performance, and find that while the pretraining data size is an important factor in the downstream performance, a designated mon-olingual tokenizer plays an equally important role in downstream performance.Abstract:
In this work, we provide a systematic and comprehensive empirical comparison of pretrained multilingual language models versus their monolingual counterparts with regard to their monolingual task performance. We study a set of nine typologically diverse languages with readily available pretrained monolingual models on a set of five diverse monolingual downstream tasks. We first aim to establish, via fair and controlled comparisons, if a gap between the multilingual and the corresponding monolingual representation of that language exists, and subsequently investigate the reason for any performance difference. To disentangle conflating factors, we train new monolingual models on the same data, with monolingually and multilingually trained tokenizers. We find that while the pretraining data size is an important factor, a designated monolingual tokenizer plays an equally important role in the downstream performance. Our results show that languages that are adequately represented in the multilingual model’s vocabulary exhibit negligible performance decreases over their monolingual counterparts. We further find that replacing the original multilingual tokenizer with the specialized monolingual tokenizer improves the downstream performance of the multilingual model for almost every task and language.read more
Citations
More filters
Proceedings ArticleDOI
When does pretraining help?: assessing self-supervised learning for law and the CaseHOLD dataset of 53,000+ legal holdings
TL;DR: In this article, the authors present a new dataset called Case Holdings On Legal Decisions (CaseHOLD), which consists of over 53,000+ multiple choice questions to identify the relevant holding of a cited case.
Proceedings Article
UNKs Everywhere: Adapting Multilingual Language Models to New Scripts.
TL;DR: This paper propose a series of data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts. But their methods rely on matrix factorization, which is not suitable for low resource languages.
Proceedings Article
MAD-G: Multilingual Adapter Generation for Efficient Cross-Lingual Transfer.
Alan Ansell,Edoardo Maria Ponti,Jonas Pfeiffer,Sebastian Ruder,Goran Glavaš,Ivan Vulić,Anna Korhonen +6 more
TL;DR: The authors propose MAD-G (Multilingual ADapter Generation) which generates language adapters from language representations based on typological features. But this approach is not viable for the vast majority of languages, due to limitations in their corpus size or compute budgets.
Posted Content
What to Pre-Train on? Efficient Intermediate Task Selection
TL;DR: This article showed that efficient embedding based methods that rely solely on the respective datasets outperform computational expensive few-shot fine-tuning approaches, demonstrating that they are able to efficiently identify the best datasets for intermediate training.
Posted Content
KLUE: Korean Language Understanding Evaluation.
Sungjoon Park,Jihyung Moon,Sungdong Kim,Won Ik Cho,Jiyoon Han,Jang-Won Park,Chisung Song,Junseong Kim,Yongsook Song,Tae-Hwan Oh,Joohong Lee,Juhyun Oh,Sungwon Lyu,Younghoon Jeong,Inkwon Lee,Sangwoo Seo,Dongjun Lee,Hyunwoo Kim,Myeonghwa Lee,Seongbo Jang,Seungwon Do,Sunkyoung Kim,KyungTae Lim,Jongwon Lee,Kyumin Park,Jamin Shin,Seonghyun Kim,Lucy Park,Alice Oh,Jung-Woo Ha,Kyunghyun Cho +30 more
TL;DR: The Korean Language Understanding Evaluation (KLUE) benchmark as mentioned in this paper is a collection of 8 Korean NLP tasks, including Topic Classification, SemanticTextual Similarity, Natural Language Inference, Named Entity Recognition, Relation Extraction, Dependency Parsing, Machine Reading Comprehension, and Dialogue State Tracking.
References
More filters
Proceedings ArticleDOI
Improving Bi-LSTM Performance for Indonesian Sentiment Analysis Using Paragraph Vector
TL;DR: In this paper, a paragraph vector is concatenated to each word vector of the document to provide information context for each sequence processing, which can help to differentiate ambiguous Indonesian words.