How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models

doi:10.18653/V1/2021.ACL-LONG.243

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

When does pretraining help?: assessing self-supervised learning for law and the CaseHOLD dataset of 53,000+ legal holdings

[...]

Lucia Zheng¹, Neel Guha¹, Brandon R. Anderson¹, Peter Henderson¹, Daniel E. Ho¹ - Show less +1 more•Institutions (1)

Stanford University¹

21 Jun 2021

TL;DR: In this article, the authors present a new dataset called Case Holdings On Legal Decisions (CaseHOLD), which consists of over 53,000+ multiple choice questions to identify the relevant holding of a cited case.

...read moreread less

Abstract: While self-supervised learning has made rapid advances in natural language processing, it remains unclear when researchers should engage in resource-intensive domain-specific pretraining (domain pretraining). The law, puzzlingly, has yielded few documented instances of substantial gains to domain pretraining in spite of the fact that legal language is widely seen to be unique. We hypothesize that these existing results stem from the fact that existing legal NLP tasks are too easy and fail to meet conditions for when domain pretraining can help. To address this, we first present CaseHOLD (Case Holdings On Legal Decisions), a new dataset comprised of over 53,000+ multiple choice questions to identify the relevant holding of a cited case. This dataset presents a fundamental task to lawyers and is both legally meaningful and difficult from an NLP perspective (F1 of 0.4 with a BiLSTM baseline). Second, we assess performance gains on CaseHOLD and existing legal NLP datasets. While a Transformer architecture (BERT) pretrained on a general corpus (Google Books and Wikipedia) improves performance, domain pretraining (on a corpus of ≈3.5M decisions across all courts in the U.S. that is larger than BERT's) with a custom legal vocabulary exhibits the most substantial performance gains with CaseHOLD (gain of 7.2% on F1, representing a 12% improvement on BERT) and consistent performance gains across two other legal tasks. Third, we show that domain pretraining may be warranted when the task exhibits sufficient similarity to the pretraining corpus: the level of performance increase in three legal tasks was directly tied to the domain specificity of the task. Our findings inform when researchers should engage in resource-intensive pretraining and show that Transformer-based architectures, too, learn embeddings suggestive of distinct legal language.

...read moreread less

83 citations

Proceedings Article•

UNKs Everywhere: Adapting Multilingual Language Models to New Scripts.

[...]

Jonas Pfeiffer¹, Ivan Vulić², Iryna Gurevych³, Sebastian Ruder⁴•Institutions (4)

Technische Universität Darmstadt¹, University of Cambridge², University of Paderborn³, Google⁴

26 Aug 2021

TL;DR: This paper propose a series of data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts. But their methods rely on matrix factorization, which is not suitable for low resource languages.

...read moreread less

Abstract: Massively multilingual language models such as multilingual BERT offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks. However, due to limited capacity and large differences in pretraining data sizes, there is a profound performance gap between resource-rich and resource-poor target languages. The ultimate challenge is dealing with under-resourced languages not covered at all by the models and written in scripts unseen during pretraining. In this work, we propose a series of novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts. Relying on matrix factorization, our methods capitalize on the existing latent knowledge about multiple languages already available in the pretrained model’s embedding matrix. Furthermore, we show that learning of the new dedicated embedding matrix in the target language can be improved by leveraging a small number of vocabulary items (i.e., the so-called lexically overlapping tokens) shared between mBERT’s and target language vocabulary. Our adaptation techniques offer substantial performance gains for languages with unseen scripts. We also demonstrate that they can yield improvements for low-resource languages written in scripts covered by the pretrained model.

...read moreread less

25 citations

Proceedings Article•

MAD-G: Multilingual Adapter Generation for Efficient Cross-Lingual Transfer.

[...]

Alan Ansell, Edoardo Maria Ponti¹, Jonas Pfeiffer², Sebastian Ruder³, Goran Glavaš⁴, Ivan Vulić¹, Anna Korhonen⁵ - Show less +3 more•Institutions (5)

University of Cambridge¹, Technische Universität Darmstadt², Google³, University of Mannheim⁴, Technion – Israel Institute of Technology⁵

01 Nov 2021

TL;DR: The authors propose MAD-G (Multilingual ADapter Generation) which generates language adapters from language representations based on typological features. But this approach is not viable for the vast majority of languages, due to limitations in their corpus size or compute budgets.

...read moreread less

Abstract: Adapter modules have emerged as a general parameter-efficient means to specialize a pretrained encoder to new domains. Massively multilingual transformers (MMTs) have particularly benefited from additional training of language-specific adapters. However, this approach is not viable for the vast majority of languages, due to limitations in their corpus size or compute budgets. In this work, we propose MAD-G (Multilingual ADapter Generation), which contextually generates language adapters from language representations based on typological features. In contrast to prior work, our time- and space-efficient MAD-G approach enables (1) sharing of linguistic knowledge across languages and (2) zero-shot inference by generating language adapters for unseen languages. We thoroughly evaluate MAD-G in zero-shot cross-lingual transfer on part-of-speech tagging, dependency parsing, and named entity recognition. While offering (1) improved fine-tuning efficiency (by a factor of around 50 in our experiments), (2) a smaller parameter budget, and (3) increased language coverage, MAD-G remains competitive with more expensive methods for language-specific adapter training across the board. Moreover, it offers substantial benefits for low-resource languages, particularly on the NER task in low-resource African languages. Finally, we demonstrate that MAD-G’s transfer performance can be further improved via: (i) multi-source training, i.e., by generating and combining adapters of multiple languages with available task-specific training data; and (ii) by further fine-tuning generated MAD-G adapters for languages with monolingual data.

...read moreread less

10 citations

Posted Content•

What to Pre-Train on? Efficient Intermediate Task Selection

[...]

Clifton Poth, Jonas Pfeiffer¹, Andreas Rücklé¹, Iryna Gurevych²•Institutions (2)

Technische Universität Darmstadt¹, University of Paderborn²

16 Apr 2021-arXiv: Computation and Language

TL;DR: This article showed that efficient embedding based methods that rely solely on the respective datasets outperform computational expensive few-shot fine-tuning approaches, demonstrating that they are able to efficiently identify the best datasets for intermediate training.

...read moreread less

Abstract: Intermediate task fine-tuning has been shown to culminate in large transfer gains across many NLP tasks. With an abundance of candidate datasets as well as pre-trained language models, it has become infeasible to run the cross-product of all combinations to find the best transfer setting. In this work we first establish that similar sequential fine-tuning gains can be achieved in adapter settings, and subsequently consolidate previously proposed methods that efficiently identify beneficial tasks for intermediate transfer learning. We experiment with a diverse set of 42 intermediate and 11 target English classification, multiple choice, question answering, and sequence tagging tasks. Our results show that efficient embedding based methods that rely solely on the respective datasets outperform computational expensive few-shot fine-tuning approaches. Our best methods achieve an average Regret@3 of less than 1% across all target tasks, demonstrating that we are able to efficiently identify the best datasets for intermediate training.

...read moreread less

8 citations

Posted Content•

KLUE: Korean Language Understanding Evaluation.

[...]

Sungjoon Park¹, Jihyung Moon, Sungdong Kim², Won Ik Cho³, Jiyoon Han⁴, Jang-Won Park, Chisung Song, Junseong Kim, Yongsook Song⁵, Tae-Hwan Oh⁴, Joohong Lee, Juhyun Oh³, Sungwon Lyu, Younghoon Jeong⁶, Inkwon Lee², Sangwoo Seo, Dongjun Lee, Hyunwoo Kim³, Myeonghwa Lee¹, Seongbo Jang, Seungwon Do, Sunkyoung Kim¹, KyungTae Lim⁷, Jongwon Lee, Kyumin Park¹, Jamin Shin, Seonghyun Kim, Lucy Park, Alice Oh, Jung-Woo Ha², Kyunghyun Cho⁸ - Show less +27 more•Institutions (8)

KAIST¹, Naver Corporation², Seoul National University³, Yonsei University⁴, Kyung Hee University⁵, Sogang University⁶, Hanbat National University⁷, New York University⁸

20 May 2021-arXiv: Computation and Language

TL;DR: The Korean Language Understanding Evaluation (KLUE) benchmark as mentioned in this paper is a collection of 8 Korean NLP tasks, including Topic Classification, SemanticTextual Similarity, Natural Language Inference, Named Entity Recognition, Relation Extraction, Dependency Parsing, Machine Reading Comprehension, and Dialogue State Tracking.

...read moreread less

Abstract: We introduce Korean Language Understanding Evaluation (KLUE) benchmark. KLUE is a collection of 8 Korean natural language understanding (NLU) tasks, including Topic Classification, SemanticTextual Similarity, Natural Language Inference, Named Entity Recognition, Relation Extraction, Dependency Parsing, Machine Reading Comprehension, and Dialogue State Tracking. We build all of the tasks from scratch from diverse source corpora while respecting copyrights, to ensure accessibility for anyone without any restrictions. With ethical considerations in mind, we carefully design annotation protocols. Along with the benchmark tasks and data, we provide suitable evaluation metrics and fine-tuning recipes for pretrained language models for each task. We furthermore release the pretrained language models (PLM), KLUE-BERT and KLUE-RoBERTa, to help reproducing baseline models on KLUE and thereby facilitate future research. We make a few interesting observations from the preliminary experiments using the proposed KLUE benchmark suite, already demonstrating the usefulness of this new benchmark suite. First, we find KLUE-RoBERTa-large outperforms other baselines, including multilingual PLMs and existing open-source Korean PLMs. Second, we see minimal degradation in performance even when we replace personally identifiable information from the pretraining corpus, suggesting that privacy and NLU capability are not at odds with each other. Lastly, we find that using BPE tokenization in combination with morpheme-level pre-tokenization is effective in tasks involving morpheme-level tagging, detection and generation. In addition to accelerating Korean NLP research, our comprehensive documentation on creating KLUE will facilitate creating similar resources for other languages in the future. KLUE is available at this https URL.

...read moreread less

7 citations

Collapse

How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models

Citations

References

Related Papers (5)