KR-BERT: A Small-Scale Korean-Specific Language Model

Open AccessPosted Content

KR-BERT: A Small-Scale Korean-Specific Language Model

- 10 Aug 2020 -

TLDR

This paper trained a Korean-specific model KR-BERT, utilizing a smaller vocabulary and dataset, and adjusted the minimal span of tokens for tokenization ranging from sub-character level to character-level to construct a better vocabulary for the model.

Abstract:

Since the appearance of BERT, recent works including XLNet and RoBERTa utilize sentence embedding models pre-trained by large corpora and a large number of parameters. Because such models have large hardware and a huge amount of data, they take a long time to pre-train. Therefore it is important to attempt to make smaller models that perform comparatively. In this paper, we trained a Korean-specific model KR-BERT, utilizing a smaller vocabulary and dataset. Since Korean is one of the morphologically rich languages with poor resources using non-Latin alphabets, it is also important to capture language-specific linguistic phenomena that the Multilingual BERT model missed. We tested several tokenizers including our BidirectionalWordPiece Tokenizer and adjusted the minimal span of tokens for tokenization ranging from sub-character level to character-level to construct a better vocabulary for our model. With those adjustments, our KR-BERT model performed comparably and even better than other existing pre-trained models using a corpus about 1/10 of the size.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models

Phillip Rust, +4 more

TL;DR: The authors provide a systematic and comprehensive empirical comparison of pretrained multilingual language models versus their monolingual counterparts with regard to their monolinguistic task performance, and find that while the pretraining data size is an important factor in the downstream performance, a designated mon-olingual tokenizer plays an equally important role in downstream performance.

...read moreread less

Posted Content

How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models.

Phillip Rust, +4 more

- 31 Dec 2020 -

arXiv: Computation and Language

TL;DR: This paper provided a systematic and comprehensive empirical comparison of pretrained multilingual language models versus their monolingual counterparts with regard to their monolinguistic task performance, and found that while the pretraining data size is an important factor in the downstream performance of the multilingual model, a designated mon-olingual tokenizer plays an equally important role in downstream performance.

...read moreread less

Posted ContentDOI

Bertinho: Galician BERT Representations.

David Vilares, +2 more

- 25 Mar 2021 -

arXiv: Computation and Language

TL;DR: In this article, a monolingual BERT model for Galician is presented, which uses 6 and 12 transformer layers for POS-tagging, dependency parsing and named entity recognition.

...read moreread less

Journal ArticleDOI

A pre-trained BERT for Korean medical natural language processing

Yoojoong Kim, +9 more

- 16 Aug 2022 -

Dental science reports

TL;DR: In this article , a Korean medical language model based on deep learning NLP is presented, and the model was trained using the pre-training framework of BERT for the medical context based on a state-of-the-art Korean language model.

...read moreread less

Posted Content

KLUE: Korean Language Understanding Evaluation.

Sungjoon Park, +30 more

- 20 May 2021 -

arXiv: Computation and Language

TL;DR: The Korean Language Understanding Evaluation (KLUE) benchmark as mentioned in this paper is a collection of 8 Korean NLP tasks, including Topic Classification, SemanticTextual Similarity, Natural Language Inference, Named Entity Recognition, Relation Extraction, Dependency Parsing, Machine Reading Comprehension, and Dialogue State Tracking.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, +3 more

TL;DR: BERT as mentioned in this paper pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

...read moreread less

Posted Content

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, +9 more

- 26 Jul 2019 -

arXiv: Computation and Language

TL;DR: It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.

...read moreread less

Proceedings ArticleDOI

Neural Machine Translation of Rare Words with Subword Units

Rico Sennrich, +2 more

TL;DR: This paper introduces a simpler and more effective approach, making the NMT model capable of open-vocabulary translation by encoding rare and unknown words as sequences of subword units, and empirically shows that subword models improve over a back-off dictionary baseline for the WMT 15 translation tasks English-German and English-Russian by 1.3 BLEU.

...read moreread less

Posted Content

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Yonghui Wu, +30 more

- 26 Sep 2016 -

arXiv: Computation and Language

TL;DR: GNMT, Google's Neural Machine Translation system, is presented, which attempts to address many of the weaknesses of conventional phrase-based translation systems and provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delicited models.

...read moreread less

Posted Content

NLTK: The Natural Language Toolkit

Edward Loper, +1 more

- 17 May 2002 -

arXiv: Computation and Language

TL;DR: NLTK, the Natural Language Toolkit, is a suite of open source program modules, tutorials and problem sets, providing ready-to-use computational linguistics courseware that covers symbolic and statistical natural language processing.

...read moreread less