Open AccessPosted Content
KR-BERT: A Small-Scale Korean-Specific Language Model
TLDR
This paper trained a Korean-specific model KR-BERT, utilizing a smaller vocabulary and dataset, and adjusted the minimal span of tokens for tokenization ranging from sub-character level to character-level to construct a better vocabulary for the model.Abstract:
Since the appearance of BERT, recent works including XLNet and RoBERTa utilize sentence embedding models pre-trained by large corpora and a large number of parameters. Because such models have large hardware and a huge amount of data, they take a long time to pre-train. Therefore it is important to attempt to make smaller models that perform comparatively. In this paper, we trained a Korean-specific model KR-BERT, utilizing a smaller vocabulary and dataset. Since Korean is one of the morphologically rich languages with poor resources using non-Latin alphabets, it is also important to capture language-specific linguistic phenomena that the Multilingual BERT model missed. We tested several tokenizers including our BidirectionalWordPiece Tokenizer and adjusted the minimal span of tokens for tokenization ranging from sub-character level to character-level to construct a better vocabulary for our model. With those adjustments, our KR-BERT model performed comparably and even better than other existing pre-trained models using a corpus about 1/10 of the size.read more
Citations
More filters
Proceedings ArticleDOI
How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models
TL;DR: The authors provide a systematic and comprehensive empirical comparison of pretrained multilingual language models versus their monolingual counterparts with regard to their monolinguistic task performance, and find that while the pretraining data size is an important factor in the downstream performance, a designated mon-olingual tokenizer plays an equally important role in downstream performance.
Posted Content
How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models.
TL;DR: This paper provided a systematic and comprehensive empirical comparison of pretrained multilingual language models versus their monolingual counterparts with regard to their monolinguistic task performance, and found that while the pretraining data size is an important factor in the downstream performance of the multilingual model, a designated mon-olingual tokenizer plays an equally important role in downstream performance.
Posted ContentDOI
Bertinho: Galician BERT Representations.
TL;DR: In this article, a monolingual BERT model for Galician is presented, which uses 6 and 12 transformer layers for POS-tagging, dependency parsing and named entity recognition.
Journal ArticleDOI
A pre-trained BERT for Korean medical natural language processing
Yoojoong Kim,Jong-Ho Kim,Jeong Moon Lee,Moon Joung Jang,Yunjin J. Yum,Seong Eon Kim,Unsub Shin,Young Min Kim,Hyung Joon Joo,Sanghoun Song +9 more
TL;DR: In this article , a Korean medical language model based on deep learning NLP is presented, and the model was trained using the pre-training framework of BERT for the medical context based on a state-of-the-art Korean language model.
Posted Content
KLUE: Korean Language Understanding Evaluation.
Sungjoon Park,Jihyung Moon,Sungdong Kim,Won Ik Cho,Jiyoon Han,Jang-Won Park,Chisung Song,Junseong Kim,Yongsook Song,Tae-Hwan Oh,Joohong Lee,Juhyun Oh,Sungwon Lyu,Younghoon Jeong,Inkwon Lee,Sangwoo Seo,Dongjun Lee,Hyunwoo Kim,Myeonghwa Lee,Seongbo Jang,Seungwon Do,Sunkyoung Kim,KyungTae Lim,Jongwon Lee,Kyumin Park,Jamin Shin,Seonghyun Kim,Lucy Park,Alice Oh,Jung-Woo Ha,Kyunghyun Cho +30 more
TL;DR: The Korean Language Understanding Evaluation (KLUE) benchmark as mentioned in this paper is a collection of 8 Korean NLP tasks, including Topic Classification, SemanticTextual Similarity, Natural Language Inference, Named Entity Recognition, Relation Extraction, Dependency Parsing, Machine Reading Comprehension, and Dialogue State Tracking.
References
More filters
Proceedings ArticleDOI
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TL;DR: BERT as mentioned in this paper pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Posted Content
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu,Myle Ott,Naman Goyal,Jingfei Du,Mandar Joshi,Danqi Chen,Omer Levy,Michael Lewis,Luke Zettlemoyer,Veselin Stoyanov +9 more
TL;DR: It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.
Proceedings ArticleDOI
Neural Machine Translation of Rare Words with Subword Units
TL;DR: This paper introduces a simpler and more effective approach, making the NMT model capable of open-vocabulary translation by encoding rare and unknown words as sequences of subword units, and empirically shows that subword models improve over a back-off dictionary baseline for the WMT 15 translation tasks English-German and English-Russian by 1.3 BLEU.
Posted Content
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
Yonghui Wu,Mike Schuster,Zhifeng Chen,Quoc V. Le,Mohammad Norouzi,Wolfgang Macherey,Maxim Krikun,Yuan Cao,Qin Gao,Klaus Macherey,Jeff Klingner,Apurva Shah,Melvin Johnson,Xiaobing Liu,Łukasz Kaiser,Stephan Gouws,Yoshikiyo Kato,Taku Kudo,Hideto Kazawa,Keith Stevens,George Kurian,Nishant Patil,Wei Wang,Cliff Young,Jason A. Smith,Jason Riesa,Alex Rudnick,Oriol Vinyals,Greg S. Corrado,Macduff Hughes,Jeffrey Dean +30 more
TL;DR: GNMT, Google's Neural Machine Translation system, is presented, which attempts to address many of the weaknesses of conventional phrase-based translation systems and provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delicited models.
Posted Content
NLTK: The Natural Language Toolkit
Edward Loper,Steven Bird +1 more
TL;DR: NLTK, the Natural Language Toolkit, is a suite of open source program modules, tutorials and problem sets, providing ready-to-use computational linguistics courseware that covers symbolic and statistical natural language processing.