Home
/
Authors
/
Sungdong Kim

Author

Sungdong Kim

Bio: Sungdong Kim is an academic researcher from Naver Corporation. The author has contributed to research in topics: Computer science & Question answering. The author has an hindex of 6, co-authored 10 publications receiving 1790 citations.

Papers

PDF

Open Access

More filters

Journal Article•DOI•

BioBERT: a pre-trained biomedical language representation model for biomedical text mining.

[...]

Jinhyuk Lee¹, Wonjin Yoon¹, Sungdong Kim², Donghyeon Kim¹, Sunkyu Kim¹, Chan Ho So¹, Jaewoo Kang¹ - Show less +3 more•Institutions (2)

Korea University¹, Naver Corporation²

25 Jan 2019-Bioinformatics

TL;DR: This article proposed BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining), which is a domain-specific language representation model pre-trained on large-scale biomedical corpora.

...read moreread less

Abstract: Motivation Biomedical text mining is becoming increasingly important as the number of biomedical documents rapidly grows. With the progress in natural language processing (NLP), extracting valuable information from biomedical literature has gained popularity among researchers, and deep learning has boosted the development of effective biomedical text mining models. However, directly applying the advancements in NLP to biomedical text mining often yields unsatisfactory results due to a word distribution shift from general domain corpora to biomedical corpora. In this article, we investigate how the recently introduced pre-trained language model BERT can be adapted for biomedical corpora. Results We introduce BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining), which is a domain-specific language representation model pre-trained on large-scale biomedical corpora. With almost the same architecture across tasks, BioBERT largely outperforms BERT and previous state-of-the-art models in a variety of biomedical text mining tasks when pre-trained on biomedical corpora. While BERT obtains performance comparable to that of previous state-of-the-art models, BioBERT significantly outperforms them on the following three representative biomedical text mining tasks: biomedical named entity recognition (0.62% F1 score improvement), biomedical relation extraction (2.80% F1 score improvement) and biomedical question answering (12.24% MRR improvement). Our analysis results show that pre-training BERT on biomedical corpora helps it to understand complex biomedical texts. Availability and implementation We make the pre-trained weights of BioBERT freely available at https://github.com/naver/biobert-pretrained, and the source code for fine-tuning BioBERT available at https://github.com/dmis-lab/biobert.

...read moreread less

2,680 citations

Proceedings Article•DOI•

Efficient Dialogue State Tracking by Selectively Overwriting Memory

[...]

Sungdong Kim¹, Sohee Yang², Gyuwan Kim¹, Sang Woo Lee¹•Institutions (2)

Naver Corporation¹, KAIST²

01 Jul 2020

TL;DR: The authors proposes a selectively overwriting mechanism for more efficient DST by decomposing DST into two sub-tasks and guiding the decoder to focus only on one of the tasks.

...read moreread less

Abstract: Recent works in dialogue state tracking (DST) focus on an open vocabulary-based setting to resolve scalability and generalization issues of the predefined ontology-based approaches. However, they are inefficient in that they predict the dialogue state at every turn from scratch. Here, we consider dialogue state as an explicit fixed-sized memory and propose a selectively overwriting mechanism for more efficient DST. This mechanism consists of two steps: (1) predicting state operation on each of the memory slots, and (2) overwriting the memory with new values, of which only a few are generated according to the predicted state operations. Our method decomposes DST into two sub-tasks and guides the decoder to focus only on one of the tasks, thus reducing the burden of the decoder. This enhances the effectiveness of training and DST performance. Our SOM-DST (Selectively Overwriting Memory for Dialogue State Tracking) model achieves state-of-the-art joint goal accuracy with 51.72% in MultiWOZ 2.0 and 53.01% in MultiWOZ 2.1 in an open vocabulary-based DST setting. In addition, we analyze the accuracy gaps between the current and the ground truth-given situations and suggest that it is a promising direction to improve state operation prediction to boost the DST performance.

...read moreread less

150 citations

Proceedings Article•DOI•

On the Effect of Pretraining Corpora on In-context Learning by a Large-scale Language Model

[...]

Seongjin Shin, Sangwook Lee, Hwijeen Ahn, Sungdong Kim, Hyoungseok Kim, Boseop Kim, Kyunghyun Cho, Gichang Lee, Woo Chul Park, Jung-Hyup Ha, Nako Sung - Show less +7 more

28 Apr 2022

TL;DR: This in-depth investigation of the effects of the source and size of the pretraining corpus on in-context learning in HyperCLOVA, a Korean-centric GPT-3 model finds that in- context learning performance heavily depends on the corpus domain source.

...read moreread less

Abstract: Many recent studies on large-scale language models have reported successful in-context zero- and few-shot learning ability. However, the in-depth analysis of when in-context learning occurs is still lacking. For example, it is unknown how in-context learning performance changes as the training corpus varies. Here, we investigate the effects of the source and size of the pretraining corpus on in-context learning in HyperCLOVA, a Korean-centric GPT-3 model. From our in-depth investigation, we introduce the following observations: (1) in-context learning performance heavily depends on the corpus domain source, and the size of the pretraining corpus does not necessarily determine the emergence of in-context learning, (2) in-context learning ability can emerge when a language model is trained on a combination of multiple corpora, even when each corpus does not result in in-context learning on its own, (3) pretraining with a corpus related to a downstream task does not always guarantee the competitive in-context learning performance of the downstream task, especially in the few-shot setting, and (4) the relationship between language modeling (measured in perplexity) and in-context learning does not always correlate: e.g., low perplexity does not always imply high in-context few-shot learning performance.

...read moreread less

33 citations

Posted Content•

Efficient Dialogue State Tracking by Selectively Overwriting Memory

[...]

Sungdong Kim¹, Sohee Yang², Gyuwan Kim¹, Sang Woo Lee¹•Institutions (2)

Naver Corporation¹, KAIST²

10 Nov 2019-arXiv: Computation and Language

TL;DR: The accuracy gaps between the current and the ground truth-given situations are analyzed and it is suggested that it is a promising direction to improve state operation prediction to boost the DST performance.

...read moreread less

16 citations

Journal Article•DOI•

QADiver: Interactive Framework for Diagnosing QA Models

[...]

Gyeongbok Lee¹, Sungdong Kim², Seung-won Hwang¹•Institutions (2)

Yonsei University¹, Naver Corporation²

17 Jul 2019

TL;DR: The authors proposed a web-based UI that provides how each model contributes to QA performances, by integrating visualization and analysis tools for model explanation, and they expect this framework can help QA model researchers to refine and improve their models.

...read moreread less

Abstract: Question answering (QA) extracting answers from text to the given question in natural language, has been actively studied and existing models have shown a promise of outperforming human performance when trained and evaluated with SQuAD dataset. However, such performance may not be replicated in the actual setting, for which we need to diagnose the cause, which is non-trivial due to the complexity of model. We thus propose a web-based UI that provides how each model contributes to QA performances, by integrating visualization and analysis tools for model explanation. We expect this framework can help QA model researchers to refine and improve their models.

...read moreread less

14 citations

1
2
3
4
…
5
6

Collapse

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

The STRING database in 2021: customizable protein-protein networks, and functional characterization of user-uploaded gene/measurement sets.

[...]

Damian Szklarczyk¹, Annika L. Gable¹, Katerina C. Nastou², David Lyon¹, Rebecca Kirsch², Sampo Pyysalo³, Nadezhda Tsankova Doncheva², Marc Legeay², Tao Fang¹, Peer Bork, Lars Juhl Jensen², Christian von Mering¹ - Show less +8 more•Institutions (3)

Swiss Institute of Bioinformatics¹, University of Copenhagen², University of Turku³

08 Jan 2021-Nucleic Acids Research

TL;DR: Changes to the text-mining system, a new scoring-mode for physical interactions, as well as extensive user interface features for customizing, extending and sharing protein networks are described.

...read moreread less

Abstract: Cellular life depends on a complex web of functional associations between biomolecules. Among these associations, protein-protein interactions are particularly important due to their versatility, specificity and adaptability. The STRING database aims to integrate all known and predicted associations between proteins, including both physical interactions as well as functional associations. To achieve this, STRING collects and scores evidence from a number of sources: (i) automated text mining of the scientific literature, (ii) databases of interaction experiments and annotated complexes/pathways, (iii) computational interaction predictions from co-expression and from conserved genomic context and (iv) systematic transfers of interaction evidence from one organism to another. STRING aims for wide coverage; the upcoming version 11.5 of the resource will contain more than 14 000 organisms. In this update paper, we describe changes to the text-mining system, a new scoring-mode for physical interactions, as well as extensive user interface features for customizing, extending and sharing protein networks. In addition, we describe how to query STRING with genome-wide, experimental data, including the automated detection of enriched functionalities and potential biases in the user's query data. The STRING resource is available online, at https://string-db.org/.

...read moreread less

3,253 citations

Proceedings Article•DOI•

SciBERT: A Pretrained Language Model for Scientific Text

[...]

Iz Beltagy¹, Kyle Lo², Arman Cohan³•Institutions (3)

Allen Institute for Artificial Intelligence¹, University of Washington², Adobe Systems³

01 Nov 2019

TL;DR: SciBERT leverages unsupervised pretraining on a large multi-domain corpus of scientific publications to improve performance on downstream scientific NLP tasks and demonstrates statistically significant improvements over BERT.

...read moreread less

Abstract: Obtaining large-scale annotated data for NLP tasks in the scientific domain is challenging and expensive. We release SciBERT, a pretrained language model based on BERT (Devlin et. al., 2018) to address the lack of high-quality, large-scale labeled scientific data. SciBERT leverages unsupervised pretraining on a large multi-domain corpus of scientific publications to improve performance on downstream scientific NLP tasks. We evaluate on a suite of tasks including sequence tagging, sentence classification and dependency parsing, with datasets from a variety of scientific domains. We demonstrate statistically significant improvements over BERT and achieve new state-of-the-art results on several of these tasks. The code and pretrained models are available at https://github.com/allenai/scibert/.

...read moreread less

1,864 citations

Proceedings Article•DOI•

Don't Stop Pretraining: Adapt Language Models to Domains and Tasks

[...]

Suchin Gururangan¹, Ana Marasović², Ana Marasović¹, Swabha Swayamdipta¹, Kyle Lo¹, Iz Beltagy¹, Doug Downey¹, Noah A. Smith¹, Noah A. Smith² - Show less +5 more•Institutions (2)

Allen Institute for Artificial Intelligence¹, University of Washington²

23 Apr 2020

TL;DR: It is consistently found that multi-phase adaptive pretraining offers large gains in task performance, and it is shown that adapting to a task corpus augmented using simple data selection strategies is an effective alternative, especially when resources for domain-adaptive pretraining might be unavailable.

...read moreread less

Abstract: Language models pretrained on text from a wide variety of sources form the foundation of today’s NLP. In light of the success of these broad-coverage models, we investigate whether it is still helpful to tailor a pretrained model to the domain of a target task. We present a study across four domains (biomedical and computer science publications, news, and reviews) and eight classification tasks, showing that a second phase of pretraining in-domain (domain-adaptive pretraining) leads to performance gains, under both high- and low-resource settings. Moreover, adapting to the task’s unlabeled data (task-adaptive pretraining) improves performance even after domain-adaptive pretraining. Finally, we show that adapting to a task corpus augmented using simple data selection strategies is an effective alternative, especially when resources for domain-adaptive pretraining might be unavailable. Overall, we consistently find that multi-phase adaptive pretraining offers large gains in task performance.

...read moreread less

1,532 citations

Proceedings Article•DOI•

Publicly Available Clinical BERT Embeddings

[...]

Emily Alsentzer, John Murphy, William Boag, Wei-Hung Weng, Di Jindi, Tristan Naumann, Matthew B. A. McDermott - Show less +3 more

06 Apr 2019

TL;DR: This paper explored and released BERT models for clinical text: one for generic clinical text and another for discharge summaries specifically, and demonstrated that using a domain-specific model yields performance improvements on 3/5 clinical NLP tasks, establishing a new state-of-the-art on the MedNLI dataset.

...read moreread less

Abstract: Contextual word embedding models such as ELMo and BERT have dramatically improved performance for many natural language processing (NLP) tasks in recent months. However, these models have been minimally explored on specialty corpora, such as clinical text; moreover, in the clinical domain, no publicly-available pre-trained BERT models yet exist. In this work, we address this need by exploring and releasing BERT models for clinical text: one for generic clinical text and another for discharge summaries specifically. We demonstrate that using a domain-specific model yields performance improvements on 3/5 clinical NLP tasks, establishing a new state-of-the-art on the MedNLI dataset. We find that these domain-specific models are not as performant on 2 clinical de-identification tasks, and argue that this is a natural consequence of the differences between de-identified source text and synthetically non de-identified task text.

...read moreread less

782 citations

Journal Article•DOI•

Pre-trained Models for Natural Language Processing: A Survey

[...]

Xipeng Qiu¹, Tianxiang Sun¹, Yige Xu¹, Yunfan Shao¹, Ning Dai¹, Xuanjing Huang¹ - Show less +2 more•Institutions (1)

Fudan University¹

18 Mar 2020-Science China-technological Sciences

TL;DR: Recently, the emergence of pre-trained models (PTMs) has brought natural language processing (NLP) to a new era as mentioned in this paper, and a comprehensive review of PTMs for NLP can be found in this survey.

...read moreread less

Abstract: Recently, the emergence of pre-trained models (PTMs) has brought natural language processing (NLP) to a new era. In this survey, we provide a comprehensive review of PTMs for NLP. We first briefly introduce language representation learning and its research progress. Then we systematically categorize existing PTMs based on a taxonomy from four different perspectives. Next, we describe how to adapt the knowledge of PTMs to downstream tasks. Finally, we outline some potential directions of PTMs for future research. This survey is purposed to be a hands-on guide for understanding, using, and developing PTMs for various NLP tasks.

...read moreread less

755 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse