Home
/
Authors
/
Jinhyuk Lee

Author

Jinhyuk Lee

Other affiliations: University of Washington, Ulsan National Institute of Science and Technology, Princeton University

Bio: Jinhyuk Lee is an academic researcher from Korea University. The author has contributed to research in topics: Question answering & Biomedical text mining. The author has an hindex of 14, co-authored 46 publications receiving 2274 citations. Previous affiliations of Jinhyuk Lee include University of Washington & Ulsan National Institute of Science and Technology.

Papers published on a yearly basis

2023
2022
2021
2020
2019
2018
2017
2016
2000

Papers

PDF

Open Access

More filters

Journal Article•DOI•

BioBERT: a pre-trained biomedical language representation model for biomedical text mining.

[...]

Jinhyuk Lee¹, Wonjin Yoon¹, Sungdong Kim², Donghyeon Kim¹, Sunkyu Kim¹, Chan Ho So¹, Jaewoo Kang¹ - Show less +3 more•Institutions (2)

Korea University¹, Naver Corporation²

25 Jan 2019-Bioinformatics

TL;DR: This article proposed BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining), which is a domain-specific language representation model pre-trained on large-scale biomedical corpora.

...read moreread less

Abstract: Motivation Biomedical text mining is becoming increasingly important as the number of biomedical documents rapidly grows. With the progress in natural language processing (NLP), extracting valuable information from biomedical literature has gained popularity among researchers, and deep learning has boosted the development of effective biomedical text mining models. However, directly applying the advancements in NLP to biomedical text mining often yields unsatisfactory results due to a word distribution shift from general domain corpora to biomedical corpora. In this article, we investigate how the recently introduced pre-trained language model BERT can be adapted for biomedical corpora. Results We introduce BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining), which is a domain-specific language representation model pre-trained on large-scale biomedical corpora. With almost the same architecture across tasks, BioBERT largely outperforms BERT and previous state-of-the-art models in a variety of biomedical text mining tasks when pre-trained on biomedical corpora. While BERT obtains performance comparable to that of previous state-of-the-art models, BioBERT significantly outperforms them on the following three representative biomedical text mining tasks: biomedical named entity recognition (0.62% F1 score improvement), biomedical relation extraction (2.80% F1 score improvement) and biomedical question answering (12.24% MRR improvement). Our analysis results show that pre-training BERT on biomedical corpora helps it to understand complex biomedical texts. Availability and implementation We make the pre-trained weights of BioBERT freely available at https://github.com/naver/biobert-pretrained, and the source code for fine-tuning BioBERT available at https://github.com/dmis-lab/biobert.

...read moreread less

2,680 citations

Proceedings Article•DOI•

Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index.

[...]

Minjoon Seo¹, Jinhyuk Lee, Tom Kwiatkowski², Ankur P. Parikh², Ali Farhadi³, Ali Farhadi¹, Hannaneh Hajishirzi¹, Hannaneh Hajishirzi³ - Show less +4 more•Institutions (3)

University of Washington¹, Google², Allen Institute for Artificial Intelligence³

01 Jul 2019

TL;DR: In this article, the authors proposed a query-agnostic indexable representation of document phrases that can drastically speed up open-domain QA by capturing syntactic, semantic, and lexical information of the phrases and eliminating pipeline filtering of context documents.

...read moreread less

Abstract: Existing open-domain question answering (QA) models are not suitable for real-time usage because they need to process several long documents on-demand for every input query, which is computationally prohibitive. In this paper, we introduce query-agnostic indexable representations of document phrases that can drastically speed up open-domain QA. In particular, our dense-sparse phrase encoding effectively captures syntactic, semantic, and lexical information of the phrases and eliminates the pipeline filtering of context documents. Leveraging strategies for optimizing training and inference time, our model can be trained and deployed even in a single 4-GPU server. Moreover, by representing phrases as pointers to their start and end tokens, our model indexes phrases in the entire English Wikipedia (up to 60 billion phrases) using under 2TB. Our experiments on SQuAD-Open show that our model is on par with or more accurate than previous models with 6000x reduced computational cost, which translates into at least 68x faster end-to-end inference benchmark on CPUs. Code and demo are available at nlp.cs.washington.edu/denspi

...read moreread less

103 citations

Journal Article•DOI•

A Neural Named Entity Recognition and Multi-Type Normalization Tool for Biomedical Text Mining

[...]

Donghyeon Kim¹, Jinhyuk Lee¹, Chan Ho So¹, Hwisang Jeon¹, Minbyul Jeong¹, Yonghwa Choi¹, Wonjin Yoon¹, Mujeen Sung¹, Jaewoo Kang¹ - Show less +5 more•Institutions (1)

Korea University¹

04 Jun 2019-IEEE Access

TL;DR: The BERN uses high-performance BioBERT named entity recognition models which recognize known entities and discover new entities and various named entity normalization models are integrated into BERN for assigning a distinct identifier to each recognized entity.

...read moreread less

Abstract: The amount of biomedical literature is vast and growing quickly, and accurate text mining techniques could help researchers to efficiently extract useful information from the literature. However, existing named entity recognition models used by text mining tools such as tmTool and ezTag are not effective enough, and cannot accurately discover new entities. Also, the traditional text mining tools do not consider overlapping entities, which are frequently observed in multi-type named entity recognition results. We propose a neural biomedical named entity recognition and multi-type normalization tool called BERN. The BERN uses high-performance BioBERT named entity recognition models which recognize known entities and discover new entities. Also, probability-based decision rules are developed to identify the types of overlapping entities. Furthermore, various named entity normalization models are integrated into BERN for assigning a distinct identifier to each recognized entity. The BERN provides a Web service for tagging entities in PubMed articles or raw text. Researchers can use the BERN Web service for their text mining tasks, such as new named entity discovery, information retrieval, question answering, and relation extraction. The application programming interfaces and demonstrations of BERN are publicly available at https://bern.korea.ac.kr .

...read moreread less

95 citations

Proceedings Article•DOI•

Ranking Paragraphs for Improving Answer Recall in Open-Domain Question Answering

[...]

Jinhyuk Lee¹, Seongjun Yun¹, Hyunjae Kim¹, Miyoung Ko¹, Jaewoo Kang - Show less +1 more•Institutions (1)

Korea University¹

30 Sep 2018

TL;DR: In this article, the authors introduced paragraph ranker, which ranks paragraphs of retrieved documents for a higher answer recall with less noise and showed that ranking paragraphs and aggregating answers using paragraph Ranker improves performance of open-domain QA pipeline.

...read moreread less

Abstract: Recently, open-domain question answering (QA) has been combined with machine comprehension models to find answers in a large knowledge source As open-domain QA requires retrieving relevant documents from text corpora to answer questions, its performance largely depends on the performance of document retrievers However, since traditional information retrieval systems are not effective in obtaining documents with a high probability of containing answers, they lower the performance of QA systems Simply extracting more documents increases the number of irrelevant documents, which also degrades the performance of QA systems In this paper, we introduce Paragraph Ranker which ranks paragraphs of retrieved documents for a higher answer recall with less noise We show that ranking paragraphs and aggregating answers using Paragraph Ranker improves performance of open-domain QA pipeline on the four open-domain QA datasets by 78% on average

...read moreread less

86 citations

Journal Article•DOI•

CollaboNet: collaboration of deep neural networks for biomedical named entity recognition

[...]

Wonjin Yoon¹, Chan Ho So¹, Jinhyuk Lee¹, Jaewoo Kang¹•Institutions (1)

Korea University¹

29 May 2019-BMC Bioinformatics

TL;DR: This paper proposed CollaboNet, which utilizes a combination of multiple NER models to reduce the number of false positives and misclassified entities including polysemous words, and achieved state-of-the-art performance.

...read moreread less

Abstract: Finding biomedical named entities is one of the most essential tasks in biomedical text mining. Recently, deep learning-based approaches have been applied to biomedical named entity recognition (BioNER) and showed promising results. However, as deep learning approaches need an abundant amount of training data, a lack of data can hinder performance. BioNER datasets are scarce resources and each dataset covers only a small subset of entity types. Furthermore, many bio entities are polysemous, which is one of the major obstacles in named entity recognition. To address the lack of data and the entity type misclassification problem, we propose CollaboNet which utilizes a combination of multiple NER models. In CollaboNet, models trained on a different dataset are connected to each other so that a target model obtains information from other collaborator models to reduce false positives. Every model is an expert on their target entity type and takes turns serving as a target and a collaborator model during training time. The experimental results show that CollaboNet can be used to greatly reduce the number of false positives and misclassified entities including polysemous words. CollaboNet achieved state-of-the-art performance in terms of precision, recall and F1 score. We demonstrated the benefits of combining multiple models for BioNER. Our model has successfully reduced the number of misclassified entities and improved the performance by leveraging multiple datasets annotated for different entity types. Given the state-of-the-art performance of our model, we believe that CollaboNet can improve the accuracy of downstream biomedical text mining applications such as bio-entity relation extraction.

...read moreread less

69 citations

1
2
3
4
…
5
6
7
8
9
10
11
12

Collapse

Cited by

PDF

Open Access

More filters

Journal Article•

“Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告

[...]

杉山拓海

12 Sep 2017-Computers & Graphics

3,940 citations

Journal Article•DOI•

The STRING database in 2021: customizable protein-protein networks, and functional characterization of user-uploaded gene/measurement sets.

[...]

Damian Szklarczyk¹, Annika L. Gable¹, Katerina C. Nastou², David Lyon¹, Rebecca Kirsch², Sampo Pyysalo³, Nadezhda Tsankova Doncheva², Marc Legeay², Tao Fang¹, Peer Bork, Lars Juhl Jensen², Christian von Mering¹ - Show less +8 more•Institutions (3)

Swiss Institute of Bioinformatics¹, University of Copenhagen², University of Turku³

08 Jan 2021-Nucleic Acids Research

TL;DR: Changes to the text-mining system, a new scoring-mode for physical interactions, as well as extensive user interface features for customizing, extending and sharing protein networks are described.

...read moreread less

Abstract: Cellular life depends on a complex web of functional associations between biomolecules. Among these associations, protein-protein interactions are particularly important due to their versatility, specificity and adaptability. The STRING database aims to integrate all known and predicted associations between proteins, including both physical interactions as well as functional associations. To achieve this, STRING collects and scores evidence from a number of sources: (i) automated text mining of the scientific literature, (ii) databases of interaction experiments and annotated complexes/pathways, (iii) computational interaction predictions from co-expression and from conserved genomic context and (iv) systematic transfers of interaction evidence from one organism to another. STRING aims for wide coverage; the upcoming version 11.5 of the resource will contain more than 14 000 organisms. In this update paper, we describe changes to the text-mining system, a new scoring-mode for physical interactions, as well as extensive user interface features for customizing, extending and sharing protein networks. In addition, we describe how to query STRING with genome-wide, experimental data, including the automated detection of enriched functionalities and potential biases in the user's query data. The STRING resource is available online, at https://string-db.org/.

...read moreread less

3,253 citations

Journal Article•DOI•

BioBERT: a pre-trained biomedical language representation model for biomedical text mining.

[...]

Jinhyuk Lee¹, Wonjin Yoon¹, Sungdong Kim², Donghyeon Kim¹, Sunkyu Kim¹, Chan Ho So¹, Jaewoo Kang¹ - Show less +3 more•Institutions (2)

Korea University¹, Naver Corporation²

25 Jan 2019-Bioinformatics

...read moreread less

2,680 citations

Proceedings Article•DOI•

SciBERT: A Pretrained Language Model for Scientific Text

[...]

Iz Beltagy¹, Kyle Lo², Arman Cohan³•Institutions (3)

Allen Institute for Artificial Intelligence¹, University of Washington², Adobe Systems³

01 Nov 2019

TL;DR: SciBERT leverages unsupervised pretraining on a large multi-domain corpus of scientific publications to improve performance on downstream scientific NLP tasks and demonstrates statistically significant improvements over BERT.

...read moreread less

Abstract: Obtaining large-scale annotated data for NLP tasks in the scientific domain is challenging and expensive. We release SciBERT, a pretrained language model based on BERT (Devlin et. al., 2018) to address the lack of high-quality, large-scale labeled scientific data. SciBERT leverages unsupervised pretraining on a large multi-domain corpus of scientific publications to improve performance on downstream scientific NLP tasks. We evaluate on a suite of tasks including sequence tagging, sentence classification and dependency parsing, with datasets from a variety of scientific domains. We demonstrate statistically significant improvements over BERT and achieve new state-of-the-art results on several of these tasks. The code and pretrained models are available at https://github.com/allenai/scibert/.

...read moreread less

1,864 citations

Proceedings Article•DOI•

Don't Stop Pretraining: Adapt Language Models to Domains and Tasks

[...]

Suchin Gururangan¹, Ana Marasović¹, Ana Marasović², Swabha Swayamdipta¹, Kyle Lo¹, Iz Beltagy¹, Doug Downey¹, Noah A. Smith², Noah A. Smith¹ - Show less +5 more•Institutions (2)

Allen Institute for Artificial Intelligence¹, University of Washington²

23 Apr 2020

TL;DR: It is consistently found that multi-phase adaptive pretraining offers large gains in task performance, and it is shown that adapting to a task corpus augmented using simple data selection strategies is an effective alternative, especially when resources for domain-adaptive pretraining might be unavailable.

...read moreread less

Abstract: Language models pretrained on text from a wide variety of sources form the foundation of today’s NLP. In light of the success of these broad-coverage models, we investigate whether it is still helpful to tailor a pretrained model to the domain of a target task. We present a study across four domains (biomedical and computer science publications, news, and reviews) and eight classification tasks, showing that a second phase of pretraining in-domain (domain-adaptive pretraining) leads to performance gains, under both high- and low-resource settings. Moreover, adapting to the task’s unlabeled data (task-adaptive pretraining) improves performance even after domain-adaptive pretraining. Finally, we show that adapting to a task corpus augmented using simple data selection strategies is an effective alternative, especially when resources for domain-adaptive pretraining might be unavailable. Overall, we consistently find that multi-phase adaptive pretraining offers large gains in task performance.

...read moreread less

1,532 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse