BIOSSES: a semantic sentence similarity estimation system for the biomedical domain.

doi:10.1093/BIOINFORMATICS/BTX238

Open AccessJournal ArticleDOI

BIOSSES: a semantic sentence similarity estimation system for the biomedical domain.

Gizem Sogancioglu, +2 more

- 15 Jul 2017 -

Bioinformatics

- Vol. 33, Iss: 14

TLDR

This work proposes several approaches for sentence‐level semantic similarity computation in the biomedical domain, including string similarity measures and measures based on the distributed vector representations of sentences learned in an unsupervised manner from a large biomedical corpus.

Abstract:

Motivation The amount of information available in textual format is rapidly increasing in the biomedical domain. Therefore, natural language processing (NLP) applications are becoming increasingly important to facilitate the retrieval and analysis of these data. Computing the semantic similarity between sentences is an important component in many NLP tasks including text retrieval and summarization. A number of approaches have been proposed for semantic sentence similarity estimation for generic English. However, our experiments showed that such approaches do not effectively cover biomedical knowledge and produce poor results for biomedical text. Methods We propose several approaches for sentence-level semantic similarity computation in the biomedical domain, including string similarity measures and measures based on the distributed vector representations of sentences learned in an unsupervised manner from a large biomedical corpus. In addition, ontology-based approaches are presented that utilize general and domain-specific ontologies. Finally, a supervised regression based model is developed that effectively combines the different similarity computation metrics. A benchmark data set consisting of 100 sentence pairs from the biomedical literature is manually annotated by five human experts and used for evaluating the proposed methods. Results The experiments showed that the supervised semantic sentence similarity computation approach obtained the best performance (0.836 correlation with gold standard human annotations) and improved over the state-of-the-art domain-independent systems up to 42.6% in terms of the Pearson correlation metric. Availability and implementation A web-based system for biomedical semantic sentence similarity computation, the source code, and the annotated benchmark data set are available at: http://tabilab.cmpe.boun.edu.tr/BIOSSES/ . Contact gizemsogancioglu@gmail.com or arzucan.ozgur@boun.edu.tr.

BIOSSES: a semantic sentence similarity estimation system for the biomedical domain.

Citations

Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing

Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets.

BioSentVec: creating sentence embeddings for biomedical texts

Evolution of Semantic Similarity -- A Survey

LinkBERT: Pretraining Language Models with Document Links

References

Distributed Representations of Words and Phrases and their Compositionality

The WEKA data mining software: an update

WordNet: a lexical database for English

Binary codes capable of correcting deletions, insertions and reversals

Binary codes capable of correcting deletions, insertions, and reversals

Related Papers (5)

BioBERT: a pre-trained biomedical language representation model for biomedical text mining.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Glove: Global Vectors for Word Representation

The Unified Medical Language System (UMLS): integrating biomedical terminology

MIMIC-III, a freely accessible critical care database