Assessment of approximate string matching in a biomedical text retrieval problem

Open Access

Assessment of approximate string matching in a biomedical text retrieval problem

Chats0

TLDR

The authors used the Smith-Waterman algorithm with affine gap penalty as a method for biomedical literature retrieval and found that the optimum performance was at string identity of 88%, at which the recall and precision were 96.9% and 97.3%, respectively.

Abstract:

Text-based search is widely used for biomedical data mining and knowledge discovery. Character errors in literatures affect the accuracy of data mining. Methods for solving this problem are being explored. This work tests the usefulness of the Smith–Waterman algorithm with affine gap penalty as a method for biomedical literature retrieval.Names ofmedicinal herbs collected fromherbalmedicine literatures arematchedwith those frommedicinal chemistry literatures by using this algorithm at different string identity levels (80–100%). The optimum performance is at string identity of 88%, at which the recall and precision are 96.9% and 97.3%, respectively. Our study suggests that the Smith–Waterman algorithm is useful for improving the success rate of biomedical text retrieval. 2004 Elsevier Ltd. All rights reserved.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Efficient approximate entity extraction with edit distance constraints

Wei Wang, +3 more

TL;DR: This paper studies the problem of approximate dictionary matching with edit distance constraints and proposes an improved neighborhood generation method employing novel partitioning and prefix pruning techniques that outperforms alternative approaches by up to an order of magnitude.

...read moreread less

Journal ArticleDOI

Bioinformatics opportunities for identification and study of medicinal plants

Vivekanand Sharma, +1 more

- 01 Mar 2013 -

Briefings in Bioinformatics

TL;DR: This work highlights areas in medicinal plant research where the application of bioinformatics methodologies may result in quicker and potentially cost-effective leads toward finding plant-based remedies.

...read moreread less

Journal ArticleDOI

Mapping biological entities using the longest approximately common prefix method

Alex Rudniy, +2 more

- 14 Jun 2014 -

BMC Bioinformatics

TL;DR: The Longest Approximately Common Prefix method is introduced as an algorithm for approximate string matching that runs in linear time and is compared to nine other well-known string matching algorithms for performance, precision and speed.

...read moreread less

Journal ArticleDOI

Improving hash-q exact string matching algorithm with perfect hashing for DNA sequences.

Abdullah Ammar Karcioglu, +1 more

- 22 Feb 2021 -

Computers in Biology and Medicine

TL;DR: In this paper, a hash function has been proposed that eliminates hash collisions for DNA sequences and provides perfect hashing and produces hash values in a time-efficient manner, and two exact string matching algorithms based on the proposed hash function have been proposed.

...read moreread less

Journal ArticleDOI

Research on Uyghur Pattern Matching Based on Syllable Features

Wayit Abliz, +6 more

- 02 May 2020 -

Information-an International Interdiscip...

TL;DR: A retrievable syllable coding format based on the syllable features of the Uyghur language and the improvement of the Boyer–Moore (BM) algorithm is proposed, which effectively solves the problem of weakening vowels and it can better match words with stem shape changes.

...read moreread less

References

PDF

Open Access

More filters

Journal ArticleDOI

Identification of common molecular subsequences.

Temple F. Smith, +1 more

- 25 Mar 1981 -

Journal of Molecular Biology

TL;DR: This letter extends the heuristic homology algorithm of Needleman & Wunsch (1970) to find a pair of segments, one from each of two long sequences, such that there is no other Pair of segments with greater similarity (homology).

...read moreread less

Journal ArticleDOI

A guided tour to approximate string matching

Gonzalo Navarro

- 01 Mar 2001 -

ACM Computing Surveys

TL;DR: This work surveys the current techniques to cope with the problem of string matching that allows errors, and focuses on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms.

...read moreread less

Journal ArticleDOI

An improved algorithm for matching biological sequences

Osamu Gotoh

- 15 Dec 1982 -

Journal of Molecular Biology

TL;DR: The algorithm of Waterman et al. (1976) for matching biological sequences was modified under some limitations to be accomplished in essentially MN steps, instead of the M 2 N steps necessary in the original algorithm.

...read moreread less

Journal ArticleDOI

Techniques for automatically correcting words in text

Karen Kukich

- 01 Dec 1992 -

ACM Computing Surveys

TL;DR: Research aimed at correcting words in text has focused on three progressively more difficult problems: nonword error detection; (2) isolated-word error correction; and (3) context-dependent work correction, which surveys documented findings on spelling error patterns.

...read moreread less

Proceedings Article

Automatic Extraction of Biological Information from Scientific Text: Protein-Protein Interactions

Christian Blaschke, +3 more

TL;DR: The basic design of a system for automatic detection of protein-protein interactions extracted from scientific abstracts is described and the feasibility of developing a fully automated system able to describe networks of protein interactions with sufficient accuracy is demonstrated.

...read moreread less

Assessment of approximate string matching in a biomedical text retrieval problem

Citations

Efficient approximate entity extraction with edit distance constraints

Bioinformatics opportunities for identification and study of medicinal plants

Mapping biological entities using the longest approximately common prefix method

Improving hash-q exact string matching algorithm with perfect hashing for DNA sequences.

Research on Uyghur Pattern Matching Based on Syllable Features

References

Identification of common molecular subsequences.

A guided tour to approximate string matching

An improved algorithm for matching biological sequences

Techniques for automatically correcting words in text

Automatic Extraction of Biological Information from Scientific Text: Protein-Protein Interactions

Related Papers (5)

A guided tour to approximate string matching

Applying Bayesian belief networks in approximate string matching for robust keyword-based retrieval

A robust model for intelligent text classification

A Concept Similarity Based Text Classification Algorithm

An Advanced Fuzzy Constructing Algorithm for Feature Discovery in Text Mining