scispace - formally typeset
Open Access

Assessment of approximate string matching in a biomedical text retrieval problem

Reads0
Chats0
TLDR
The authors used the Smith-Waterman algorithm with affine gap penalty as a method for biomedical literature retrieval and found that the optimum performance was at string identity of 88%, at which the recall and precision were 96.9% and 97.3%, respectively.
Abstract
Text-based search is widely used for biomedical data mining and knowledge discovery. Character errors in literatures affect the accuracy of data mining. Methods for solving this problem are being explored. This work tests the usefulness of the Smith–Waterman algorithm with affine gap penalty as a method for biomedical literature retrieval.Names ofmedicinal herbs collected fromherbalmedicine literatures arematchedwith those frommedicinal chemistry literatures by using this algorithm at different string identity levels (80–100%). The optimum performance is at string identity of 88%, at which the recall and precision are 96.9% and 97.3%, respectively. Our study suggests that the Smith–Waterman algorithm is useful for improving the success rate of biomedical text retrieval. 2004 Elsevier Ltd. All rights reserved.

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI

Efficient approximate entity extraction with edit distance constraints

TL;DR: This paper studies the problem of approximate dictionary matching with edit distance constraints and proposes an improved neighborhood generation method employing novel partitioning and prefix pruning techniques that outperforms alternative approaches by up to an order of magnitude.
Journal ArticleDOI

Bioinformatics opportunities for identification and study of medicinal plants

TL;DR: This work highlights areas in medicinal plant research where the application of bioinformatics methodologies may result in quicker and potentially cost-effective leads toward finding plant-based remedies.
Journal ArticleDOI

Mapping biological entities using the longest approximately common prefix method

TL;DR: The Longest Approximately Common Prefix method is introduced as an algorithm for approximate string matching that runs in linear time and is compared to nine other well-known string matching algorithms for performance, precision and speed.
Journal ArticleDOI

Improving hash-q exact string matching algorithm with perfect hashing for DNA sequences.

TL;DR: In this paper, a hash function has been proposed that eliminates hash collisions for DNA sequences and provides perfect hashing and produces hash values in a time-efficient manner, and two exact string matching algorithms based on the proposed hash function have been proposed.
Journal ArticleDOI

Research on Uyghur Pattern Matching Based on Syllable Features

TL;DR: A retrievable syllable coding format based on the syllable features of the Uyghur language and the improvement of the Boyer–Moore (BM) algorithm is proposed, which effectively solves the problem of weakening vowels and it can better match words with stem shape changes.
References
More filters
Journal ArticleDOI

Identification of common molecular subsequences.

TL;DR: This letter extends the heuristic homology algorithm of Needleman & Wunsch (1970) to find a pair of segments, one from each of two long sequences, such that there is no other Pair of segments with greater similarity (homology).
Journal ArticleDOI

A guided tour to approximate string matching

TL;DR: This work surveys the current techniques to cope with the problem of string matching that allows errors, and focuses on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms.
Journal ArticleDOI

An improved algorithm for matching biological sequences

TL;DR: The algorithm of Waterman et al. (1976) for matching biological sequences was modified under some limitations to be accomplished in essentially MN steps, instead of the M 2 N steps necessary in the original algorithm.
Journal ArticleDOI

Techniques for automatically correcting words in text

Karen Kukich
TL;DR: Research aimed at correcting words in text has focused on three progressively more difficult problems: nonword error detection; (2) isolated-word error correction; and (3) context-dependent work correction, which surveys documented findings on spelling error patterns.
Proceedings Article

Automatic Extraction of Biological Information from Scientific Text: Protein-Protein Interactions

TL;DR: The basic design of a system for automatic detection of protein-protein interactions extracted from scientific abstracts is described and the feasibility of developing a fully automated system able to describe networks of protein interactions with sufficient accuracy is demonstrated.
Related Papers (5)