scispace - formally typeset
Search or ask a question

Showing papers by "Eugene Agichtein published in 2003"


Journal ArticleDOI
TL;DR: Four complementary approaches for extracting gene and protein synonyms from text are explored, namely the unsupervised, partially supervised, and supervised machine-learning techniques, as well as the manual knowledge-based approach.
Abstract: Motivation: Genes and proteins are often associated with multiple names. More names are added as new functional or structural information is discovered. Because authors can use any one of the known names for a gene or protein, information retrieval and extraction would benefit from identifying the gene and protein terms that are synonyms of the same substance. Results: We have explored four complementary approaches for extracting gene and protein synonyms from text, namely the unsupervised, partially supervised, and supervised machine-learning techniques, as well as the manual knowledge-based approach. We report results of a large scale evaluation of these alternatives over an archive of biological journal articles. Our evaluation shows that our extraction techniques could be a valuable supplement to resources such as SWISSPROT, as our systems were able to capture gene and protein synonyms not listed in the SWISSPROT database. Data Availability: The extracted gene and protein syn

113 citations


Proceedings ArticleDOI
05 Mar 2003
TL;DR: An automatic query-based technique to retrieve documents useful for the extraction of user-defined relations from large text databases is developed, which can be adapted to new domains, databases, or target relations with minimal human effort.
Abstract: A wealth of information is hidden within unstructured text. This information is often best exploited in structured or relational form, which is suited for sophisticated query processing, for integration with relational databases, and for data mining. Current information extraction techniques extract relations from a text database by examining every document in the database, or use filters to select promising documents for extraction. The exhaustive scanning approach is not practical or even feasible for large databases, and the current filtering techniques require human involvement to maintain and to adapt to new databases and domains. We develop an automatic query-based technique to retrieve documents useful for the extraction of user-defined relations from large text databases, which can be adapted to new domains, databases, or target relations with minimal human effort. We report a thorough experimental evaluation over a large newspaper archive that shows that we significantly improve the efficiency of the extraction process by focusing only on promising documents.

110 citations


Proceedings Article
01 Jan 2003
TL;DR: A graph-based “ reachability” metric is developed that allows to characterize when an application’s query-based strategy will successfully “reach” all documents that the application needs and is complemented with an efficient sampling-based technique that accurately estimates the reachability associated with a text database and an application's query- based strategy.
Abstract: Searchable text databases abound on the web. Applications that require access to such databases often resort to querying to extract relevant documents because of two main reasons. First, some text databases on the web are not “crawlable,” and hence the only way to retrieve their documents is via querying. Second, applications often require only a small fraction of a database’s contents, so retrieving relevant documents via querying is an attractive choice from an efficiency viewpoint, even for crawlable databases. Often an application’s query-based strategy starts with a small number of user-provided queries. Then, new queries are extracted ‐in an application-dependent way‐ from the documents in the initial query results, and the process iterates. The success of this common type of strategy relies on retrieved documents “contributing” new queries. If new documents fail to produce new queries, then the process might stall before all relevant documents are retrieved. In this paper, we develop a graph-based “reachability” metric that allows to characterize when an application’s query-based strategy will successfully “reach” all documents that the application needs. We complement our metric with an efficient sampling-based technique that accurately estimates the reachability associated with a text database and an application’s query-based strategy. We report preliminary experiments backing the usefulness of our metric and the accuracy of the associated estimation technique over real text databases and for two applications.

37 citations


Proceedings ArticleDOI
01 Dec 2003
TL;DR: The approach addresses the data scarcity problem by combining text and sequence analysis and demonstrates the effectiveness of the approach by predicting protein sub-cellular localization and determining localization specific functional regions of these proteins.
Abstract: Recently presented protein sequence classification models can identify relevant regions of the sequence. This observation has many potential applications to detecting functional regions of proteins. However, identifying such sequence regions automatically is difficult in practice, as relatively few types of information have enough annotated sequences to perform this analysis. Our approach addresses this data scarcity problem by combining text and sequence analysis. First, we train a text classifier over the explicit textual annotations available for some of the sequences in the dataset, and use the trained classifier to predict the class for the rest of the unlabeled sequences. We then train a joint sequence text classifier over the text contained in the functional annotations of the sequences, and the actual sequences in this larger, automatically extended dataset. Finally, we project the classifier onto the original sequences to determine the relevant regions of the sequences. We demonstrate the effectiveness of our approach by predicting protein sub-cellular localization and determining localization specific functional regions of these proteins.

28 citations


Book ChapterDOI
13 Oct 2003
TL;DR: This work proposes a novel, partially supervised approach for extracting user-defined relations from XML documents with unknown schema, which attempts to automatically capture the lexical and structural features that indicate the relevant portions of the input document, based on a few user-annotated examples.
Abstract: XML is becoming a prevalent format for data exchange. Many XML documents have complex schemas that are not always known, and can vary widely between information sources and applications. In contrast, database applications rely mainly on the flat relational model. We propose a novel, partially supervised approach for extracting user-defined relations from XML documents with unknown schema. The extracted relations can be directly used by an RDBMS, or utilized for information integration or data mining tasks. Our method attempts to automatically capture the lexical and structural features that indicate the relevant portions of the input document, based on a few user-annotated examples. This information can then be used to extract the relation of interest from documents with schemas potentially different from the training examples. We present preliminary experiments showing that our method could be capable of extracting the target relation from XML documents even in the presence of significant variations in the document schemas.

8 citations


Proceedings ArticleDOI
09 Jun 2003
TL;DR: A wealth of information is hidden within unstructured text that is best utilized in structured or relational form, which is suited for sophisticated query processing, for integration with relational databases, and for data mining.
Abstract: Background: A wealth of information is hidden within unstructured text. This information is often best utilized in structured or relational form, which is suited for sophisticated query processing, for integration with relational databases, and for data mining. For example, newspaper and e-mail archives contain information that could be useful to analysts and govenment agencies. Information extraction systems produce a structured representation of the information that is “buried” in text documents. Unfortunately, processing each document is computationally expensive, and is not feasible for large text databases or for the web. With many database sizes exceeding millions of documents, processing time is becoming a bottleneck for exploiting information extraction technology.

5 citations