scispace - formally typeset
Search or ask a question
Topic

Document retrieval

About: Document retrieval is a research topic. Over the lifetime, 6821 publications have been published within this topic receiving 214383 citations.


Papers
More filters
Proceedings ArticleDOI
12 May 1998
TL;DR: This paper investigates the use of different acoustic and language models in the speech recognizer in an effort to improve phonetic recognition performance, and examines a variety of subword unit indexing terms to measure their ability to perform effective spoken document retrieval.
Abstract: This paper describes the development and application of a phonetic recognition system to the task of spoken document retrieval. The recognizer is used to generate phonetic transcriptions of the speech messages which are then processed to produce subword unit representations for indexing and retrieval. Subword units are used as an alternative to words units generated by either keyword spotting or word recognition. We first investigate the use of different acoustic and language models in the speech recognizer in an effort to improve phonetic recognition performance. Then we examine a variety of subword unit indexing terms and measure their ability to perform effective spoken document retrieval. Finally, we look at some simple robust indexing and retrieval methods that take into account the characteristics of the recognition errors in an attempt to improve retrieval performance.

59 citations

Proceedings ArticleDOI
04 Jun 2006
TL;DR: Using TMI indexes that are only five times larger than corresponding linear-text indexes, phrase spotting was improved over searching top-1 transcripts by 25-35%, and relevance ranking by 14%, at only a small loss compared to unindexed lattice search.
Abstract: Large-scale web-search engines are generally designed for linear text. The linear text representation is suboptimal for audio search, where accuracy can be significantly improved if the search includes alternate recognition candidates, commonly represented as word lattices.This paper proposes a method for indexing word lattices that is suitable for large-scale web-search engines, requiring only limited code changes.The proposed method, called Time-based Merging for Indexing (TMI), first converts the word lattice to a posterior-probability representation and then merges word hypotheses with similar time boundaries to reduce the index size. Four alternative approximations are presented, which differ in index size and the strictness of the phrase-matching constraints.Results are presented for three types of typical web audio content, podcasts, video clips, and online lectures, for phrase spotting and relevance ranking. Using TMI indexes that are only five times larger than corresponding linear-text indexes, phrase spotting was improved over searching top-1 transcripts by 25-35%, and relevance ranking by 14%, at only a small loss compared to unindexed lattice search.

59 citations

01 Dec 2008
TL;DR: This paper presents a novel approach to robust, contentbased retrieval of digital music, formulate the hashing and retrieval problems analogously to that of text retrieval and leverage established results for this unique application.
Abstract: This paper presents a novel approach to robust, contentbased retrieval of digital music. We formulate the hashing and retrieval problems analogously to that of text retrieval and leverage established results for this unique application. Accordingly, songs are represented as a ”Bagof-Audio-Words” and similarity calculations follow directly from the well-known Vector Space model [12]. We evaluate our system on a 4000 song data set to demonstrate its practical applicability, and evaluation shows our technique to be robust to a variety of signal distortions. Most interestingly, the system is capable of matching studio recordings to live recordings of the same song with high accuracy.

59 citations

Journal ArticleDOI
TL;DR: An integrated approach based on the concept of specialized construction data models is presented and a detailed account of how this approach was followed to integrate a quality assurance document management system within a steel fabrication company is presented.
Abstract: Electronic document management (EDM) systems have revolutionized the construction document storage and retrieval process. An electronically managed document consists of a computer representation of the main document body, usually in the form of a bitmap, and a reference or indexing structure used to retrieve the document. In construction, the reference data includes information that relates the document to certain aspects of the constructed facility and the construction company. It is argued that this relationship needs to be represented explicitly and in an integrated manner as opposed to the approach used by traditional EDM systems. This paper presents an integrated approach based on the concept of specialized construction data models. A detailed account of how this approach was followed to integrate a quality assurance document management system within a steel fabrication company is also presented.

59 citations

Journal ArticleDOI
TL;DR: The HITS-PR-HHblits predictor is a protocol for protein remote homology detection using different sets of programs, which will become a very useful computational tool for proteome analysis.
Abstract: As one of the most important fundamental problems in protein sequence analysis, protein remote homology detection is critical for both theoretical research (protein structure and function studies) and real world applications (drug design). Although several computational predictors have been proposed, their detection performance is still limited. In this study, we treat protein remote homology detection as a document retrieval task, where the proteins are considered as documents and its aim is to find the highly related documents with the query documents in a database. A protein similarity network was constructed based on the true labels of proteins in the database, and the query proteins were then connected into the network based on the similarity scores calculated by three ranking methods, including PSI-BLAST, Hmmer and HHblits. The PageRank algorithm and Hyperlink-Induced Topic Search (HITS) algorithm were respectively performed on this network to move the homologous proteins of query proteins to the neighbors of the query proteins in the network. Finally, PageRank and HITS algorithms were combined, and a predictor called HITS-PR-HHblits was proposed to further improve the predictive performance. Tested on the SCOP and SCOPe benchmark datasets, the experimental results showed that the proposed protocols outperformed other state-of-the-art methods. For the convenience of the most experimental scientists, a web server for HITS-PR-HHblits was established at http://bioinformatics.hitsz.edu.cn/HITS-PR-HHblits, by which the users can easily get the results without the need to go through the mathematical details. The HITS-PR-HHblits predictor is a protocol for protein remote homology detection using different sets of programs, which will become a very useful computational tool for proteome analysis.

58 citations


Network Information
Related Topics (5)
Web page
50.3K papers, 975.1K citations
81% related
Metadata
43.9K papers, 642.7K citations
79% related
Recommender system
27.2K papers, 598K citations
79% related
Ontology (information science)
57K papers, 869.1K citations
78% related
Natural language
31.1K papers, 806.8K citations
77% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20239
202239
2021107
2020130
2019144
2018111