Topic

Document retrieval

About: Document retrieval is a research topic. Over the lifetime, 6821 publications have been published within this topic receiving 214383 citations.

...read moreread less

Papers published on a yearly basis

1 / 2

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Phonetic recognition for spoken document retrieval

[...]

Kenney Ng¹, Victor W. Zue¹•Institutions (1)

Massachusetts Institute of Technology¹

12 May 1998

TL;DR: This paper investigates the use of different acoustic and language models in the speech recognizer in an effort to improve phonetic recognition performance, and examines a variety of subword unit indexing terms to measure their ability to perform effective spoken document retrieval.

...read moreread less

Abstract: This paper describes the development and application of a phonetic recognition system to the task of spoken document retrieval. The recognizer is used to generate phonetic transcriptions of the speech messages which are then processed to produce subword unit representations for indexing and retrieval. Subword units are used as an alternative to words units generated by either keyword spotting or word recognition. We first investigate the use of different acoustic and language models in the speech recognizer in an effort to improve phonetic recognition performance. Then we examine a variety of subword unit indexing terms and measure their ability to perform effective spoken document retrieval. Finally, we look at some simple robust indexing and retrieval methods that take into account the characteristics of the recognition errors in an attempt to improve retrieval performance.

...read moreread less

59 citations

Proceedings Article•DOI•

Towards Spoken-Document Retrieval for the Internet: Lattice Indexing For Large-Scale Web-Search Architectures

[...]

Zhengyu Zhou¹, Peng Yu², Ciprian Chelba², Frank Seide²•Institutions (2)

The Chinese University of Hong Kong¹, Microsoft²

04 Jun 2006

TL;DR: Using TMI indexes that are only five times larger than corresponding linear-text indexes, phrase spotting was improved over searching top-1 transcripts by 25-35%, and relevance ranking by 14%, at only a small loss compared to unindexed lattice search.

...read moreread less

Abstract: Large-scale web-search engines are generally designed for linear text. The linear text representation is suboptimal for audio search, where accuracy can be significantly improved if the search includes alternate recognition candidates, commonly represented as word lattices.This paper proposes a method for indexing word lattices that is suitable for large-scale web-search engines, requiring only limited code changes.The proposed method, called Time-based Merging for Indexing (TMI), first converts the word lattice to a posterior-probability representation and then merges word hypotheses with similar time boundaries to reduce the index size. Four alternative approximations are presented, which differ in index size and the strictness of the phrase-matching constraints.Results are presented for three types of typical web audio content, podcasts, video clips, and online lectures, for phrase spotting and relevance ranking. Using TMI indexes that are only five times larger than corresponding linear-text indexes, phrase spotting was improved over searching top-1 transcripts by 25-35%, and relevance ranking by 14%, at only a small loss compared to unindexed lattice search.

...read moreread less

59 citations

A text retrieval approach to content-based audio retrieval

[...]

Matthew Riley¹, Eric Heinen¹, Joydeep Ghosh¹•Institutions (1)

University of Texas at Austin¹

01 Dec 2008

TL;DR: This paper presents a novel approach to robust, contentbased retrieval of digital music, formulate the hashing and retrieval problems analogously to that of text retrieval and leverage established results for this unique application.

...read moreread less

Abstract: This paper presents a novel approach to robust, contentbased retrieval of digital music. We formulate the hashing and retrieval problems analogously to that of text retrieval and leverage established results for this unique application. Accordingly, songs are represented as a ”Bagof-Audio-Words” and similarity calculations follow directly from the well-known Vector Space model [12]. We evaluate our system on a 4000 song data set to demonstrate its practical applicability, and evaluation shows our technique to be robust to a variety of signal distortions. Most interestingly, the system is capable of matching studio recordings to live recordings of the same song with high accuracy.

...read moreread less

59 citations

Journal Article•DOI•

Integrating document management with project and company data

[...]

Dany Hajjar, Simaan AbouRizk

01 Jan 2000-Journal of Computing in Civil Engineering

TL;DR: An integrated approach based on the concept of specialized construction data models is presented and a detailed account of how this approach was followed to integrate a quality assurance document management system within a steel fabrication company is presented.

...read moreread less

Abstract: Electronic document management (EDM) systems have revolutionized the construction document storage and retrieval process. An electronically managed document consists of a computer representation of the main document body, usually in the form of a bitmap, and a reference or indexing structure used to retrieve the document. In construction, the reference data includes information that relates the document to certain aspects of the constructed facility and the construction company. It is argued that this relationship needs to be represented explicitly and in an integrated manner as opposed to the approach used by traditional EDM systems. This paper presents an integrated approach based on the concept of specialized construction data models. A detailed account of how this approach was followed to integrate a quality assurance document management system within a steel fabrication company is also presented.

...read moreread less

59 citations

Journal Article•DOI•

HITS-PR-HHblits: protein remote homology detection by combining PageRank and Hyperlink-Induced Topic Search.

[...]

Bin Liu¹, Shuangyan Jiang¹, Quan Zou²•Institutions (2)

Harbin Institute of Technology¹, University of Electronic Science and Technology of China²

07 Nov 2018-Briefings in Bioinformatics

TL;DR: The HITS-PR-HHblits predictor is a protocol for protein remote homology detection using different sets of programs, which will become a very useful computational tool for proteome analysis.

...read moreread less

Abstract: As one of the most important fundamental problems in protein sequence analysis, protein remote homology detection is critical for both theoretical research (protein structure and function studies) and real world applications (drug design). Although several computational predictors have been proposed, their detection performance is still limited. In this study, we treat protein remote homology detection as a document retrieval task, where the proteins are considered as documents and its aim is to find the highly related documents with the query documents in a database. A protein similarity network was constructed based on the true labels of proteins in the database, and the query proteins were then connected into the network based on the similarity scores calculated by three ranking methods, including PSI-BLAST, Hmmer and HHblits. The PageRank algorithm and Hyperlink-Induced Topic Search (HITS) algorithm were respectively performed on this network to move the homologous proteins of query proteins to the neighbors of the query proteins in the network. Finally, PageRank and HITS algorithms were combined, and a predictor called HITS-PR-HHblits was proposed to further improve the predictive performance. Tested on the SCOP and SCOPe benchmark datasets, the experimental results showed that the proposed protocols outperformed other state-of-the-art methods. For the convenience of the most experimental scientists, a web server for HITS-PR-HHblits was established at http://bioinformatics.hitsz.edu.cn/HITS-PR-HHblits, by which the users can easily get the results without the need to go through the mathematical details. The HITS-PR-HHblits predictor is a protocol for protein remote homology detection using different sets of programs, which will become a very useful computational tool for proteome analysis.

...read moreread less

58 citations

Collapse

Network Information

Performance

Metrics

6,866

Papers

224,605

Citations

No. of papers in the topic in previous years
Year	Papers
2023	9
2022	39
2021	107
2020	130
2019	144
2018	111

Document retrieval

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics