scispace - formally typeset
Search or ask a question
Topic

Document retrieval

About: Document retrieval is a research topic. Over the lifetime, 6821 publications have been published within this topic receiving 214383 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: An all computer document retrieval system which can find documents related to a request even though they may not be indexed by the exact terms of the request, and can present these documents in the order of their relevance to the request is described.
Abstract: This paper describes an all computer document retrieval system which can find documents related to a request even though they may not be indexed by the exact terms of the request, and can present these documents in the order of their relevance to the request. The key to this ability lies in the application of a statistical formula by which the computer calculates the degree of association between pairs of index terms. With proper manipulation of these associations (entirely within the machine) a vocabulary of synonyms, near synonyms and other words closely related to any given term or group of terms is derived. Such a vocabulary related to a group of request terms is believed to be a much more powerful tool for selecting documents from a collection than has been available heretofore. By noting the number of matching terms between this extended list of request terms and the terms used to index a document, and with due regard for their degree of association, documents are selected by the computer and arranged in the order of their relevance to the request. Like all other documentalists who are operating large coordinate indexes, we are searching for better ways to exploit this type of information system. In our library we have already eliminated the time-consuming job of posting document numbers manually by enlisting the aid of a 705 computer. (The computer periodically prepares revised posting cards to replace the outdated ones.) Now we are searching for better solutions to our retrieval problems. One obvious retrieval problem in any large system is the time required to \"coordinate\" heavily posted terms. We are convinced we must mechanize if we are to allow our collection to grow indefinitely. A second problem is the retrieval of so many documents related to a single request that the customer finds it difficult to decide which to examine first. Since he has no precise means of determining which document is most closely related to a request, we have been forced into assisting him to use somewhat arbitrary or subjective means. The date of the document is sometimes used as a relevance criterion with the hope that the most recent document will be the most pertinent, or the name of the author is used with the hope that a known author will answer the request better than an unknown one. The pitfalls of such criteria are apparent. The third, and …

100 citations

Journal ArticleDOI
TL;DR: In this article, the authors show how to use wavelet trees to solve fundamental algorithmic problems such as range quantile queries, range next value queries, and range intersection queries.

100 citations

Patent
Robin Green1
31 Jul 2001
TL;DR: A character look-up table is a boolean array with one bit elements that are processed in groups whose size corresponds to the maximum bit processing count of the computer, effectively culling non-matching words simultaneously as mentioned in this paper.
Abstract: A computer-operated document retrieval system includes a lexicon of words contained in system documents, and a document look-up table that relates words by unique word numbers to the documents. A word look-up table identifies sets of words with common characteristics, specifically prefix value and word length, and a character look-up table identifies whether any word contains a specified character. A target set generator accesses the word look-up table to compose a target word set with characteristics corresponding to the search string. A refining module reduces the target set by selecting a set of characters from the search string, and accessing the character look-up table to identify which target words use the character set. The character look-up table is a boolean array with one bit elements that are processed in groups whose size corresponds to the maximum bit processing count of the computer, effectively culling non-matching words simultaneously. A string comparison module determines whether any word remaining in the target set matches the search string. The system quickly executes various searches, including prefix, exact match, wildcard, and fuzzy searches.

100 citations

Journal ArticleDOI
TL;DR: An approach to document enrichment is presented, which consists of developing and integrating formal knowledge models with archives of documents, to provide intelligent knowledge retrieval and (possibly) additional knowledge-intensive services, beyond what is currently available using “standard” information retrieval and search facilities.
Abstract: In this paper, we present an approach to document enrichment, which consists of developing and integrating formal knowledge models with archives of documents, to provide intelligent knowledge retrieval and (possibly) additional knowledge-intensive services, beyond what is currently available using “standard” information retrieval and search facilities. Our approach is ontology-driven, in the sense that the construction of the knowledge model is carried out in a top-down fashion, by populating a given ontology, rather than in a bottom-up fashion, by annotating a particular document. In this paper, we give an overview of the approach and we examine the various types of issues (e.g. modelling, organizational and user interface issues) which need to be tackled to effectively deploy our approach in the workplace. In addition, we also discuss a number of technologies we have developed to support ontology-driven document enrichment and we illustrate our ideas in the domains of electronic news publishing, scholarly discourse and medical guidelines.

100 citations

Patent
15 Sep 2015
TL;DR: In this article, a retrieval request for one or more documents containing search terms descriptive of the documents can be processed by identifying a set of candidate documents tagged with subjects, then using affinity values to adjust the aggregate score for the terms in the dictionaries.
Abstract: Techniques for managing big data include retrieval using per-subject dictionaries having multiple levels of sub-classification hierarchy within the subject. Entries may include subject-determining-power (SDP) scores that provide an indication of the descriptive power of the entry term with respect to the subject of the dictionary containing the term. The same term may have entries in multiple dictionaries with different SDP scores in each of the dictionaries. A retrieval request for one or more documents containing search terms descriptive of the one or more documents can be processed by identifying a set of candidate documents tagged with subjects, i.e., identifiers of per-subject dictionaries having entries corresponding to a search term, then using affinity values to adjust the aggregate score for the terms in the dictionaries. Documents are then selected for best match to the subject based on the adjusted scores. Alternatively, the adjustment may be performed after selecting the documents by re-ordering them according to adjusted scores.

100 citations


Network Information
Related Topics (5)
Web page
50.3K papers, 975.1K citations
81% related
Metadata
43.9K papers, 642.7K citations
79% related
Recommender system
27.2K papers, 598K citations
79% related
Ontology (information science)
57K papers, 869.1K citations
78% related
Natural language
31.1K papers, 806.8K citations
77% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20239
202239
2021107
2020130
2019144
2018111