scispace - formally typeset
Search or ask a question
Topic

Document retrieval

About: Document retrieval is a research topic. Over the lifetime, 6821 publications have been published within this topic receiving 214383 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: A novel approach to automatically retrieve keywords and then uses genetic algorithms to adapt the keyword weights and this approach is faster and uses less memory than the PAT-tree based approach.
Abstract: This paper proposes a novel approach to automatically retrieve keywords and then uses genetic algorithms to adapt the keyword weights. One of the contributions of the paper is to combine the Bigram (Chen, A., He, J., Xu, L., Gey, F. C., & Meggs, J. 1997. Chinese text retrieval without using a dictionary , ACM SIGIR’97, Philadelphia, PA, USA, pp. 42–49; Yang, Y.-Y., Chang, J.-S., & Chen, K.-J. 1993), Document automatic classification and ranking , Master thesis, Department of Computer Science, National Tsing Hua University) model and PAT-tree structure (Chien, L.-F., Huang, T.-I., & Chien, M.-C. 1997 Pat-tree-based keyword extraction for Chinese information retrieval , ACM SIGIR’97, Philadelphia, PA, US, pp. 50–59) to retrieve keywords. The approach extracts bigrams from documents and uses the bigrams to construct a PAT-tree to retrieve keywords. The proposed approach can retrieve any type of keywords such as technical keywords and a person’s name. Effectiveness of the proposed approach is demonstrated by comparing how effective are the keywords found by both this approach and the PAT-tree based approach. This comparison reveals that our keyword retrieval approach is as accurate as the PAT-tree based approach, yet our approach is faster and uses less memory. The study then applies genetic algorithms to tune the weight of retrieved keywords. Moreover, several documents obtained from web sites are tested and experimental results are compared with those of other approaches, indicating that the proposed approach is highly promising for applications.

132 citations

01 Oct 1985
TL;DR: The main goal of this thesis is to compare clustered file searches and inverted file searches in order to determine under what circumstances one search is to be preferred over the other.
Abstract: The major component of a document retrieval system is the component that searches the document collection and selects the documents to be returned in response to a query. Since users wait for the results of the search, the component must be efficient as well as effective. The main goal of this thesis is to compare clustered file searches and inverted file searches in order to determine under what circumstances one search is to be preferred over the other. A preliminary goal is to define a good cluster search. Three types of agglomerative clustering strategies, the single link, the complete link, and the group average link methods, are investigated. Searches of the single link hierarchy, the cluster hierarchy used extensively in previous research, are shown to be inferior to searches of the other hierarchy types. Searches of the group average link and complete link hierarchies perform similarly for small collections; for larger collections, searches of the complete link hierarchy are more effective. A top-down search of the group average link hierarchy is the most time efficient search asymptotically. The experimental evidence suggests that the difference in the efficiency and effectiveness of the complete link and group average link searches is due to the restricted depth of the complete link hierarchy. The depth of the group average link hierarchy increases as the size of the collection increases, but the depth of the complete link hierarchy does not. Thus the largest clusters in the complete link hierarchy are not very large, and the clusters can be accurately represented by centroids. Since the depth of the hierarchy does not increase with collection size, searches of the complete link hierarchy should remain effective for larger collections. The top-down search of the complete link hierarchy is somewhat more effective than the inverted file search. The relative efficiency of the two searches depends on the relative efficiency of accessing a page and computing a similarity, since the cluster search accesses many more pages but computes fewer similarities than the inverted file search. For an inexpensive similarity measure, the inverted file search is much more efficient.

131 citations

Book ChapterDOI
25 Mar 2002
TL;DR: The research focuses on the degree to which implicit evidence of document relevance can be substituted for explicit evidence in terms of both user opinion and search effectiveness.
Abstract: In this paper we report on the application of two contrasting types of relevance feedback for web retrieval. We compare two systems; one using explicit relevance feedback (where searchers explicitly have to mark documents relevant) and one using implicit relevance feedback (where the system endeavours to estimate relevance by mining the searcher's interaction). The feedback is used to update the display according to the user's interaction. Our research focuses on the degree to which implicit evidence of document relevance can be substituted for explicit evidence. We examine the two variations in terms of both user opinion and search effectiveness.

131 citations

Proceedings ArticleDOI
01 Aug 1998
TL;DR: This paper implemented the model and ran a series of experiments to show that, in addition to the added functionality, the use of the structural information embedded in SGML documents can improve the effectiveness of document retrieval, compared to the case where no such information is used.
Abstract: In traditional information retrieval (IR) systems, a document as a whole is the target for a query. With increasing interests in structured documents like SGML documents, there is a growing need to build an LR system that can retrieve parts of documents, which satisfy not only content-based but also structure-based requirements. In this paper, we describe an inference-net-based approach to this problem. The model is capable of retrieving elements at any level in a principled way, satisfying certain containment constraints in a quety. Moreover, lvhile the model is general enough to reproduce the ranking strategy adopted by conventional document retrieval systems by making use of document and collection level statistics such as TF and IDF, its flexibility allows for incorporation of a variety of pragmatic and semantic information associated with document structures. We implemented the model and ran a series of experiments to show that, in addition to the added functionality, the use of the structural information embedded in SGML documents can improve the effectiveness of document retrieval, compared to the case where no such information is used. We also show that giving a pragmatic preference to a certain element tape of the SGML documents can enhance retrieval effectiveness.

131 citations

Patent
13 Sep 2001
TL;DR: In this paper, a document retrieval method using a computer program includes retrieving a first set of documents using a first query expression generated by the computer program and an evaluation of the first set from the user.
Abstract: A document retrieval method using a computer program includes retrieving a first set of documents using a first query expression generated by the computer program. The first set of documents is provided to a user. An evaluation of the first set of documents is received from the user. The first query expression is changed to a second query expression generated by the computer program based on the evaluation.

130 citations


Network Information
Related Topics (5)
Web page
50.3K papers, 975.1K citations
81% related
Metadata
43.9K papers, 642.7K citations
79% related
Recommender system
27.2K papers, 598K citations
79% related
Ontology (information science)
57K papers, 869.1K citations
78% related
Natural language
31.1K papers, 806.8K citations
77% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20239
202239
2021107
2020130
2019144
2018111