scispace - formally typeset
Search or ask a question
Topic

Document retrieval

About: Document retrieval is a research topic. Over the lifetime, 6821 publications have been published within this topic receiving 214383 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: Investigation of the degree to which variations in relevance judgments affect the evaluation of retrieval performance found that in no case was there a noticeable or material difference in retrieval performance due to variation in relevance judgment.
Abstract: The relevance judgments used to evaluate the performance of information retrieval systems are known to vary among judges and to vary under certain conditions extraneous to the relevance relationship between queries and documents. The study reported here investigated the degree to which variations in relevance judgments affect the evaluation of retrieval performance. Four sets of relevance judgments were used to test the retrieval effectiveness of six document representations. In no case was there a noticeable or material difference in retrieval performance due to variations in relevance judgment. Additionally, for each set of relevance judgments, the relative performance of the six different document representations was the same. Reasons why variations in relevance judgments may not affect recall and precision results were examined in further detail.

67 citations

Journal ArticleDOI
TL;DR: A linkage similarity measure which takes into account both the bibliographic coupling of documents and their cocitations produced improved document retrieval over a measure based only on bibliographical coupling.
Abstract: A linkage similarity measure which takes into account both the bibliographic coupling of documents and their cocitations (both cited and citing papers) produced improved document retrieval over a measure based only on bibliographic coupling. The test collection consisted of 1712 papers whose relevance to specific queries had been judged by users. To evaluate the effect of using cocitation data, we calculated for each query two measures of similarity between each relevant paper and every other paper retrieved. Papers were then sorted by the similarity measures, producing two ordered lists. We then compared the resulting predictions of relevance, partial relevance, and non-relevance to the user's evaluations of the same papers. Over-all, the change from the bibliographic coupling measure to the linkage similarity measure, representing the introduction of cocitation data, resulted in better retrieval performance.

67 citations

Journal ArticleDOI
TL;DR: This paper proposes a new document retrieval (DR) and plagiarism detection (PD) system using multilayer self-organizing map (MLSOM), and shows that the tree-structured data is effective for DR and PD.
Abstract: This paper proposes a new document retrieval (DR) and plagiarism detection (PD) system using multilayer self-organizing map (MLSOM). A document is modeled by a rich tree-structured representation, and a SOM-based system is used as a computationally effective solution. Instead of relying on keywords/lines, the proposed scheme compares a full document as a query for performing retrieval and PD. The tree-structured representation hierarchically includes document features as document, pages, and paragraphs. Thus, it can reflect underlying context that is difficult to acquire from the currently used word-frequency information. We show that the tree-structured data is effective for DR and PD. To handle tree-structured representation in an efficient way, we use an MLSOM algorithm, which was previously developed by the authors for the application of image retrieval. In this study, it serves as an effective clustering algorithm. Using the MLSOM, local matching techniques are developed for comparing text documents. Two novel MLSOM-based PD methods are proposed. Detailed simulations are conducted and the experimental results corroborate that the proposed approach is computationally efficient and accurate for DR and PD.

67 citations

Journal ArticleDOI
TL;DR: Minor adjustments have been made for the display of full text databases, allowing words resulting in retrieval to be displayed in context; but changes have not been made in retrieval techniques.
Abstract: Complete texts of many journals are now available for online searching. Most of these full text databases have been made available on the same or similar search systems that provide access to bibliographic information. The systems use inverted files that retain limited context information (e.g., paragraphs and location of words within paragraphs). The retrieval techniques used are simply those that were developed earlier for bibliographic databases. Retrieval relies on Boolean logic, word stem searching with truncation, and word proximity specification. Minor adjustments have been made for the display of full text databases, allowing words resulting in retrieval to be displayed in context; but changes have not been made in retrieval techniques. This is due to the reliance on search systems that provide access to many types of databases, all of which are by‐products of improved techniques for creating printed publications.

67 citations

Patent
09 Jun 2003
TL;DR: In this paper, a document retrieval method and system for separately performing a process for correcting erroneously recognized characters existing in characteristic character strings within a seed document or the documents to be registered and tolerating inaccuracies existing in the documents targeted for retrieval is described.
Abstract: Disclosed are a document retrieval method and system for separately performing a process for correcting erroneously recognized characters existing in characteristic character strings within a seed document or the documents to be registered and a process for tolerating erroneously recognized characters existing in the documents targeted for retrieval. The process for correcting erroneously recognized characters existing in characteristic character strings extracts characteristic character strings from a read document, replaces the extracted characteristic character strings containing erroneously recognized characters with character strings appropriate for document retrieval, and selects characteristic character strings for use in actual document retrieval.

67 citations


Network Information
Related Topics (5)
Web page
50.3K papers, 975.1K citations
81% related
Metadata
43.9K papers, 642.7K citations
79% related
Recommender system
27.2K papers, 598K citations
79% related
Ontology (information science)
57K papers, 869.1K citations
78% related
Natural language
31.1K papers, 806.8K citations
77% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20239
202239
2021107
2020130
2019144
2018111