scispace - formally typeset
Search or ask a question
Topic

Document retrieval

About: Document retrieval is a research topic. Over the lifetime, 6821 publications have been published within this topic receiving 214383 citations.


Papers
More filters
Patent
05 Feb 1997
TL;DR: In this paper, the indexer traverses the hypertext database and finds hypertext information including the address of the document the hyperlinks point to and the anchor text of each hyperlink.
Abstract: A search engine for retrieving documents pertinent to a query indexes documents in accordance with hyperlinks pointing to those documents. The indexer traverses the hypertext database and finds hypertext information including the address of the document the hyperlinks point to and the anchor text of each hyperlink. The information is stored in an inverted index file, which may also be used to calculate document link vectors for each hyperlink pointing to a particular document. When a query is entered, the search engine finds all document vectors for documents having the query terms in their anchor text. A query vector is also calculated, and the dot product of the query vector and each document link vector is calculated. The dot products relating to a particular document are summed to determine the relevance ranking for each document.

373 citations

Journal ArticleDOI
TL;DR: This article describes GlOSS, Glossary of Servers Server, with two versions: bGloss, which provides a Boolean query retrieval model, and vGlOSS, which providing a vector-space retrieval model and extensively describes the methodology for measuring the retrieval effectiveness of these systems.
Abstract: The dramatic growth of the Internet has created a new problem for users: location of the relevant sources of documents. This article presents a framework for (and experimentally analyzes a solution to) this problem, which we call the text-source discovery problem. Our approach consists of two phases. First, each text source exports its contents to a centralized service. Second, users present queries to the service, which returns an ordered list of promising text sources. This article describes GlOSS, Glossary of Servers Server, with two versions: bGlOSS, which provides a Boolean query retrieval model, and vGlOSS, which provides a vector-space retrieval model. We also present hGlOSS, which provides a decentralized version of the system. We extensively describe the methodology for measuring the retrieval effectiveness of these systems and provide experimental evidence, based on actual data, that all three systems are highly effective in determining promising text sources for a given query.

371 citations

Book ChapterDOI
20 Oct 2003
TL;DR: A simplistic upper-level ontology is introduced which starts with some basic philosophic distinctions and goes down to the most popular entity types, thus providing many of the inter-domain common sense concepts and allowing easy domain-specific extensions.
Abstract: The Semantic Web realization depends on the availability of critical mass of metadata for the web content, linked to formal knowledge about the world. This paper presents our vision about a holistic system allowing annotation, indexing, and retrieval of documents with respect to real-world entities. A system (called KIM), partially implementing this concept is shortly presented and used for evaluation and demonstration. Our understanding is that a system for semantic annotation should be based upon specific knowledge about the world, rather than indifferent to any ontological commitments and general knowledge. To assure efficiency and reusability of the metadata we introduce a simplistic upper-level ontology which starts with some basic philosophic distinctions and goes down to the most popular entity types (people, companies, cities, etc.), thus providing many of the inter-domain common sense concepts and allowing easy domain-specific extensions. Based on the ontology, an extensive knowledge base of entities descriptions is maintained. Semantically enhanced information extraction system providing automatic annotation with references to classes in the ontology and instances in the knowledge base is presented. Based on these annotations, we perform IR-like indexing and retrieval, further extended using the ontology and knowledge about the specific entities.

366 citations

Patent
07 Nov 1990
TL;DR: In this article, a dictionary of context vectors provides a context vector for each word stem in the dictionary, and a normalized summary vector is stored for each document by combining the context vectors of the words remaining in the document after uninteresting words are removed.
Abstract: A method for storing and searching documents also useful in disambiguating word senses and a method for generating a dictionary of context vectors. The dictionary of context vectors provides a context vector for each word stem in the dictionary. A context vector is a fixed length list of component values corresponding to a list of word-based features, the component values being an approximate measure of the conceptual relationship between the word stem and the word-based feature. Documents are stored by combining the context vectors of the words remaining in the document after uninteresting words are removed. The summary vector obtained by adding all of the context vectors of the remaining words is normalized. The normalized summary vector is stored for each document. The data base of normalized summary vectors is searched using a query vector and identifying the document whose vector is closest to that query vector. The normalized summary vectors of each document can be stored using cluster trees according to a centroid consistent algorithm to accelerate the searching process. Said searching process also gives an efficient way of finding nearest neighbor vectors in high-dimensional spaces.

361 citations

Journal ArticleDOI
TL;DR: The problems with query matching systems are discussed, which were designed for skilled search intermediaries rather than end‐users, and the knowledge and skills they require in the information‐seeking process, illustrated with examples of searching card and online catalogs.
Abstract: Author(s): Borgman, Christine L. | Abstract: We return to arguments made 10 years ago (Borgman, 1986a) that online catalogs are difficult to use because their design does not incorporate sufficient understanding of searching behavior. The earlier article examined studies of information retrieval system searching for their implications for online catalog design; this article examines the implications of card catalog design for online catalogs. With this analysis, we hope to contribute to a better understanding of user behavior and to lay to rest the card catalog design model for online catalogs. We discuss the problems with query matching systems, which were designed for skilled search intermediaries rather than end-users, and the knowledge and skills they require in the information-seeking process, illustrated with examples of searching card and online catalogs. Searching requires conceptual knowledge of the information retrieval process—translating an information need into a searchable query; semantic knowledge of how to implement a query in a given system—the how and when to use system features; and technical skills in executing the query—basic computing skills and the syntax of entering queries as specific search statements. In the short term, we can help make online catalogs easier to use through improved training and documentation that is based on information-seeking behavior, with the caveat that good training is not a substitute for good system design. Our long term goal should be to design intuitive systems that require a minimum of instruction. Given the complexity of the information retrieval problem and the limited capabilities of today's systems, we are far from achieving that goal. If libraries are to provide primary information services for the networked world, they need to put research results on the information-seeking process into practice in designing the next generation of online public access information retrieval systems.

357 citations


Network Information
Related Topics (5)
Web page
50.3K papers, 975.1K citations
81% related
Metadata
43.9K papers, 642.7K citations
79% related
Recommender system
27.2K papers, 598K citations
79% related
Ontology (information science)
57K papers, 869.1K citations
78% related
Natural language
31.1K papers, 806.8K citations
77% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20239
202239
2021107
2020130
2019144
2018111