scispace - formally typeset
Search or ask a question
Topic

Document retrieval

About: Document retrieval is a research topic. Over the lifetime, 6821 publications have been published within this topic receiving 214383 citations.


Papers
More filters
01 Jan 2001
TL;DR: An overview of conceptbased information retrieval techniques and software tools currently available as prototypes or commercial products using feature classification, which incorporates general characteristics of tools and their information retrieval features.
Abstract: . In order to solve the problem of information overkill on the web current information retrieval tools need to be improved. Much more "intelligence" should be embedded to search tools to manage effectively search, retrieval, filtering and presenting relevant information. This can be done by concept-based (or ontology driven) information retrieval, which is considered as one of the high-impact technologies for the next ten years. Nevertheless, most of commercial products of search and retrieval category do not report about concept-based search features. The paper provides an overview of conceptbased information retrieval techniques and software tools currently available as prototypes or commercial products. Tools are evaluated using feature classification, which incorporates general characteristics of tools and their information retrieval features.

109 citations

Journal ArticleDOI
TL;DR: Analysis of users' verbal data shows that high precision does not always mean high quality to users because of different users' expectations, and four related measures of recall and precision are found to be significantly correlated with success.
Abstract: The appropriateness of evaluation criteria and measures have been a subject of debate and a vital concern in the information retrieval evaluation literature. A study was conducted to investigate the appropriateness of 20 measures for evaluating interactive information retrieval performance, representing four major evaluation criteria. Among the 20 measures studied were the two most well-known relevance-based measures of effectiveness, recall and precision. The user's judgment of information retrieval success was used as the devised criterion measure with which all other 20 measures were to be correlated. A sample of 40 end-users with individual information problems from an academic environment were observed, interacting with six professional intermediaries searching on their behalf in large operational systems. Quantitative data consisting of values for all measures studied and verbal data containing users' reasons for assigning certain values to selected measures were collected. Statistical analysis of the quantitative data showed that precision, one of the most important traditional measures of effectiveness, is not significantly correlated with the user's judgment of success. Users appear to be more concerned with absolute recall than with precision, although absolute recall was not directly tested in the study. Four related measures of recall and precision are found to be significantly correlated with success. Among these are user's satisfaction with completeness of search results and user's satisfaction with precision of the search. This article explores the possible explanations for this outcome through content analysis of users' verbal data. The analysis shows that high precision does not always mean high quality (relevancy, completeness, etc.) to users because of different users' expectations. The user's purpose in obtaining information is suggested to be the primary cause for the high concern for recall. Implications for research and practice are discussed. © 1994 John Wiley & Sons, Inc.

109 citations

Proceedings Article
01 Apr 1995
TL;DR: A statistical analysis of the TREC-3 data shows that performance differences across queries is greater thanperformance differences across participants runs.
Abstract: A statistical analysis of the TREC-3 data shows that performance differences across queries is greater than performance differences across participants runs. Generally, groups of runs which do not differ significantly at lerge, sometimes accounting for over half the runs. Correlation among the various performance measures is high.

108 citations

Posted Content
TL;DR: A Deep Boltzmann Machine model suitable for modeling and extracting latent semantic representations from a large unstructured collection of documents is introduced and it is shown that the model assigns better log probability to unseen data than the Replicated Softmax model.
Abstract: We introduce a Deep Boltzmann Machine model suitable for modeling and extracting latent semantic representations from a large unstructured collection of documents. We overcome the apparent difficulty of training a DBM with judicious parameter tying. This parameter tying enables an efficient pretraining algorithm and a state initialization scheme that aids inference. The model can be trained just as efficiently as a standard Restricted Boltzmann Machine. Our experiments show that the model assigns better log probability to unseen data than the Replicated Softmax model. Features extracted from our model outperform LDA, Replicated Softmax, and DocNADE models on document retrieval and document classification tasks.

108 citations

Journal ArticleDOI
TL;DR: To deal with the organization problems of data in this conceptual model, the conventional concept of a list is extended to a fuzzy list and the notion of an inverted file structure can be extended to the fuzzy data in the retrieval model.
Abstract: This paper is concerned with the organization and retrieval of records in document retrieval systems which admit of imprecision in the form of fuzziness in document characterization and retrieval rules. A mathematical model for such systems, based on the theory of fuzzy sets, is introduced. A document retrieval system, as defined in this paper, is a quadruple (X, D, Q, γ), where X is a collection of the document descriptions (also referred to as index records, or records); D is the descriptor set; Q is a query set; γ: QxX → [0, 1], (called the matching function) assigns to each pair (q, x) where q ϵ Q and x ϵ X, a number γ(q, x) in the interval [0, 1], called the matching index for the query q and the document description x. In our system model, each document description x is defined as a fuzzy set in the descriptor set D. As a fuzzy subset of D, each x is characterized by a membership function μx: D → [0, 1], where μx(d), representing the grade of membership of d in x, is referred to as the index weight of the descriptor d for the document representation x. The retrieval response of the system is defined in terms of the matching function γ. More specifically, given a query q, the index record retrieval response, f(q), is defined to be a fuzzy set in X whose membership function is given by μ ƒ(q) (x) = γ(q, x) . To deal with the organization problems of data in our conceptual model, the conventional concept of a list is extended to a fuzzy list. Specifically, L(d), the fuzzy list corresponding to a descriptor d, is defined as a fuzzy set in the document description set X whose membership function is given by μ l (d) (x) = μ x , (d) . In this way, the notion of an inverted file structure can be extended to the fuzzy data in our retrieval model.

108 citations


Network Information
Related Topics (5)
Web page
50.3K papers, 975.1K citations
81% related
Metadata
43.9K papers, 642.7K citations
79% related
Recommender system
27.2K papers, 598K citations
79% related
Ontology (information science)
57K papers, 869.1K citations
78% related
Natural language
31.1K papers, 806.8K citations
77% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20239
202239
2021107
2020130
2019144
2018111