scispace - formally typeset
Search or ask a question
Topic

Document retrieval

About: Document retrieval is a research topic. Over the lifetime, 6821 publications have been published within this topic receiving 214383 citations.


Papers
More filters
Journal ArticleDOI
01 Mar 2007
TL;DR: Measured by miss and false alarm rates, the EP-supported ET (EPET) technique exhibits better tracking effectiveness than a traditional ET technique and suggests that the proposed EP technique could effectively discover event episodes and EPs in sequences of documents.
Abstract: Recent advances in information and networking technologies have contributed significantly to global connectivity and greatly facilitated and fostered information creation, distribution, and access. The resultant ever-increasing volume of online textual documents creates an urgent need for new text mining techniques that can intelligently and automatically extract implicit and potentially useful knowledge from these documents for decision support. This research focuses on identifying and discovering event episodes together with their temporal relationships that occur frequently (referred to as evolution patterns (EPs) in this paper) in sequences of documents. The discovery of such EPs can be applied in domains such as knowledge management and used to facilitate existing document management and retrieval techniques [e.g., event tracking (ET)]. Specifically, we propose and design an EP discovery technique for mining EPs from sequences of documents. We experimentally evaluate our proposed EP technique in the context of facilitating ET. Measured by miss and false alarm rates, the EP-supported ET (EPET) technique exhibits better tracking effectiveness than a traditional ET technique. The encouraging performance of the EPET technique demonstrates the potential usefulness of EPs in supporting ET and suggests that the proposed EP technique could effectively discover event episodes and EPs in sequences of documents

46 citations

Book ChapterDOI
24 Jun 2013
TL;DR: The basic idea is to measure the distance between candidate concepts using the PMING distance, a collaborative semantic proximity measure, a measure which can be computed using statistical results from a web search engine, and show that the proposed technique can provide users with more satisfying expansion results and improve the quality of web document retrieval.
Abstract: In this work several semantic approaches to concept-based query expansion and re-ranking schemes are studied and compared with different ontology-based expansion methods in web document search and retrieval. In particular, we focus on concept-based query expansion schemes where, in order to effectively increase the precision of web document retrieval and to decrease the users’ browsing time, the main goal is to quickly provide users with the most suitable query expansion. Two key tasks for query expansion in web document retrieval are to find the expansion candidates, as the closest concepts in web document domain, and to rank the expanded queries properly. The approach we propose aims at improving the expansion phase for better web document retrieval and precision. The basic idea is to measure the distance between candidate concepts using the PMING distance, a collaborative semantic proximity measure, i.e. a measure which can be computed using statistical results from a web search engine. Experiments show that the proposed technique can provide users with more satisfying expansion results and improve the quality of web document retrieval.

46 citations

Journal ArticleDOI
TL;DR: This paper presents a new method for QE based on fuzzy logic considering the top-retrieved document as relevance feedback documents for mining additional QE terms and increases the precision rates and the recall rates of information retrieval systems for dealing with document retrieval.
Abstract: Efficient query expansion (QE) terms selection methods are really very important for improving the accuracy and efficiency of the system by removing the irrelevant and redundant terms from the top-retrieved feedback documents corpus with respect to a user query. Each individual QE term selection method has its weaknesses and strengths. To overcome the weaknesses and to utilize the strengths of the individual method, we used multiple terms selection methods together. In this paper, we present a new method for QE based on fuzzy logic considering the top-retrieved document as relevance feedback documents for mining additional QE terms. Different QE terms selection methods calculate the degrees of importance of all unique terms of top-retrieved documents collection for mining additional expansion terms. These methods give different relevance scores for each term. The proposed method combines different weights of each term by using fuzzy rules to infer the weights of the additional query terms. Then, the weights of the additional query terms and the weights of the original query terms are used to form the new query vector, and we use this new query vector to retrieve documents. All the experiments are performed on TREC and FIRE benchmark datasets. The proposed QE method increases the precision rates and the recall rates of information retrieval systems for dealing with document retrieval. It gets a significant higher average recall rate, average precision rate and F measure on both datasets.

46 citations

Journal ArticleDOI
TL;DR: This work proposes a new spectral-based information retrieval method that is able to utilize many different levels of document resolution by examining the term patterns that occur in the documents, and takes advantage of the multiresolution analysis properties of the wavelet transform.
Abstract: Current information retrieval methods either ignore the term positions or deal with exact term positions; the former can be seen as coarse document resolution, the latter as fine document resolution. We propose a new spectral-based information retrieval method that is able to utilize many different levels of document resolution by examining the term patterns that occur in the documents. To do this, we take advantage of the multiresolution analysis properties of the wavelet transform. We show that we are able to achieve higher precision when compared to vector space and proximity retrieval methods, while producing fast query times and using a compact index.

46 citations

Journal ArticleDOI
TL;DR: It is shown that document retrieval - specifically, access to intellectual content - is a complex process which is most strongly influenced by three factors: the size of the document collection; the type of search (exhaustive, existence or sample); and, the determinacy of document representation.
Abstract: With the growing focus on what is collectively known as "knowledge management", a shift continues to take place in commercial information system development: a shift away from the well-understood data retrieval/database model, to the more complex and challenging development of commercial document/ information retrieval models. While document retrieval has had a long and rich legacy of research, its impact on commercial applications has been modest. At the enterprise level most large organizations have little understanding of, or commitment to, high quality document access and management. Part of the reason for this is that we still do not have a good framework for understanding the major factors which affect the performance of large-scale corporate document retrieval systems. The thesis of this discussion is that document retrieval - specifically, access to intellectual content - is a complex process which is most strongly influenced by three factors: the size of the document collection; the type of search (exhaustive, existence or sample); and, the determinacy of document representation. Collectively, these factors can be used to provide a useful framework for, or taxonomy of, document retrieval, and highlight some of the fundamental issues facing the design and development of commercial document retrieval systems. This is the first of a series of three articles. Part II (D.C. Blair, The challenge of commercial document retrieval. Part II. A strategy for document searching based on identifiable document partitions, Information Processing and Management, 2001b, this issue) will discuss the implications of this framework for search strategy, and Part III (D.C. Blair, Some thoughts on the reported results of Text REtrieval Conference (TREC), Information Processing and Management, 2002, forthcoming) will consider the importance of the TREC results for our understanding of operating information retrieval systems.

46 citations


Network Information
Related Topics (5)
Web page
50.3K papers, 975.1K citations
81% related
Metadata
43.9K papers, 642.7K citations
79% related
Recommender system
27.2K papers, 598K citations
79% related
Ontology (information science)
57K papers, 869.1K citations
78% related
Natural language
31.1K papers, 806.8K citations
77% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20239
202239
2021107
2020130
2019144
2018111