scispace - formally typeset
Search or ask a question
Topic

Document retrieval

About: Document retrieval is a research topic. Over the lifetime, 6821 publications have been published within this topic receiving 214383 citations.


Papers
More filters
Proceedings ArticleDOI
27 Apr 2006-Scopus
TL;DR: A novel signature retrieval strategy is presented, which includes a technique for noise and printed text removal from signature images, previously extracted from business documents, based on a normalized correlation similarity measure using global shape-based binary feature vectors.
Abstract: In searching a repository of business documents, a task of interest is that of using a query signature image to retrieve from a database, other signatures matching the query. The signature retrieval task involves a two-step process of extracting all the signatures from the documents and then performing a match on these signatures. This paper presents a novel signature retrieval strategy, which includes a technique for noise and printed text removal from signature images, previously extracted from business documents. Signature matching is based on a normalized correlation similarity measure using global shape-based binary feature vectors. In a retrieval task involving a database of 447 signatures, on an average 4.43 out of the top 5 choices were signatures belonging to the writer of the queried signature. On considering the Top 10 ranks, a F-measure value of 76.3 was obtained and the precision and recall values at this F-measure were 74.5% and 78.28% respectively.

47 citations

Proceedings ArticleDOI
11 Jun 2007
TL;DR: The WISDM project at UIUC is built and a prototype search engine over a 2TB Web corpus is evaluated, showing the feasibility and promise of a large-scale system architecture to support entity search.
Abstract: As the Web has evolved into a data-rich repository, with the standard page view," current search engines are increasingly inadequate While we often search for various data "entities" (eg phone number, paper PDF, date), today's engines only take us indirectly to pages Therefore, we propose the concept of entity search, a significant departure from traditional document retrieval Towards our goal of supporting entity search, in the WISDM project at UIUC we build and evaluate our prototype search engine over a 2TB Web corpus Our demonstration shows the feasibility and promise of a large-scale system architecture to support entity search

47 citations

Journal ArticleDOI
TL;DR: This paper aims at improving the learning of a ranking model in target domain by leveraging knowledge from the outdated or out-of-domain data by proposing two novel methods to conduct knowledge transfer at feature level and instance level.
Abstract: Recently, learning to rank technology is attracting increasing attention from both academia and industry in the areas of machine learning and information retrieval. A number of algorithms have been proposed to rank documents according to the user-given query using a human-labeled training dataset. A basic assumption behind general learning to rank algorithms is that the training and test data are drawn from the same data distribution. However, this assumption does not always hold true in real world applications. For example, it can be violated when the labeled training data become outdated or originally come from another domain different from its counterpart of test data. Such situations bring a new problem, which we define as cross domain learning to rank. In this paper, we aim at improving the learning of a ranking model in target domain by leveraging knowledge from the outdated or out-of-domain data (both are referred to as source domain data). We first give a formal definition of the cross domain learning to rank problem. Following this, two novel methods are proposed to conduct knowledge transfer at feature level and instance level, respectively. These two methods both utilize Ranking SVM as the basic learner. In the experiments, we evaluate these two methods using data from benchmark datasets for document retrieval. The results show that the feature-level transfer method performs better with steady improvements over baseline approaches across different datasets, while the instance-level transfer method comes out with varying performance depending on the dataset used.

47 citations

Journal ArticleDOI
TL;DR: A set of document content description tags, or metadata encodings, that can be used to promote disciplined search access to Internet medical documents to facilitate document retrieval by Internet search engines is defined.

47 citations

Dissertation
03 Oct 1996
TL;DR: This study investigated whether the information obtained by matching causal relations expressed in documents with the causal Relations expressed in users' queries could be used to improve document retrieval results in comparison to using just term matching without considering relations.
Abstract: This study represents one attempt to make use of relations expressed in text to improve information retrieval effectiveness In particular, the study investigated whether the information obtained by matching causal relations expressed in documents with the causal relations expressed in users' queries could be used to improve document retrieval results in comparison to using just term matching without considering relations An automatic method for identifying and extracting cause-effect information in Wall Street Journal text was developed The method uses linguistic clues to identify causal relations without recourse to knowledge-based inferencing The method was successful in identifying and extracting about 68% of the causal relations that were clearly expressed within a sentence or between adjacent sentences in Wall Street Journal text Of the instances that the computer program identified as causal relations, 72% can be considered to be correct The automatic method was used in an experimental information retrieval system to identify causal relations in a database of full-text Wall Street Journal documents Causal relation matching was found to yield a small but significant improvement in retrieval results when the weights used for combining the scores from different types of matching were customized for each query--as in an SDI or routing queries situation The best results were obtained when causal relation matching was combined with word proximity matching (matching pairs of causally related words in the query with pairs of words that co-occur within document sentences) An analysis using manually identified causal relations indicate that bigger retrieval improvements can be expected with more accurate identification of causal relations The best kind of causal relation matching was found to be one in which one member of the causal relation (either the cause or the effect) was represented as a wildcard that could match with any term The study also investigated whether using Roget's International Thesaurus (3rd ed) to expand query terms with synonymous and related terms would improve retrieval effectiveness Using Roget category codes in addition to keywords did give better retrieval results However, the Roget codes were better at identifying the nonrelevant documents than the relevant ones

47 citations


Network Information
Related Topics (5)
Web page
50.3K papers, 975.1K citations
81% related
Metadata
43.9K papers, 642.7K citations
79% related
Recommender system
27.2K papers, 598K citations
79% related
Ontology (information science)
57K papers, 869.1K citations
78% related
Natural language
31.1K papers, 806.8K citations
77% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20239
202239
2021107
2020130
2019144
2018111