scispace - formally typeset
Search or ask a question
Topic

Document retrieval

About: Document retrieval is a research topic. Over the lifetime, 6821 publications have been published within this topic receiving 214383 citations.


Papers
More filters
01 Jan 1997
TL;DR: It is shown that users who are able to read more than one language will likely prefer a multilingual text retrieval system over a collection of monolingual systems and cross language text retrieval was selected as the preferred term in the interest of standardization.
Abstract: The explosive growth of the Internet and other sources of networked information have made automatic me diation of access to networked information sources an increasingly important problem Much of this informa tion is expressed as electronic text and it is becoming practical to automatically convert some printed docu ments and recorded speech to electronic text as well Thus automated systems capable of detecting useful documents are nding widespread application With even a small number of languages it can be in convenient to issue the same query repeatedly in every language so users who are able to read more than one language will likely prefer a multilingual text retrieval system over a collection of monolingual systems And since reading ability in a language does not always im ply uent writing ability in that language such users will likely nd cross language text retrieval particularly useful for languages in which they are less con dent of their ability to express their information needs e ec tively The use of such systems can be also be bene cial if the user is able to read only a single language For example when only a small portion of the doc ument collection will ever be examined by the user performing retrieval before translation can be signif icantly more economical than performing translation before retrieval So when the application is su ciently important to justify the time and e ort required for translation those costs can be minimized if an e ec tive cross language text retrieval system is available Even when translation is not available there are cir cumstances in which cross language text retrieval could be useful to a monolingual user For example a re searcher might nd a paper published in an unfamil iar language useful if that paper contains references to works by the same author that are in the researcher s native language Multilingual text retrieval can be de ned as selec tion of useful documents from collections that may con tain several languages English French Chinese etc This formulation allows for the possibility that individ ual documents might contain more than one language a common occurrence in some applications Both cross language and within language retrieval are in cluded in this formulation but it is the cross language aspect of the problem which distinguishes multilin gual text retrieval from its well studied monolingual counterpart At the SIGIR workshop on Cross Linguistic Information Retrieval the participants dis cussed the proliferation of terminology being used to describe the eld and settled on Cross Language as the best single description of the salient aspect of the problem Multilingual was felt to be too broad since that term has also been used to describe systems able to perform within language retrieval in more than one language but that lack any cross language capabil ity Cross lingual and cross linguistic were felt to be equally good descriptions of the eld but cross language was selected as the preferred term in the interest of standardization Unfortunately at about the same time the U S Defense Advanced Research Projects Agency DARPA introduced translingual as their preferred term so we are still some distance from reaching consensus on this matter

99 citations

Proceedings ArticleDOI
13 Oct 2003
TL;DR: A re-ranking method to improve Web image retrieval by reordering the images retrieved from an image search engine based on a relevance model, which is a probabilistic model that evaluates the relevance of the HTML document linking to the image, and assigns a probability of relevance.
Abstract: Web image retrieval is a challenging task that requires efforts from image processing, link structure analysis, and Web text retrieval. Since content-based image retrieval is still considered very difficult, most current large-scale Web image search engines exploit text and link structure to "understand" the content of the Web images. However, local text information, such as caption, filenames and adjacent text, is not always reliable and informative. Therefore, global information should be taken into account when a Web image retrieval system makes relevance judgment. We propose a re-ranking method to improve Web image retrieval by reordering the images retrieved from an image search engine. The re-ranking process is based on a relevance model, which is a probabilistic model that evaluates the relevance of the HTML document linking to the image, and assigns a probability of relevance. The experiment results showed that the re-ranked image retrieval achieved better performance than original Web image retrieval, suggesting the effectiveness of the re-ranking method. The relevance model is learned from the Internet without preparing any training data and independent of the underlying algorithm of the image search engines. The re-ranking process should be applicable to any image search engines with little effort.

99 citations

Journal ArticleDOI
TL;DR: It is shown that the concept of threshold values resolves the problems inherent with relevance weights, and possible evaluation mechanisms for retrieval of documents, based on fuzzy-set-theoretic considerations are explored.
Abstract: Several papers have appeared that have analyzed recent developments in the problem of processing, in a document retrieval system, queries expressed as Boolean expressions. The purpose of this paper is to continue that analysis. We shall show that the concept of threshold values resolves the problems inherent with relevance weights. Moreover, we shall explore possible evaluation mechanisms for retrieval of documents, based on fuzzy-set-theoretic considerations.

99 citations

Proceedings ArticleDOI
01 Jul 2000
TL;DR: This work proposes a novel method for phonetic retrieval in the CueVideo system based on the probabilistic formulation of term weighting using phone confusion data in a Bayesian framework and evaluates this method of spoken document retrieval against word-based retrieval for the search levels identified in a realistic video-based distributed learning setting.
Abstract: Combined word-based index and phonetic indexes have been used to improve the performance of spoken document retrieval systems primarily by addressing the out-of-vocabulary retrieval problem. However, a known problem with phonetic recognition is its limited accuracy in comparison with word level recognition. We propose a novel method for phonetic retrieval in the CueVideo system based on the probabilistic formulation of term weighting using phone confusion data in a Bayesian framework. We evaluate this method of spoken document retrieval against word-based retrieval for the search levels identified in a realistic video-based distributed learning setting. Using our test data, we achieved an average recall of 0.88 with an average precision of 0.69 for retrieval of out-of-vocabulary words on phonetic transcripts with 35% word error rate. For in-vocabulary words, we achieved a 17% improvement in recall over word-based retrieval with a 17% loss in precision for word error rites ranging from 35 to 65%.

98 citations

Proceedings ArticleDOI
30 Mar 2008
TL;DR: It is found that even a simple normalization method leads to improvements of early precision, both for document and passage retrieval, and better normalization results in better retrieval performance.
Abstract: In the named entity normalization task, a system identifies a canonical unambiguous referent for names like Bush or Alabama. Resolving synonymy and ambiguity of such names can benefit end-to-end information access tasks. We evaluate two entity normalization methods based on Wikipedia in the context of both passage and document retrieval for question anwering. We find that even a simple normalization method leads to improvements of early precision, both for document and passage retrieval. Moreover, better normalization results in better retrieval performance.

98 citations


Network Information
Related Topics (5)
Web page
50.3K papers, 975.1K citations
81% related
Metadata
43.9K papers, 642.7K citations
79% related
Recommender system
27.2K papers, 598K citations
79% related
Ontology (information science)
57K papers, 869.1K citations
78% related
Natural language
31.1K papers, 806.8K citations
77% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20239
202239
2021107
2020130
2019144
2018111