scispace - formally typeset
Search or ask a question
Topic

Document retrieval

About: Document retrieval is a research topic. Over the lifetime, 6821 publications have been published within this topic receiving 214383 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: This paper describes a process whereby a morpho-syntactic analysis of phrases or user queries is used to generate a structured representation of text to evaluate the effectiveness or quality of the matching and scoring of phrases.
Abstract: The application of automatic natural language processing techniques to the indexing and the retrieval of text information has been a target of information retrieval researchers for some time. Incorporating semantic-level processing of language into retrieval has led to conceptual information retrieval, which is effective but usually restricted in its domain. Using syntactic-level analysis is domain-independent, but has not yet yielded significant improvements in retrieval quality. This paper describes a process whereby a morpho-syntactic analysis of phrases or user queries is used to generate a structured representation of text. A process of matching these structured representations is then described that generates a metric value or score indicating the degree of match between phrases. This scoring can then be used for ranking the phrases. In order to evaluate the effectiveness or quality of the matching and scoring of phrases, some experiments are described that indicate the method to be quite useful. Ultimately the phrasematching technique described here would be used as part of an overall document retrieval strategy, and some future work towards this direction is outlined.

40 citations

Proceedings ArticleDOI
03 Jul 2014
TL;DR: This paper model the relationships among images by constructing a voting graph, and proposes an adaptive teleportation random walk process, in which a confidence factor is introduced to control the teleportation probability, on the voting graph.
Abstract: Social tags are known to be a valuable source of information for image retrieval and organization. However, contrary to the conventional document retrieval, rich tag frequency information in social sharing systems, such as Flickr, is not available, thus we cannot directly use the tag frequency (analogous to the term frequency in a document) to represent the relevance of tags. Many heuristic approaches have been proposed to address this problem, among which the well-known neighbor voting based approaches are the most effective methods. The basic assumption of these methods is that a tag is considered as relevant to the visual content of a target image if this tag is also used to annotate the visual neighbor images of the target image by lots of different users. The main limitation of these approaches is that they treat the voting power of each neighbor image either equally or simply based on its visual similarity. In this paper, we cast the social tag relevance learning problem as an adaptive teleportation random walk process on the voting graph. In particular, we model the relationships among images by constructing a voting graph, and then propose an adaptive teleportation random walk, in which a confidence factor is introduced to control the teleportation probability, on the voting graph. Through this process, direct and indirect relationships among images can be explored to cooperatively estimate the tag relevance. To quantify the performance of our approach, we compare it with state-of-the-art methods on two publicly available datasets (NUS-WIDE and MIR Flickr). The results indicate that our method achieves substantial performance gains on these datasets.

40 citations

Proceedings ArticleDOI
01 Sep 1991
TL;DR: A genetic algorithm for MDAP is developed and the effects of varying the communication cost matrix representing the interprocessor communication topology and the uniformity of the distribution of documents to the clusters are studied.
Abstract: Information retrieval is the selection of documents that are potentially relevant to a user’s information need. Given the vast volume of data stored in modern information retrieval systems, searching the document database requires vast computational resources. To meet these computational demands, various researchers have developed parallel information retrieval systems. As efficient exploitation of parallelism demands fast access to the documents, data organization and placement significantly affect the total processing time. We describe and evaluate a data placement strategy for distributed memory, distributed 1/0 multicomputers. Initially, a formal description of the Multiprocessor Document Allocation Problem (MDAP) and a proof that MDAP is NP Complete are presented. A document allocation algorithm for MDAP based on Genetic Algorithms is developed. This algorithm assumes that the documents are clustered using any one of the many clustering algorithms. We define a cost function for the derived allocation and evaluate the performance of our algorithm using this function. As part of the experimental analysis, the effects of varying the number of documents and their distribution across the clusters as well the exploitation of various differing architectural interconnection topologies are studied. We also experiment with the several parameters common to Genetic Algorithms, e.g., the probability of mutation and the population size. 1.0 Introduction An efficient multiprocessor information retrieval system must maintain a low system response time and require relatively little storage overhead. As the volume of stored data continues to increase daily, the multiprocessor engines must likewise scale to a large number of processors. This demand for system scalability necessitates a distributed memory architecture as a large number of processors is not currently possible in a sharedmemory configuration. A distributed memory system, however, introduces the problem Perrrkion to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appaar, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. @ 1991 ACM 0-89791 -448 -1/91 /0009 /0230 . ..$1 .50 Hava Tova Siegelmann Dept. of Computer Science Rutgers University New Brunswick, NJ 08903 associated with the proper placement of data onto the given architecture. We refer to this problem as the Multiprocessor Document Allocation Problem (MDAP), a derivative of the Mapping Problem originally described by Bokhari [Bok8 1]. We assume a clustered document database. A clustered approach is taken since an index file organization can introduce vast storage overhead (up to roughly 300% according to Haskin [Has8 1]) and a full-text or signature analysis technique results in lengthy search times. In this context, a proper solution to MDAP is any mapping of the documents onto the processors such that the average cluster diameter is kept to a minimum while still providing for an even document distribution across the nodes. To achieve a significant reduction in the total query processing time using parallelism, the allocation of data among the processors should be distributed as evenly as possible and the interprocessor communication among the nodes should be minimized. Achieving such an allocation is NP Complete. Thus, it is necessary to use heuristics to obtain satisfactory mappings, which may indeed be suboptimal. Genetic Algorithms [DeJ89, G0189, Gre85, Gre87, H0187, Rag87] approximate optimal solutions to computationally intractable problems. We develop a genetic algorithm for MDAP and examine the effects of varying the communication cost matrix representing the interprocessor communication topology and the uniformity of the distribution of documents to the clusters. 1.1 Mapping Problem Approximations As the Mapping Problem and some of its derivatives are NP complete, heuristic algorithms are commonly employed to approximate the optimal solutions. Some of these approaches [Bok81, B0188, Lee87] deal, in some manner, This work was partially supported by grants from DCS, Inc. under contract number 5-35071 and the Center for Innovative Technology under contract number 5-34042.

40 citations

Proceedings ArticleDOI
01 Sep 2015
TL;DR: This work uses the images in image-text documents of each language as the hub and derives a common semantic subspace bridging two languages by means of generalized canonical correlation analysis, which substantially enhances retrieval accuracy in zero-shot and few-shot scenarios where text-to-text examples are scarce.
Abstract: We propose an image-mediated learning approach for cross-lingual document retrieval where no or only a few parallel corpora are available. Using the images in image-text documents of each language as the hub, we derive a common semantic subspace bridging two languages by means of generalized canonical correlation analysis. For the purpose of evaluation, we create and release a new document dataset consisting of three types of data (English text, Japanese text, and images). Our approach substantially enhances retrieval accuracy in zero-shot and few-shot scenarios where text-to-text examples are scarce.

40 citations

Proceedings ArticleDOI
07 Nov 2002
TL;DR: A new visualization approach for metadata combining different visualizations into a so-called SuperTable accompanied by a Scatterplot, solving the problem which seemed to be immanent to visualization's in document retrieval: the change of modalities.
Abstract: We present a new visualization approach for metadata combining different visualizations into a so-called SuperTable accompanied by a Scatterplot. The goal is to improve user experience during the information seeking process. Our new visualizations are based on our experiences developing a visual information retrieval system called INSYDER to supply small and medium size enterprises with business information front the Internet. Based on extensive user tests the original visualizations have been redesigned in two different design variants. Instead of offering multiple visualizations to choose front the SuperTable + Scatterplot combines them in a new way. Therefore, the user has the feeling that he is working with one single visualization in different states. Further the SuperTable solves a problem which seemed to be immanent to visualization's in document retrieval: the change of modalities.

40 citations


Network Information
Related Topics (5)
Web page
50.3K papers, 975.1K citations
81% related
Metadata
43.9K papers, 642.7K citations
79% related
Recommender system
27.2K papers, 598K citations
79% related
Ontology (information science)
57K papers, 869.1K citations
78% related
Natural language
31.1K papers, 806.8K citations
77% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20239
202239
2021107
2020130
2019144
2018111