Topic

Document retrieval

About: Document retrieval is a research topic. Over the lifetime, 6821 publications have been published within this topic receiving 214383 citations.

...read moreread less

Papers published on a yearly basis

1 / 2

Papers

PDF

Open Access

More filters

Journal Article•DOI•

Privacy-Preserving Multi-Keyword Top- $k$ k Similarity Search Over Encrypted Data

[...]

Ding Xiaofeng¹, Liu Peng¹, Hai Jin¹•Institutions (1)

Huazhong University of Science and Technology¹

01 Mar 2019-IEEE Transactions on Dependable and Secure Computing

TL;DR: The proposed approach can significantly improve the capability of defending the privacy breaches, the scalability and the time efficiency of query processing over the state-of-the-art methods.

...read moreread less

Abstract: Cloud computing provides individuals and enterprises massive computing power and scalable storage capacities to support a variety of big data applications in domains like health care and scientific research, therefore more and more data owners are involved to outsource their data on cloud servers for great convenience in data management and mining. However, data sets like health records in electronic documents usually contain sensitive information, which brings about privacy concerns if the documents are released or shared to partially untrusted third-parties in cloud. A practical and widely used technique for data privacy preservation is to encrypt data before outsourcing to the cloud servers, which however reduces the data utility and makes many traditional data analytic operators like keyword-based top- $k$ k document retrieval obsolete. In this paper, we investigate the multi-keyword top- $k$ k search problem for big data encryption against privacy breaches, and attempt to identify an efficient and secure solution to this problem. Specifically, for the privacy concern of query data, we construct a special tree-based index structure and design a random traversal algorithm, which makes even the same query to produce different visiting paths on the index, and can also maintain the accuracy of queries unchanged under stronger privacy. For improving the query efficiency, we propose a group multi-keyword top- $k$ k search scheme based on the idea of partition, where a group of tree-based indexes are constructed for all documents. Finally, we combine these methods together into an efficient and secure approach to address our proposed top- $k$ k similarity search. Extensive experimental results on real-life data sets demonstrate that our proposed approach can significantly improve the capability of defending the privacy breaches, the scalability and the time efficiency of query processing over the state-of-the-art methods.

...read moreread less

53 citations

Proceedings Article•

MedMentions: A Large Biomedical Corpus Annotated with UMLS Concepts

[...]

Sunil Mohan¹, Donghui Li•Institutions (1)

National Institutes of Health¹

12 Mar 2019

TL;DR: The MedMentions corpus as mentioned in this paper is a manually annotated resource for the recognition of biomedical concepts, which includes over 4,000 abstracts and over 350,000 linked mentions.

...read moreread less

Abstract: This paper presents the formal release of {\em MedMentions}, a new manually annotated resource for the recognition of biomedical concepts. What distinguishes MedMentions from other annotated biomedical corpora is its size (over 4,000 abstracts and over 350,000 linked mentions), as well as the size of the concept ontology (over 3 million concepts from UMLS 2017) and its broad coverage of biomedical disciplines. In addition to the full corpus, a sub-corpus of MedMentions is also presented, comprising annotations for a subset of UMLS 2017 targeted towards document retrieval. To encourage research in Biomedical Named Entity Recognition and Linking, data splits for training and testing are included in the release, and a baseline model and its metrics for entity linking are also described.

...read moreread less

53 citations

Journal Article•DOI•

Searching biases in large interactive document retrieval systems

[...]

David C. Blair¹•Institutions (1)

University of Michigan¹

01 Aug 1980-Journal of the Association for Information Science and Technology

TL;DR: A searching algorithm is suggested that helps the inquirer searching for documents on a large interactive system to construct and modify queries inefficiently and to avoid the effect of these biases.

...read moreread less

Abstract: The way that individuals construct and modify search queries on a large interactive document retrieval system is subject to systematic biases similar to those that have been demonstrated in experiments on judgments under uncertainty. These biases are shared by both naive and sophisticated subjects and cause the inquirer searching for documents on a large interactive system to construct and modify queries inefficiently. A searching algorithm is suggested that helps the inquirer to avoid the effect of these biases.

...read moreread less

53 citations

Book Chapter•DOI•

Efficient Graph-Based Document Similarity

[...]

Christian Paul¹, Achim Rettinger¹, Aditya Mogadala¹, Craig A. Knoblock², Pedro Szekely² - Show less +1 more•Institutions (2)

Karlsruhe Institute of Technology¹, University of Southern California²

29 May 2016

TL;DR: This paper presents an efficient semantic similarity approach exploiting explicit hierarchical and transversal relations and shows that its similarity measure provides a significantly higher correlation with human notions of document similarity than comparable measures.

...read moreread less

Abstract: Assessing the relatedness of documents is at the core of many applications such as document retrieval and recommendation. Most similarity approaches operate on word-distribution-based document representations - fast to compute, but problematic when documents differ in language, vocabulary or type, and neglecting the rich relational knowledge available in Knowledge Graphs. In contrast, graph-based document models can leverage valuable knowledge about relations between entities - however, due to expensive graph operations, similarity assessments tend to become infeasible in many applications. This paper presents an efficient semantic similarity approach exploiting explicit hierarchical and transversal relations. We show in our experiments that i our similarity measure provides a significantly higher correlation with human notions of document similarity than comparable measures, ii this also holds for short documents with few annotations, iii document similarity can be calculated efficiently compared to other graph-traversal based approaches.

...read moreread less

53 citations

Proceedings Article•

Effects of out of vocabulary words in spoken document retrieval

[...]

Philip C. Woodland, S.E. Johnson, Pierre Jourlin, Karen Sparck Jones

28 Jul 2000

TL;DR: The use of a parallel corpus for query and document expansion is found to be especially beneficial, and with this data set, good retrieval performance can be achieved even for fairly high OOV rates.

...read moreread less

Abstract: The eects of out-of-vocabulary (OOV) items in spoken document retrieval (SDR) are investigated. Several sets of transcriptions were created for the TREC-8 SDR task using a speech recognition system varying the vocabulary sizes and OOV rates, and the relative retrieval perfor- mance measured. The eects of OOV terms on a simple baseline IR system and on more sophisticated retrieval systems are described. The use of a parallel corpus for query and document expansion is found to be especially beneÞcial, and with this data set, good retrieval perfor- mance can be achieved even for fairly high OOV rates.

...read moreread less

53 citations

Collapse

Network Information

Performance

Metrics

6,866

Papers

224,605

Citations

No. of papers in the topic in previous years
Year	Papers
2023	9
2022	39
2021	107
2020	130
2019	144
2018	111

Document retrieval

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics