scispace - formally typeset
Search or ask a question

Showing papers by "Jeffrey Scott Vitter published in 2022"


Journal ArticleDOI
TL;DR: To handle sorted top-k document retrieval, an O(nlog (d/B) space data structure with optimal query cost is presented and answers are reported in the unsorted order of relevance.
Abstract: The ranked (or top-k) document retrieval problem is defined as follows: preprocess a collection {T1,T2,… ,Td} of d strings (called documents) of total length n into a data structure, such that for any given query (P,k), where P is a string (called pattern) of length p ≥ 1 and k ∈ [1,d] is an integer, the identifiers of those k documents that are most relevant to P can be reported, ideally in the sorted order of their relevance. The seminal work by Hon et al. [FOCS 2009 and Journal of the ACM 2014] presented an O(n)-space (in words) data structure with O(p+k log k) query time. The query time was later improved to O(p+k) [SODA 2012] and further to O(p/ log σn+k) [SIAM Journal on Computing 2017] by Navarro and Nekrich, where σ is the alphabet size. We revisit this problem in the external memory model and present three data structures. The first one takes O(n)-space and answer queries in O(p/B + log B n + k/B+ log * (n/B)) I/Os, where B is the block size. The second one takes O(n log * (n/B)) space and answer queries in optimal O(p/B + log B n + k/B) I/Os. In both cases, the answers are reported in the unsorted order of relevance. To handle sorted top-k document retrieval, we present an O(n log (d/B)) space data structure with optimal query cost.