scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Trustworthy keyword search for regulatory-compliant records retention

TL;DR: This paper proposes a novel scheme for effcient creation of a trustworthy inverted index and demonstrates, through extensive simulations and experiments with an enterprise keyword search engine, that the scheme can achieve online update speeds while maintaining good query performance.
Abstract: Recent litigation and intense regulatory focus on secure retention of electronic records have spurred a rush to introduce Write-Once-Read-Many (WORM) storage devices for retaining business records such as electronic mail. However, simply storing records in WORM storage is insuffcient to ensure that the records are trustworthy, i.e., able to provide irrefutable proof and accurate details of past events. Specifically, some form of index is needed for timely access to the records, but unless the index is maintained securely, the records can in effect be hidden or altered, even if stored in WORM storage. In this paper, we systematically analyze the requirements for establishing a trustworthy inverted index to enable keyword-based search queries. We propose a novel scheme for effcient creation of such an index and demonstrate, through extensive simulations and experiments with an enterprise keyword search engine, that the scheme can achieve online update speeds while maintaining good query performance. In addition, we present a secure index structure for multi-keyword queries that supports insert, lookup and range queries in time logarithmic in the number of documents.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings Article
10 Aug 2009
TL;DR: Vanish is presented, a system that meets this challenge through a novel integration of cryptographic techniques with global-scale, P2P, distributed hash tables (DHTs) and meets the privacy-preserving goals described above.
Abstract: Today's technical and legal landscape presents formidable challenges to personal data privacy First, our increasing reliance on Web services causes personal data to be cached, copied, and archived by third parties, often without our knowledge or control Second, the disclosure of private data has become commonplace due to carelessness, theft, or legal actions Our research seeks to protect the privacy of past, archived data -- such as copies of emails maintained by an email provider -- against accidental, malicious, and legal attacks Specifically, we wish to ensure that all copies of certain data become unreadable after a userspecified time, without any specific action on the part of a user, and even if an attacker obtains both a cached copy of that data and the user's cryptographic keys and passwords This paper presents Vanish, a system that meets this challenge through a novel integration of cryptographic techniques with global-scale, P2P, distributed hash tables (DHTs) We implemented a proof-of-concept Vanish prototype to use both the million-plus-node Vuze Bit-Torrent DHT and the restricted-membership OpenDHT We evaluate experimentally and analytically the functionality, security, and performance properties of Vanish, demonstrating that it is practical to use and meets the privacy-preserving goals described above We also describe two applications that we prototyped on Vanish: a Firefox plugin for Gmail and other Web sites and a Vanishing File application

404 citations


Cites background from "Trustworthy keyword search for regu..."

  • ...After that time, access to that data should be revoked for everyone — including the legitimate users of that data, the known or unknown entities holding copies of it, and the attackers....

    [...]

Proceedings Article
10 Aug 2009
TL;DR: A tree-based data structure is described that can generate tamper-evident proofs with logarithmic size and space, improving over previous linear constructions and allowing large-scale log servers to selectively delete old events, in an agreed-upon fashion, while generating efficient proofs that no inappropriate events were deleted.
Abstract: Many real-world applications wish to collect tamperevident logs for forensic purposes. This paper considers the case of an untrusted logger, serving a number of clients who wish to store their events in the log, and kept honest by a number of auditors who will challenge the logger to prove its correct behavior. We propose semantics of tamper-evident logs in terms of this auditing process. The logger must be able to prove that individual logged events are still present, and that the log, as seen now, is consistent with how it was seen in the past. To accomplish this efficiently, we describe a tree-based data structure that can generate such proofs with logarithmic size and space, improving over previous linear constructions. Where a classic hash chain might require an 800 MB trace to prove that a randomly chosen event is in a log with 80 million events, our prototype returns a 3 KB proof with the same semantics. We also present a flexible mechanism for the log server to present authenticated and tamper-evident search results for all events matching a predicate. This can allow large-scale log servers to selectively delete old events, in an agreed-upon fashion, while generating efficient proofs that no inappropriate events were deleted. We describe a prototype implementation and measure its performance on an 80 million event syslog trace at 1,750 events per second using a single CPU core. Performance improves to 10,500 events per second if cryptographic signatures are offloaded, corresponding to 1.1 TB of logging throughput per week.

219 citations

Proceedings ArticleDOI
24 Mar 2009
TL;DR: This paper presents Zerber+R -- a ranking model which allows for privacy-preserving top-k retrieval from an outsourced inverted index and proposes a relevance score transformation function which makes relevance scores of different terms indistinguishable, such that even if stored on an untrusted server they do not reveal information about the indexed data.
Abstract: Privacy-preserving document exchange among collaboration groups in an enterprise as well as across enterprises requires techniques for sharing and search of access-controlled information through largely untrusted servers. In these settings search systems need to provide confidentiality guarantees for shared information while offering IR properties comparable to the ordinary search engines. Top-k is a standard IR technique which enables fast query execution on very large indexes and makes systems highly scalable. However, indexing access-controlled information for top-k retrieval is a challenging task due to the sensitivity of the term statistics used for ranking.In this paper we present Zerber+R -- a ranking model which allows for privacy-preserving top-k retrieval from an outsourced inverted index. We propose a relevance score transformation function which makes relevance scores of different terms indistinguishable, such that even if stored on an untrusted server they do not reveal information about the indexed data. Experiments on two real-world data sets show that Zerber+R makes economical usage of bandwidth and offers retrieval properties comparable with an ordinary inverted index.

148 citations

Journal ArticleDOI
01 Dec 2006
TL;DR: The monotonicity principle is presented and it is seen how it leads to the use of top-K mappings rather than a single mapping.
Abstract: In this paper we analyze the problem of schema matching, explain why it is such a "tough" problem and suggest directions for handling it effectively. In particular, we present the monotonicity principle and see how it leads to the use of top-K mappings rather than a single mapping.

65 citations

Proceedings ArticleDOI
25 Mar 2008
TL;DR: The r-confidential Zerber indexing facility for sensitive documents is proposed, which uses secret splitting and term merging to provide tunable limits on information leakage, even under statistical attacks; requires only limited trust in a central indexing authority; and is extremely easy to use and administer.
Abstract: To carry out work assignments, small groups distributed within a larger enterprise often need to share documents among themselves while shielding those documents from others' eyes. In this situation, users need an indexing facility that can quickly locate relevant documents that they are allowed to access, without (1) leaking information about the remaining documents, (2) imposing a large management burden as users, groups, and documents evolve, or (3) requiring users to agree on a central completely trusted authority. To address this problem, we propose the concept of r-confidentiality, which captures the degree of information leakage from an index about the terms contained in inaccessible documents. Then we propose the r-confidential Zerber indexing facility for sensitive documents, which uses secret splitting and term merging to provide tunable limits on information leakage, even under statistical attacks; requires only limited trust in a central indexing authority; and is extremely easy to use and administer. Experiments with real-world data show that Zerber offers excellent performance for index insertions and lookups while requiring only a modest amount of storage space and network bandwidth.

52 citations

References
More filters
Book
01 Jan 1949

5,898 citations

Proceedings Article
01 Jan 1994
TL;DR: Much of the work involved investigating plausible methods of applying Okapi-style weighting to phrases, and expansion using terms from the top documents retrieved by a pilot search on topic terms was used.
Abstract: City submitted two runs each for the automatic ad hoc, very large collection track, automatic routing and Chinese track; and took part in the interactive and filtering tracks. The method used was : expansion using terms from the top documents retrieved by a pilot search on topic terms. Additional runs seem to show that we would have done better without expansion. Twor runs using the method of city96al were also submitted for the Very Large Collection track. The training database and its relevant documents were partitioned into three parts. Working on a pool of terms extracted from the relevant documents for one partition, an iterative procedure added or removed terms and/or varied their weights. After each change in query content or term weights a score was calculated by using the current query to search a second protion of the training database and evaluating the results against the corresponding set of relevant documents. Methods were compared by evaluating queries predictively against the third training partition. Queries from different methods were then merged and the results evaluated in the same way. Two runs were submitted, one based on character searching and the other on words or phrases. Much of the work involved investigating plausible methods of applying Okapi-style weighting to phrases

2,459 citations


"Trustworthy keyword search for regu..." refers methods in this paper

  • ...These techniques often ing lists are assigned scores based on similarity measures like cosine [28] or Okapi BM-25 [25]....

    [...]

  • ...These techniques often ing lists are assigned scores based on similarity measures like cosine [28] or Okapi BM-25 [25]....

    [...]

Book
11 May 1999
TL;DR: A guide to the MG system and its applications, as well as a comparison to the NZDL reference index, are provided.
Abstract: PREFACE 1. OVERVIEW 2. TEXT COMPRESSION 3. INDEXING 4. QUERYING 5. INDEX CONSTRUCTION 6. IMAGE COMPRESSION 7. TEXTUAL IMAGES 8. MIXED TEXT AND IMAGES 9. IMPLEMENTATION 10. THE INFORMATION EXPLOSION A. GUIDE TO THE MG SYSTEM B. GUIDE TO THE NZDL REFERENCES INDEX

2,068 citations


"Trustworthy keyword search for regu..." refers background or methods in this paper

  • ...do not perform well in practice [28] and can omit relevant documents....

    [...]

  • ...These techniques often ing lists are assigned scores based on similarity measures like cosine [28] or Okapi BM-25 [25]....

    [...]

Book
01 Jan 2001
TL;DR: This introduction to database systems offers a readable comprehensive approach with engaging, real-world examples, and users will learn how to successfully plan a database application before building it.
Abstract: From the Publisher: This introduction to database systems offers a readable comprehensive approach with engaging, real-world examples—users will learn how to successfully plan a database application before building it. The first half of the book provides in-depth coverage of databases from the point of view of the database designer, user, and application programmer, while the second half of the book provides in-depth coverage of databases from the point of view of the DBMS implementor. The first half of the book focuses on database design, database use, and implementation of database applications and database management systems—it covers the latest database standards SQL:1999, SQL/PSM, SQL/CLI, JDBC, ODL, and XML, with broader coverage of SQL than most other books. The second half of the book focuses on storage structures, query processing, and transaction management—it covers the main techniques in these areas with broader coverage of query optimization than most other books, along with advanced topics including multidimensional and bitmap indexes, distributed transactions, and information integration techniques. A professional reference for database designers, users, and application programmers.

1,405 citations


"Trustworthy keyword search for regu..." refers methods in this paper

  • ...For example, one can exploit the fact that the posting lists are sorted on document ID and use the zigzag join [14] algorithm (Figure 5), together with an auxiliary in-...

    [...]

Journal ArticleDOI
01 Sep 1999
TL;DR: It is shown that web users type in short queries, mostly look at the first 10 results only, and seldom modify the query, suggesting that traditional information retrieval techniques may not work well for answering web search requests.
Abstract: In this paper we present an analysis of an AltaVista Search Engine query log consisting of approximately 1 billion entries for search requests over a period of six weeks. This represents almost 285 million user sessions, each an attempt to fill a single information need. We present an analysis of individual queries, query duplication, and query sessions. We also present results of a correlation analysis of the log entries, studying the interaction of terms within queries. Our data supports the conjecture that web users differ significantly from the user assumed in the standard information retrieval literature. Specifically, we show that web users type in short queries, mostly look at the first 10 results only, and seldom modify the query. This suggests that traditional information retrieval techniques may not work well for answering web search requests. The correlation analysis showed that the most highly correlated items are constituents of phrases. This result indicates it may be useful for search engines to consider search terms as parts of phrases even if the user did not explicitly specify them as such.

1,255 citations