Trustworthy keyword search for regulatory-compliant records retention

doi:10.5555/1182635.1164213

Home
/
Papers
/
Trustworthy keyword search for regulatory-compliant records retention

Proceedings Article•DOI•

Trustworthy keyword search for regulatory-compliant records retention

Soumyadeb Mitra¹, Windsor Wee Sun Hsu², Marianne Winslett¹•Institutions (2)

University of Illinois at Urbana–Champaign¹, IBM²

01 Sep 2006-pp 1001-1012

TL;DR: This paper proposes a novel scheme for effcient creation of a trustworthy inverted index and demonstrates, through extensive simulations and experiments with an enterprise keyword search engine, that the scheme can achieve online update speeds while maintaining good query performance.

read less

Abstract: Recent litigation and intense regulatory focus on secure retention of electronic records have spurred a rush to introduce Write-Once-Read-Many (WORM) storage devices for retaining business records such as electronic mail. However, simply storing records in WORM storage is insuffcient to ensure that the records are trustworthy, i.e., able to provide irrefutable proof and accurate details of past events. Specifically, some form of index is needed for timely access to the records, but unless the index is maintained securely, the records can in effect be hidden or altered, even if stored in WORM storage. In this paper, we systematically analyze the requirements for establishing a trustworthy inverted index to enable keyword-based search queries. We propose a novel scheme for effcient creation of such an index and demonstrate, through extensive simulations and experiments with an enterprise keyword search engine, that the scheme can achieve online update speeds while maintaining good query performance. In addition, we present a secure index structure for multi-keyword queries that supports insert, lookup and range queries in time logarithmic in the number of documents.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•

Vanish: increasing data privacy with self-destructing data

[...]

Roxana Geambasu¹, Tadayoshi Kohno¹, Amit Levy¹, Henry M. Levy¹•Institutions (1)

University of Washington¹

10 Aug 2009

TL;DR: Vanish is presented, a system that meets this challenge through a novel integration of cryptographic techniques with global-scale, P2P, distributed hash tables (DHTs) and meets the privacy-preserving goals described above.

...read moreread less

Abstract: Today's technical and legal landscape presents formidable challenges to personal data privacy First, our increasing reliance on Web services causes personal data to be cached, copied, and archived by third parties, often without our knowledge or control Second, the disclosure of private data has become commonplace due to carelessness, theft, or legal actions Our research seeks to protect the privacy of past, archived data -- such as copies of emails maintained by an email provider -- against accidental, malicious, and legal attacks Specifically, we wish to ensure that all copies of certain data become unreadable after a userspecified time, without any specific action on the part of a user, and even if an attacker obtains both a cached copy of that data and the user's cryptographic keys and passwords This paper presents Vanish, a system that meets this challenge through a novel integration of cryptographic techniques with global-scale, P2P, distributed hash tables (DHTs) We implemented a proof-of-concept Vanish prototype to use both the million-plus-node Vuze Bit-Torrent DHT and the restricted-membership OpenDHT We evaluate experimentally and analytically the functionality, security, and performance properties of Vanish, demonstrating that it is practical to use and meets the privacy-preserving goals described above We also describe two applications that we prototyped on Vanish: a Firefox plugin for Gmail and other Web sites and a Vanishing File application

...read moreread less

404 citations

Cites background from "Trustworthy keyword search for regu..."

...After that time, access to that data should be revoked for everyone — including the legitimate users of that data, the known or unknown entities holding copies of it, and the attackers....
[...]

Proceedings Article•

Efficient data structures for tamper-evident logging

[...]

Scott A. Crosby¹, Dan S. Wallach¹•Institutions (1)

Rice University¹

10 Aug 2009

TL;DR: A tree-based data structure is described that can generate tamper-evident proofs with logarithmic size and space, improving over previous linear constructions and allowing large-scale log servers to selectively delete old events, in an agreed-upon fashion, while generating efficient proofs that no inappropriate events were deleted.

...read moreread less

Abstract: Many real-world applications wish to collect tamperevident logs for forensic purposes. This paper considers the case of an untrusted logger, serving a number of clients who wish to store their events in the log, and kept honest by a number of auditors who will challenge the logger to prove its correct behavior. We propose semantics of tamper-evident logs in terms of this auditing process. The logger must be able to prove that individual logged events are still present, and that the log, as seen now, is consistent with how it was seen in the past. To accomplish this efficiently, we describe a tree-based data structure that can generate such proofs with logarithmic size and space, improving over previous linear constructions. Where a classic hash chain might require an 800 MB trace to prove that a randomly chosen event is in a log with 80 million events, our prototype returns a 3 KB proof with the same semantics. We also present a flexible mechanism for the log server to present authenticated and tamper-evident search results for all events matching a predicate. This can allow large-scale log servers to selectively delete old events, in an agreed-upon fashion, while generating efficient proofs that no inappropriate events were deleted. We describe a prototype implementation and measure its performance on an 80 million event syslog trace at 1,750 events per second using a single CPU core. Performance improves to 10,500 events per second if cryptographic signatures are offloaded, corresponding to 1.1 TB of logging throughput per week.

...read moreread less

219 citations

Proceedings Article•DOI•

Zerber+R: top-k retrieval from a confidential index

[...]

Sergej Zerr¹, Daniel Olmedilla¹, Wolfgang Nejdl¹, Wolf Siberski¹•Institutions (1)

Leibniz University of Hanover¹

24 Mar 2009

TL;DR: This paper presents Zerber+R -- a ranking model which allows for privacy-preserving top-k retrieval from an outsourced inverted index and proposes a relevance score transformation function which makes relevance scores of different terms indistinguishable, such that even if stored on an untrusted server they do not reveal information about the indexed data.

...read moreread less

Abstract: Privacy-preserving document exchange among collaboration groups in an enterprise as well as across enterprises requires techniques for sharing and search of access-controlled information through largely untrusted servers. In these settings search systems need to provide confidentiality guarantees for shared information while offering IR properties comparable to the ordinary search engines. Top-k is a standard IR technique which enables fast query execution on very large indexes and makes systems highly scalable. However, indexing access-controlled information for top-k retrieval is a challenging task due to the sensitivity of the term statistics used for ranking.In this paper we present Zerber+R -- a ranking model which allows for privacy-preserving top-k retrieval from an outsourced inverted index. We propose a relevance score transformation function which makes relevance scores of different terms indistinguishable, such that even if stored on an untrusted server they do not reveal information about the indexed data. Experiments on two real-world data sets show that Zerber+R makes economical usage of bandwidth and offers retrieval properties comparable with an ordinary inverted index.

...read moreread less

148 citations

Journal Article•DOI•

Why is schema matching tough and what can we do about it

[...]

Avigdor Gal¹•Institutions (1)

Technion – Israel Institute of Technology¹

01 Dec 2006

TL;DR: The monotonicity principle is presented and it is seen how it leads to the use of top-K mappings rather than a single mapping.

...read moreread less

Abstract: In this paper we analyze the problem of schema matching, explain why it is such a "tough" problem and suggest directions for handling it effectively. In particular, we present the monotonicity principle and see how it leads to the use of top-K mappings rather than a single mapping.

...read moreread less

65 citations

Proceedings Article•DOI•

Zerber: r-confidential indexing for distributed documents

[...]

Sergej Zerr¹, Elena Demidova¹, Daniel Olmedilla¹, Wolfgang Nejdl¹, Marianne Winslett², Soumyadeb Mitra² - Show less +2 more•Institutions (2)

Leibniz University of Hanover¹, University of Illinois at Urbana–Champaign²

25 Mar 2008

TL;DR: The r-confidential Zerber indexing facility for sensitive documents is proposed, which uses secret splitting and term merging to provide tunable limits on information leakage, even under statistical attacks; requires only limited trust in a central indexing authority; and is extremely easy to use and administer.

...read moreread less

Abstract: To carry out work assignments, small groups distributed within a larger enterprise often need to share documents among themselves while shielding those documents from others' eyes. In this situation, users need an indexing facility that can quickly locate relevant documents that they are allowed to access, without (1) leaking information about the remaining documents, (2) imposing a large management burden as users, groups, and documents evolve, or (3) requiring users to agree on a central completely trusted authority. To address this problem, we propose the concept of r-confidentiality, which captures the degree of information leakage from an index about the terms contained in inaccessible documents. Then we propose the r-confidential Zerber indexing facility for sensitive documents, which uses secret splitting and term merging to provide tunable limits on information leakage, even under statistical attacks; requires only limited trust in a central indexing authority; and is extremely easy to use and administer. Experiments with real-world data show that Zerber offers excellent performance for index insertions and lookups while requiring only a modest amount of storage space and network bandwidth.

...read moreread less

52 citations

1
2
3
4
…
5
6
7

Collapse

References

PDF

Open Access

More filters

Book•

Human behavior and the principle of least effort

[...]

George Kingsley Zipf¹•Institutions (1)

Harvard University¹

01 Jan 1949

5,898 citations

Proceedings Article•

Okapi at TREC

[...]

Stephen Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, Mike Gatford - Show less +1 more

01 Jan 1994

TL;DR: Much of the work involved investigating plausible methods of applying Okapi-style weighting to phrases, and expansion using terms from the top documents retrieved by a pilot search on topic terms was used.

...read moreread less

Abstract: City submitted two runs each for the automatic ad hoc, very large collection track, automatic routing and Chinese track; and took part in the interactive and filtering tracks. The method used was : expansion using terms from the top documents retrieved by a pilot search on topic terms. Additional runs seem to show that we would have done better without expansion. Twor runs using the method of city96al were also submitted for the Very Large Collection track. The training database and its relevant documents were partitioned into three parts. Working on a pool of terms extracted from the relevant documents for one partition, an iterative procedure added or removed terms and/or varied their weights. After each change in query content or term weights a score was calculated by using the current query to search a second protion of the training database and evaluating the results against the corresponding set of relevant documents. Methods were compared by evaluating queries predictively against the third training partition. Queries from different methods were then merged and the results evaluated in the same way. Two runs were submitted, one based on character searching and the other on words or phrases. Much of the work involved investigating plausible methods of applying Okapi-style weighting to phrases

...read moreread less

2,459 citations

"Trustworthy keyword search for regu..." refers methods in this paper

...These techniques often ing lists are assigned scores based on similarity measures like cosine [28] or Okapi BM-25 [25]....
[...]
...These techniques often ing lists are assigned scores based on similarity measures like cosine [28] or Okapi BM-25 [25]....
[...]

Book•

Managing Gigabytes: Compressing and Indexing Documents and Images

[...]

Ian H. Witten, Alistair Moffat, Tim Bell

11 May 1999

TL;DR: A guide to the MG system and its applications, as well as a comparison to the NZDL reference index, are provided.

...read moreread less

Abstract: PREFACE 1. OVERVIEW 2. TEXT COMPRESSION 3. INDEXING 4. QUERYING 5. INDEX CONSTRUCTION 6. IMAGE COMPRESSION 7. TEXTUAL IMAGES 8. MIXED TEXT AND IMAGES 9. IMPLEMENTATION 10. THE INFORMATION EXPLOSION A. GUIDE TO THE MG SYSTEM B. GUIDE TO THE NZDL REFERENCES INDEX

...read moreread less

2,068 citations

"Trustworthy keyword search for regu..." refers background or methods in this paper

...do not perform well in practice [28] and can omit relevant documents....
[...]
...These techniques often ing lists are assigned scores based on similarity measures like cosine [28] or Okapi BM-25 [25]....
[...]

Book•

Database Systems: The Complete Book

[...]

Hector Garcia-Molina, Jeffrey D. Ullman, Jennifer Widom

01 Jan 2001

TL;DR: This introduction to database systems offers a readable comprehensive approach with engaging, real-world examples, and users will learn how to successfully plan a database application before building it.

...read moreread less

Abstract: From the Publisher: This introduction to database systems offers a readable comprehensive approach with engaging, real-world examplesusers will learn how to successfully plan a database application before building it. The first half of the book provides in-depth coverage of databases from the point of view of the database designer, user, and application programmer, while the second half of the book provides in-depth coverage of databases from the point of view of the DBMS implementor. The first half of the book focuses on database design, database use, and implementation of database applications and database management systemsit covers the latest database standards SQL:1999, SQL/PSM, SQL/CLI, JDBC, ODL, and XML, with broader coverage of SQL than most other books. The second half of the book focuses on storage structures, query processing, and transaction managementit covers the main techniques in these areas with broader coverage of query optimization than most other books, along with advanced topics including multidimensional and bitmap indexes, distributed transactions, and information integration techniques. A professional reference for database designers, users, and application programmers.

...read moreread less

1,405 citations

"Trustworthy keyword search for regu..." refers methods in this paper

...For example, one can exploit the fact that the posting lists are sorted on document ID and use the zigzag join [14] algorithm (Figure 5), together with an auxiliary in-...
[...]

Journal Article•DOI•

Analysis of a very large web search engine query log

[...]

Craig Silverstein¹, Hannes Marais, Monika Henzinger¹, Michael Moricz•Institutions (1)

Google¹

01 Sep 1999

TL;DR: It is shown that web users type in short queries, mostly look at the first 10 results only, and seldom modify the query, suggesting that traditional information retrieval techniques may not work well for answering web search requests.

...read moreread less

Abstract: In this paper we present an analysis of an AltaVista Search Engine query log consisting of approximately 1 billion entries for search requests over a period of six weeks. This represents almost 285 million user sessions, each an attempt to fill a single information need. We present an analysis of individual queries, query duplication, and query sessions. We also present results of a correlation analysis of the log entries, studying the interaction of terms within queries. Our data supports the conjecture that web users differ significantly from the user assumed in the standard information retrieval literature. Specifically, we show that web users type in short queries, mostly look at the first 10 results only, and seldom modify the query. This suggests that traditional information retrieval techniques may not work well for answering web search requests. The correlation analysis showed that the most highly correlated items are constituents of phrases. This result indicates it may be useful for search engines to consider search terms as parts of phrases even if the user did not explicitly specify them as such.

...read moreread less

1,255 citations