scispace - formally typeset
Search or ask a question
Author

Vitaly Shmatikov

Bio: Vitaly Shmatikov is an academic researcher from Cornell University. The author has contributed to research in topics: Anonymity & Information privacy. The author has an hindex of 64, co-authored 148 publications receiving 17801 citations. Previous affiliations of Vitaly Shmatikov include University of Texas at Austin & French Institute for Research in Computer Science and Automation.


Papers
More filters
Posted Content
TL;DR: This article developed gradient-based approaches for generating semantic collisions and demonstrate that state-of-the-art models for many tasks which rely on analyzing the meaning and similarity of texts, including paraphrase identification, document retrieval, response suggestion, and extractive summarization are vulnerable to semantic collisions.
Abstract: We study semantic collisions: texts that are semantically unrelated but judged as similar by NLP models. We develop gradient-based approaches for generating semantic collisions and demonstrate that state-of-the-art models for many tasks which rely on analyzing the meaning and similarity of texts-- including paraphrase identification, document retrieval, response suggestion, and extractive summarization-- are vulnerable to semantic collisions. For example, given a target query, inserting a crafted collision into an irrelevant document can shift its retrieval rank from 1000 to top 3. We show how to generate semantic collisions that evade perplexity-based filtering and discuss other potential mitigations. Our code is available at this https URL.

1 citations

Posted Content
TL;DR: In this paper, the authors demonstrate that the size of encrypted writes to durable write-ahead logs can reveal sensitive information about the inputs and activities of MongoDB applications and design, implement, and evaluate BigFoot, a WAL modification that mitigates size leakage.
Abstract: Modern databases and data-warehousing systems separate query processing and durable storage. Storage systems have idiosyncratic bugs and security vulnerabilities, thus attacks that compromise only storage are a realistic threat. In this paper, we show that encryption alone is not sufficient to protect databases from compromised storage. Using MongoDB WiredTiger as a concrete example, we demonstrate that sizes of encrypted writes to a durable write-ahead log can reveal sensitive information about the inputs and activities of MongoDB applications. We then design, implement, and evaluate BigFoot, a WAL modification that mitigates size leakage.
19 Jul 2023
TL;DR: In this paper , an attacker generates an adversarial perturbation corresponding to the prompt and blends it into an image or audio recording, which steers the model to output the attacker-chosen text and/or make the subsequent dialog follow the attacker's instruction.
Abstract: We demonstrate how images and sounds can be used for indirect prompt and instruction injection in multi-modal LLMs. An attacker generates an adversarial perturbation corresponding to the prompt and blends it into an image or audio recording. When the user asks the (unmodified, benign) model about the perturbed image or audio, the perturbation steers the model to output the attacker-chosen text and/or make the subsequent dialog follow the attacker's instruction. We illustrate this attack with several proof-of-concept examples targeting LLaVa and PandaGPT.
Journal Article
TL;DR: This study demonstrates the difficulty of combining fairness with timeliness in the context of probabilistic contract signing by identifying a strategy for a misbehaving participant that brings the protocol to an unfair state with arbitrarily high probability.
Abstract: We present three case studies, investigating the use of probabilistic model checking to automatically analyse properties of probabilistic contract signing protocols. We use the probabilistic model checker PRISM to analyse three protocols: Rabin's probabilistic protocol for fair commitment exchange; the probabilistic contract signing protocol of Ben-Or, Goldreich, Micali, and Rivest; and a randomised protocol for signing contracts of Even, Goldreich, and Lempel. These case studies illustrate the general methodology for applying probabilistic model checking to formal verification of probabilistic security protocols. For the Ben-Or et al. protocol, we demonstrate the difficulty of combining fairness with timeliness. If, as required by timeliness, the judge responds to participants' messages immediately upon receiving them, then there exists a strategy for a misbehaving participant that brings the protocol to an unfair state with arbitrarily high probability, unless unusually strong assumptions are made about the quality of the communication channels between the judge and honest participants. We quantify the tradeoffs involved in the attack strategy, and discuss possible modifications of the protocol that ensure both fairness and timeliness. For the Even et al. protocol, we demonstrate that the responder enjoys a distinct advantage. With probability 1, the protocol reaches a state in which the responder possesses the initiator's commitment, but the initiator does not possess the responder's commitment. We then analyse several variants of the protocol, exploring the tradeoff between fairness and the number of messages that must be exchanged between participants.
09 Jul 2006
TL;DR: This workshop presentation will discuss ways in which the computational semantics lead to different reasoning methods and report progress to date in several directions from the computational difference between efficiently recognizing and efficiently producing a value.
Abstract: We have been developing a cryptographically sound formal logic for proving protocol security properties without explicitly reasoning about probability, asymptotic complexity, or the actions of a malicious attacker. The approach rests on a probabilistic, polynomial-time semantics for a protocol security logic that was originally developed using nondeterministic symbolic semantics. This workshop presentation will discuss ways in which the computational semantics lead to different reasoning methods and report our progress to date in several directions. One significant difference between the symbolic and computational settings results from the computational difference between efficiently recognizing and efficiently producing a value. Among the more recent developments are a compositional method for proving cryptographically sound properties of key exchange protocols, and some work on secrecy properties that illustrates the computational interpretation of inductive properties of protocol roles.

Cited by
More filters
Journal ArticleDOI

[...]

08 Dec 2001-BMJ
TL;DR: There is, I think, something ethereal about i —the square root of minus one, which seems an odd beast at that time—an intruder hovering on the edge of reality.
Abstract: There is, I think, something ethereal about i —the square root of minus one. I remember first hearing about it at school. It seemed an odd beast at that time—an intruder hovering on the edge of reality. Usually familiarity dulls this sense of the bizarre, but in the case of i it was the reverse: over the years the sense of its surreal nature intensified. It seemed that it was impossible to write mathematics that described the real world in …

33,785 citations

Journal ArticleDOI
TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.
Abstract: Machine Learning is the study of methods for programming computers to learn. Computers are applied to a wide range of tasks, and for most of these it is relatively easy for programmers to design and implement the necessary software. However, there are many tasks for which this is difficult or impossible. These can be divided into four general categories. First, there are problems for which there exist no human experts. For example, in modern automated manufacturing facilities, there is a need to predict machine failures before they occur by analyzing sensor readings. Because the machines are new, there are no human experts who can be interviewed by a programmer to provide the knowledge necessary to build a computer system. A machine learning system can study recorded data and subsequent machine failures and learn prediction rules. Second, there are problems where human experts exist, but where they are unable to explain their expertise. This is the case in many perceptual tasks, such as speech recognition, hand-writing recognition, and natural language understanding. Virtually all humans exhibit expert-level abilities on these tasks, but none of them can describe the detailed steps that they follow as they perform them. Fortunately, humans can provide machines with examples of the inputs and correct outputs for these tasks, so machine learning algorithms can learn to map the inputs to the outputs. Third, there are problems where phenomena are changing rapidly. In finance, for example, people would like to predict the future behavior of the stock market, of consumer purchases, or of exchange rates. These behaviors change frequently, so that even if a programmer could construct a good predictive computer program, it would need to be rewritten frequently. A learning program can relieve the programmer of this burden by constantly modifying and tuning a set of learned prediction rules. Fourth, there are applications that need to be customized for each computer user separately. Consider, for example, a program to filter unwanted electronic mail messages. Different users will need different filters. It is unreasonable to expect each user to program his or her own rules, and it is infeasible to provide every user with a software engineer to keep the rules up-to-date. A machine learning system can learn which mail messages the user rejects and maintain the filtering rules automatically. Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis. Statistics focuses on understanding the phenomena that have generated the data, often with the goal of testing different hypotheses about those phenomena. Data mining seeks to find patterns in the data that are understandable by people. Psychological studies of human learning aspire to understand the mechanisms underlying the various learning behaviors exhibited by people (concept learning, skill acquisition, strategy change, etc.).

13,246 citations

Posted Content
H. Brendan McMahan1, Eider Moore1, Daniel Ramage1, Seth Hampson, Blaise Aguera y Arcas1 
TL;DR: This work presents a practical method for the federated learning of deep networks based on iterative model averaging, and conducts an extensive empirical evaluation, considering five different model architectures and four datasets.
Abstract: Modern mobile devices have access to a wealth of data suitable for learning models, which in turn can greatly improve the user experience on the device. For example, language models can improve speech recognition and text entry, and image models can automatically select good photos. However, this rich data is often privacy sensitive, large in quantity, or both, which may preclude logging to the data center and training there using conventional approaches. We advocate an alternative that leaves the training data distributed on the mobile devices, and learns a shared model by aggregating locally-computed updates. We term this decentralized approach Federated Learning. We present a practical method for the federated learning of deep networks based on iterative model averaging, and conduct an extensive empirical evaluation, considering five different model architectures and four datasets. These experiments demonstrate the approach is robust to the unbalanced and non-IID data distributions that are a defining characteristic of this setting. Communication costs are the principal constraint, and we show a reduction in required communication rounds by 10-100x as compared to synchronized stochastic gradient descent.

5,936 citations

Book
11 Aug 2014
TL;DR: The preponderance of this monograph is devoted to fundamental techniques for achieving differential privacy, and application of these techniques in creative combinations, using the query-release problem as an ongoing example.
Abstract: The problem of privacy-preserving data analysis has a long history spanning multiple disciplines. As electronic data about individuals becomes increasingly detailed, and as technology enables ever more powerful collection and curation of these data, the need increases for a robust, meaningful, and mathematically rigorous definition of privacy, together with a computationally rich class of algorithms that satisfy this definition. Differential Privacy is such a definition.After motivating and discussing the meaning of differential privacy, the preponderance of this monograph is devoted to fundamental techniques for achieving differential privacy, and application of these techniques in creative combinations, using the query-release problem as an ongoing example. A key point is that, by rethinking the computational goal, one can often obtain far better results than would be achieved by methodically replacing each step of a non-private computation with a differentially private implementation. Despite some astonishingly powerful computational results, there are still fundamental limitations — not just on what can be achieved with differential privacy but on what can be achieved with any method that protects against a complete breakdown in privacy. Virtually all the algorithms discussed herein maintain differential privacy against adversaries of arbitrary computational power. Certain algorithms are computationally intensive, others are efficient. Computational complexity for the adversary and the algorithm are both discussed.We then turn from fundamentals to applications other than queryrelease, discussing differentially private methods for mechanism design and machine learning. The vast majority of the literature on differentially private algorithms considers a single, static, database that is subject to many analyses. Differential privacy in other models, including distributed databases and computations on data streams is discussed.Finally, we note that this work is meant as a thorough introduction to the problems and techniques of differential privacy, but is not intended to be an exhaustive survey — there is by now a vast amount of work in differential privacy, and we can cover only a small portion of it.

5,190 citations

Book ChapterDOI
Cynthia Dwork1
25 Apr 2008
TL;DR: This survey recalls the definition of differential privacy and two basic techniques for achieving it, and shows some interesting applications of these techniques, presenting algorithms for three specific tasks and three general results on differentially private learning.
Abstract: Over the past five years a new approach to privacy-preserving data analysis has born fruit [13, 18, 7, 19, 5, 37, 35, 8, 32]. This approach differs from much (but not all!) of the related literature in the statistics, databases, theory, and cryptography communities, in that a formal and ad omnia privacy guarantee is defined, and the data analysis techniques presented are rigorously proved to satisfy the guarantee. The key privacy guarantee that has emerged is differential privacy. Roughly speaking, this ensures that (almost, and quantifiably) no risk is incurred by joining a statistical database. In this survey, we recall the definition of differential privacy and two basic techniques for achieving it. We then show some interesting applications of these techniques, presenting algorithms for three specific tasks and three general results on differentially private learning.

3,314 citations