scispace - formally typeset
Search or ask a question

Showing papers in "arXiv: Cryptography and Security in 2020"


Posted Content
TL;DR: This paper demonstrates that in such settings, an adversary can perform a training data extraction attack to recover individual training examples by querying the language model, and finds that larger models are more vulnerable than smaller models.
Abstract: It has become common to publish large (billion parameter) language models that have been trained on private datasets. This paper demonstrates that in such settings, an adversary can perform a training data extraction attack to recover individual training examples by querying the language model. We demonstrate our attack on GPT-2, a language model trained on scrapes of the public Internet, and are able to extract hundreds of verbatim text sequences from the model's training data. These extracted examples include (public) personally identifiable information (names, phone numbers, and email addresses), IRC conversations, code, and 128-bit UUIDs. Our attack is possible even though each of the above sequences are included in just one document in the training data. We comprehensively evaluate our extraction attack to understand the factors that contribute to its success. Worryingly, we find that larger models are more vulnerable than smaller models. We conclude by drawing lessons and discussing possible safeguards for training large language models.

496 citations


Posted Content
TL;DR: This document describes the aggregation and anonymization process applied to the initial version of Google COVID-19 Community Mobility Reports (published at this http URL on April 2, 2020), a publicly available resource intended to help public health authorities understand what has changed in response to work-from-home, shelter-in-place, and other recommended policies aimed at flattening the curve of the CO VID-19 pandemic.
Abstract: This document describes the aggregation and anonymization process applied to the initial version of Google COVID-19 Community Mobility Reports (published at this http URL on April 2, 2020), a publicly available resource intended to help public health authorities understand what has changed in response to work-from-home, shelter-in-place, and other recommended policies aimed at flattening the curve of the COVID-19 pandemic. Our anonymization process is designed to ensure that no personal data, including an individual's location, movement, or contacts, can be derived from the resulting metrics. The high-level description of the procedure is as follows: we first generate a set of anonymized metrics from the data of Google users who opted in to Location History. Then, we compute percentage changes of these metrics from a baseline based on the historical part of the anonymized metrics. We then discard a subset which does not meet our bar for statistical reliability, and release the rest publicly in a format that compares the result to the private baseline.

363 citations


Posted Content
TL;DR: This document is a response to some of the privacy characteristics of direct contact tracing apps like TraceTogether and an early-stage Request for Comments to the community to encourage community efforts to develop alternative effective solutions with stronger privacy protection for the users.
Abstract: Contact tracing is an essential tool for public health officials and local communities to fight the spread of novel diseases, such as for the COVID-19 pandemic. The Singaporean government just released a mobile phone app, TraceTogether, that is designed to assist health officials in tracking down exposures after an infected individual is identified. However, there are important privacy implications of the existence of such tracking apps. Here, we analyze some of those implications and discuss ways of ameliorating the privacy concerns without decreasing usefulness to public health. We hope in writing this document to ensure that privacy is a central feature of conversations surrounding mobile contact tracing apps and to encourage community efforts to develop alternative effective solutions with stronger privacy protection for the users. Importantly, though we discuss potential modifications, this document is not meant as a formal research paper, but instead is a response to some of the privacy characteristics of direct contact tracing apps like TraceTogether and an early-stage Request for Comments to the community. Date written: 2020-03-24 Minor correction: 2020-03-30

344 citations


Posted Content
TL;DR: This paper summarizes and categorizes existing backdoor attacks and defenses based on their characteristics, and provides a unified framework for analyzing poisoning-based backdoor attacks.
Abstract: Backdoor attack intends to embed hidden backdoor into deep neural networks (DNNs), such that the attacked model performs well on benign samples, whereas its prediction will be maliciously changed if the hidden backdoor is activated by the attacker-defined trigger. This threat could happen when the training process is not fully controlled, such as training on third-party datasets or adopting third-party models, which poses a new and realistic threat. Although backdoor learning is an emerging and rapidly growing research area, its systematic review, however, remains blank. In this paper, we present the first comprehensive survey of this realm. We summarize and categorize existing backdoor attacks and defenses based on their characteristics, and provide a unified framework for analyzing poisoning-based backdoor attacks. Besides, we also analyze the relation between backdoor attacks and relevant fields ($i.e.,$ adversarial attacks and data poisoning), and summarize widely adopted benchmark datasets. Finally, we briefly outline certain future research directions relying upon reviewed works. A curated list of backdoor-related resources is also available at \url{this https URL}.

260 citations


Posted Content
TL;DR: This paper provides a concise introduction to the concept of FL, and a unique taxonomy covering threat models and two major attacks on FL: 1) poisoning attacks and 2) inference attacks, and provides an accessible review of this important topic.
Abstract: With the emergence of data silos and popular privacy awareness, the traditional centralized approach of training artificial intelligence (AI) models is facing strong challenges. Federated learning (FL) has recently emerged as a promising solution under this new reality. Existing FL protocol design has been shown to exhibit vulnerabilities which can be exploited by adversaries both within and without the system to compromise data privacy. It is thus of paramount importance to make FL system designers to be aware of the implications of future FL algorithm design on privacy-preservation. Currently, there is no survey on this topic. In this paper, we bridge this important gap in FL literature. By providing a concise introduction to the concept of FL, and a unique taxonomy covering threat models and two major attacks on FL: 1) poisoning attacks and 2) inference attacks, this paper provides an accessible review of this important topic. We highlight the intuitions, key techniques as well as fundamental assumptions adopted by various attacks, and discuss promising future research directions towards more robust privacy preservation in FL.

227 citations


Posted Content
TL;DR: Label-only membership inference attacks as mentioned in this paper evaluate the robustness of a model's predicted labels under perturbations to obtain a fine-grained membership signal, and empirically show that label-only attacks perform on par with prior attacks that required access to model confidences.
Abstract: Membership inference attacks are one of the simplest forms of privacy leakage for machine learning models: given a data point and model, determine whether the point was used to train the model. Existing membership inference attacks exploit models' abnormal confidence when queried on their training data. These attacks do not apply if the adversary only gets access to models' predicted labels, without a confidence measure. In this paper, we introduce label-only membership inference attacks. Instead of relying on confidence scores, our attacks evaluate the robustness of a model's predicted labels under perturbations to obtain a fine-grained membership signal. These perturbations include common data augmentations or adversarial examples. We empirically show that our label-only membership inference attacks perform on par with prior attacks that required access to model confidences. We further demonstrate that label-only attacks break multiple defenses against membership inference attacks that (implicitly or explicitly) rely on a phenomenon we call confidence masking. These defenses modify a model's confidence scores in order to thwart attacks, but leave the model's predicted labels unchanged. Our label-only attacks demonstrate that confidence-masking is not a viable defense strategy against membership inference. Finally, we investigate worst-case label-only attacks, that infer membership for a small number of outlier data points. We show that label-only attacks also match confidence-based attacks in this setting. We find that training models with differential privacy and (strong) L2 regularization are the only known defense strategies that successfully prevents all attacks. This remains true even when the differential privacy budget is too high to offer meaningful provable guarantees.

192 citations


Posted Content
TL;DR: A comprehensive survey on blockchain for big data, focusing on up-to-date approaches, opportunities, and future directions is provided, including blockchain for secure big data acquisition, data storage, data analytics, and data privacy preservation.
Abstract: Big data has generated strong interest in various scientific and engineering domains over the last few years. Despite many advantages and applications, there are many challenges in big data to be tackled for better quality of service, e.g., big data analytics, big data management, and big data privacy and security. Blockchain with its decentralization and security nature has the great potential to improve big data services and applications. In this article, we provide a comprehensive survey on blockchain for big data, focusing on up-to-date approaches, opportunities, and future directions. First, we present a brief overview of blockchain and big data as well as the motivation behind their integration. Next, we survey various blockchain services for big data, including blockchain for secure big data acquisition, data storage, data analytics, and data privacy preservation. Then, we review the state-of-the-art studies on the use of blockchain for big data applications in different vertical domains such as smart city, smart healthcare, smart transportation, and smart grid. For a better understanding, some representative blockchain-big data projects are also presented and analyzed. Finally, challenges and future directions are discussed to further drive research in this promising area.

173 citations


Posted Content
TL;DR: This system, referred to as DP3T, provides a technological foundation to help slow the spread of SARS-CoV-2 by simplifying and accelerating the process of notifying people who might have been exposed to the virus so that they can take appropriate measures to break its transmission chain.
Abstract: This document describes and analyzes a system for secure and privacy-preserving proximity tracing at large scale. This system, referred to as DP3T, provides a technological foundation to help slow the spread of SARS-CoV-2 by simplifying and accelerating the process of notifying people who might have been exposed to the virus so that they can take appropriate measures to break its transmission chain. The system aims to minimise privacy and security risks for individuals and communities and guarantee the highest level of data protection. The goal of our proximity tracing system is to determine who has been in close physical proximity to a COVID-19 positive person and thus exposed to the virus, without revealing the contact's identity or where the contact occurred. To achieve this goal, users run a smartphone app that continually broadcasts an ephemeral, pseudo-random ID representing the user's phone and also records the pseudo-random IDs observed from smartphones in close proximity. When a patient is diagnosed with COVID-19, she can upload pseudo-random IDs previously broadcast from her phone to a central server. Prior to the upload, all data remains exclusively on the user's phone. Other users' apps can use data from the server to locally estimate whether the device's owner was exposed to the virus through close-range physical proximity to a COVID-19 positive person who has uploaded their data. In case the app detects a high risk, it will inform the user.

172 citations


Posted Content
TL;DR: These techniques can bypass current state-of-the-art defense mechanisms against backdoor attacks, including Neural Cleanse, ABS, and STRIP, and are the first two schemes that algorithmically generate triggers, which rely on a novel generative network.
Abstract: Machine learning (ML) has made tremendous progress during the past decade and is being adopted in various critical real-world applications. However, recent research has shown that ML models are vulnerable to multiple security and privacy attacks. In particular, backdoor attacks against ML models that have recently raised a lot of awareness. A successful backdoor attack can cause severe consequences, such as allowing an adversary to bypass critical authentication systems. Current backdooring techniques rely on adding static triggers (with fixed patterns and locations) on ML model inputs. In this paper, we propose the first class of dynamic backdooring techniques: Random Backdoor, Backdoor Generating Network (BaN), and conditional Backdoor Generating Network (c-BaN). Triggers generated by our techniques can have random patterns and locations, which reduce the efficacy of the current backdoor detection mechanisms. In particular, BaN and c-BaN are the first two schemes that algorithmically generate triggers, which rely on a novel generative network. Moreover, c-BaN is the first conditional backdooring technique, that given a target label, it can generate a target-specific trigger. Both BaN and c-BaN are essentially a general framework which renders the adversary the flexibility for further customizing backdoor attacks. We extensively evaluate our techniques on three benchmark datasets: MNIST, CelebA, and CIFAR-10. Our techniques achieve almost perfect attack performance on backdoored data with a negligible utility loss. We further show that our techniques can bypass current state-of-the-art defense mechanisms against backdoor attacks, including Neural Cleanse, ABS, and STRIP.

147 citations


Posted Content
TL;DR: A novel backdoor attack technique in which the triggers vary from input to input, and an input-aware trigger generator driven by diversity loss is implemented, making backdoor verification impossible.
Abstract: In recent years, neural backdoor attack has been considered to be a potential security threat to deep learning systems. Such systems, while achieving the state-of-the-art performance on clean data, perform abnormally on inputs with predefined triggers. Current backdoor techniques, however, rely on uniform trigger patterns, which are easily detected and mitigated by current defense methods. In this work, we propose a novel backdoor attack technique in which the triggers vary from input to input. To achieve this goal, we implement an input-aware trigger generator driven by diversity loss. A novel cross-trigger test is applied to enforce trigger nonreusablity, making backdoor verification impossible. Experiments show that our method is efficient in various attack scenarios as well as multiple datasets. We further demonstrate that our backdoor can bypass the state of the art defense methods. An analysis with a famous neural network inspector again proves the stealthiness of the proposed attack. Our code is publicly available at this https URL.

139 citations


Posted Content
TL;DR: This work advocates for a third-party free approach to assisted mobile contact tracing, because such an approach mitigates the security and privacy risks of requiring a trusted third party.
Abstract: The global health threat from COVID-19 has been controlled in a number of instances by large-scale testing and contact tracing efforts. We created this document to suggest three functionalities on how we might best harness computing technologies to supporting the goals of public health organizations in minimizing morbidity and mortality associated with the spread of COVID-19, while protecting the civil liberties of individuals. In particular, this work advocates for a third-party free approach to assisted mobile contact tracing, because such an approach mitigates the security and privacy risks of requiring a trusted third party. We also explicitly consider the inferential risks involved in any contract tracing system, where any alert to a user could itself give rise to de-anonymizing information. More generally, we hope to participate in bringing together colleagues in industry, academia, and civil society to discuss and converge on ideas around a critical issue rising with attempts to mitigate the COVID-19 pandemic.

Journal ArticleDOI
TL;DR: A blockchain-based secure FL framework to create smart contracts and prevent malicious or unreliable participants from being involved in FL is proposed, which can effectively deter poisoning and membership inference attacks, thereby improving the security of FL in 5G networks.
Abstract: Federated Learning (FL) has been recently proposed as an emerging paradigm to build machine learning models using distributed training datasets that are locally stored and maintained on different devices in 5G networks while providing privacy preservation for participants. In FL, the central aggregator accumulates local updates uploaded by participants to update a global model. However, there are two critical security threats: poisoning and membership inference attacks. These attacks may be carried out by malicious or unreliable participants, resulting in the construction failure of global models or privacy leakage of FL models. Therefore, it is crucial for FL to develop security means of defense. In this article, we propose a blockchain-based secure FL framework to create smart contracts and prevent malicious or unreliable participants from involving in FL. In doing so, the central aggregator recognizes malicious and unreliable participants by automatically executing smart contracts to defend against poisoning attacks. Further, we use local differential privacy techniques to prevent membership inference attacks. Numerical results suggest that the proposed framework can effectively deter poisoning and membership inference attacks, thereby improving the security of FL in 5G networks.

Posted Content
TL;DR: This paper conducts the first comprehensive survey on federated learning, and provides a concise introduction to the concept of FL, and a unique taxonomy covering: 1) threat models; 2) poisoning attacks and defense against robustness; 3) inference attacks and defenses against privacy.
Abstract: As data are increasingly being stored in different silos and societies becoming more aware of data privacy issues, the traditional centralized training of artificial intelligence (AI) models is facing efficiency and privacy challenges. Recently, federated learning (FL) has emerged as an alternative solution and continue to thrive in this new reality. Existing FL protocol design has been shown to be vulnerable to adversaries within or outside of the system, compromising data privacy and system robustness. Besides training powerful global models, it is of paramount importance to design FL systems that have privacy guarantees and are resistant to different types of adversaries. In this paper, we conduct the first comprehensive survey on this topic. Through a concise introduction to the concept of FL, and a unique taxonomy covering: 1) threat models; 2) poisoning attacks and defenses against robustness; 3) inference attacks and defenses against privacy, we provide an accessible review of this important topic. We highlight the intuitions, key techniques as well as fundamental assumptions adopted by various attacks and defenses. Finally, we discuss promising future research directions towards robust and privacy-preserving federated learning.

Posted Content
TL;DR: New classes of backdoors strictly more powerful than those in prior literature are demonstrated: single-pixel and physical backdoors in ImageNet models, backdoors that switch the model to a covert, privacy-violating task, and back Doors that do not require inference-time input modifications.
Abstract: We investigate a new method for injecting backdoors into machine learning models, based on compromising the loss-value computation in the model-training code. We use it to demonstrate new classes of backdoors strictly more powerful than those in the prior literature: single-pixel and physical backdoors in ImageNet models, backdoors that switch the model to a covert, privacy-violating task, and backdoors that do not require inference-time input modifications. Our attack is blind: the attacker cannot modify the training data, nor observe the execution of his code, nor access the resulting model. The attack code creates poisoned training inputs "on the fly," as the model is training, and uses multi-objective optimization to achieve high accuracy on both the main and backdoor tasks. We show how a blind attack can evade any known defense and propose new ones.

Posted Content
TL;DR: The different technological approaches to mobile-phone based contact-tracing to date are outlined and advanced security enhancing approaches that can mitigate these risks are described and trade-offs one must make are described.
Abstract: Containment, the key strategy in quickly halting an epidemic, requires rapid identification and quarantine of the infected individuals, determination of whom they have had close contact with in the previous days and weeks, and decontamination of locations the infected individual has visited. Achieving containment demands accurate and timely collection of the infected individual's location and contact history. Traditionally, this process is labor intensive, susceptible to memory errors, and fraught with privacy concerns. With the recent almost ubiquitous availability of smart phones, many people carry a tool which can be utilized to quickly identify an infected individual's contacts during an epidemic, such as the current 2019 novel Coronavirus crisis. Unfortunately, the very same first-generation contact tracing tools have been used to expand mass surveillance, limit individual freedoms and expose the most private details about individuals. We seek to outline the different technological approaches to mobile-phone based contact-tracing to date and elaborate on the opportunities and the risks that these technologies pose to individuals and societies. We describe advanced security enhancing approaches that can mitigate these risks and describe trade-offs one must make when developing and deploying any mass contact-tracing technology. With this paper, our aim is to continue to grow the conversation regarding contact-tracing for epidemic and pandemic containment and discuss opportunities to advance this space. We invite feedback and discussion.

Posted ContentDOI
TL;DR: VerifyMed is the first proof-of-concept platform, built on Ethereum, for transparently validating the authorization and competence of medical professionals using blockchain technology and enables a healthcare professional to build a portfolio of real-life work experience and further validates the competence by storing outcome metrics reported by the patients.
Abstract: Patients living in a digitized world can now interact with medical professionals through online services such as chat applications, video conferencing or indirectly through consulting services. These applications need to tackle several fundamental trust issues: 1. Checking and confirming that the person they are interacting with is a real person; 2. Validating that the healthcare professional has competence within the field in question; and 3. Confirming that the healthcare professional has a valid license to practice. In this paper, we present VerifyMed -- the first proof-of-concept platform, built on Ethereum, for transparently validating the authorization and competence of medical professionals using blockchain technology. Our platform models trust relationships within the healthcare industry to validate professional clinical authorization. Furthermore, it enables a healthcare professional to build a portfolio of real-life work experience and further validates the competence by storing outcome metrics reported by the patients. The extensive realistic simulations show that with our platform, an average cost for creating a smart contract for a treatment and getting it approved is around 1 USD, and the cost for evaluating a treatment is around 50 cents.

Posted Content
TL;DR: This paper is the first to explore the implication of flash loans for the nascent decentralized finance (DeFi) ecosystem and shows how two previously executed attacks can be "boosted" to result in a profit of 2.37x and 1.73x, respectively.
Abstract: Credit allows a lender to loan out surplus capital to a borrower. In the traditional economy, credit bears the risk that the borrower may default on its debt, the lender hence requires an upfront collateral from the borrower, plus interest fee payments. Due to the atomicity of blockchain transactions, lenders can offer flash loans, i.e. loans that are only valid within one transaction and must be repaid by the end of that transaction. This concept has lead to a number of interesting attack possibilities, some of which have been exploited recently (February 2020). This paper is the first to explore the implication of flash loans for the nascent decentralized finance (DeFi) ecosystem. We analyze two existing attacks vectors with significant ROIs (beyond 500k%), and then go on to formulate finding flash loan-based attack parameters as an optimization problem over the state of the underlying Ethereum blockchain as well as the state of the DeFi ecosystem. Specifically, we show how two previously executed attacks can be "boosted" to result in a profit of 829.5k USD and 1.1M USD, respectively, which is a boost of 2.37x and 1.73x, respectively.

Posted Content
TL;DR: This paper presents the first systematic investigation of the backdoor attack against models designed for natural language processing (NLP) tasks, and proposes three methods to construct triggers in the NLP setting, including Char-level, Word- level, and Sentence-level triggers.
Abstract: Machine learning (ML) has progressed rapidly during the past decade and ML models have been deployed in various real-world applications. Meanwhile, machine learning models have been shown to be vulnerable to various security and privacy attacks. One attack that has attracted a great deal of attention recently is the backdoor attack. Specifically, the adversary poisons the target model training set, to mislead any input with an added secret trigger to a target class, while keeping the accuracy for original inputs unchanged. Previous backdoor attacks mainly focus on computer vision tasks. In this paper, we present the first systematic investigation of the backdoor attack against models designed for natural language processing (NLP) tasks. Specifically, we propose three methods to construct triggers in the NLP setting, including Char-level, Word-level, and Sentence-level triggers. Our Attacks achieve an almost perfect success rate without jeopardizing the original model utility. For instance, using the word-level triggers, our backdoor attack achieves 100% backdoor accuracy with only a drop of 0.18%, 1.26%, and 0.19% in the models utility, for the IMDB, Amazon, and Stanford Sentiment Treebank datasets, respectively.

Journal ArticleDOI
TL;DR: This paper presents a practical, lightweight deep learning DDoS detection system called Lucid, which exploits the properties of Convolutional Neural Networks (CNNs) to classify traffic flows as either malicious or benign, with a 40x reduction in processing time.
Abstract: Distributed Denial of Service (DDoS) attacks are one of the most harmful threats in today's Internet, disrupting the availability of essential services. The challenge of DDoS detection is the combination of attack approaches coupled with the volume of live traffic to be analysed. In this paper, we present a practical, lightweight deep learning DDoS detection system called LUCID, which exploits the properties of Convolutional Neural Networks (CNNs) to classify traffic flows as either malicious or benign. We make four main contributions; (1) an innovative application of a CNN to detect DDoS traffic with low processing overhead, (2) a dataset-agnostic preprocessing mechanism to produce traffic observations for online attack detection, (3) an activation analysis to explain LUCID's DDoS classification, and (4) an empirical validation of the solution on a resource-constrained hardware platform. Using the latest datasets, LUCID matches existing state-of-the-art detection accuracy whilst presenting a 40x reduction in processing time, as compared to the state-of-the-art. With our evaluation results, we prove that the proposed approach is suitable for effective DDoS detection in resource-constrained operational environments.

Posted Content
TL;DR: This article proposes to integrate federated learning and local differential privacy (LDP) to facilitate the crowdsourcing applications to achieve the machine learning model, and proposes four LDP mechanisms to perturb gradients generated by vehicles.
Abstract: Internet of Vehicles (IoV) is a promising branch of the Internet of Things. IoV simulates a large variety of crowdsourcing applications such as Waze, Uber, and Amazon Mechanical Turk, etc. Users of these applications report the real-time traffic information to the cloud server which trains a machine learning model based on traffic information reported by users for intelligent traffic management. However, crowdsourcing application owners can easily infer users' location information, which raises severe location privacy concerns of the users. In addition, as the number of vehicles increases, the frequent communication between vehicles and the cloud server incurs unexpected amount of communication cost. To avoid the privacy threat and reduce the communication cost, in this paper, we propose to integrate federated learning and local differential privacy (LDP) to facilitate the crowdsourcing applications to achieve the machine learning model. Specifically, we propose four LDP mechanisms to perturb gradients generated by vehicles. The Three-Outputs mechanism is proposed which introduces three different output possibilities to deliver a high accuracy when the privacy budget is small. The output possibilities of Three-Outputs can be encoded with two bits to reduce the communication cost. Besides, to maximize the performance when the privacy budget is large, an optimal piecewise mechanism (PM-OPT) is proposed. We further propose a suboptimal mechanism (PM-SUB) with a simple formula and comparable utility to PM-OPT. Then, we build a novel hybrid mechanism by combining Three-Outputs and PM-SUB.

Posted Content
TL;DR: This paper proposes to benchmark membership inference privacy risks by improving existing non-neural network based inference attacks and proposing a new inference attack method based on a modification of prediction entropy, and introduces a new approach for fine-grained privacy analysis by formulating and deriving a new metric called the privacy risk score.
Abstract: Machine learning models are prone to memorizing sensitive data, making them vulnerable to membership inference attacks in which an adversary aims to guess if an input sample was used to train the model. In this paper, we show that prior work on membership inference attacks may severely underestimate the privacy risks by relying solely on training custom neural network classifiers to perform attacks and focusing only on the aggregate results over data samples, such as the attack accuracy. To overcome these limitations, we first propose to benchmark membership inference privacy risks by improving existing non-neural network based inference attacks and proposing a new inference attack method based on a modification of prediction entropy. We also propose benchmarks for defense mechanisms by accounting for adaptive adversaries with knowledge of the defense and also accounting for the trade-off between model accuracy and privacy risks. Using our benchmark attacks, we demonstrate that existing defense approaches are not as effective as previously reported. Next, we introduce a new approach for fine-grained privacy analysis by formulating and deriving a new metric called the privacy risk score. Our privacy risk score metric measures an individual sample's likelihood of being a training member, which allows an adversary to identify samples with high privacy risks and perform attacks with high confidence. We experimentally validate the effectiveness of the privacy risk score and demonstrate that the distribution of privacy risk score across individual samples is heterogeneous. Finally, we perform an in-depth investigation for understanding why certain samples have high privacy risks, including correlations with model sensitivity, generalization error, and feature embeddings. Our work emphasizes the importance of a systematic and rigorous evaluation of privacy risks of machine learning models.

Posted Content
TL;DR: A longitudinal study of 30 papers from top-tier security conferences within the past 10 years confirms common pitfalls in the design, implementation, and evaluation of learning-based security systems, and derives a list of actionable recommendations to support researchers and the community in avoiding pitfalls.
Abstract: With the growing processing power of computing systems and the increasing availability of massive datasets, machine learning algorithms have led to major breakthroughs in many different areas. This development has influenced computer security, spawning a series of work on learning-based security systems, such as for malware detection, vulnerability discovery, and binary code analysis. Despite great potential, machine learning in security is prone to subtle pitfalls that undermine its performance and render learning-based systems potentially unsuitable for security tasks and practical deployment. In this paper, we look at this problem with critical eyes. First, we identify common pitfalls in the design, implementation, and evaluation of learning-based security systems. We conduct a longitudinal study of 30 papers from top-tier security conferences within the past 10 years, confirming that these pitfalls are widespread in the current security literature. In an empirical analysis, we further demonstrate how individual pitfalls can lead to unrealistic performance and interpretations, obstructing the understanding of the security problem at hand. As a remedy, we derive a list of actionable recommendations to support researchers and our community in avoiding pitfalls, promoting a sound design, development, evaluation, and deployment of learning-based systems for computer security.

Posted Content
TL;DR: Entangled Watermarking Embeddings (EWE) is introduced, which encourages the model to learn common features for classifying data that is sampled from the task distribution, but also data that encodes watermarks, which forces an adversary attempting to remove watermarks that are entangled with legitimate data to sacrifice performance on legitimate data.
Abstract: Machine learning involves expensive data collection and training procedures. Model owners may be concerned that valuable intellectual property can be leaked if adversaries mount model extraction attacks. As it is difficult to defend against model extraction without sacrificing significant prediction accuracy, watermarking instead leverages unused model capacity to have the model overfit to outlier input-output pairs. Such pairs are watermarks, which are not sampled from the task distribution and are only known to the defender. The defender then demonstrates knowledge of the input-output pairs to claim ownership of the model at inference. The effectiveness of watermarks remains limited because they are distinct from the task distribution and can thus be easily removed through compression or other forms of knowledge transfer. We introduce Entangled Watermarking Embeddings (EWE). Our approach encourages the model to learn features for classifying data that is sampled from the task distribution and data that encodes watermarks. An adversary attempting to remove watermarks that are entangled with legitimate data is also forced to sacrifice performance on legitimate data. Experiments on MNIST, Fashion-MNIST, CIFAR-10, and Speech Commands validate that the defender can claim model ownership with 95\% confidence with less than 100 queries to the stolen copy, at a modest cost below 0.81 percentage points on average in the defended model's performance.

Proceedings ArticleDOI
TL;DR: In this article, the authors present CrypTFlow2, a cryptographic framework for secure inference over realistic deep neural networks (DNNs) using secure 2-party computation.
Abstract: We present CrypTFlow2, a cryptographic framework for secure inference over realistic Deep Neural Networks (DNNs) using secure 2-party computation. CrypTFlow2 protocols are both correct -- i.e., their outputs are bitwise equivalent to the cleartext execution -- and efficient -- they outperform the state-of-the-art protocols in both latency and scale. At the core of CrypTFlow2, we have new 2PC protocols for secure comparison and division, designed carefully to balance round and communication complexity for secure inference tasks. Using CrypTFlow2, we present the first secure inference over ImageNet-scale DNNs like ResNet50 and DenseNet121. These DNNs are at least an order of magnitude larger than those considered in the prior work of 2-party DNN inference. Even on the benchmarks considered by prior work, CrypTFlow2 requires an order of magnitude less communication and 20x-30x less time than the state-of-the-art.

Posted Content
TL;DR: Epione is introduced, a lightweight system for contact tracing with strong privacy protections and a new cryptographic tool for secure two-party private set intersection cardinality (PSI-CA), which allows two parties, each holding a set of items, to learn the intersection size of two private sets without revealing intersection items.
Abstract: Contact tracing is an essential tool in containing infectious diseases such as COVID-19. Many countries and research groups have launched or announced mobile apps to facilitate contact tracing by recording contacts between users with some privacy considerations. Most of the focus has been on using random tokens, which are exchanged during encounters and stored locally on users' phones. Prior systems allow users to search over released tokens in order to learn if they have recently been in the proximity of a user that has since been diagnosed with the disease. However, prior approaches do not provide end-to-end privacy in the collection and querying of tokens. In particular, these approaches are vulnerable to either linkage attacks by users using token metadata, linkage attacks by the server, or false reporting by users. In this work, we introduce Epione, a lightweight system for contact tracing with strong privacy protections. Epione alerts users directly if any of their contacts have been diagnosed with the disease, while protecting the privacy of users' contacts from both central services and other users, and provides protection against false reporting. As a key building block, we present a new cryptographic tool for secure two-party private set intersection cardinality (PSI-CA), which allows two parties, each holding a set of items, to learn the intersection size of two private sets without revealing intersection items. We specifically tailor it to the case of large-scale contact tracing where clients have small input sets and the server's database of tokens is much larger.

Journal ArticleDOI
TL;DR: This work evaluates six popular mutation-based greybox fuzzers against Magma, a ground-truth evaluation framework that enables uniform fuzzer evaluation and comparison, and draws conclusions about the fuzzers' exploration and detection capabilities.
Abstract: High scalability and low running costs have made fuzz testing the de facto standard for discovering software bugs. Fuzzing techniques are constantly being improved in a race to build the ultimate bug-finding tool. However, while fuzzing excels at finding bugs in the wild, evaluating and comparing fuzzer performance is challenging due to the lack of metrics and benchmarks. For example, crash count, perhaps the most commonly-used performance metric, is inaccurate due to imperfections in deduplication techniques. Additionally, the lack of a unified set of targets results in ad hoc evaluations that hinder fair comparison. We tackle these problems by developing Magma, a ground-truth fuzzing benchmark that enables uniform fuzzer evaluation and comparison. By introducing real bugs into real software, Magma allows for the realistic evaluation of fuzzers against a broad set of targets. By instrumenting these bugs, Magma also enables the collection of bug-centric performance metrics independent of the fuzzer. Magma is an open benchmark consisting of seven targets that perform a variety of input manipulations and complex computations, presenting a challenge to state-of-the-art fuzzers. We evaluate seven widely-used mutation-based fuzzers (AFL, AFLFast, AFL++, FairFuzz, MOpt-AFL, honggfuzz, and SymCC-AFL) against Magma over 200,000 CPU-hours. Based on the number of bugs reached, triggered, and detected, we draw conclusions about the fuzzers' exploration and detection capabilities. This provides insight into fuzzer performance evaluation, highlighting the importance of ground truth in performing more accurate and meaningful evaluations.

Proceedings ArticleDOI
TL;DR: UNICORN is presented, an anomaly-based APT detector that effectively leverages data provenance analysis that outperforms an existing state-of-the-art APT detection system and detects real-life APT scenarios with high accuracy.
Abstract: Advanced Persistent Threats (APTs) are difficult to detect due to their "low-and-slow" attack patterns and frequent use of zero-day exploits. We present UNICORN, an anomaly-based APT detector that effectively leverages data provenance analysis. From modeling to detection, UNICORN tailors its design specifically for the unique characteristics of APTs. Through extensive yet time-efficient graph analysis, UNICORN explores provenance graphs that provide rich contextual and historical information to identify stealthy anomalous activities without pre-defined attack signatures. Using a graph sketching technique, it summarizes long-running system execution with space efficiency to combat slow-acting attacks that take place over a long time span. UNICORN further improves its detection capability using a novel modeling approach to understand long-term behavior as the system evolves. Our evaluation shows that UNICORN outperforms an existing state-of-the-art APT detection system and detects real-life APT scenarios with high accuracy.

Journal ArticleDOI
TL;DR: This paper proposes Pivot, a novel solution for privacy preserving vertical decision tree training and prediction, ensuring that no intermediate information is disclosed other than those the clients have agreed to release (i.e., the final tree model and the prediction output).
Abstract: Federated learning (FL) is an emerging paradigm that enables multiple organizations to jointly train a model without revealing their private data to each other. This paper studies {\it vertical} federated learning, which tackles the scenarios where (i) collaborating organizations own data of the same set of users but with disjoint features, and (ii) only one organization holds the labels. We propose Pivot, a novel solution for privacy preserving vertical decision tree training and prediction, ensuring that no intermediate information is disclosed other than those the clients have agreed to release (i.e., the final tree model and the prediction output). Pivot does not rely on any trusted third party and provides protection against a semi-honest adversary that may compromise $m-1$ out of $m$ clients. We further identify two privacy leakages when the trained decision tree model is released in plaintext and propose an enhanced protocol to mitigate them. The proposed solution can also be extended to tree ensemble models, e.g., random forest (RF) and gradient boosting decision tree (GBDT) by treating single decision trees as building blocks. Theoretical and experimental analysis suggest that Pivot is efficient for the privacy achieved.

Posted Content
TL;DR: It is found that existing randomized smoothing methods have limited effectiveness at defending against backdoor attacks, which highlight the needs of new theory and methods to certify robustness againstbackdoor attacks.
Abstract: Backdoor attack is a severe security threat to deep neural networks (DNNs). We envision that, like adversarial examples, there will be a cat-and-mouse game for backdoor attacks, i.e., new empirical defenses are developed to defend against backdoor attacks but they are soon broken by strong adaptive backdoor attacks. To prevent such cat-and-mouse game, we take the first step towards certified defenses against backdoor attacks. Specifically, in this work, we study the feasibility and effectiveness of certifying robustness against backdoor attacks using a recent technique called randomized smoothing. Randomized smoothing was originally developed to certify robustness against adversarial examples. We generalize randomized smoothing to defend against backdoor attacks. Our results show the theoretical feasibility of using randomized smoothing to certify robustness against backdoor attacks. However, we also find that existing randomized smoothing methods have limited effectiveness at defending against backdoor attacks, which highlight the needs of new theory and methods to certify robustness against backdoor attacks.

Posted Content
TL;DR: This work proposes a subgraph based backdoor attack to GNN for graph classification that predicts an attacker-chosen target label for a testing graph once a predefined subgraph is injected to the testing graph.
Abstract: Node classification and graph classification are two basic graph analytics tools Node classification aims to predict a label for each node in a graph, while graph classification aims to predict a label for the entire graph Existing studies on graph neural networks (GNNs) in adversarial settings mainly focused on node classification, leaving GNN based graph classification largely unexplored We aim to bridge this gap in this work Specifically, we propose a subgraph based backdoor attack to GNN based graph classification In our backdoor attack, a GNN classifier predicts an attacker-chosen target label for a testing graph once the attacker injects a predefined subgraph to the testing graph Our empirical results on three real-world graph datasets show that our backdoor attacks are effective with small impact on a GNN's prediction accuracy for clean testing graphs We generalize a state-of-the-art randomized smoothing based certified defense to defend against our backdoor attacks Our empirical results show that the defense is ineffective in some cases, highlighting the needs of new defenses for our backdoor attacks