scispace - formally typeset
Search or ask a question
Author

Lei Wang

Bio: Lei Wang is an academic researcher from Indiana University. The author has contributed to research in topics: Locality-sensitive hashing & Information privacy. The author has an hindex of 9, co-authored 12 publications receiving 392 citations.

Papers
More filters
Journal ArticleDOI
27 Apr 2018-PLOS ONE
TL;DR: Hoaxy as discussed by the authors is an open platform that enables large-scale, systematic studies of how misinformation and fact-checking spread and compete on Twitter and quantifies how effectively the network can be disrupted by penalizing the most central nodes.
Abstract: Massive amounts of fake news and conspiratorial content have spread over social media before and after the 2016 US Presidential Elections despite intense fact-checking efforts. How do the spread of misinformation and fact-checking compete? What are the structural and dynamic characteristics of the core of the misinformation diffusion network, and who are its main purveyors? How to reduce the overall amount of misinformation? To explore these questions we built Hoaxy, an open platform that enables large-scale, systematic studies of how misinformation and fact-checking spread and compete on Twitter. Hoaxy captures public tweets that include links to articles from low-credibility and fact-checking sources. We perform k-core decomposition on a diffusion network obtained from two million retweets produced by several hundred thousand accounts over the six months before the election. As we move from the periphery to the core of the network, fact-checking nearly disappears, while social bots proliferate. The number of users in the main core reaches equilibrium around the time of the election, with limited churn and increasingly dense connections. We conclude by quantifying how effectively the network can be disrupted by penalizing the most central nodes. These findings provide a first look at the anatomy of a massive online misinformation diffusion network.

183 citations

Posted Content
TL;DR: It is demonstrated that even a well-generalized model contains vulnerable instances subject to a new generalized MIA (GMIA), and novel techniques for selecting vulnerable instances and detecting their subtle influences ignored by overfitting metrics are used.
Abstract: Membership Inference Attack (MIA) determines the presence of a record in a machine learning model's training data by querying the model Prior work has shown that the attack is feasible when the model is overfitted to its training data or when the adversary controls the training algorithm However, when the model is not overfitted and the adversary does not control the training algorithm, the threat is not well understood In this paper, we report a study that discovers overfitting to be a sufficient but not a necessary condition for an MIA to succeed More specifically, we demonstrate that even a well-generalized model contains vulnerable instances subject to a new generalized MIA (GMIA) In GMIA, we use novel techniques for selecting vulnerable instances and detecting their subtle influences ignored by overfitting metrics Specifically, we successfully identify individual records with high precision in real-world datasets by querying black-box machine learning models Further we show that a vulnerable record can even be indirectly attacked by querying other related records and existing generalization techniques are found to be less effective in protecting the vulnerable instances Our findings sharpen the understanding of the fundamental cause of the problem: the unique influences the training instance may have on the model

183 citations

Journal ArticleDOI
Kaiyuan Liu1, Sujun Li1, Lei Wang1, Yuzhen Ye1, Haixu Tang1 
TL;DR: A deep learning approach that can predict the complete spectra (both backbone and non-backbone ions) directly from peptide sequences and developed a multi-task learning (MTL) approach for predicting spectra of insufficient training samples, which allows the model to make accurate predictions for electron transfer dissociation (ETD) spectra and HCD specta of less abundant charges.
Abstract: The ability to predict tandem mass (MS/MS) spectra from peptide sequences can significantly enhance our understanding of the peptide fragmentation process and could improve peptide identification in proteomics. However, current approaches for predicting high-energy collisional dissociation (HCD) spectra are limited to predict the intensities of expected ion types, that is, the a/b/c/x/y/z ions and their neutral loss derivatives (referred to as backbone ions). In practice, backbone ions only account for <70% of total ion intensities in HCD spectra, indicating many intense ions are ignored by current predictors. In this paper, we present a deep learning approach that can predict the complete spectra (both backbone and nonbackbone ions) directly from peptide sequences. We made no assumptions or expectations on which kind of ions to predict but instead predicting the intensities for all possible m/z. Training this model needs no annotations of fragment ion nor any prior knowledge of the fragmentation rules. Our analyses show that the predicted 2+ and 3+ HCD spectra are highly similar to the experimental spectra, with average full-spectrum cosine similarities of 0.820 (±0.088) and 0.786 (±0.085), respectively, very close to the similarities between the experimental replicated spectra. In contrast, the best-performed backbone only models can only achieve an average similarity below 0.75 and 0.70 for 2+ and 3+ spectra, respectively. Furthermore, we developed a multitask learning (MTL) approach for predicting spectra of insufficient training samples, which allows our model to make accurate predictions for electron transfer dissociation (ETD) spectra and HCD spectra of less abundant charges (1+ and 4+).

41 citations

Proceedings ArticleDOI
01 Sep 2020
TL;DR: This work revisits membership inference attacks from the perspective of a pragmatic adversary who carefully selects targets and make predictions conservatively and design a new evaluation methodology that allows to evaluate the membership privacy risk at the level of individuals and not only in aggregate.
Abstract: Membership Inference Attacks (MIAs) aim to determine the presence of a record in a machine learning model's training data by querying the model. Recent work has demonstrated the effectiveness of MIA on various machine learning models and corresponding defenses have been proposed. However, both attacks and defenses have focused on an adversary that indiscriminately attacks all the records without regard to the cost of false positives or negatives. In this work, we revisit membership inference attacks from the perspective of a pragmatic adversary who carefully selects targets and make predictions conservatively. We design a new evaluation methodology that allows us to evaluate the membership privacy risk at the level of individuals and not only in aggregate. We experimentally demonstrate that highly vulnerable records exist even when the aggregate attack precision is close to 50% (baseline). Specifically, on the MNIST dataset, our pragmatic adversary achieves a precision of 95.05% whereas the prior attack only achieves a precision of 51.7%.

40 citations

Journal ArticleDOI
TL;DR: The outcomes suggest that applying highly optimized privacy-preserving and secure computation techniques to safeguard genomic data sharing and analysis is useful, however, the results also indicate that further efforts are needed to refine these techniques into practical solutions.
Abstract: The human genome can reveal sensitive information and is potentially re-identifiable, which raises privacy and security concerns about sharing such data on wide scales. In 2016, we organized the third Critical Assessment of Data Privacy and Protection competition as a community effort to bring together biomedical informaticists, computer privacy and security researchers, and scholars in ethical, legal, and social implications (ELSI) to assess the latest advances on privacy-preserving techniques for protecting human genomic data. Teams were asked to develop novel protection methods for emerging genome privacy challenges in three scenarios: Track (1) data sharing through the Beacon service of the Global Alliance for Genomics and Health. Track (2) collaborative discovery of similar genomes between two institutions; and Track (3) data outsourcing to public cloud services. The latter two tracks represent continuing themes from our 2015 competition, while the former was new and a response to a recently established vulnerability. The winning strategy for Track 1 mitigated the privacy risk by hiding approximately 11% of the variation in the database while permitting around 160,000 queries, a significant improvement over the baseline. The winning strategies in Tracks 2 and 3 showed significant progress over the previous competition by achieving multiple orders of magnitude performance improvement in terms of computational runtime and memory requirements. The outcomes suggest that applying highly optimized privacy-preserving and secure computation techniques to safeguard genomic data sharing and analysis is useful. However, the results also indicate that further efforts are needed to refine these techniques into practical solutions.

35 citations


Cited by
More filters
Proceedings ArticleDOI
19 May 2019
TL;DR: In this article, passive and active inference attacks are proposed to exploit the leakage of information about participants' training data in federated learning, where each participant can infer the presence of exact data points and properties that hold only for a subset of the training data and are independent of the properties of the joint model.
Abstract: Collaborative machine learning and related techniques such as federated learning allow multiple participants, each with his own training dataset, to build a joint model by training locally and periodically exchanging model updates. We demonstrate that these updates leak unintended information about participants' training data and develop passive and active inference attacks to exploit this leakage. First, we show that an adversarial participant can infer the presence of exact data points -- for example, specific locations -- in others' training data (i.e., membership inference). Then, we show how this adversary can infer properties that hold only for a subset of the training data and are independent of the properties that the joint model aims to capture. For example, he can infer when a specific person first appears in the photos used to train a binary gender classifier. We evaluate our attacks on a variety of tasks, datasets, and learning configurations, analyze their limitations, and discuss possible defenses.

1,084 citations

Proceedings ArticleDOI
19 May 2019
TL;DR: The reasons why deep learning models may leak information about their training data are investigated and new algorithms tailored to the white-box setting are designed by exploiting the privacy vulnerabilities of the stochastic gradient descent algorithm, which is the algorithm used to train deep neural networks.
Abstract: Deep neural networks are susceptible to various inference attacks as they remember information about their training data. We design white-box inference attacks to perform a comprehensive privacy analysis of deep learning models. We measure the privacy leakage through parameters of fully trained models as well as the parameter updates of models during training. We design inference algorithms for both centralized and federated learning, with respect to passive and active inference attackers, and assuming different adversary prior knowledge. We evaluate our novel white-box membership inference attacks against deep learning algorithms to trace their training data records. We show that a straightforward extension of the known black-box attacks to the white-box setting (through analyzing the outputs of activation functions) is ineffective. We therefore design new algorithms tailored to the white-box setting by exploiting the privacy vulnerabilities of the stochastic gradient descent algorithm, which is the algorithm used to train deep neural networks. We investigate the reasons why deep learning models may leak information about their training data. We then show that even well-generalized models are significantly susceptible to white-box membership inference attacks, by analyzing state-of-the-art pre-trained and publicly available models for the CIFAR dataset. We also show how adversarial participants, in the federated learning setting, can successfully run active membership inference attacks against other participants, even when the global model achieves high prediction accuracies.

783 citations

Journal ArticleDOI
TL;DR: An increasing trend in published articles on health-related misinformation and the role of social media in its propagation is observed, and the most extensively studied topics involving misinformation relate to vaccination, Ebola and Zika Virus, although others, such as nutrition, cancer, fluoridation of water and smoking also featured.

773 citations

Journal ArticleDOI
TL;DR: In this article, the authors use a dataset of 171 million tweets in the five months preceding the election day to identify 30 million tweets, from 2.2 million users, which contain a link to news outlets.
Abstract: The dynamics and influence of fake news on Twitter during the 2016 US presidential election remains to be clarified. Here, we use a dataset of 171 million tweets in the five months preceding the election day to identify 30 million tweets, from 2.2 million users, which contain a link to news outlets. Based on a classification of news outlets curated by www.opensources.co , we find that 25% of these tweets spread either fake or extremely biased news. We characterize the networks of information flow to find the most influential spreaders of fake and traditional news and use causal modeling to uncover how fake news influenced the presidential election. We find that, while top influencers spreading traditional center and left leaning news largely influence the activity of Clinton supporters, this causality is reversed for the fake news: the activity of Trump supporters influences the dynamics of the top fake news spreaders.

576 citations

Posted Content
TL;DR: This paper demonstrates that in such settings, an adversary can perform a training data extraction attack to recover individual training examples by querying the language model, and finds that larger models are more vulnerable than smaller models.
Abstract: It has become common to publish large (billion parameter) language models that have been trained on private datasets. This paper demonstrates that in such settings, an adversary can perform a training data extraction attack to recover individual training examples by querying the language model. We demonstrate our attack on GPT-2, a language model trained on scrapes of the public Internet, and are able to extract hundreds of verbatim text sequences from the model's training data. These extracted examples include (public) personally identifiable information (names, phone numbers, and email addresses), IRC conversations, code, and 128-bit UUIDs. Our attack is possible even though each of the above sequences are included in just one document in the training data. We comprehensively evaluate our extraction attack to understand the factors that contribute to its success. Worryingly, we find that larger models are more vulnerable than smaller models. We conclude by drawing lessons and discussing possible safeguards for training large language models.

496 citations