scispace - formally typeset
Search or ask a question

Showing papers on "Bernoulli sampling published in 2019"


Journal ArticleDOI
TL;DR: Extensive experimental evaluations demonstrate the superiority of this new sharp attention model for person re-ID over other related existing, published state-of-the-art works on three challenging benchmarks, including CUHK03, Market-1501, and DukeMTMC-reID.
Abstract: In this paper, we present novel sharp attention networks by adaptively sampling feature maps from convolutional neural networks for person re-identification (re-ID) problems. Due to the introduction of sampling-based attention models, the proposed approach can adaptively generate sharper attention-aware feature masks. This greatly differs from the gating-based attention mechanism that relies on soft gating functions to select the relevant features for person re-ID. In contrast, the proposed sampling-based attention mechanism allows us to effectively trim irrelevant features by enforcing the resultant feature masks to focus on the most discriminative features. It can produce sharper attentions that is more assertive in localizing subtle features relevant to re-identifying people across cameras. For this purpose, a differentiable Gumbel-Softmax sampler is employed to approximate the Bernoulli sampling to train the sharp attention networks. Extensive experimental evaluations demonstrate the superiority of this new sharp attention model for person re-ID over other related existing, published state-of-the-art works on three challenging benchmarks, including CUHK03, Market-1501, and DukeMTMC-reID.

46 citations


Journal ArticleDOI
TL;DR: In this article, the authors proposed randomized incomplete $U$U$-statistics with sparse weights whose computational cost can be made independent of the order of the data distribution of the original data distribution.
Abstract: This paper studies inference for the mean vector of a high-dimensional $U$-statistic. In the era of big data, the dimension $d$ of the $U$-statistic and the sample size $n$ of the observations tend to be both large, and the computation of the $U$-statistic is prohibitively demanding. Data-dependent inferential procedures such as the empirical bootstrap for $U$-statistics is even more computationally expensive. To overcome such a computational bottleneck, incomplete $U$-statistics obtained by sampling fewer terms of the $U$-statistic are attractive alternatives. In this paper, we introduce randomized incomplete $U$-statistics with sparse weights whose computational cost can be made independent of the order of the $U$-statistic. We derive nonasymptotic Gaussian approximation error bounds for the randomized incomplete $U$-statistics in high dimensions, namely in cases where the dimension $d$ is possibly much larger than the sample size $n$, for both nondegenerate and degenerate kernels. In addition, we propose generic bootstrap methods for the incomplete $U$-statistics that are computationally much less demanding than existing bootstrap methods, and establish finite sample validity of the proposed bootstrap methods. Our methods are illustrated on the application to nonparametric testing for the pairwise independence of a high-dimensional random vector under weaker assumptions than those appearing in the literature.

38 citations


Posted Content
TL;DR: In this article, the authors investigate the robustness of sampling against adaptive adversarial attacks in a streaming setting, where an adversary sends a stream of elements from a universe $U$ to a sampling algorithm (e.g., Bernoulli sampling or reservoir sampling), with the goal of making the sample "very unrepresentative" of the underlying data stream.
Abstract: Random sampling is a fundamental primitive in modern algorithms, statistics, and machine learning, used as a generic method to obtain a small yet "representative" subset of the data. In this work, we investigate the robustness of sampling against adaptive adversarial attacks in a streaming setting: An adversary sends a stream of elements from a universe $U$ to a sampling algorithm (e.g., Bernoulli sampling or reservoir sampling), with the goal of making the sample "very unrepresentative" of the underlying data stream. The adversary is fully adaptive in the sense that it knows the exact content of the sample at any given point along the stream, and can choose which element to send next accordingly, in an online manner. Well-known results in the static setting indicate that if the full stream is chosen in advance (non-adaptively), then a random sample of size $\Omega(d / \varepsilon^2)$ is an $\varepsilon$-approximation of the full data with good probability, where $d$ is the VC-dimension of the underlying set system $(U,R)$. Does this sample size suffice for robustness against an adaptive adversary? The simplistic answer is \emph{negative}: We demonstrate a set system where a constant sample size (corresponding to VC-dimension $1$) suffices in the static setting, yet an adaptive adversary can make the sample very unrepresentative, as long as the sample size is (strongly) sublinear in the stream length, using a simple and easy-to-implement attack. However, this attack is "theoretical only", requiring the set system size to (essentially) be exponential in the stream length. This is not a coincidence: We show that to make Bernoulli or reservoir sampling robust against adaptive adversaries, the modification required is solely to replace the VC-dimension term $d$ in the sample size with the cardinality term $\log |R|$. This nearly matches the bound imposed by the attack.

14 citations


Book ChapterDOI
18 Feb 2019
TL;DR: In this paper, the authors present a method and software for ballot-polling risk-limiting audits based on Bernoulli sampling, where ballots are included in the sample with probability p, independently.
Abstract: We present a method and software for ballot-polling risk-limiting audits (RLAs) based on Bernoulli sampling: ballots are included in the sample with probability p, independently. Bernoulli sampling has several advantages: (1) it does not require a ballot manifest; (2) it can be conducted independently at different locations, rather than requiring a central authority to select the sample from the whole population of cast ballots or requiring stratified sampling; (3) it can start in polling places on election night, before margins are known. If the reported margins for the 2016 U.S. Presidential election are correct, a Bernoulli ballot-polling audit with a risk limit of 5% and a sampling rate of \(p_0=1\%\) would have had at least a 99% probability of confirming the outcome in 42 states. (The other states were more likely to have needed to examine additional ballots). Logistical and security advantages that auditing in the polling place affords may outweigh the cost of examining more ballots than some other methods might require.

11 citations


Posted Content
TL;DR: This work has shown that the join of two uniform samples is not an independent sample of the original join and that it leads to quadratically fewer output tuples.
Abstract: Despite decades of research on approximate query processing (AQP), our understanding of sample-based joins has remained limited and, to some extent, even superficial. The common belief in the community is that joining random samples is futile. This belief is largely based on an early result showing that the join of two uniform samples is not an independent sample of the original join, and that it leads to quadratically fewer output tuples. However, unfortunately, this result has little applicability to the key questions practitioners face. For example, the success metric is often the final approximation's accuracy, rather than output cardinality. Moreover, there are many non-uniform sampling strategies that one can employ. Is sampling for joins still futile in all of these settings? If not, what is the best sampling strategy in each case? To the best of our knowledge, there is no formal study answering these questions. This paper aims to improve our understanding of sample-based joins and offer a guideline for practitioners building and using real-world AQP systems. We study limitations of offline samples in approximating join queries: given an offline sampling budget, how well can one approximate the join of two tables? We answer this question for two success metrics: output size and estimator variance. We show that maximizing output size is easy, while there is an information-theoretical lower bound on the lowest variance achievable by any sampling strategy. We then define a hybrid sampling scheme that captures all combinations of stratified, universe, and Bernoulli sampling, and show that this scheme with our optimal parameters achieves the theoretical lower bound within a constant factor. Since computing these optimal parameters requires shuffling statistics across the network, we also propose a decentralized variant where each node acts autonomously using minimal statistics.

11 citations


Journal ArticleDOI
09 Dec 2019
TL;DR: In this article, the authors study the problem of sample-based joins in approximate query processing and propose a sampling strategy to maximize the output size and minimize the variance of the original join.
Abstract: Despite decades of research on AQP (approximate query processing), our understanding of sample-based joins has remained limited and, to some extent, even superficial. The common belief in the community is that joining random samples is futile. This belief is largely based on an early result showing that the join of two uniform samples is not an independent sample of the original join, and that it leads to quadratically fewer output tuples. Unfortunately, this early result has little applicability to the key questions practitioners face. For example, the success metric is often the final approximation's accuracy, rather than output cardinality. Moreover, there are many non-uniform sampling strategies that one can employ. Is sampling for joins still futile in all of these settings? If not, what is the best sampling strategy in each case? To the best of our knowledge, there is no formal study answering these questions.This paper aims to improve our understanding of sample-based joins and offer a guideline for practitioners building and using real-world AQP systems. We study limitations of offline samples in approximating join queries: given an offline sampling budget, how well can one approximate the join of two tables? We answer this question for two success metrics: output size and estimator variance. We show that maximizing output size is easy, while there is an information-theoretical lower bound on the lowest variance achievable by any sampling strategy. We then define a hybrid sampling scheme that captures all combinations of stratified, universe, and Bernoulli sampling, and show that this scheme with our optimal parameters achieves the theoretical lower bound within a constant factor. Since computing these optimal parameters requires shuffling statistics across the network, we also propose a decentralized variant in which each node acts autonomously using minimal statistics. We also empirically validate our findings on popular SQL and AQP engines.

9 citations


Posted Content
TL;DR: In this paper, the authors investigate the robustness of sampling against adaptive adversarial attacks in a streaming setting: an adversary sends a stream of elements from a universe U to a sampling algorithm (e.g., Bernoulli sampling or reservoir sampling), with the goal of making the sample "very unrepresentative" of the underlying data stream.
Abstract: Random sampling is a fundamental primitive in modern algorithms, statistics, and machine learning, used as a generic method to obtain a small yet "representative" subset of the data. In this work, we investigate the robustness of sampling against adaptive adversarial attacks in a streaming setting: An adversary sends a stream of elements from a universe U to a sampling algorithm (e.g., Bernoulli sampling or reservoir sampling), with the goal of making the sample "very unrepresentative" of the underlying data stream. The adversary is fully adaptive in the sense that it knows the exact content of the sample at any given point along the stream, and can choose which element to send next accordingly, in an online manner. Well-known results in the static setting indicate that if the full stream is chosen in advance (non-adaptively), then a random sample of size Ω(d/e2) is an e-approximation of the full data with good probability, where d is the VC-dimension of the underlying set system (U, R). Does this sample size suffice for robustness against an adaptive adversary? The simplistic answer is negative : We demonstrate a set system where a constant sample size (corresponding to a VC-dimension of 1) suffices in the static setting, yet an adaptive adversary can make the sample very unrepresentative, as long as the sample size is (strongly) sublinear in the stream length, using a simple and easy-to-implement attack. However, this attack is "theoretical only", requiring the set system size to (essentially) be exponential in the stream length. This is not a coincidence: We show that in order to make the sampling algorithm robust against adaptive adversaries, the modification required is solely to replace the VC-dimension term d in the sample size with the cardinality term log |R|. That is, the Bernoulli and reservoir sampling algorithms with sample size Ω(log |R|/e2) output a representative sample of the stream with good probability, even in the presence of an adaptive adversary. This nearly matches the bound imposed by the attack.

3 citations


Proceedings Article
01 Jan 2019
TL;DR: This work constructs an inference network conditioned on the symbolic representation of entities and relation types in the Knowledge Graph, to provide the variational distributions, and introduces two models from this highly scalable probabilistic framework, namely the Latent Information and Latent Fact models, for reasoning over knowledge graph-based representations.
Abstract: Recent advances in Neural Variational Inference allowed for a renaissance in latent variable models in a variety of domains involving high-dimensional data. While traditional variational methods derive an analytical approximation for the intractable distribution over the latent variables, here we construct an inference network conditioned on the symbolic representation of entities and relation types in the Knowledge Graph, to provide the variational distributions. The new framework results in a highly-scalable method. Under a Bernoulli sampling framework, we provide an alternative justification for commonly used techniques in large-scale stochastic variational inference, which drastically reduce training time at a cost of an additional approximation to the variational lower bound. We introduce two models from this highly scalable probabilistic framework, namely the Latent Information and Latent Fact models, for reasoning over knowledge graph-based representations. Our Latent Information and Latent Fact models improve upon baseline performance under certain conditions. We use the learnt embedding variance to estimate predictive uncertainty during link prediction, and discuss the quality of these learnt uncertainty estimates. Our source code and datasets are publicly available online at https://github.com/alexanderimanicowenrivers/Neural-Variational-Knowledge-Graphs.

2 citations


Posted Content
TL;DR: In this paper, the authors proposed a probabilistic framework for neural variational inference over knowledge graph-based representations, which can reduce training time at a cost of an additional approximation to the variational lower bound.
Abstract: Recent advances in Neural Variational Inference allowed for a renaissance in latent variable models in a variety of domains involving high-dimensional data. While traditional variational methods derive an analytical approximation for the intractable distribution over the latent variables, here we construct an inference network conditioned on the symbolic representation of entities and relation types in the Knowledge Graph, to provide the variational distributions. The new framework results in a highly-scalable method. Under a Bernoulli sampling framework, we provide an alternative justification for commonly used techniques in large-scale stochastic variational inference, which drastically reduce training time at a cost of an additional approximation to the variational lower bound. We introduce two models from this highly scalable probabilistic framework, namely the Latent Information and Latent Fact models, for reasoning over knowledge graph-based representations. Our Latent Information and Latent Fact models improve upon baseline performance under certain conditions. We use the learnt embedding variance to estimate predictive uncertainty during link prediction, and discuss the quality of these learnt uncertainty estimates. Our source code and datasets are publicly available online at this https URL.

1 citations


Patent
01 Feb 2019
TL;DR: In this paper, a near-earth space 3D point cloud unified coding visualization method is proposed, which comprises the following steps of: unifying codes by using Hilbert coding technology, mapping three-dimensional data to one-dimensional linear space, and combining with the main code under value so as to be compatible with the mainstream big data platform, accessing and retrieving the collected data hierarchically, automatically determining the scope and hierarchy when visualizing the threedimensional point cloud, and uniformly sampling the displayed data by Bernoulli sampling technology when querying the data at the server, so as
Abstract: The invention discloses a near-earth space three-dimensional point cloud unified coding visualization method, which comprises the following steps of: unifying codes by using Hilbert coding technology,mapping three-dimensional data to one-dimensional linear space, and combining with the main code under value so as to be compatible with the mainstream big data platform, accessing and retrieving thecollected data hierarchically, automatically determining the scope and hierarchy when visualizing the three-dimensional point cloud, and uniformly sampling the displayed data by Bernoulli sampling technology when querying the data at the server, so as to ensure the continuity and proximity of the displayed data. The invention has the advantages that the data is sampled by Bernoulli after the datais queried from the big data platform, the bandwidth is saved, the range and the hierarchy of the query are automatically updated following the operation of the user, and the burden of the renderingengine is reduced when visualization is performed. The invention can automatically determine the range and hierarchy of the point cloud to be displayed according to the visualization requirements of the user, and plays an important role in three-dimensional visualization exploration, investigation and analysis.