scispace - formally typeset
Search or ask a question

Showing papers by "Vladimir Braverman published in 2010"


Patent
02 Apr 2010
TL;DR: In this paper, a method of effectively representing and processing data sets with time series is disclosed, which may comprise representing time series as a virtual part of data in a data store layer of a user system.
Abstract: A method of effectively representing and processing data sets with time series is disclosed. The method may comprise representing time series as a virtual part of data in a data store layer of a user system, thereby allowing processing of time-series related queries in said data store layer of said user system.

112 citations


Proceedings ArticleDOI
05 Jun 2010
TL;DR: This paper provides the first zero-one law in the streaming model for a wide class of functions, and shows a lower bound that requires greater then polylog memory for computing an approximation to Σi∈ [n] G(mi) by any one-pass streaming algorithm.
Abstract: Data streams emerged as a critical model for multiple applications that handle vast amounts of data. One of the most influential and celebrated papers in streaming is the "AMS" paper on computing frequency moments by Alon, Matias and Szegedy. The main question left open (and explicitly asked) by AMS in 1996 is to give the precise characterization for which functions G on frequency vectors mi (1≤ i ≤ n) can Σi∈ [n] G(mi) be approximated efficiently, where "efficiently" means by a single pass over data stream and poly-logarithmic memory. No such characterization was known despite a tremendous amount of research on frequency-based functions in streaming literature. In this paper we finally resolve the AMS main question and give a precise characterization (in fact, a zero-one law) for all monotonically increasing functions on frequencies that are zero at the origin.That is, we consider all monotonic functions G: R → R such that G(0) = 0 and G can be computed in poly-logarithmic time and space and ask, for which G in this class is there an (1±e)-approximation algorithm for computing Σi∈ [n] G(mi) for any polylogarithmic e?We give an algebraic characterization for all such G so that: For all functions G in our class that satisfy our algebraic condition, we provide a very general and constructive way to derive an efficient (1±e)-approximation algorithm for computing Σi∈ [n] G(mi) with polylogarithmic memory and a single pass over data stream; while: For all functions G in our class that do not satisfy our algebraic characterization, we show a lower bound that requires greater then polylog memory for computing an approximation to Σi∈ [n] G(mi) by any one-pass streaming algorithm.Thus, we provide a zero-one law for all monotonically increasing functions G which are zero at the origin. Our results are quite general. As just one illustrative example, our main theorem implies a lower bound for G(x) =(x(x-1))0.5arctan(x+1), while for a function G(x) =(x(x+1))0.5arctan(x+1) our main theorem automatically yields a polylog memory one-pass (1±e)-approximation algorithm for computing Σi∈ [n] G(mi). For both of these examples no lower or upper bounds were known. Of course, these are just illustrative examples, and there are many others. One might argue that these two functions may not be of interest in practical applications -- we stress that our law works for all functions in this class, and the above examples illustrate the power of our method. To the best of our knowledge, this is the first zero-one law in the streaming model for a wide class of functions, though we suspect that there are many more such laws to be discovered. Surprisingly, our upper bound requires only 4-wise independence and does not need the stronger machinery of Nisan's pseudorandom generators, even though our class captures multiple functions that previously required Nisan's generators. Furthermore, we believe that our methods can be extended to the more general models and complexity classes. For instance, the law also holds for a smaller class of non-decreasing and symmetric functions (i.e., G(x) = G(-x) and G(0) = 0) which, due to negative values, allow deletions.

88 citations


Posted Content
TL;DR: This paper provides a different yet simple approach to obtain a $O(\log(m)\log(nm)\cdot (\log\log n)^4\cdot n^{1-{2\over k}})$ algorithm for constant $\epsilon$ and shows that this algorithm requires only $4$-wise independence, in contrast to existing methods that use pseudo-random generators for computing large frequency moments.
Abstract: In a ground-breaking paper, Indyk and Woodruff (STOC 05) showed how to compute $F_k$ (for $k>2$) in space complexity $O(\mbox{\em poly-log}(n,m)\cdot n^{1-\frac2k})$, which is optimal up to (large) poly-logarithmic factors in $n$ and $m$, where $m$ is the length of the stream and $n$ is the upper bound on the number of distinct elements in a stream The best known lower bound for large moments is $\Omega(\log(n)n^{1-\frac2k})$ A follow-up work of Bhuvanagiri, Ganguly, Kesh and Saha (SODA 2006) reduced the poly-logarithmic factors of Indyk and Woodruff to $O(\log^2(m)\cdot (\log n+ \log m)\cdot n^{1-{2\over k}})$ Further reduction of poly-log factors has been an elusive goal since 2006, when Indyk and Woodruff method seemed to hit a natural "barrier" Using our simple recursive sketch, we provide a different yet simple approach to obtain a $O(\log(m)\log(nm)\cdot (\log\log n)^4\cdot n^{1-{2\over k}})$ algorithm for constant $\epsilon$ (our bound is, in fact, somewhat stronger, where the $(\log\log n)$ term can be replaced by any constant number of $\log $ iterations instead of just two or three, thus approaching $log^*n$ Our bound also works for non-constant $\epsilon$ (for details see the body of the paper) Further, our algorithm requires only $4$-wise independence, in contrast to existing methods that use pseudo-random generators for computing large frequency moments

37 citations


Posted Content
TL;DR: Estimating the chance that a random multigraph is composed of a given number of node-disjoint Eulerian components leads to a new tail bound on the chaos, which is obtained by connecting the moments of an order 2 Rademacher chaos to the combinatorial properties of random Eulerians multigraphs.
Abstract: The celebrated dimension reduction lemma of Johnson and Lindenstrauss has numerous computational and other applications. Due to its application in practice, speeding up the computation of a Johnson-Lindenstrauss style dimension reduction is an important question. Recently, Dasgupta, Kumar, and Sarlos (STOC 2010) constructed such a transform that uses a sparse matrix. This is motivated by the desire to speed up the computation when applied to sparse input vectors, a scenario that comes up in applications. The sparsity of their construction was further improved by Kane and Nelson (ArXiv 2010). We improve the previous bound on the number of non-zero entries per column of Kane and Nelson from $O(1/\epsilon \log(1/\delta)\log(k/\delta))$ (where the target dimension is $k$, the distortion is $1\pm \epsilon$, and the failure probability is $\delta$) to $$ O\left({1\over\epsilon} \left({\log(1/\delta)\log\log\log(1/\delta) \over \log\log(1/\delta)}\right)^2\right). $$ We also improve the amount of randomness needed to generate the matrix. Our results are obtained by connecting the moments of an order 2 Rademacher chaos to the combinatorial properties of random Eulerian multigraphs. Estimating the chance that a random multigraph is composed of a given number of node-disjoint Eulerian components leads to a new tail bound on the chaos. Our estimates may be of independent interest, and as this part of the argument is decoupled from the analysis of the coefficients of the chaos, we believe that our methods can be useful in the analysis of other chaoses.

31 citations


Proceedings ArticleDOI
05 Jun 2010
TL;DR: In this article, the authors present an algorithm that computes an (e, δ)-approximation of the statistical distance between the joint and product distributions defined by a stream of k-tuples, which requires O((1/e log(nm/δ))(30+k)k) memory and a single pass over the data stream.
Abstract: Approximating pairwise, or k-wise, independence with sublinear memory is of considerable importance in the data stream model. In the streaming model the joint distribution is given by a stream of k-tuples, with the goal of testing correlations among the components measured over the entire stream. Indyk and McGregor (SODA 08) recently gave exciting new results for measuring pairwise independence in this model.Statistical distance is one of the most fundamental metrics for measuring the similarity of two distributions, and it has been a metric of choice in many papers that discuss distribution closeness. For pairwise independence, the Indyk and McGregor methods provide log{n}-approximation under statistical distance between the joint and product distributions in the streaming model. Indyk and McGregor leave, as their main open question, the problem of improving their log n-approximation for the statistical distance metric.In this paper we solve the main open problem posed by Indyk and McGregor for the statistical distance for pairwise independence and extend this result to any constant k. In particular, we present an algorithm that computes an (e, δ)-approximation of the statistical distance between the joint and product distributions defined by a stream of k-tuples. Our algorithm requires O((1/e log(nm/δ))(30+k)k) memory and a single pass over the data stream.

28 citations


Journal ArticleDOI
TL;DR: This paper presents a novel smooth histogram method that is more general and achieves stronger bounds than the exponential histogram, and provides the first approximation algorithms for the following functions: $L_p$ norms, frequency moments, the length of the increasing subsequence, and the geometric mean.
Abstract: In the streaming model, elements arrive sequentially and can be observed only once. Maintaining statistics and aggregates is an important and nontrivial task in this model. These tasks become even more challenging in the sliding windows model, where statistics must be maintained only over the most recent $n$ elements. In their pioneering paper, Datar et al. [SIAM J. Comput., 31 (2002), pp. 1794-1813] presented the exponential histogram, an effective method for estimating statistics on sliding windows. In this paper we present a novel smooth histogram method that is more general and achieves stronger bounds than the exponential histogram. In particular, the smooth histogram method improves the approximation error rate obtained via exponential histograms. Furthermore, the smooth histogram method not only captures and improves multiple previous results on sliding windows but also extends the class of functions that can be approximated on sliding windows. In particular, we provide the first approximation algorithms for the following functions: $L_p$ norms, frequency moments, the length of the increasing subsequence, and the geometric mean.

25 citations


Proceedings ArticleDOI
04 Mar 2010
TL;DR: This work extends that of Indyk and McGregor, who showed the result for $k = 2, and gives a randomized algorithm that is a $(1\pm \epsilon)$ approximation that requires space logarithmic in $n$ and $m$ and proportional to $3^k$.
Abstract: In their seminal work, Alon, Matias, and Szegedy introduced several sketching techniques, including showing that $4$-wise independence is sufficient to obtain good approximations of the second frequency moment. In this work, we show that their sketching technique can be extended to product domains $[n]^k$ by using the product of $4$-wise independent functions on $[n]$. Our work extends that of Indyk and McGregor, who showed the result for $k = 2$. Their primary motivation was the problem of identifying correlations in data streams. In their model, a stream of pairs $(i,j) \in [n]^2$ arrive, giving a joint distribution $(X,Y)$, and they find approximation algorithms for how close the joint distribution is to the product of the marginal distributions under various metrics, which naturally corresponds to how close $X$ and $Y$ are to being independent. By using our technique, we obtain a new result for the problem of approximating the $\ell_2$ distance between the joint distribution and the product of the marginal distributions for $k$-ary vectors, instead of just pairs, in a single pass. Our analysis gives a randomized algorithm that is a $(1\pm \epsilon)$ approximation (with probability $1-\delta$) that requires space logarithmic in $n$ and $m$ and proportional to $3^k$.

18 citations