Showing papers by "Vladimir Braverman published in 2010"

PDF

Open Access

Patent•

Methods for effective processing of time series

[...]

Guy Shaked¹, Vladimir Braverman¹, Victor Belayaev, Gabby Rubin¹, Marine Sadetsky¹ - Show less +1 more•Institutions (1)

02 Apr 2010

TL;DR: In this paper, a method of effectively representing and processing data sets with time series is disclosed, which may comprise representing time series as a virtual part of data in a data store layer of a user system.

...read moreread less

Abstract: A method of effectively representing and processing data sets with time series is disclosed. The method may comprise representing time series as a virtual part of data in a data store layer of a user system, thereby allowing processing of time-series related queries in said data store layer of said user system.

...read moreread less

112 citations

Proceedings Article•DOI•

Zero-one frequency laws

[...]

Vladimir Braverman¹, Rafail Ostrovsky¹•Institutions (1)

University of California, Los Angeles¹

05 Jun 2010

TL;DR: This paper provides the first zero-one law in the streaming model for a wide class of functions, and shows a lower bound that requires greater then polylog memory for computing an approximation to Σi∈ [n] G(mi) by any one-pass streaming algorithm.

...read moreread less

Abstract: Data streams emerged as a critical model for multiple applications that handle vast amounts of data. One of the most influential and celebrated papers in streaming is the "AMS" paper on computing frequency moments by Alon, Matias and Szegedy. The main question left open (and explicitly asked) by AMS in 1996 is to give the precise characterization for which functions G on frequency vectors mi (1≤ i ≤ n) can Σi∈ [n] G(mi) be approximated efficiently, where "efficiently" means by a single pass over data stream and poly-logarithmic memory. No such characterization was known despite a tremendous amount of research on frequency-based functions in streaming literature. In this paper we finally resolve the AMS main question and give a precise characterization (in fact, a zero-one law) for all monotonically increasing functions on frequencies that are zero at the origin.That is, we consider all monotonic functions G: R → R such that G(0) = 0 and G can be computed in poly-logarithmic time and space and ask, for which G in this class is there an (1±e)-approximation algorithm for computing Σi∈ [n] G(mi) for any polylogarithmic e?We give an algebraic characterization for all such G so that: For all functions G in our class that satisfy our algebraic condition, we provide a very general and constructive way to derive an efficient (1±e)-approximation algorithm for computing Σi∈ [n] G(mi) with polylogarithmic memory and a single pass over data stream; while: For all functions G in our class that do not satisfy our algebraic characterization, we show a lower bound that requires greater then polylog memory for computing an approximation to Σi∈ [n] G(mi) by any one-pass streaming algorithm.Thus, we provide a zero-one law for all monotonically increasing functions G which are zero at the origin. Our results are quite general. As just one illustrative example, our main theorem implies a lower bound for G(x) =(x(x-1))0.5arctan(x+1), while for a function G(x) =(x(x+1))0.5arctan(x+1) our main theorem automatically yields a polylog memory one-pass (1±e)-approximation algorithm for computing Σi∈ [n] G(mi). For both of these examples no lower or upper bounds were known. Of course, these are just illustrative examples, and there are many others. One might argue that these two functions may not be of interest in practical applications -- we stress that our law works for all functions in this class, and the above examples illustrate the power of our method. To the best of our knowledge, this is the first zero-one law in the streaming model for a wide class of functions, though we suspect that there are many more such laws to be discovered. Surprisingly, our upper bound requires only 4-wise independence and does not need the stronger machinery of Nisan's pseudorandom generators, even though our class captures multiple functions that previously required Nisan's generators. Furthermore, we believe that our methods can be extended to the more general models and complexity classes. For instance, the law also holds for a smaller class of non-decreasing and symmetric functions (i.e., G(x) = G(-x) and G(0) = 0) which, due to negative values, allow deletions.

...read moreread less

88 citations

Posted Content•

Recursive Sketching For Frequency Moments

[...]

Vladimir Braverman, Rafail Ostrovsky

11 Nov 2010-arXiv: Data Structures and Algorithms

TL;DR: This paper provides a different yet simple approach to obtain a $O(\log(m)\log(nm)\cdot (\log\log n)^4\cdot n^{1-{2\over k}})$ algorithm for constant $\epsilon$ and shows that this algorithm requires only $4$-wise independence, in contrast to existing methods that use pseudo-random generators for computing large frequency moments.

...read moreread less

Abstract: In a ground-breaking paper, Indyk and Woodruff (STOC 05) showed how to compute $F_k$ (for $k>2$) in space complexity $O(\mbox{\em poly-log}(n,m)\cdot n^{1-\frac2k})$, which is optimal up to (large) poly-logarithmic factors in $n$ and $m$, where $m$ is the length of the stream and $n$ is the upper bound on the number of distinct elements in a stream The best known lower bound for large moments is $\Omega(\log(n)n^{1-\frac2k})$ A follow-up work of Bhuvanagiri, Ganguly, Kesh and Saha (SODA 2006) reduced the poly-logarithmic factors of Indyk and Woodruff to $O(\log^2(m)\cdot (\log n+ \log m)\cdot n^{1-{2\over k}})$ Further reduction of poly-log factors has been an elusive goal since 2006, when Indyk and Woodruff method seemed to hit a natural "barrier" Using our simple recursive sketch, we provide a different yet simple approach to obtain a $O(\log(m)\log(nm)\cdot (\log\log n)^4\cdot n^{1-{2\over k}})$ algorithm for constant $\epsilon$ (our bound is, in fact, somewhat stronger, where the $(\log\log n)$ term can be replaced by any constant number of $\log $ iterations instead of just two or three, thus approaching $log^*n$ Our bound also works for non-constant $\epsilon$ (for details see the body of the paper) Further, our algorithm requires only $4$-wise independence, in contrast to existing methods that use pseudo-random generators for computing large frequency moments

...read moreread less

37 citations

Posted Content•

Rademacher Chaos, Random Eulerian Graphs and The Sparse Johnson-Lindenstrauss Transform

[...]

Vladimir Braverman, Rafail Ostrovsky, Yuval Rabani

11 Nov 2010-arXiv: Data Structures and Algorithms

TL;DR: Estimating the chance that a random multigraph is composed of a given number of node-disjoint Eulerian components leads to a new tail bound on the chaos, which is obtained by connecting the moments of an order 2 Rademacher chaos to the combinatorial properties of random Eulerians multigraphs.

...read moreread less

Abstract: The celebrated dimension reduction lemma of Johnson and Lindenstrauss has numerous computational and other applications. Due to its application in practice, speeding up the computation of a Johnson-Lindenstrauss style dimension reduction is an important question. Recently, Dasgupta, Kumar, and Sarlos (STOC 2010) constructed such a transform that uses a sparse matrix. This is motivated by the desire to speed up the computation when applied to sparse input vectors, a scenario that comes up in applications. The sparsity of their construction was further improved by Kane and Nelson (ArXiv 2010). We improve the previous bound on the number of non-zero entries per column of Kane and Nelson from $O(1/\epsilon \log(1/\delta)\log(k/\delta))$ (where the target dimension is $k$, the distortion is $1\pm \epsilon$, and the failure probability is $\delta$) to $$ O\left({1\over\epsilon} \left({\log(1/\delta)\log\log\log(1/\delta) \over \log\log(1/\delta)}\right)^2\right). $$ We also improve the amount of randomness needed to generate the matrix. Our results are obtained by connecting the moments of an order 2 Rademacher chaos to the combinatorial properties of random Eulerian multigraphs. Estimating the chance that a random multigraph is composed of a given number of node-disjoint Eulerian components leads to a new tail bound on the chaos. Our estimates may be of independent interest, and as this part of the argument is decoupled from the analysis of the coefficients of the chaos, we believe that our methods can be useful in the analysis of other chaoses.

...read moreread less

31 citations

Proceedings Article•DOI•

Measuring independence of datasets

[...]

Vladimir Braverman¹, Rafail Ostrovsky¹•Institutions (1)

University of California, Los Angeles¹

05 Jun 2010

TL;DR: In this article, the authors present an algorithm that computes an (e, δ)-approximation of the statistical distance between the joint and product distributions defined by a stream of k-tuples, which requires O((1/e log(nm/δ))(30+k)k) memory and a single pass over the data stream.

...read moreread less

Abstract: Approximating pairwise, or k-wise, independence with sublinear memory is of considerable importance in the data stream model. In the streaming model the joint distribution is given by a stream of k-tuples, with the goal of testing correlations among the components measured over the entire stream. Indyk and McGregor (SODA 08) recently gave exciting new results for measuring pairwise independence in this model.Statistical distance is one of the most fundamental metrics for measuring the similarity of two distributions, and it has been a metric of choice in many papers that discuss distribution closeness. For pairwise independence, the Indyk and McGregor methods provide log{n}-approximation under statistical distance between the joint and product distributions in the streaming model. Indyk and McGregor leave, as their main open question, the problem of improving their log n-approximation for the statistical distance metric.In this paper we solve the main open problem posed by Indyk and McGregor for the statistical distance for pairwise independence and extend this result to any constant k. In particular, we present an algorithm that computes an (e, δ)-approximation of the statistical distance between the joint and product distributions defined by a stream of k-tuples. Our algorithm requires O((1/e log(nm/δ))(30+k)k) memory and a single pass over the data stream.

...read moreread less

28 citations

Journal Article•DOI•

Effective Computations on Sliding Windows

[...]

Vladimir Braverman¹, Rafail Ostrovsky¹•Institutions (1)

University of California, Los Angeles¹

01 Mar 2010-SIAM Journal on Computing

TL;DR: This paper presents a novel smooth histogram method that is more general and achieves stronger bounds than the exponential histogram, and provides the first approximation algorithms for the following functions: $L_p$ norms, frequency moments, the length of the increasing subsequence, and the geometric mean.

...read moreread less

Abstract: In the streaming model, elements arrive sequentially and can be observed only once. Maintaining statistics and aggregates is an important and nontrivial task in this model. These tasks become even more challenging in the sliding windows model, where statistics must be maintained only over the most recent $n$ elements. In their pioneering paper, Datar et al. [SIAM J. Comput., 31 (2002), pp. 1794-1813] presented the exponential histogram, an effective method for estimating statistics on sliding windows. In this paper we present a novel smooth histogram method that is more general and achieves stronger bounds than the exponential histogram. In particular, the smooth histogram method improves the approximation error rate obtained via exponential histograms. Furthermore, the smooth histogram method not only captures and improves multiple previous results on sliding windows but also extends the class of functions that can be approximated on sliding windows. In particular, we provide the first approximation algorithms for the following functions: $L_p$ norms, frequency moments, the length of the increasing subsequence, and the geometric mean.

...read moreread less

25 citations

Proceedings Article•DOI•

Ams without 4-wise independence on product domains

[...]

Vladimir Braverman¹, Kai-Min Chung², Zhenming Liu², Michael Mitzenmacher², Rafail Ostrovsky¹ - Show less +1 more•Institutions (2)

University of California, Los Angeles¹, Harvard University²

04 Mar 2010

TL;DR: This work extends that of Indyk and McGregor, who showed the result for $k = 2, and gives a randomized algorithm that is a $(1\pm \epsilon)$ approximation that requires space logarithmic in $n$ and $m$ and proportional to $3^k$.

...read moreread less

Abstract: In their seminal work, Alon, Matias, and Szegedy introduced several sketching techniques, including showing that $4$-wise independence is sufficient to obtain good approximations of the second frequency moment. In this work, we show that their sketching technique can be extended to product domains $[n]^k$ by using the product of $4$-wise independent functions on $[n]$. Our work extends that of Indyk and McGregor, who showed the result for $k = 2$. Their primary motivation was the problem of identifying correlations in data streams. In their model, a stream of pairs $(i,j) \in [n]^2$ arrive, giving a joint distribution $(X,Y)$, and they find approximation algorithms for how close the joint distribution is to the product of the marginal distributions under various metrics, which naturally corresponds to how close $X$ and $Y$ are to being independent. By using our technique, we obtain a new result for the problem of approximating the $\ell_2$ distance between the joint distribution and the product of the marginal distributions for $k$-ary vectors, instead of just pairs, in a single pass. Our analysis gives a randomized algorithm that is a $(1\pm \epsilon)$ approximation (with probability $1-\delta$) that requires space logarithmic in $n$ and $m$ and proportional to $3^k$.

...read moreread less

18 citations