About: Coreset is a(n) research topic. Over the lifetime, 570 publication(s) have been published within this topic receiving 8139 citation(s).
13 Jun 2004-
Abstract: In this paper, we show the existence of small coresets for the problems of computing k-median and k-means clustering for points in low dimension. In other words, we show that given a point set P in Rd, one can compute a weighted set S ⊆ P, of size O(k e-d log n), such that one can compute the k-median/means clustering on S instead of on P, and get an (1+e)-approximation. As a result, we improve the fastest known algorithms for (1+e)-approximate k-means and k-median. Our algorithms have linear running time for a fixed k and e. In addition, we can maintain the (1+e)-approximate k-median or k-means clustering of a stream when points are being only inserted, using polylogarithmic space and update time.
Topics: Correlation clustering (60%), k-medians clustering (59%), Cluster analysis (56%) ...read more
Kenneth L. Clarkson1•Institutions (1)
03 Sep 2010-ACM Transactions on Algorithms
Abstract: The problem of maximizing a concave function f(x) in the unit simplex Δ can be solved approximately by a simple greedy algorithm. For given k, the algorithm can find a point x(k) on a k-dimensional face of Δ, such that f(x(k) ≥ f(xa) − O(1/k). Here f(xa) is the maximum value of f in Δ, and the constant factor depends on f. This algorithm and analysis were known before, and related to problems of statistics and machine learning, such as boosting, regression, and density mixture estimation. In other work, coming from computational geometry, the existence of ϵ-coresets was shown for the minimum enclosing ball problem by means of a simple greedy algorithm. Similar greedy algorithms, which are special cases of the Frank-Wolfe algorithm, were described for other enclosure problems. Here these results are tied together, stronger convergence results are reviewed, and several coreset bounds are generalized or strengthened.
01 Jan 2007-
Abstract: The paradigm of coresets has recently emerged as a powerful tool for efficiently approximating various extent measures of a point set P . Using this paradigm, one quickly computes a small subset Q of P , called a coreset, that approximates the original set P and and then solves the problem on Q using a relatively inefficient algorithm. The solution for Q is then translated to an approximate solution to the original point set P . This paper describes the ways in which this paradigm has been successfully applied to various optimization and extent measure problems.
12 Jul 2018-arXiv: Data Structures and Algorithms
Abstract: We develop and analyze a method to reduce the size of a very large set of data points in a high dimensional Euclidean space R d to a small set of weighted points such that the result of a predetermined data analysis task on the reduced set is approximately the same as that for the original point set. For example, computing the first k principal components of the reduced set will return approximately the first k principal components of the original set or computing the centers of a k-means clustering on the reduced set will return an approximation for the original set. Such a reduced set is also known as a coreset. The main new feature of our construction is that the cardinality of the reduced set is independent of the dimension d of the input space and that the sets are mergable. The latter property means that the union of two reduced sets is a reduced set for the union of the two original sets (this property has recently also been called composability, see Indyk et. al., PODS 2014). It allows us to turn our methods into streaming or distributed algorithms using standard approaches. For problems such as k-means and subspace approximation the coreset sizes are also independent of the number of input points. Our method is based on projecting the points on a low dimensional subspace and reducing the cardinality of the points inside this subspace using known methods. The proposed approach works for a wide range of data analysis techniques including k-means clustering, principal component analysis and subspace clustering. The main conceptual contribution is a new coreset definition that allows to charge costs that appear for every solution to an additive constant.
06 Jan 2013-
Abstract: @d can be approximated up to (1 + e)-factor, for an arbitrary small e > 0, using the O(k/e2)-rank approximation of A and a constant. This implies, for example, that the optimal k-means clustering of the rows of A is (1 + e)-approximated by an optimal k-means clustering of their projection on the O(k/e2) first right singular vectors (principle components) of A.A (j, k)-coreset for projective clustering is a small set of points that yields a (1 + e)-approximation to the sum of squared distances from the n rows of A to any set of k affine subspaces, each of dimension at most j. Our embedding yields (0, k)-coresets of size O(k) for handling k-means queries, (j, 1)-coresets of size O(j) for PCA queries, and (j, k)-coresets of size (log n)O(jk) for any j, k ≥ 1 and constant e e (0, 1/2). Previous coresets usually have a size which is linearly or even exponentially dependent of d, which makes them useless when d ~ n.Using our coresets with the merge-and-reduce approach, we obtain embarrassingly parallel streaming algorithms for problems such as k-means, PCA and projective clustering. These algorithms use update time per point and memory that is polynomial in log n and only linear in d.For cost functions other than squared Euclidean distances we suggest a simple recursive coreset construction that produces coresets of size