scispace - formally typeset
Open accessBook ChapterDOI: 10.1007/978-3-319-94667-2_28

Faster Coreset Construction for Projective Clustering via Low-Rank Approximation

16 Jul 2018-pp 336-348
Abstract: In this work, we present a randomized coreset construction for projective clustering, which involves computing a set of k closest j-dimensional linear (affine) subspaces of a given set of n vectors in d dimensions. Let \(A \in \mathbb {R}^{n\times d}\) be an input matrix. An earlier deterministic coreset construction of Feldman et. al. [10] relied on computing the SVD of A. The best known algorithms for SVD require \(\min \{nd^2, n^2d\}\) time, which may not be feasible for large values of n and d. We present a coreset construction by projecting the matrix A on some orthonormal vectors that closely approximate the right singular vectors of A. As a consequence, when the values of k and j are small, we are able to achieve a faster algorithm, as compared to [10], while maintaining almost the same approximation. We also benefit in terms of space as well as exploit the sparsity of the input dataset. Another advantage of our approach is that it can be constructed in a streaming setting quite efficiently.

...read more

Topics: Coreset (70%), Orthonormality (53%), Low-rank approximation (51%) ...read more
Citations
  More

Open accessPosted Content
Samarth Sinha1, Han Zhang2, Anirudh Goyal3, Yoshua Bengio  +2 moreInstitutions (3)
Abstract: Recent work by Brock et al. (2018) suggests that Generative Adversarial Networks (GANs) benefit disproportionately from large mini-batch sizes. Unfortunately, using large batches is slow and expensive on conventional hardware. Thus, it would be nice if we could generate batches that were effectively large though actually small. In this work, we propose a method to do this, inspired by the use of Coreset-selection in active learning. When training a GAN, we draw a large batch of samples from the prior and then compress that batch using Coreset-selection. To create effectively large batches of 'real' images, we create a cached dataset of Inception activations of each training image, randomly project them down to a smaller dimension, and then use Coreset-selection on those projected activations at training time. We conduct experiments showing that this technique substantially reduces training time and memory usage for modern GAN variants, that it reduces the fraction of dropped modes in a synthetic dataset, and that it allows GANs to reach a new state of the art in anomaly detection.

...read more

13 Citations


Open accessPosted Content
Jae-hun Shim, Kyeongbo Kong, Suk-Ju Kang1Institutions (1)
08 Jul 2021-arXiv: Learning
Abstract: Neural architecture search (NAS), an important branch of automatic machine learning, has become an effective approach to automate the design of deep learning models. However, the major issue in NAS is how to reduce the large search time imposed by the heavy computational burden. While most recent approaches focus on pruning redundant sets or developing new search methodologies, this paper attempts to formulate the problem based on the data curation manner. Our key strategy is to search the architecture using summarized data distribution, i.e., core-set. Typically, many NAS algorithms separate searching and training stages, and the proposed core-set methodology is only used in search stage, thus their performance degradation can be minimized. In our experiments, we were able to save overall computational time from 30.8 hours to 3.5 hours, 8.8x reduction, on a single RTX 3090 GPU without sacrificing accuracy.

...read more

Topics: Deep learning (53%), Pruning (decision trees) (53%), Data curation (51%)
References
  More

Open accessProceedings Article
01 Jan 1996-
Abstract: Clustering algorithms are attractive for the task of class identification in spatial databases. However, the application to large spatial databases rises the following requirements for clustering algorithms: minimal requirements of domain knowledge to determine the input parameters, discovery of clusters with arbitrary shape and good efficiency on large databases. The well-known clustering algorithms offer no solution to the combination of these requirements. In this paper, we present the new clustering algorithm DBSCAN relying on a density-based notion of clusters which is designed to discover clusters of arbitrary shape. DBSCAN requires only one input parameter and supports the user in determining an appropriate value for it. We performed an experimental evaluation of the effectiveness and efficiency of DBSCAN using synthetic data and real data of the SEQUOIA 2000 benchmark. The results of our experiments demonstrate that (1) DBSCAN is significantly more effective in discovering clusters of arbitrary shape than the well-known algorithm CLARANS, and that (2) DBSCAN outperforms CLARANS by a factor of more than 100 in terms of efficiency.

...read more

Topics: OPTICS algorithm (76%), SUBCLU (75%), DBSCAN (72%) ...read more

14,280 Citations


Proceedings ArticleDOI: 10.1109/FOCS.2006.37
Tamas Sarlos1Institutions (1)
21 Oct 2006-
Abstract: Recently several results appeared that show significant reduction in time for matrix multiplication, singular value decomposition as well as linear (\ell_ 2) regression, all based on data dependent random sampling. Our key idea is that low dimensional embeddings can be used to eliminate data dependence and provide more versatile, linear time pass efficient matrix computation. Our main contribution is summarized as follows. --Independent of the recent results of Har-Peled and of Deshpande and Vempala, one of the first -- and to the best of our knowledge the most efficient -- relative error (1 + \in) \parallel A - A_k \parallel _F approximation algorithms for the singular value decomposition of an m ? n matrix A with M non-zero entries that requires 2 passes over the data and runs in time O\left( {\left( {M(\frac{k} { \in } + k\log k) + (n + m)(\frac{k} { \in } + k\log k)^2 } \right)\log \frac{1} {\delta }} \right) --The first o(nd^{2}) time (1 + \in) relative error approximation algorithm for n ? d linear (\ell_2) regression. --A matrix multiplication and norm approximation algorithm that easily applies to implicitly given matrices and can be used as a black box probability boosting tool.

...read more

763 Citations


Journal ArticleDOI: 10.1023/A:1009783824328
Tian Zhang1, Raghu Ramakrishnan1, Miron Livny1Institutions (1)
Abstract: Data clustering is an important technique for exploratory data analysis, and has been studied for several years. It has been shown to be useful in many practical domains such as data classification and image processing. Recently, there has been a growing emphasis on exploratory analysis of very large datasets to discover useful patterns and/or correlations among attributes. This is called data mining, and data clustering is regarded as a particular branch. However existing data clustering methods do not adequately address the problem of processing large datasets with a limited amount of resources (e.g., memory and cpu cycles). So as the dataset size increases, they do not scale up well in terms of memory requirement, running time, and result quality. In this paper, an efficient and scalable data clustering method is proposed, based on a new in-memory data structure called CF-tree, which serves as an in-memory summary of the data distribution. We have implemented it in a system called BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), and studied its performance extensively in terms of memory requirements, running time, clustering quality, stability and scalability; we also compare it with other available methods. Finally, BIRCH is applied to solve two real-life problems: one is building an iterative and interactive pixel classification tool, and the other is generating the initial codebook for image compression.

...read more

Topics: Data stream clustering (72%), Cluster analysis (71%), CURE data clustering algorithm (71%) ...read more

699 Citations


Open accessProceedings Article
10 Sep 2000-
Abstract: Nearest neighbor search in high dimensional spaces is an interesting and important problem which is relevant for a wide variety of novel database applications. As recent results show, however, the problem is a very di cult one, not only with regards to the performance issue but also to the quality issue. In this paper, we discuss the quality issue and identify a new generalized notion of nearest neighbor search as the relevant problem in high dimensional space. In contrast to previous approaches, our new notion of nearest neighbor search does not treat all dimensions equally but uses a quality criterion to select relevant dimensions (projections) with respect to the given query. As an example for a useful quality criterion, we rate how well the data is clustered around the query point within the selected projection. We then propose an e cient and e ective algorithm to solve the generalized nearest neighbor problem. Our experiments based on a number of real and synthetic data sets show that our new approach provides new insights into the nature of nearest neighbor search on high

...read more

Topics: Nearest neighbor search (72%), Best bin first (70%), k-nearest neighbors algorithm (57%) ...read more

499 Citations


Proceedings ArticleDOI: 10.1145/1007352.1007400
Sariel Har-Peled1, Soham Mazumdar1Institutions (1)
13 Jun 2004-
Abstract: In this paper, we show the existence of small coresets for the problems of computing k-median and k-means clustering for points in low dimension. In other words, we show that given a point set P in Rd, one can compute a weighted set S ⊆ P, of size O(k e-d log n), such that one can compute the k-median/means clustering on S instead of on P, and get an (1+e)-approximation. As a result, we improve the fastest known algorithms for (1+e)-approximate k-means and k-median. Our algorithms have linear running time for a fixed k and e. In addition, we can maintain the (1+e)-approximate k-median or k-means clustering of a stream when points are being only inserted, using polylogarithmic space and update time.

...read more

Topics: Correlation clustering (60%), k-medians clustering (59%), Cluster analysis (56%) ...read more

492 Citations


Performance
Metrics
No. of citations received by the Paper in previous years
YearCitations
20211
20191
Network Information
Related Papers (5)