scispace - formally typeset
Search or ask a question

Showing papers by "Sakti Pramanik published in 2020"


Journal ArticleDOI
TL;DR: A novel global clustering method ERC (effective multiple range queries-based clustering), which takes advantage of the structure of a CF tree, which optimizes the node split scheme used in the CF tree and effectively computes clusters over large data sets.
Abstract: Many existing clustering methods usually compute clusters from the reduced data sets obtained by summarizing the original very large data sets. BIRCH is a popular summary-based clustering method that first builds a CF tree, and then performs a global clustering using the leaf entries of the tree. However, to the best of our knowledge, no prior studies have proposed a global clustering method that uses the structure of a CF tree. Therefore, we propose a novel global clustering method ERC (effective multiple range queries-based clustering), which takes advantage of the structure of a CF tree. We further propose a CF $^+$ + tree, which optimizes the node split scheme used in the CF tree. As a result, the CF $^+$ + -ERC (CF $^+$ + tree-based ERC) method effectively computes clusters over large data sets. Furthermore, it does not require a predefined number of clusters to compute the clusters. We present in-depth theoretical and experimental analyses of our method. Experimental results on very large synthetic data sets demonstrate that the proposed approach is effective in terms of cluster quality and robustness and is significantly faster than existing clustering methods. In addition, we apply our clustering method to real data sets and achieve promising results.

9 citations


Journal ArticleDOI
TL;DR: This paper presents a novel method, called the VA-Store, to reduce the large space requirement for repetitive data in prevailing genome sequence analysis tasks using , and transforms a given query on a virtual store into one or more queries on the physical store for execution.
Abstract: In recent years, we have witnessed an increasing demand to process big data in numerous applications. It is observed that there often exist substantial amounts of repetitive data in different portions of a big data repository/dataset for applications such as genome sequence analyses. In this paper, we present a novel method, called the VA-Store, to reduce the large space requirement for repetitive data in prevailing genome sequence analysis tasks using $k$ k -mers (i.e., subsequences of length $k$ k ) with multiple $k$ k values. The VA-Store maintains a physical store for one portion of the input dataset (i.e., $k_0$ k 0 -mers) and supports multiple virtual stores for other portions of the dataset (i.e., $k$ k -mers with $k e k_0$ k ≠ k 0 ). Utilizing important relationships among repetitive data, the VA-Store transforms a given query on a virtual store into one or more queries on the physical store for execution. Both precise and approximate transformations are considered. Accuracy estimation models for approximate solutions are derived. Query optimization strategies are suggested to improve query performance. Our experiments using real and synthetic datasets demonstrate that the VA-Store is quite promising in providing effective storage and efficient query processing for solving a kernel database problem on repetitive big data for genome sequence analysis applications.

5 citations


Proceedings ArticleDOI
09 Mar 2020
TL;DR: This study shows that using the bottom-up update method can provide improved efficiency, comparing to the traditional top-down update method, especially when the number of dimensions for a vector that need to be updated is small.
Abstract: There is an increasing demand from numerous applications such as bioinformatics and cybersecurity to efficiently process various types of queries on datasets in a multidimensional Non-ordered Discrete Data Space (NDDS). An NDDS consists of vectors with values coming from a non-ordered discrete domain for each dimension. The BoND-tree index was recently developed to efficiently process box queries on a large dataset from an NDDS on disk. The original work of the BoND-tree focused on developing the index construction and query algorithms. No work has been reported on exploring efficient and effective update strategies for the BoND-tree. In this paper, we study two update methods based on two different strategies for updating the index tree in an NDDS. Our study shows that using the bottom-up update method can provide improved efficiency, comparing to the traditional top-down update method, especially when the number of dimensions for a vector that need to be updated is small. On the other hand, our study also shows that the two update methods have a comparable effectiveness, which indicates that the bottom-up update method is generally more advantageous.