scispace - formally typeset
Search or ask a question

Showing papers by "Sameep Mehta published in 2005"


Proceedings ArticleDOI
21 Aug 2005
TL;DR: A general framework to discover spatial associations and spatio-temporal episodes for scientific datasets is presented and it is shown that such episodes can be used to reason about critical events.
Abstract: In this paper, we present a general framework to discover spatial associations and spatio-temporal episodes for scientific datasets. In contrast to previous work in this area, features are modeled as geometric objects rather than points. We define multiple distance metrics that take into account objects' extent and thus are more robust in capturing the influence of an object on other objects in spatial neighborhood. We have developed algorithms to discover four different types of spatial object interaction (association) patterns. We also extend our approach to accommodate temporal information and propose a simple algorithm to derive spatio-temporal episodes. We show that such episodes can be used to reason about critical events. We evaluate our framework on real datasets to demonstrate its efficacy. The datasets originate from two different areas: Computational Molecular Dynamics and Computational Fluid Flow. We present results highlighting the importance of the identified patterns and episodes by using knowledge from the underlying domains. We also show that the proposed algorithms scale linearly with respect to the dataset size.

75 citations


Journal ArticleDOI
TL;DR: A novel PCA-based unsupervised algorithm for the discretization of continuous attributes in multivariate data sets that leverages the underlying correlation structure in the data set to obtain the discrete intervals and ensures that the inherent correlations are preserved.
Abstract: Discretization is a crucial preprocessing technique used for a variety of data warehousing and mining tasks. In this paper, we present a novel PCA-based unsupervised algorithm for the discretization of continuous attributes in multivariate data sets. The algorithm leverages the underlying correlation structure in the data set to obtain the discrete intervals and ensures that the inherent correlations are preserved. Previous efforts on this problem are largely supervised and consider only piecewise correlation among attributes. We consider the correlation among continuous attributes and, at the same time, also take into account the interactions between continuous and categorical attributes. Our approach also extends easily to data sets containing missing values. We demonstrate the efficacy of the approach on real data sets and as a preprocessing step for both classification and frequent itemset mining tasks. We show that the intervals are meaningful and can uncover hidden patterns in data. We also show that large compression factors can be obtained on the discretized data sets. The approach is task independent, i.e., the same discretized data set can be used for different data mining tasks. Thus, the data sets can be discretized, compressed, and stored once and can be used again and again.

52 citations


Proceedings Article
01 Jan 2005
TL;DR: This application paper presents a two-step dynamic classifier to classify anomalous structures (defects) in data generated from abinitio Molecular Dynamics simulations of Silicon (Si) atom systems that is robust and scalable in the size of the atom systems.
Abstract: In this application paper we explore techniques to classify anomalous structures (defects) in data generated from abinitio Molecular Dynamics (MD) simulations of Silicon (Si) atom systems. These systems are studied to understand the processes behind the formation of various defects as they have a profound impact on the electrical and mechanical properties of Silicon. In our prior work we presented techniques for defect detection [11, 12, 14]. Here, we present a two-step dynamic classifier to classify the defects. The first step uses up to third-order shape moments to provide a smaller set of candidate defect classes. The second step assigns the correct class to the defect structure by considering the actual spatial positions of the individual atoms. The dynamic classifier is robust and scalable in the size of the atom systems. Each phase is immune to noise, which is characterized after a study of the simulation data. We also validate the proposed solutions by using a physical model and properties of lattices. We demonstrate the efficacy and correctness of our approach on several large datasets. Our approach is able to recognize previously seen defects and also identify new defects in real time.

11 citations


Proceedings Article
30 Jul 2005
TL;DR: This work has developed algorithms to discover two types of spatial association patterns in scientific data that are modeled as geometric objects rather than points and define multiple distance metrics that take into account objects' extent.
Abstract: In this paper, we present efficient algorithms to discover spatial associations among features extracted from scientific datasets. In contrast to previous work in this area, features are modeled as geometric objects rather than points. We define multiple distance metrics that take into account objects' extent. We have developed algorithms to discover two types of spatial association patterns in scientific data. We present experimental results to demonstrate the efficacy of our approach on real datasets drawn from the bioinformatic domain. We also highlight the importance of the discovered patterns by integrating the underlying domain knowledge.

9 citations


Proceedings ArticleDOI
04 Apr 2005
TL;DR: This paper presents a case study in creating a parallel and scalable implementation of a scientific data analysis application which analyzes datasets produced by molecular dynamics simulations, and uses a system called FREERIDE, which was originally developed for parallelizing data mining algorithms.
Abstract: This paper presents a case study in creating a parallel and scalable implementation of a scientific data analysis application. We focus on a defect detection and categorization application which analyzes datasets produced by molecular dynamics (MD) simulations. In parallelizing this application, we had the following three goals. First, we obviously wanted to achieve high parallel efficiency. Second, we wanted to create an implementation that can scale to disk-resident datasets. Third, we wanted to create an easy to maintain and modify implementation, which is possible only through using high-level interfaces. We used a number of techniques for organizing the input data, achieving load balance, and efficiently parallelizing the step for updating and matching with the defect catalog. To meet our third goal, we used a system called FREERIDE (FRamework for Rapid Implementation of Datamining Engines), which was originally developed for parallelizing data mining algorithms. We have carried out a detailed evaluation of our implementation. The main observations from our experiments are as follows: 1) our implementation achieves high parallel efficiency, 2) the execution time remains proportional to the amount of computation even as the dataset becomes disk-resident, and 3) our scheme for load balancing and the method we use for parallelizing updating and matching of the defect catalog are crucial for parallel efficiency of the defect categorization phase.

6 citations