Showing papers by "Sameep Mehta published in 2005"

PDF

Open Access

Proceedings Article•DOI•

A generalized framework for mining spatio-temporal patterns in scientific data

[...]

Hui Yang¹, Srinivasan Parthasarathy¹, Sameep Mehta¹•Institutions (1)

21 Aug 2005

TL;DR: A general framework to discover spatial associations and spatio-temporal episodes for scientific datasets is presented and it is shown that such episodes can be used to reason about critical events.

...read moreread less

Abstract: In this paper, we present a general framework to discover spatial associations and spatio-temporal episodes for scientific datasets. In contrast to previous work in this area, features are modeled as geometric objects rather than points. We define multiple distance metrics that take into account objects' extent and thus are more robust in capturing the influence of an object on other objects in spatial neighborhood. We have developed algorithms to discover four different types of spatial object interaction (association) patterns. We also extend our approach to accommodate temporal information and propose a simple algorithm to derive spatio-temporal episodes. We show that such episodes can be used to reason about critical events. We evaluate our framework on real datasets to demonstrate its efficacy. The datasets originate from two different areas: Computational Molecular Dynamics and Computational Fluid Flow. We present results highlighting the importance of the identified patterns and episodes by using knowledge from the underlying domains. We also show that the proposed algorithms scale linearly with respect to the dataset size.

...read moreread less

75 citations

Journal Article•DOI•

Toward unsupervised correlation preserving discretization

[...]

Sameep Mehta¹, Srinivasan Parthasarathy¹, Hui Yang¹•Institutions (1)

Ohio State University¹

01 Sep 2005-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A novel PCA-based unsupervised algorithm for the discretization of continuous attributes in multivariate data sets that leverages the underlying correlation structure in the data set to obtain the discrete intervals and ensures that the inherent correlations are preserved.

...read moreread less

Abstract: Discretization is a crucial preprocessing technique used for a variety of data warehousing and mining tasks. In this paper, we present a novel PCA-based unsupervised algorithm for the discretization of continuous attributes in multivariate data sets. The algorithm leverages the underlying correlation structure in the data set to obtain the discrete intervals and ensures that the inherent correlations are preserved. Previous efforts on this problem are largely supervised and consider only piecewise correlation among attributes. We consider the correlation among continuous attributes and, at the same time, also take into account the interactions between continuous and categorical attributes. Our approach also extends easily to data sets containing missing values. We demonstrate the efficacy of the approach on real data sets and as a preprocessing step for both classification and frequent itemset mining tasks. We show that the intervals are meaningful and can uncover hidden patterns in data. We also show that large compression factors can be obtained on the discretized data sets. The approach is task independent, i.e., the same discretized data set can be used for different data mining tasks. Thus, the data sets can be discretized, compressed, and stored once and can be used again and again.

...read moreread less

52 citations

Proceedings Article•

Dynamic Classification of Defect Structures in Molecular Dynamics Simulation Data.

[...]

Sameep Mehta, Steve Barr, Tat-Sang Choy, Hui Yang, Srinivasan Parthasarathy, Raghu Machiraju, John W. Wilkins - Show less +3 more

01 Jan 2005

TL;DR: This application paper presents a two-step dynamic classifier to classify anomalous structures (defects) in data generated from abinitio Molecular Dynamics simulations of Silicon (Si) atom systems that is robust and scalable in the size of the atom systems.

...read moreread less

Abstract: In this application paper we explore techniques to classify anomalous structures (defects) in data generated from abinitio Molecular Dynamics (MD) simulations of Silicon (Si) atom systems. These systems are studied to understand the processes behind the formation of various defects as they have a profound impact on the electrical and mechanical properties of Silicon. In our prior work we presented techniques for defect detection [11, 12, 14]. Here, we present a two-step dynamic classifier to classify the defects. The first step uses up to third-order shape moments to provide a smaller set of candidate defect classes. The second step assigns the correct class to the defect structure by considering the actual spatial positions of the individual atoms. The dynamic classifier is robust and scalable in the size of the atom systems. Each phase is immune to noise, which is characterized after a study of the simulation data. We also validate the proposed solutions by using a physical model and properties of lattices. We demonstrate the efficacy and correctness of our approach on several large datasets. Our approach is able to recognize previously seen defects and also identify new defects in real time.

...read moreread less

11 citations

Proceedings Article•

Mining spatial object associations for scientific data

[...]

Hui Yang¹, Srinivasan Parthasarathy¹, Sameep Mehta¹•Institutions (1)

Ohio State University¹

30 Jul 2005

TL;DR: This work has developed algorithms to discover two types of spatial association patterns in scientific data that are modeled as geometric objects rather than points and define multiple distance metrics that take into account objects' extent.

...read moreread less

Abstract: In this paper, we present efficient algorithms to discover spatial associations among features extracted from scientific datasets. In contrast to previous work in this area, features are modeled as geometric objects rather than points. We define multiple distance metrics that take into account objects' extent. We have developed algorithms to discover two types of spatial association patterns in scientific data. We present experimental results to demonstrate the efficacy of our approach on real datasets drawn from the bioinformatic domain. We also highlight the importance of the discovered patterns by integrating the underlying domain knowledge.

...read moreread less

9 citations

Proceedings Article•DOI•

Parallelizing a defect detection and categorization application

[...]

Leonid Glimcher¹, Gagan Agrawal¹, Sameep Mehta¹, Rioming Jin¹, Raghu Machiraju¹ - Show less +1 more•Institutions (1)

Ohio State University¹

04 Apr 2005

TL;DR: This paper presents a case study in creating a parallel and scalable implementation of a scientific data analysis application which analyzes datasets produced by molecular dynamics simulations, and uses a system called FREERIDE, which was originally developed for parallelizing data mining algorithms.

...read moreread less

Abstract: This paper presents a case study in creating a parallel and scalable implementation of a scientific data analysis application. We focus on a defect detection and categorization application which analyzes datasets produced by molecular dynamics (MD) simulations. In parallelizing this application, we had the following three goals. First, we obviously wanted to achieve high parallel efficiency. Second, we wanted to create an implementation that can scale to disk-resident datasets. Third, we wanted to create an easy to maintain and modify implementation, which is possible only through using high-level interfaces. We used a number of techniques for organizing the input data, achieving load balance, and efficiently parallelizing the step for updating and matching with the defect catalog. To meet our third goal, we used a system called FREERIDE (FRamework for Rapid Implementation of Datamining Engines), which was originally developed for parallelizing data mining algorithms. We have carried out a detailed evaluation of our implementation. The main observations from our experiments are as follows: 1) our implementation achieves high parallel efficiency, 2) the execution time remains proportional to the amount of computation even as the dataset becomes disk-resident, and 3) our scheme for load balancing and the method we use for parallelizing updating and matching of the defect catalog are crucial for parallel efficiency of the defect categorization phase.

...read moreread less

6 citations