Showing papers in "IEEE Transactions on Knowledge and Data Engineering in 2002"

PDF

Open Access

Journal Article•DOI•

CLARANS: a method for clustering objects for spatial data mining

[...]

Raymond T. Ng¹, Jiawei Han²•Institutions (2)

University of British Columbia¹, Simon Fraser University²

01 Sep 2002-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A new clustering method is proposed, called CLARANS, whose aim is to identify spatial structures that may be present in the data, and two spatial data mining algorithms that aim to discover relationships between spatial and nonspatial attributes are developed.

...read moreread less

Abstract: Spatial data mining is the discovery of interesting relationships and characteristics that may exist implicitly in spatial databases. To this end, this paper has three main contributions. First, it proposes a new clustering method called CLARANS, whose aim is to identify spatial structures that may be present in the data. Experimental results indicate that, when compared with existing clustering methods, CLARANS is very efficient and effective. Second, the paper investigates how CLARANS can handle not only point objects, but also polygon objects efficiently. One of the methods considered, called the IR-approximation, is very efficient in clustering convex and nonconvex polygon objects. Third, building on top of CLARANS, the paper develops two spatial data mining algorithms that aim to discover relationships between spatial and nonspatial attributes. Both algorithms can discover knowledge that is difficult to find with existing spatial data mining algorithms.

...read moreread less

1,165 citations

Journal Article•DOI•

Practical data-oriented microaggregation for statistical disclosure control

[...]

Josep Domingo-Ferrer, Josep Maria Mateo-Sanz

01 Jan 2002-IEEE Transactions on Knowledge and Data Engineering

TL;DR: In this paper, candidate optimal solutions to the multivariate and univariate microaggregation problems are characterized and two heuristics based on hierarchical clustering and genetic algorithms are introduced which are data-oriented in that they try to preserve natural data aggregates.

...read moreread less

Abstract: Microaggregation is a statistical disclosure control technique for microdata disseminated in statistical databases. Raw microdata (i.e., individual records or data vectors) are grouped into small aggregates prior to publication. Each aggregate should contain at least k data vectors to prevent disclosure of individual information, where k is a constant value preset by the data protector. No exact polynomial algorithms are known to date to microaggregate optimally, i.e., with minimal variability loss. Methods in the literature rank data and partition them into groups of fixed-size; in the multivariate case, ranking is performed by projecting data vectors onto a single axis. In this paper, candidate optimal solutions to the multivariate and univariate microaggregation problems are characterized. In the univariate case, two heuristics based on hierarchical clustering and genetic algorithms are introduced which are data-oriented in that they try to preserve natural data aggregates. In the multivariate case, fixed-size and hierarchical clustering microaggregation algorithms are presented which do not require data to be projected onto a single dimension; such methods clearly reduce variability loss as compared to conventional multivariate microaggregation on projected data.

...read moreread less

596 citations

Journal Article•DOI•

An instance-weighting method to induce cost-sensitive trees

[...]

Kai Ming Ting¹•Institutions (1)

Monash University¹

01 May 2002-IEEE Transactions on Knowledge and Data Engineering

TL;DR: The algorithm incorporating the instance-weighting method is found to be better than the original algorithm in in of total misclassification costs, the number of high cost errors, and tree size two-class data sets.

...read moreread less

Abstract: We introduce an instance-weighting method to induce cost-sensitive trees. It is a generalization of the standard tree induction process where only the initial instance weights determine the type of tree to be induced-minimum error trees or minimum high cost error trees. We demonstrate that it can be easily adapted to an existing tree learning algorithm. Previous research provides insufficient evidence to support the idea that the greedy divide-and-conquer algorithm can effectively induce a truly cost-sensitive tree directly from the training data. We provide this empirical evidence in this paper. The algorithm incorporating the instance-weighting method is found to be better than the original algorithm in in of total misclassification costs, the number of high cost errors, and tree size two-class data sets. The instance-weighting method is simpler and more effective in implementation than a previous method based on altered priors.

...read moreread less

459 citations

Journal Article•DOI•

A survey of temporal knowledge discovery paradigms and methods

[...]

John F. Roddick¹, Myra Spiliopoulou•Institutions (1)

University of South Australia¹

01 Jul 2002-IEEE Transactions on Knowledge and Data Engineering

TL;DR: The confluence of temporal databases and data mining is investigated, the work to date is surveyed, and the issues involved and the outstanding problems in temporal data mining are explored.

...read moreread less

Abstract: With the increase in the size of data sets, data mining has recently become an important research topic and is receiving substantial interest from both academia and industry. At the same time, interest in temporal databases has been increasing and a growing number of both prototype and implemented systems are using an enhanced temporal understanding to explain aspects of behavior associated with the implicit time-varying nature of the universe. This paper investigates the confluence of these two areas, surveys the work to date, and explores the issues involved and the outstanding problems in temporal data mining.

...read moreread less

442 citations

Journal Article•DOI•

Efficient C4.5 [classification algorithm]

[...]

Salvatore Ruggieri

01 Mar 2002-IEEE Transactions on Knowledge and Data Engineering

TL;DR: An analytic evaluation of the runtime behavior of the C4.5 algorithm is presented which highlights some efficiency improvements and a more efficient version of the algorithm is implemented, called EC 4.5.

...read moreread less

Abstract: We present an analytic evaluation of the runtime behavior of the C4.5 algorithm which highlights some efficiency improvements. Based on the analytic evaluation, we have implemented a more efficient version of the algorithm, called EC4.5. It improves on C4.5 by adopting the best among three strategies for computing the information gain of continuous attributes. All the strategies adopt a binary search of the threshold in the whole training set starting from the local threshold computed at a node. The first strategy computes the local threshold using the algorithm of C4.5, which, in particular, sorts cases by means of the quicksort method. The second strategy also uses the algorithm of C4.5, but adopts a counting sort method. The third strategy calculates the local threshold using a main-memory version of the RainForest algorithm, which does not need sorting. Our implementation computes the same decision trees as C4.5 with a performance gain of up to five times.

...read moreread less

352 citations

Journal Article•DOI•

An efficient path computation model for hierarchically structured topographical road maps

[...]

Sungwon Jung¹, Sakti Pramanik²•Institutions (2)

Sogang University¹, Michigan State University²

01 Sep 2002-IEEE Transactions on Knowledge and Data Engineering

TL;DR: The authors' performance analysis of SPAH on grid graphs showed that it significantly reduces the search space over existing methods and Experimental results show that inter query shortest path problem provides more opportunity for scalable parallelism than the intra query shortest paths problem.

...read moreread less

Abstract: In this paper, we have developed a HiTi (Hierarchical MulTi) graph model for structuring large topographical road maps to speed up the minimum cost route computation. The HiTi graph model provides a novel approach to abstracting and structuring a topographical road map in a hierarchical fashion. We propose a new shortest path algorithm named SPAH, which utilizes HiTi graph model of a topographical road map for its computation. We give the proof for the optimality of SPAH. Our performance analysis of SPAH on grid graphs showed that it significantly reduces the search space over existing methods. We also present an in-depth experimental analysis of HiTi graph method by comparing it with other similar works on grid graphs. Within the HiTi graph framework, we also propose a parallel shortest path algorithm named ISPAH. Experimental results show that inter query shortest path problem provides more opportunity for scalable parallelism than the intra query shortest path problem.

...read moreread less

242 citations

Journal Article•DOI•

Unsupervised learning with mixed numeric and nominal data

[...]

C. Li¹, Gautam Biswas²•Institutions (2)

Middle Tennessee State University¹, Vanderbilt University²

01 Jul 2002-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A similarity-based agglomerative clustering (SBAC) algorithm that works well for data with mixed numeric and nominal features and demonstrates the effectiveness of this algorithm in unsupervised discovery tasks.

...read moreread less

Abstract: Presents a similarity-based agglomerative clustering (SBAC) algorithm that works well for data with mixed numeric and nominal features. A similarity measure proposed by D.W. Goodall (1966) for biological taxonomy, that gives greater weight to uncommon feature value matches in similarity computations and makes no assumptions about the underlying distributions of the feature values, is adopted to define the similarity measure between pairs of objects. An agglomerative algorithm is employed to construct a dendrogram, and a simple distinctness heuristic is used to extract a partition of the data. The performance of the SBAC algorithm has been studied on real and artificially-generated data sets. The results demonstrate the effectiveness of this algorithm in unsupervised discovery tasks. Comparisons with other clustering schemes illustrate the superior performance of this approach.

...read moreread less

222 citations

Journal Article•DOI•

A time-bound cryptographic key assignment scheme for access control in a hierarchy

[...]

Wen-Guey Tzeng¹•Institutions (1)

National Chiao Tung University¹

01 Jan 2002-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A time-bound cryptographic key assignment scheme in which the cryptographic keys of a class are different for each time period, which is to broadcast data to authorized users in a multilevel-security way and the other is to construct a flexible cryptographic key backup system.

...read moreread less

Abstract: The cryptographic key assignment problem is to assign cryptographic keys to a set of partially ordered classes so that the cryptographic key of a higher class can be used to derive the cryptographic key of a lower class. In this paper, we propose a time-bound cryptographic key assignment scheme in which the cryptographic keys of a class are different for each time period, that is, the cryptographic key of class C/sub i/ at time t is K/sub i, t./ Key derivation is constrained not only by the class relation, but also the time period. In our scheme, each user holds some secret parameters whose number is independent of the number of the classes in the hierarchy and the total time periods. We present two novel applications of our scheme. One is to broadcast data to authorized users in a multilevel-security way and the other is to construct a flexible cryptographic key backup system.

...read moreread less

205 citations

Journal Article•DOI•

Fast indexing and visualization of metric data sets using slim-trees

[...]

Caetano Traina¹, Agma J. M. Traina¹, Christos Faloutsos², Bernhard Seeger³•Institutions (3)

University of São Paulo¹, Carnegie Mellon University², University of Marburg³

01 Mar 2002-IEEE Transactions on Knowledge and Data Engineering

TL;DR: The slim-tree is a metric access method that tackles the problem of overlaps between nodes in metric spaces and that allows one to minimize the overlap, and how to improve the performance of a metric tree through the proposed "slim-down" algorithm is shown.

...read moreread less

Abstract: Many recent database applications need to deal with similarity queries. For such applications, it is important to measure the similarity between two objects using the distance between them. Focusing on this problem, this paper proposes the slim-tree, a new dynamic tree for organizing metric data sets in pages of fixed size. The slim-tree uses the triangle inequality to prune the distance calculations that are needed to answer similarity queries over objects in metric spaces. The proposed insertion algorithm uses new policies to select the nodes where incoming objects are stored. When a node overflows, the slim-tree uses a minimal spanning tree to help with the splitting. The new insertion algorithm leads to a tree with high storage utilization and improved query performance. The slim-tree is a metric access method that tackles the problem of overlaps between nodes in metric spaces and that allows one to minimize the overlap. The proposed "fat-factor" is a way to quantify whether a given tree can be improved and also to compare two trees. We show how to use the fat-factor to achieve accurate estimates of the search performance and also how to improve the performance of a metric tree through the proposed "slim-down" algorithm. This paper also presents a new tool in the slim-tree's arsenal of resources, aimed at visualizing it. Visualization is a powerful tool for interactive data mining and for the visual tracking of the behavior of a tree under updates. Finally, we present a formula to estimate the number of disk accesses in range queries. Results from experiments with real and synthetic data sets show that the new slim-tree algorithms lead to performance improvements. These results show that the slim-tree outperforms the M-tree by up to 200% for range queries. For insertion and splitting, the minimal-spanning-tree-based algorithm achieves up to 40 times faster insertions. We observed improvements of up to 40% in range queries after applying the slim-down algorithm.

...read moreread less

201 citations

Journal Article•DOI•

Optimizing main-memory join on modern hardware

[...]

Stefan Manegold, Peter Boncz, Martin L. Kersten

01 Jul 2002-IEEE Transactions on Knowledge and Data Engineering

TL;DR: The partitioned hash-join is refined with a new partitioning algorithm called radix-cluster, which is specifically designed to optimize memory access, and the effect of implementation techniques that optimize CPU resource usage is investigated.

...read moreread less

Abstract: In the past decade, the exponential growth in commodity CPU's speed has far outpaced advances in memory latency. A second trend is that CPU performance advances are not only brought by increased clock rates, but also by increasing parallelism inside the CPU. Current database systems have not yet adapted to these trends and show poor utilization of both CPU and memory resources on current hardware. In this paper, we show how these resources can be optimized for large joins and translate these insights into guidelines for future database architectures, encompassing data structures, algorithms, cost modeling and implementation. In particular, we discuss how vertically fragmented data structures optimize cache performance on sequential data access. On the algorithmic side, we refine the partitioned hash-join with a new partitioning algorithm called "radix-cluster", which is specifically designed to optimize memory access. The performance of this algorithm is quantified using a detailed analytical model that incorporates memory access costs in terms of a limited number of parameters, such as cache sizes and miss penalties. We also present a calibration tool that extracts such parameters automatically from any computer hardware. The accuracy of our models is proven by exhaustive experiments conducted with the Monet database system on three different hardware platforms. Finally, we investigate the effect of implementation techniques that optimize CPU resource usage. Our experiments show that large joins can be accelerated almost an order of magnitude on modern RISC hardware when both memory and CPU resources are optimized.

...read moreread less

187 citations

Journal Article•DOI•

Coordinated placement and replacement for large-scale distributed caches

[...]

Madhukar R. Korupolu, Mike Dahlin¹•Institutions (1)

University of Texas at Austin¹

01 Nov 2002-IEEE Transactions on Knowledge and Data Engineering

TL;DR: Simulation examines three practical cooperative placement algorithms, including one that is provably close to optimal, and compares these algorithms to the optimal placement algorithm and several cooperative and noncooperative replacement algorithms.

...read moreread less

Abstract: In a large-scale information system such as a digital library or the Web, a set of distributed caches can improve their effectiveness by coordinating their data placement decisions. Using simulation, we examine three practical cooperative placement algorithms, including one that is provably close to optimal, and we compare these algorithms to the optimal placement algorithm and several cooperative and noncooperative replacement algorithms. We draw five conclusions from these experiments: 1) cooperative placement can significantly improve performance compared to local replacement algorithms, particularly when the size of individual caches is limited compared to the universe of objects; 2) although the amortizing placement algorithm is only guaranteed to be within 14 times the optimal, in practice it seems to provide an excellent approximation of the optimal; 3) in a cooperative caching scenario, the recent greedy-dual local replacement algorithm performs much better than the other local replacement algorithms; 4) our hierarchical-greedy-dual replacement algorithm yields further improvements over the greedy-dual algorithm especially when there are idle caches in the system; and 5) a key challenge to coordinated placement algorithms is generating good predictions of access patterns based on past accesses.

...read moreread less

Journal Article•DOI•

Spatio-temporal predicates

[...]

Martin Erwig¹, Markus Schneider²•Institutions (2)

Oregon State University¹, FernUniversität Hagen²

01 Jul 2002-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A framework is developed in which spatio-temporal predicates can be obtained by temporal aggregation of elementary spatial predicates and sequential composition, and two approaches are considered to consider possible transitions between spatial configurations.

...read moreread less

Abstract: Investigates temporal changes of topological relationships and thereby integrates two important research areas: first, 2D topological relationships that have been investigated quite intensively, and second, the change of spatial information over time We investigate spatio-temporal predicates, which describe developments of well-known spatial topological relationships A framework is developed in which spatio-temporal predicates can be obtained by temporal aggregation of elementary spatial predicates and sequential composition We compare our framework with two other possible approaches: one is based on the observation that spatio-temporal objects correspond to 3D spatial objects for which existing topological predicates can be exploited The other approach is to consider possible transitions between spatial configurations These considerations help to identify a canonical set of spatio-temporal predicates

...read moreread less

Journal Article•DOI•

A modified Chi2 algorithm for discretization

[...]

Francis E. H. Tay¹, Lixiang Shen¹•Institutions (1)

National University of Singapore¹

01 May 2002-IEEE Transactions on Knowledge and Data Engineering

TL;DR: The Chi2 algorithm, a modification to the ChiMerge method, automates the discretization process by introducing an inconsistency rate as the stopping criterion and it automatically selects the significance value and becomes a completely automatic discretized method.

...read moreread less

Abstract: Since the ChiMerge algorithm was first proposed by Kerber (1992), it has become a widely used and discussed discretization method. The Chi2 algorithm is a modification to the ChiMerge method. It automates the discretization process by introducing an inconsistency rate as the stopping criterion and it automatically selects the significance value. In addition, it adds a finer phase aimed at feature selection to broaden the applications of the ChiMerge algorithm. However, the Chi2 algorithm does not consider the inaccuracy inherent in ChiMerge's merging criterion. The user-defined inconsistency rate also brings about inaccuracy to the discretization process. These two drawbacks are first discussed in this paper and modifications to overcome them are then proposed. By comparison, results with the original Chi2 algorithm using C4.5, the modified Chi2 algorithm, performs better than the original Chi2 algorithm. It becomes a completely automatic discretization method.

...read moreread less

Journal Article•DOI•

Recovery from malicious transactions

[...]

Paul Ammann¹, Sushil Jajodia¹, Peng Liu²•Institutions (2)

George Mason University¹, Penn State College of Information Sciences and Technology²

01 Sep 2002-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This work considers recovery from malicious but committed transactions in the database context and presents algorithms to restore only the damaged part of the database, using an information warfare perspective.

...read moreread less

Abstract: Preventive measures sometimes fail to deflect malicious attacks. We adopt an information warfare perspective, which assumes success by the attacker in achieving partial, but not complete, damage. In particular, we work in the database context and consider recovery from malicious but committed transactions. Traditional recovery mechanisms do not address this problem, except for complete rollbacks, which undo the work of benign transactions as well as malicious ones, and compensating transactions, whose utility depends on application semantics. Recovery is complicated by the presence of benign transactions that depend, directly or indirectly, on the malicious transactions. We present algorithms to restore only the damaged part of the database. We identify the information that needs to be maintained for such algorithms. The initial algorithms repair damage to quiescent databases; subsequent algorithms increase availability by allowing new transactions to execute concurrently with the repair process. Also, via a study of benchmarks, we show practical examples of how offline analysis can efficiently provide the necessary data to repair the damage of malicious transactions.

...read moreread less

Journal Article•DOI•

Clustering for approximate similarity search in high-dimensional spaces

[...]

Chen Li¹, Edward Y. Chang², Hector Garcia-Molina¹, Gio Wiederhold¹•Institutions (2)

Stanford University¹, University of California, Santa Barbara²

01 Jul 2002-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A clustering and indexing paradigm for high-dimensional search spaces based on finding clusters and building a simple but efficient index for them, which can find near points with high recall in very few IOs and perform significantly better than other approaches.

...read moreread less

Abstract: We present a clustering and indexing paradigm (called Clindex) for high-dimensional search spaces. The scheme is designed for approximate similarity searches, where one would like to find many of the data points near a target point, but where one can tolerate missing a few near points. For such searches, our scheme can find near points with high recall in very few IOs and perform significantly better than other approaches. Our scheme is based on finding clusters and, then, building a simple but efficient index for them. We analyze the trade-offs involved in clustering and building such an index structure, and present extensive experimental results.

...read moreread less

Journal Article•DOI•

Mining sequential patterns with regular expression constraints

[...]

Minos Garofalakis¹, Rajeev Rastogi¹, Kyuseok Shim²•Institutions (2)

Bell Labs¹, Seoul National University²

01 May 2002-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A family of novel algorithms for mining frequent sequential patterns that also satisfy user-specified RE constraints that provide valuable insights into the trade-offs that arise when constraints that do not subscribe to nice properties are integrated into the mining process.

...read moreread less

Abstract: Discovering sequential patterns is an important problem in data mining with a host of application domains including medicine, telecommunications, and the World Wide Web. Conventional sequential pattern mining systems provide users with only a very restricted mechanism (based on minimum support) for specifying patterns of interest. As a consequence, the pattern mining process is typically characterized by lack of focus and users often end up paying inordinate computational costs just to be inundated with an overwhelming number of useless results. We propose the use of Regular Expressions (REs) as a flexible constraint specification tool that enables user-controlled focus to be incorporated into the pattern mining process. We develop a family of novel algorithms (termed SPIRIT-Sequential Pattern mining with Regular expression consTraints) for mining frequent sequential patterns that also satisfy user-specified RE constraints. The main distinguishing factor among the proposed schemes is the degree to which the RE constraints are enforced to prune the search space of patterns during computation. Our solutions provide valuable insights into the trade-offs that arise when constraints that do not subscribe to nice properties (like anti monotonicity) are integrated into the mining process.

...read moreread less

Journal Article•DOI•

Finding localized associations in market basket data

[...]

Charu C. Aggarwal¹, Cecilia M. Procopiuc¹, Philip S. Yu¹•Institutions (1)

IBM¹

01 Jan 2002-IEEE Transactions on Knowledge and Data Engineering

TL;DR: Empirical results show that the method is indeed able to find a significantly larger number of associations than what can be discovered by analysis of the aggregate data.

...read moreread less

Abstract: In this paper, we discuss a technique for discovering localized associations in segments of the data using clustering. Often, the aggregate behavior of a data set may be very different from localized segments. In such cases, it is desirable to design algorithms which are effective in discovering localized associations because they expose a customer pattern which is more specific than the aggregate behavior. This information may be very useful for target marketing. We present empirical results which show that the method is indeed able to find a significantly larger number of associations than what can be discovered by analysis of the aggregate data.

...read moreread less

Journal Article•DOI•

A content-based authorization model for digital libraries

[...]

Nabil R. Adam¹, Vijayalakshmi Atluri, Elisa Bertino, Elena Ferrari•Institutions (1)

Rutgers University¹

01 Mar 2002-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This work proposes a content-based authorization model that is suitable for a DL environment and features both content-dependent and content-independent access control to digital library objects, and the varying granularity of authorization objects ranging from sets of library objects to specific portions of objects.

...read moreread less

Abstract: Digital libraries (DLs) introduce several challenging requirements with respect to the formulation, specification and enforcement of adequate data protection policies. Unlike conventional database environments, a DL environment is typically characterized by a dynamic user population, often making accesses from remote locations, and by an extraordinarily large amount of multimedia information, stored in a variety of formats. Moreover, in a DL environment, access policies are often specified based on user qualifications and characteristics, rather than on user identity (e.g. a user can be given access to an R-rated video only if he/ she is more than 18 years old). Another crucial requirement is the support for content-dependent authorizations on digital library objects (e.g. all documents containing discussions on how to operate guns must be made available only to users who are 18 or older). Since traditional authorization models do not adequately meet the access control requirements typical of DLs, we propose a content-based authorization model that is suitable for a DL environment. Specifically, the most innovative features of our authorization model are: (1) flexible specification of authorizations based on the qualifications and (positive and negative) characteristics of users, (2) both content-dependent and content-independent access control to digital library objects, and (3) the varying granularity of authorization objects ranging from sets of library objects to specific portions of objects.

...read moreread less

Journal Article•DOI•

SemQuery: semantic clustering and querying on heterogeneous features for visual data

[...]

G. Sheikholeslami¹, W. Chang², Aidong Zhang³•Institutions (3)

Cisco Systems, Inc.¹, Indiana University², State University of New York System³

01 Sep 2002-IEEE Transactions on Knowledge and Data Engineering

TL;DR: In this article, a semantics-based clustering and indexing approach, termed SemQuery, is presented to support visual queries on heterogeneous features of images, where each semantic image cluster contains a set of subclusters that are represented by the heterogenous features that the images contain.

...read moreread less

Abstract: The effectiveness of content-based image retrieval can be enhanced using heterogeneous features embedded in the images. However, since the features in texture, color, and shape are generated using different computation methods and thus may require different similarity measurements, the integration of the retrievals on heterogeneous features is a nontrivial task. We present a semantics-based clustering and indexing approach, termed SemQuery, to support visual queries on heterogeneous features of images. Using this approach, the database images are classified based on their heterogeneous features. Each semantic image cluster contains a set of subclusters that are represented by the heterogeneous features that the images contain. An image is included in a semantic cluster if it falls within the scope of all the heterogeneous clusters of the semantic cluster. We also design a neural network model to merge the results of basic queries on individual features. A query processing strategy is then presented to support visual queries on heterogeneous features. An experimental analysis is conducted and presented to demonstrate the effectiveness and efficiency of the proposed approach.

...read moreread less

Journal Article•DOI•

Indexing and retrieval for genomic databases

[...]

Hugh E. Williams¹, Justin Zobel²•Institutions (2)

Melbourne Institute of Technology¹, RMIT University²

01 Jan 2002-IEEE Transactions on Knowledge and Data Engineering

TL;DR: It is shown experimentally that the indexed approach results in significant savings in computationally intensive local alignments and that index-based searching is as accurate as existing exhaustive search schemes.

...read moreread less

Abstract: Genomic sequence databases are widely used by molecular biologists for homology searching. Amino acid and nucleotide databases are increasing in size exponentially, and mean sequence lengths are also increasing. In searching such databases, it is desirable to use heuristics to perform computationally intensive local alignments on selected sequences and to reduce the costs of the alignments that are attempted. We present an index-based approach for both selecting sequences that display broad similarity to a query and for fast local alignment. We show experimentally that the indexed approach results in significant savings in computationally intensive local alignments and that index-based searching is as accurate as existing exhaustive search schemes.

...read moreread less

Journal Article•DOI•

Weighted fuzzy reasoning using weighted fuzzy Petri nets

[...]

Shyi-Ming Chen¹•Institutions (1)

National Taiwan University of Science and Technology¹

01 Mar 2002-IEEE Transactions on Knowledge and Data Engineering

TL;DR: The proposed weighted fuzzy reasoning algorithm can allow the rule-based systems to perform fuzzy reasoning in a more flexible and more intelligent manner.

...read moreread less

Abstract: This paper presents a Weighted Fuzzy Petri Net model (WFPN) and proposes a weighted fuzzy reasoning algorithm for rule-based systems based on Weighted Fuzzy Petri Nets. The fuzzy production rules in the knowledge base of a rule-based system are modeled by Weighted Fuzzy Petri Nets, where the truth values of the propositions appearing in the fuzzy production rules and the certainty factors of the rules are represented by fuzzy numbers. Furthermore, the weights of the propositions appearing in the rules are also represented by fuzzy numbers. The proposed weighted fuzzy reasoning algorithm can allow the rule-based systems to perform fuzzy reasoning in a more flexible and more intelligent manner.

...read moreread less

Journal Article•DOI•

ImageMap: an image indexing method based on spatial similarity

[...]

Euripides G. M. Petrakis¹, Christos Faloutsos², King-Ip Lin³•Institutions (3)

University of Crete¹, Carnegie Mellon University², University of Memphis³

01 Sep 2002-IEEE Transactions on Knowledge and Data Engineering

TL;DR: The most general image content representation is adopted, that is, Attributed Relational Graphs (ARGs), in conjunction with the well-accepted ARG editing distance on ARGs, as a method for indexing and similarity searching in image databases (IDBs).

...read moreread less

Abstract: We introduce ImageMap, as a method for indexing and similarity searching in image databases (IDBs). ImageMap answers "queries by example" involving any number of objects or regions and taking into account their interrelationships. We adopt the most general image content representation, that is, Attributed Relational Graphs (ARGs), in conjunction with the well-accepted ARG editing distance on ARGs. We tested ImageMap on real and realistic medical images. Our method not only provides for visualization of the data set, clustering and data mining, but it also achieves up to 1,000-fold speed-up in search over sequential scanning, with zero or very few false dismissals.

...read moreread less

Journal Article•DOI•

Pincer-search: an efficient algorithm for discovering the maximum frequent set

[...]

Dao-I Lin¹, Zvi M. Kedem²•Institutions (2)

Redback Networks¹, New York University²

01 May 2002-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A new algorithm is presented which combines both the bottom-up and the top-down searches for finding frequent itemsets, and does not require explicit examination of every frequent itemset.

...read moreread less

Abstract: Discovering frequent itemsets is a key problem in important data mining applications, such as the discovery of association rules, strong rules, episodes, and minimal keys. Typical algorithms for solving this problem operate in a bottom-up, breadth-first search direction. The computation starts from frequent 1-itemsets (the minimum length frequent itemsets) and continues until all maximal (length) frequent itemsets are found. During the execution, every frequent itemset is explicitly considered. Such algorithms perform well when all maximal frequent itemsets are short. However, performance drastically deteriorates when some of the maximal frequent itemsets are long. We present a new algorithm which combines both the bottom-up and the top-down searches. The primary search direction is still bottom-up, but a restricted search is also conducted in the top-down direction. This search is used only for maintaining and updating a new data structure, the maximum frequent candidate set. It is used to prune early candidates that would be normally encountered in the bottom-up search. A very important characteristic of the algorithm is that it does not require explicit examination of every frequent itemset. We evaluate the performance of the algorithm using well-known synthetic benchmark databases, real-life census, and stock market databases.

...read moreread less

Journal Article•DOI•

Redefining clustering for high-dimensional applications

[...]

Charu C. Aggarwal¹, Philip S. Yu•Institutions (1)

IBM¹

01 Mar 2002-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This work introduces a very general concept of projected clustering which is able to construct clusters in arbitrarily aligned subspaces of lower dimensionality, and provides a new concept of using extended cluster feature vectors in order to make the algorithm scalable for very large databases.

...read moreread less

Abstract: Clustering problems are well-known in the database literature for their use in numerous applications, such as customer segmentation, classification, and trend analysis. High-dimensional data has always been a challenge for clustering algorithms because of the inherent sparsity of the points. Recent research results indicate that, in high-dimensional data, even the concept of proximity or clustering may not be meaningful. We introduce a very general concept of projected clustering which is able to construct clusters in arbitrarily aligned subspaces of lower dimensionality. The subspaces are specific to the clusters themselves. This definition is substantially more general and realistic than the currently available techniques which limit the method to only projections from the original set of attributes. The generalized projected clustering technique may also be viewed as a way of trying to redefine clustering for high-dimensional applications by searching for hidden subspaces with clusters which are created by interattribute correlations. We provide a new concept of using extended cluster feature vectors in order to make the algorithm scalable for very large databases. The running time and space requirements of the algorithm are adjustable and are likely to trade-off with better accuracy.

...read moreread less

Journal Article•DOI•

Causal maps: theory, implementation, and practical applications in multiagent environments

[...]

Brahim Chaib-draa¹•Institutions (1)

Laval University¹

01 Nov 2002-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This work proposes a formal model for causal maps with a precise semantics based on relational algebra and investigates the issue of using this tool in multiagent environments by explaining through different examples how and why this tool is useful for the following aspects.

...read moreread less

Abstract: Analytical techniques are generally inadequate for dealing with causal interrelationships among a set of individual and social concepts. Usually, causal maps are used to cope with this type of interrelationships. However, the classical view of causal maps is based on an intuitive view with ad hoc rules and no precise semantics of the primitive concepts, nor a sound formal treatment of relations between concepts. We solve this problem by proposing a formal model for causal maps with a precise semantics based on relational algebra and the software tool, CM-RELVIEW, in which it has been implemented. Then, we investigate the issue of using this tool in multiagent environments by explaining through different examples how and why this tool is useful for the following aspects: 1) the reasoning on agents' subjective views, 2) the qualitative distributed decision making, and 3) the organization of agents considered as a holistic approach. For each of these aspects, we focus on the computational mechanisms developed within CM-RELVIEW to support it.

...read moreread less

Journal Article•DOI•

An ElGamal-like cryptosystem for enciphering large messages

[...]

Min-Shiang Hwang¹, Chin-Chen Chang², Kuo-Feng Hwang²•Institutions (2)

Chaoyang University of Technology¹, National Chung Cheng University²

01 Mar 2002-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This article proposes a simple cryptosystem which allows a large confidential message to be encrypted efficiently and is based on the Diffie-Hellman distribution scheme, together with the ElGamal cryptosSystem.

...read moreread less

Abstract: In practice, we usually require two cryptosystems, an asymmetric one and a symmetric one, to encrypt a large confidential message. The asymmetric cryptosystem is used to deliver secret key SK, while the symmetric cryptosystem is used to encrypt a large confidential message with the secret key SK. In this article, we propose a simple cryptosystem which allows a large confidential message to be encrypted efficiently. Our scheme is based on the Diffie-Hellman distribution scheme, together with the ElGamal cryptosystem.

...read moreread less

Journal Article•DOI•

A distance-based approach to entity reconciliation in heterogeneous databases

[...]

Debabrata Dey¹, Sumit Sarkar², Prabuddha De²•Institutions (2)

University of Washington¹, University of Texas at Dallas²

01 May 2002-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A decision-theoretic model to resolve the entity heterogeneity problem, which arises when the same real-world entity type is represented using different identifiers in different applications, is proposed and shown to be computationally efficient.

...read moreread less

Abstract: In modern organizations, decision makers must often be able to quickly access information from diverse sources in order to make timely decisions. A critical problem facing many such organizations is the inability to easily reconcile the information contained in heterogeneous data sources. To overcome this limitation, an organization must resolve several types of heterogeneity problems that may exist across different sources. We examine one such problem called the entity heterogeneity problem, which arises when the same real-world entity type is represented using different identifiers in different applications. A decision-theoretic model to resolve the problem is proposed. Our model uses a distance measure to express the similarity between two entity instances. We have implemented the model and tested it on real-world data. The results indicate that the model performs quite well in terms of its ability to predict whether two entity instances should be matched or not. The model is shown to be computationally efficient. It also scales well to large relations from the perspective of the accuracy of prediction. Overall, the test results imply that this is certainly a viable approach in practical situations.

...read moreread less

Journal Article•DOI•

Mining optimized association rules with categorical and numeric attributes

[...]

Rajeev Rastogi¹, Kyuseok Shim²•Institutions (2)

Bell Labs¹, KAIST²

01 Jan 2002-IEEE Transactions on Knowledge and Data Engineering

TL;DR: The generalized association rules enable us to extract more useful information about seasonal and local patterns involving multiple attributes and present effective techniques for pruning the search space when computing optimized association rules for both categorical and numeric attributes.

...read moreread less

Abstract: Mining association rules on large data sets has received considerable attention in recent years. Association rules are useful for determining correlations between attributes of a relation and have applications in marketing, financial, and retail sectors. Furthermore, optimized association rules are an effective way to focus on the most interesting characteristics involving certain attributes. Optimized association rules are permitted to contain uninstantiated attributes and the problem is to determine instantiations such that either the support or confidence of the rule is maximized. In this paper, we generalize the optimized association rules problem in three ways: (1) association rules are allowed to contain disjunctions over uninstantiated attributes, (2) association rules are permitted to contain an arbitrary number of uninstantiated attributes, and (3) uninstantiated attributes can be either categorical or numeric. Our generalized association rules enable us to extract more useful information about seasonal and local patterns involving multiple attributes. We present effective techniques for pruning the search space when computing optimized association rules for both categorical and numeric attributes. Finally, we report the results of our experiments that indicate that our pruning algorithms are efficient for a large number of uninstantiated attributes, disjunctions, and values in the domain of the attributes.

...read moreread less

Journal Article•DOI•

The EVE approach: view synchronization in dynamic distributed environments

[...]

A.J. Lee, Anisoara Nica, Elke A. Rundensteiner¹•Institutions (1)

Worcester Polytechnic Institute¹

01 Sep 2002-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This paper is the first to introduce and address the problem of schema changes of ISs, while previous work in this area, such as incremental view maintenance, has mainly dealt with data changes at ISs.

...read moreread less

Abstract: The construction and maintenance of data warehouses (views) in large-scale environments composed of numerous distributed and evolving information sources (ISs) such as the WWW has received great attention recently. Such environments are plagued with changing information because ISs tend to continuously evolve by modifying not only their content but also their query capabilities and interface and by joining or leaving the environment at any time. We are the first to introduce and address the problem of schema changes of ISs, while previous work in this area, such as incremental view maintenance, has mainly dealt with data changes at ISs. We outline our solution approach to this challenging new problem of how to adapt views in such evolving environments. We identify a new view adaptation problem for view evolution in the context of ISs schema changes, which we call view synchronization. We also outline the Evolvable View Environment (EVE) approach that we propose as framework for solving the view synchronization problem, along with our decisions concerning the key design issues surrounding EVE. The main contributions of this paper are: 1) we provide an E-SQL view definition language with which the view definer can direct the view evolution process, 2) we introduce a model for information source description which allows a large class of ISs to participate in our system dynamically, 3) we formally define what constitutes a legal view rewriting, 4) we develop replacement strategies for affected view components which are designed to meet the preferences expressed by E-SQL, 5) we prove the correctness of the replacement strategies, and 6) we provide a set of view synchronization algorithms based on those strategies. A prototype of our EVE system has successfully been built using Java, JDBC, Oracle, and MS Access.

...read moreread less

Journal Article•DOI•

Transaction processing in mobile, heterogeneous database systems

[...]

J.B. Lim, A.R. Hurson¹•Institutions (1)

Pennsylvania State University¹

01 Nov 2002-IEEE Transactions on Knowledge and Data Engineering

TL;DR: The proposed concurrency control algorithm, v-lock, uses global locking tables created with semantic information contained within the hierarchy that are used to serialize global transactions, detect and remove global deadlocks.

...read moreread less

Abstract: As technological advances are made in software and hardware, the feasibility of accessing information "any time, anywhere" is becoming a reality. Furthermore, the diversity and amount of information available to a given user is increasing at a rapid rate. In a mobile computing environment, a potentially large number of users may simultaneously access the global data; therefore, there is a need to provide a means to allow concurrent management of transactions. Current multidatabase concurrency control schemes do not address the limited bandwidth and frequent disconnection associated with wireless networks. This paper proposes a new hierarchical concurrency control algorithm. The proposed concurrency control algorithm, v-lock, uses global locking tables created with semantic information contained within the hierarchy. The locking tables are used to serialize global transactions, detect and remove global deadlocks. Additionally, data replication, at the mobile unit, is used to limit the effects of the restrictions imposed by a mobile environment. The replicated data provides additional availability in case of a weak connection or disconnection. Current research has concentrated on page and file-based caching or replication schemes to address the availability and consistency issues in a mobile environment. In a mobile, multidatabase environment, local autonomy restrictions prevent the use of a page or file-based data replication scheme. This paper proposes a new data replication scheme to address the limited bandwidth and local autonomy restrictions. Queries and the associated data are cached at the mobile unit as a complete object. Consistency is maintained by using a parity-based invalidation scheme. A simple prefetching scheme is used in conjunction with caching to further improve the effectiveness of the proposed scheme. Finally, a simulator was developed to evaluate the performance of the proposed algorithms. The simulation results are presented and discussed.

...read moreread less