Showing papers in "IEEE Transactions on Knowledge and Data Engineering in 2001"

PDF

Open Access

Journal Article•DOI•

Protecting respondents identities in microdata release

[...]

01 Nov 2001-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This paper addresses the problem of releasing microdata while safeguarding the anonymity of respondents to which the data refer and introduces the concept of minimal generalization that captures the property of the release process not distorting the data more than needed to achieve k-anonymity.

...read moreread less

Abstract: Today's globally networked society places great demands on the dissemination and sharing of information. While in the past released information was mostly in tabular and statistical form, many situations call for the release of specific data (microdata). In order to protect the anonymity of the entities (called respondents) to which information refers, data holders often remove or encrypt explicit identifiers such as names, addresses, and phone numbers. Deidentifying data, however, provides no guarantee of anonymity. Released information often contains other data, such as race, birth date, sex, and ZIP code, that can be linked to publicly available information to reidentify respondents and inferring information that was not intended for disclosure. In this paper we address the problem of releasing microdata while safeguarding the anonymity of respondents to which the data refer. The approach is based on the definition of k-anonymity. A table provides k-anonymity if attempts to link explicitly identifying information to its content map the information to at least k entities. We illustrate how k-anonymity can be provided without compromising the integrity (or truthfulness) of the information released by using generalization and suppression techniques. We introduce the concept of minimal generalization that captures the property of the release process not distorting the data more than needed to achieve k-anonymity, and present an algorithm for the computation of such a generalization. We also discuss possible preference policies to choose among different minimal generalizations.

...read moreread less

2,291 citations

Journal Article•DOI•

Analysis of the clustering properties of the Hilbert space-filling curve

[...]

Bongki Moon¹, H. V. Jagadish², Christos Faloutsos³, Joel H. Saltz³•Institutions (3)

University of Arizona¹, Bell Labs², University of Maryland, College Park³

01 Jan 2001-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This work analyzes the clustering property of the Hilbert space-filling curve by deriving closed-form formulas for the number of clusters in a given query region of an arbitrary shape and shows that the Hilbert curve achieves better clustering than the z curve.

...read moreread less

Abstract: Several schemes for the linear mapping of a multidimensional space have been proposed for various applications, such as access methods for spatio-temporal databases and image compression. In these applications, one of the most desired properties from such linear mappings is clustering, which means the locality between objects in the multidimensional space being preserved in the linear space. It is widely believed that the Hilbert space-filling curve achieves the best clustering (Abel and Mark, 1990; Jagadish, 1990). We analyze the clustering property of the Hilbert space-filling curve by deriving closed-form formulas for the number of clusters in a given query region of an arbitrary shape (e.g., polygons and polyhedra). Both the asymptotic solution for the general case and the exact solution for a special case generalize previous work. They agree with the empirical results that the number of clusters depends on the hypersurface area of the query region and not on its hypervolume. We also show that the Hilbert curve achieves better clustering than the z curve. From a practical point of view, the formulas given provide a simple measure that can be used to predict the required disk access behaviors and, hence, the total access time.

...read moreread less

740 citations

Journal Article•DOI•

Finding interesting associations without support pruning

[...]

Edith Cohen¹, Mayur Datar, S. Fujiwara, Aristides Gionis, Piotr Indyk, Rajeev Motwani, Jeffrey D. Ullman, Cheng Yang - Show less +4 more•Institutions (1)

AT&T Labs¹

01 Jan 2001-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This work develops a family of algorithms for solving association-rule mining, employing a combination of random sampling and hashing techniques, and provides analysis of the algorithms developed and experiments on real and synthetic data to obtain a comparative performance analysis.

...read moreread less

Abstract: Association-rule mining has heretofore relied on the condition of high support to do its work efficiently. In particular, the well-known a priori algorithm is only effective when the only rules of interest are relationships that occur very frequently. However, there are a number of applications, such as data mining, identification of similar Web documents, clustering, and collaborative filtering, where the rules of interest have comparatively few instances in the data. In these cases, we must look for highly correlated items, or possibly even causal relationships between infrequent items. We develop a family of algorithms for solving this problem, employing a combination of random sampling and hashing techniques. We provide analysis of the algorithms developed and conduct experiments on real and synthetic data to obtain a comparative performance analysis.

...read moreread less

370 citations

Journal Article•DOI•

Global viewing of heterogeneous data sources

[...]

Silvana Castano¹, V. De Antonellis•Institutions (1)

University of Milan¹

01 Mar 2001-IEEE Transactions on Knowledge and Data Engineering

TL;DR: An affinity based unification method for global view construction is proposed and experiences of applying the proposed unification method and the associated tool environment ARTEMIS on databases of the Italian Public Administration information systems are described.

...read moreread less

Abstract: The problem of defining global views of heterogeneous data sources to support querying and cooperation activities is becoming more and more important due to the availability of multiple data sources within complex organizations and in global information systems. Global views are defined to provide a unified representation of the information in the different sources by analyzing conceptual schemas associated with them and resolving possible semantic heterogeneity. We propose an affinity based unification method for global view construction. In the method: (1) the concept of affinity is introduced to assess the level of semantic relationship between elements in different schemas by taking into account semantic heterogeneity; (2) schema elements are classified by affinity levels using clustering procedures so that their different representations can be analyzed for unification; (3) global views are constructed starting from selected clusters by unifying representations of their elements. Experiences of applying the proposed unification method and the associated tool environment ARTEMIS on databases of the Italian Public Administration information systems are described.

...read moreread less

344 citations

Journal Article•DOI•

Locating objects in mobile computing

[...]

Evaggelia Pitoura, George Samaras

01 Jul 2001-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This work presents a comprehensive survey of the various approaches to the problem of storing, querying, and updating the location of objects in mobile computing, identifying the fundamental techniques underlying the proposed approaches along various dimensions.

...read moreread less

Abstract: In current distributed systems, the notion of mobility is emerging in many forms and applications. Mobility arises naturally in wireless computing since the location of users changes as they move. Besides mobility in wireless computing, software mobile agents are another popular form of moving objects. Locating objects, i.e., identifying their current location, is central to mobile computing. We present a comprehensive survey of the various approaches to the problem of storing, querying, and updating the location of objects in mobile computing. The fundamental techniques underlying the proposed approaches are identified, analyzed, and classified along various dimensions.

...read moreread less

276 citations

Journal Article•DOI•

Minimizing bandwidth requirements for on-demand data delivery

[...]

Derek L. Eager¹, Mary K. Vernon², John Zahorjan³•Institutions (3)

University of Saskatchewan¹, University of Wisconsin-Madison², University of Washington³

01 Sep 2001-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A new practical delivery technique is proposed, called hierarchical multicast stream merging (HMSM), which has a required server bandwidth that is lower than the partitioned dynamic skyscraper and is reasonably close to the minimum achievable server bandwidth over a wide range of client request rates.

...read moreread less

Abstract: Two recent techniques for multicast or broadcast delivery of streaming media can provide immediate service to each client request, yet achieve considerable client stream sharing which leads to significant server and network bandwidth savings. The paper considers: 1) how well these recently proposed techniques perform relative to each other and 2) whether there are new practical delivery techniques that can achieve better bandwidth savings than the previous techniques over a wide range of client request rates. The principal results are as follows: First, the recent partitioned dynamic skyscraper technique is adapted to provide immediate service to each client request more simply and directly than the original dynamic skyscraper method. Second, at moderate to high client request rates, the dynamic skyscraper method has required server bandwidth that is significantly lower than the recent optimized stream tapping/patching/controlled multicast technique. Third, the minimum required server bandwidth for any delivery technique that provides immediate real-time delivery to clients increases logarithmically (with constant factor equal to one) as a function of the client request arrival rate. Furthermore, it is (theoretically) possible to achieve very close to the minimum required server bandwidth if client receive bandwidth is equal to two times the data streaming rate and client storage capacity is sufficient for buffering data from shared streams. Finally, we propose a new practical delivery technique, called hierarchical multicast stream merging (HMSM), which has a required server bandwidth that is lower than the partitioned dynamic skyscraper and is reasonably close to the minimum achievable required server bandwidth over a wide range of client request rates.

...read moreread less

225 citations

Journal Article•DOI•

On the "dimensionality curse" and the "self-similarity blessing"

[...]

Flip Korn¹, B.-U. Pagel, Christos Faloutsos•Institutions (1)

AT&T Labs¹

01 Jan 2001-IEEE Transactions on Knowledge and Data Engineering

TL;DR: It is shown how the Hausdorff and Correlation fractal dimensions of a data set can yield extremely accurate formulas that can predict the I/O performance to within one standard deviation on multiple real and synthetic data sets.

...read moreread less

Abstract: Spatial queries in high-dimensional spaces have been studied extensively. Among them, nearest neighbor queries are important in many settings, including spatial databases (Find the k closest cities) and multimedia databases (Find the k most similar images). Previous analyses have concluded that nearest-neighbor search is hopeless in high dimensions due to the notorious "curse of dimensionality". We show that this may be overpessimistic. We show that what determines the search performance (at least for R-tree-like structures) is the intrinsic dimensionality of the data set and not the dimensionality of the address space (referred to as the embedding dimensionality). The typical (and often implicit) assumption in many previous studies is that the data is uniformly distributed, with independence between attributes. However, real data sets overwhelmingly disobey these assumptions; rather, they typically are skewed and exhibit intrinsic ("fractal") dimensionalities that are much lower than their embedding dimension, e.g. due to subtle dependencies between attributes. We show how the Hausdorff and Correlation fractal dimensions of a data set can yield extremely accurate formulas that can predict the I/O performance to within one standard deviation on multiple real and synthetic data sets.

...read moreread less

217 citations

Journal Article•DOI•

Emergent semantics through interaction in image databases

[...]

Simone Santini¹, Amarnath Gupta¹, Ramesh Jain²•Institutions (2)

University of California, San Diego¹, San Diego Supercomputer Center²

01 May 2001-IEEE Transactions on Knowledge and Data Engineering

TL;DR: It is argued that images don't have an intrinsic meaning, but that they are endowed with a meaning by placing them in the context of other images and by the user interaction.

...read moreread less

Abstract: In this paper, we briefly discuss some aspects of image semantics and the role that it plays for the design of image databases. We argue that images don't have an intrinsic meaning, but that they are endowed with a meaning by placing them in the context of other images and by the user interaction. From this observation, we conclude that, in an image, database users should be allowed to manipulate not only the individual images, but also the relation between them. We present an interface model based on the manipulation of configurations of images.

...read moreread less

215 citations

Journal Article•DOI•

Automating statistics management for query optimizers

[...]

Surajit Chaudhuri¹, Vivek Narasayya•Institutions (1)

Microsoft¹

01 Jan 2001-IEEE Transactions on Knowledge and Data Engineering

TL;DR: Novel techniques that help significantly reduce the set of statistics that need to be created without sacrificing the quality of query plans generated are introduced.

...read moreread less

Abstract: Statistics play a key role in influencing the quality of plans chosen by a database query optimizer. In this paper, we identify the statistics that are essential for an optimizer. We introduce novel techniques that help significantly reduce the set of statistics that need to be created without sacrificing the quality of query plans generated. We discuss how these techniques can be leveraged to automate statistics management in databases. We have implemented and experimentally evaluated our approach on Microsoft SQL Server 7.0.

...read moreread less

210 citations

Journal Article•DOI•

DEMON: mining and monitoring evolving data

[...]

Venkatesh Ganti¹, Johannes Gehrke, Raghu Ramakrishnan•Institutions (1)

University of Wisconsin-Madison¹

01 Jan 2001-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A new dimension, called the data span dimension, is introduced, which allows user-defined selections of a temporal subset of the database, and a generic algorithm is described that takes any traditional incremental model maintenance algorithm and transforms it into an algorithm that allows restrictions on the dataspan dimension.

...read moreread less

Abstract: Data mining algorithms have been the focus of much research. In practice, the input data to a data mining process resides in a large data warehouse whose data is kept up-to-date through periodic or occasional addition and deletion of blocks of data. Most data mining algorithms have either assumed that the input data is static, or have been designed for arbitrary insertions and deletions of data records. We consider a dynamic environment that evolves through systematic addition or deletion of blocks of data. We introduce a new dimension, called the data span dimension, which allows user-defined selections of a temporal subset of the database. Taking this new degree of freedom into account, we describe efficient model maintenance algorithms for frequent item sets and clusters. We then describe a generic algorithm that takes any traditional incremental model maintenance algorithm and transforms it into an algorithm that allows restrictions on the data span dimension. We also develop an algorithm for automatically discovering a specific class of interesting block selection sequences. In a detailed experimental study, we examine the validity and performance of our ideas on synthetic and real datasets.

...read moreread less

140 citations

Journal Article•DOI•

Learning distributed representations of concepts using linear relational embedding

[...]

Alberto Paccanaro¹, Geoffrey E. Hinton•Institutions (1)

University College London¹

01 Mar 2001-IEEE Transactions on Knowledge and Data Engineering

TL;DR: In this paper, linear relational embedding is introduced as a means of learning a distributed representation of concepts from data consisting of binary relations between these concepts, and the operation of applying a relation to a concept as a matrix-vector multiplication that produces an approximation to the related concept is learned by maximizing an appropriate discriminative goodness function using gradient ascent.

...read moreread less

Abstract: We introduce linear relational embedding as a means of learning a distributed representation of concepts from data consisting of binary relations between these concepts. The key idea is to represent concepts as vectors, binary relations as matrices, and the operation of applying a relation to a concept as a matrix-vector multiplication that produces an approximation to the related concept. A representation for concepts and relations is learned by maximizing an appropriate discriminative goodness function using gradient ascent. On a task involving family relationships, learning is fast and leads to good generalization.

...read moreread less

Journal Article•DOI•

A new approach to online generation of association rules

[...]

Charu C. Aggarwal¹, Philip S. Yu•Institutions (1)

IBM¹

01 Jul 2001-IEEE Transactions on Knowledge and Data Engineering

TL;DR: The problem of online mining of association rules in a large database of sales transactions is discussed, with the use of nonredundant association rules helping significantly in the reduction of irrelevant noise in the data mining process.

...read moreread less

Abstract: We discuss the problem of online mining of association rules in a large database of sales transactions. The online mining is performed by preprocessing the data effectively in order to make it suitable for repeated online queries. We store the preprocessed data in such a way that online processing may be done by applying a graph theoretic search algorithm whose complexity is proportional to the size of the output. The result is an online algorithm which is independent of the size of the transactional data and the size of the preprocessed data. The algorithm is almost instantaneous in the size of the output. The algorithm also supports techniques for quickly discovering association rules from large itemsets. The algorithm is capable of finding rules with specific items in the antecedent or consequent. These association rules are presented in a compact form, eliminating redundancy. The use of nonredundant association rules helps significantly in the reduction of irrelevant noise in the data mining process.

...read moreread less

Journal Article•DOI•

A parametric approach to deductive databases with uncertainty

[...]

Laks V. S. Lakshmanan¹, Nematollaah Shiri¹•Institutions (1)

Concordia University¹

01 Jul 2001-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This work proposes a generic framework, called the parametric framework, as a unifying umbrella for IB frameworks, and develops the declarative, fixpoint, and proof-theoretic semantics of programs in this framework and shows their equivalence.

...read moreread less

Abstract: Numerous frameworks have been proposed in recent years for deductive databases with uncertainty. On the basis of how uncertainty is associated with the facts and rules in a program, we classify these frameworks into implication-based (IB) and annotation-based (AB) frameworks. We take the IB approach and propose a generic framework, called the parametric framework, as a unifying umbrella for IB frameworks. We develop the declarative, fixpoint, and proof-theoretic semantics of programs in our framework and show their equivalence. Using the framework as a basis, we then study the query optimization problem of containment of conjunctive queries in this framework and establish necessary and sufficient conditions for containment for several classes of parametric conjunctive queries. Our results yield tools for use in the query optimization for large classes of query programs in IB deductive databases with uncertainty.

...read moreread less

Journal Article•DOI•

Structured development of problem solving methods

[...]

Dieter Fensel¹, Enrico Motta²•Institutions (2)

University of Amsterdam¹, Open University²

01 Nov 2001-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This paper presents a comprehensive and detailed framework for characterizing problem solving methods and their development process and suggests that PSM development consists of introducing assumptions and commitments along a three-dimensional space defined in terms of problem-solving strategy, task commitments, and domain (knowledge) assumptions.

...read moreread less

Abstract: Problem solving methods (PSMs) describe the reasoning components of knowledge-based systems as patterns of behavior that can be reused across applications. While the availability of extensive problem solving method libraries and the emerging consensus on problem solving method specification languages indicate the maturity of the field, a number of important research issues are still open. In particular, very little progress has been achieved on foundational and methodological issues. Hence, despite the number of libraries which have been developed, it is still not clear what organization principles should be adopted to construct truly comprehensive libraries, covering large numbers of applications and encompassing both task-specific and task-independent problem solving methods. In this paper, we address these "fundamental" issues and present a comprehensive and detailed framework for characterizing problem solving methods and their development process. In particular, we suggest that PSM development consists of introducing assumptions and commitments along a three-dimensional space defined in terms of problem-solving strategy, task commitments, and domain (knowledge) assumptions. Individual moves through this space can be formally described by means of adapters. In the paper, we illustrate our approach and argue that our architecture provides answers to three fundamental problems related to research in problem solving methods: 1) what is the epistemological structure and what are the modeling primitives of PSMs? 2) how can we model the PSM development process? and 3) how can we develop and organize truly comprehensive and manageable libraries of problem solving methods?.

...read moreread less

Journal Article•DOI•

Indexing animated objects using spatiotemporal access methods

[...]

George Kollios¹, Vassilis J. Tsotras², Dimitrios Gunopulos², Alex Delis, Marios Hadjieleftheriou² - Show less +1 more•Institutions (2)

Boston University¹, University of California, Riverside²

01 Sep 2001-IEEE Transactions on Knowledge and Data Engineering

TL;DR: An approach for indexing animated objects and efficiently answering queries about their position in time and space by using a 2D access method that is made partially persistent and an optimization problem for which the optimal solution for the case where objects move linearly is provided.

...read moreread less

Abstract: We present an approach for indexing animated objects and efficiently answering queries about their position in time and space. In particular, we consider an animated movie as a spatiotemporal evolution. A movie is viewed as an ordered sequence of frames, where each frame is a 2D space occupied by the objects that appear in that frame. The queries of interest are range queries of the form, "find the objects that appear in area S between frames f/sub i/ and f/sub j//sup "/ as well as nearest neighbor queries such as, "find the q nearest objects to a given position A between frames f/sub i/ and f/sub j//sup "/. The straightforward approach to index such objects considers the frame sequence as another dimension and uses a 3D access method (such as an R-Tree or its variants). This, however, assigns long "lifetime" intervals to objects that appear through many consecutive frames. Long intervals are difficult to cluster efficiently in a 3D index. Instead, we propose to reduce the problem to a partial-persistence problem. Namely, we use a 2D access method that is made partially persistent. We show that this approach leads to faster query performance while still using storage proportional to the total number of changes in the frame evolution, What differentiates this problem from traditional temporal indexing approaches is that objects are allowed to move and/or change their extent continuously between frames. We present novel methods to approximate such object evolutions, We formulate an optimization problem for which we provide an optimal solution for the case where objects move linearly. Finally, we present an extensive experimental study of the proposed methods. While we concentrate on animated movies, our approach is general and can be applied to other spatiotemporal applications as well.

...read moreread less

Journal Article•DOI•

Representation and processing of structures with binary sparse distributed codes

[...]

Dmitri A. Rachkovskij

01 Mar 2001-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A comparative description of sparse binary distributed representation developed in the framework of the associative-projective neural network architecture and the more well known holographic reduced representations of T.A. Plate and P. Kanerva is provided.

...read moreread less

Abstract: The schemes for compositional distributed representations include those allowing on-the-fly construction of fixed dimensionality codevectors to encode structures of various complexity. Similarity of such codevectors takes into account both structural and semantic similarity of represented structures. We provide a comparative description of sparse binary distributed representation developed in the framework of the associative-projective neural network architecture and the more well known holographic reduced representations of T.A. Plate (1995) and binary spatter codes of P. Kanerva (1996). The key procedure in associative-projective neural networks is context-dependent thinning which binds codevectors and maintains their sparseness. The codevectors are stored in structured memory array which can be realized as distributed auto-associative memory. Examples of distributed representation of structured data are given. Fast estimation of the similarity of analogical episodes by the overlap of their codevectors is used in the modeling of analogical reasoning both for retrieval of analogs from memory and for analogical mapping.

...read moreread less

Journal Article•DOI•

Aggregation of imprecise and uncertain information in databases

[...]

Sally McClean¹, Bryan Scotney, Mary Shapcott•Institutions (1)

Ulster University¹

01 Nov 2001-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This work considers the problem of aggregation using an imprecise probability data model that allows us to represent imprecision by partial probabilities and uncertainty using probability distributions to perform the operations necessary for knowledge discovery in databases.

...read moreread less

Abstract: Information stored in a database is often subject to uncertainty and imprecision. Probability theory provides a well-known and well understood way of representing uncertainty and may thus be used to provide a mechanism for storing uncertain information in a database. We consider the problem of aggregation using an imprecise probability data model that allows us to represent imprecision by partial probabilities and uncertainty using probability distributions. Most work to date has concentrated on providing functionality for extending the relational algebra with a view to executing traditional queries on uncertain or imprecise data. However, for imprecise and uncertain data, we often require aggregation operators that provide information on patterns in the data. Thus, while traditional query processing is tuple-driven, processing of uncertain data is often attribute-driven where we use aggregation operators to discover attribute properties. The aggregation operator that we define uses the Kullback-Leibler information divergence between the aggregated probability distribution and the individual tuple values to provide a probability distribution for the domain values of an attribute or group of attributes. The provision of such aggregation operators is a central requirement in furnishing a database with the capability to perform the operations necessary for knowledge discovery in databases.

...read moreread less

Journal Article•DOI•

A graph-based approach for discovering various types of association rules

[...]

Show-Jane Yen¹, Arbee L. P. Chen•Institutions (1)

Fu Jen Catholic University¹

01 Sep 2001-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This work proposes a graph-based approach to generate various types of association rules from a large database of customer transactions, and shows that its algorithms outperform other algorithms which need to make multiple passes over the database.

...read moreread less

Abstract: Mining association rules is an important task for knowledge discovery. We can analyze past transaction data to discover customer behaviors such that the quality of business decisions can be improved. Various types of association rules may exist in a large database of customer transactions. The strategy of mining association rules focuses on discovering large item sets, which are groups of items which appear together in a sufficient number of transactions. We propose a graph-based approach to generate various types of association rules from a large database of customer transactions. This approach scans the database once to construct an association graph and then traverses the graph to generate all large item sets. Empirical evaluations show that our algorithms outperform other algorithms which need to make multiple passes over the database.

...read moreread less

Journal Article•DOI•

Hierarchical case-based reasoning integrating case-based and decompositional problem-solving techniques for plant-control software design

[...]

Barry Smyth¹, Mark T. Keane², Pádraig Cunningham•Institutions (2)

University College Dublin¹, Trinity College, Dublin²

01 Sep 2001-IEEE Transactions on Knowledge and Data Engineering

TL;DR: The technique of hierarchical case based reasoning, which allows complex problems to be solved by reusing multiple cases at various levels of abstraction, is described in the context of Deja Vu, a CBR system aimed at automating plant-control software design.

...read moreread less

Abstract: Case based reasoning (CBR) is an artificial intelligence technique that emphasises the role of past experience during future problem solving. New problems are solved by retrieving and adapting the solutions to similar problems, solutions that have been stored and indexed for future reuse as cases in a case-base. The power of CBR is severely curtailed if problem solving is limited to the retrieval and adaptation of a single case, so most CBR systems dealing with complex problem solving tasks have to use multiple cases. The paper describes and evaluates the technique of hierarchical case based reasoning, which allows complex problems to be solved by reusing multiple cases at various levels of abstraction. The technique is described in the context of Deja Vu, a CBR system aimed at automating plant-control software design.

...read moreread less

Journal Article•DOI•

Scalable color image indexing and retrieval using vector wavelets

[...]

E. Albuz, E. Kocalar, Ashfaq Khokhar

01 Sep 2001-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This paper presents a scalable content-based image indexing and retrieval system based on vector wavelet coefficients of color images that shows that, in a database of 5,000 images, query search takes less than 30 msec on a 266 MHz Pentium II processor.

...read moreread less

Abstract: This paper presents a scalable content-based image indexing and retrieval system based on vector wavelet coefficients of color images. Highly decorrelated wavelet coefficient planes are used to acquire a search efficient feature space. The feature space is subsequently indexed using properties of all the images in the database. Therefore, the feature key of an image not only corresponds to the content of the image itself but also to how much the image is different from the other images being stored in the database. The search time linearly depends on the number of images similar to the query image and is independent of the database size. We show that, in a database of 5,000 images, query search takes less than 30 msec on a 266 MHz Pentium II processor, compared to several seconds of retrieval time in the earlier systems proposed in the literature.

...read moreread less

Journal Article•DOI•

Hierarchical growing cell structures: TreeGCS

[...]

Victoria J. Hodge¹, Jim Austin•Institutions (1)

Universities UK¹

01 Mar 2001-IEEE Transactions on Knowledge and Data Engineering

TL;DR: The proposed TreeGCS algorithm refines and builds upon the GCS base, overcoming an inconsistency in the original GCS algorithm, where the network topology is susceptible to the ordering of the input vectors.

...read moreread less

Abstract: We propose a hierarchical clustering algorithm (TreeGCS) based upon the Growing Cell Structure (GCS) neural network of B. Fritzke (1993). Our algorithm refines and builds upon the GCS base, overcoming an inconsistency in the original GCS algorithm, where the network topology is susceptible to the ordering of the input vectors. Our algorithm is unsupervised, flexible, and dynamic and we have imposed no additional parameters on the underlying GCS algorithm. Our ultimate aim is a hierarchical clustering neural network that is both consistent and stable and identifies the innate hierarchical structure present in vector-based data. We demonstrate improved stability of the GCS foundation and evaluate our algorithm against the hierarchy generated by an ascendant hierarchical clustering dendogram. Our approach emulates the hierarchical clustering of the dendogram. It demonstrates the importance of the parameter settings for GCS and how they affect the stability of the clustering.

...read moreread less

Journal Article•DOI•

Z/sub Y/X-a multimedia document model for reuse and adaptation of multimedia content

[...]

S. Boll, W. Klas

01 May 2001-IEEE Transactions on Knowledge and Data Engineering

TL;DR: The Z/sub Y/X model is developed, a comprehensive means for advanced multimedia content creation: support for template-driven authoring of multimedia content and support for flexible, dynamic composition of multimedia documents customized to the user's local context and needs.

...read moreread less

Abstract: Advanced multimedia applications require adequate support for the modeling of multimedia content by multimedia document models. More and more this support calls for not only the adequate modeling of the temporal and spatial course of a multimedia presentation and its interactions, but also for the partial reuse of multimedia documents and adaptation to a given user context. However, our thorough investigation of existing standards for multimedia document models such as HTML, MHEG, SMIL, and HyTime leads to us the conclusion that these standard models do not provide sufficient modeling support for reuse and adaptation. Therefore, we propose a new approach for the modeling of adaptable and reusable multimedia content, the Z/sub Y/X model. The model offers primitives that provide-beyond the more or less common primitives for temporal, spatial, and interaction modeling-a variform support for reuse of structure and layout of document fragments and for the adaptation of the content and its presentation to the user context. We present the model in detail and illustrate the application and effectiveness of these concepts by samples taken from our Cardio-OP application in the domain of cardiac surgery. With the Z/sub Y/X model, we developed a comprehensive means for advanced multimedia content creation: support for template-driven authoring of multimedia content and support for flexible, dynamic composition of multimedia documents customized to the user's local context and needs. The approach significantly impacts and supports the authoring process in terms of methodology and economic aspects.

...read moreread less

Journal Article•DOI•

Constructing the dependency structure of a multiagent probabilistic network

[...]

S.K.M. Wong, Cory J. Butz

01 May 2001-IEEE Transactions on Knowledge and Data Engineering

TL;DR: An automated process for constructing the combined dependency structure of a multiagent probabilistic network is proposed and the constructed dependency structure is a perfect-map of the minimal cover, which means every probabilism conditional independency logically implied by the minimum cover can be inferred from the dependency structure.

...read moreread less

Abstract: A probabilistic network consists of a dependency structure and corresponding probability tables. The dependency structure is a graphical representation of the conditional independencies that are known to hold in the problem domain. We propose an automated process for constructing the combined dependency structure of a multiagent probabilistic network. Each domain expert supplies any known conditional independency information and not necessarily an explicit dependency structure. Our method determines a succinct representation of all the supplied independency information called a minimal cover. This process involves detecting all inconsistent information and removing all redundant information. A unique dependency structure of the multiagent probabilistic network can be constructed directly from this minimal cover. The main result is that the constructed dependency structure is a perfect-map of the minimal cover. That is, every probabilistic conditional independency logically implied by the minimal cover can be inferred from the dependency structure and every probabilistic conditional independency inferred from the dependency structure is logically implied by the minimal cover.

...read moreread less

Journal Article•DOI•

Fuzzy logic techniques in multimedia database querying: a preliminary investigation of the potentials

[...]

Didier Dubois, Henri Prade¹, Florence Sèdes¹•Institutions (1)

Paul Sabatier University¹

01 May 2001-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This paper provides a preliminary investigation of the potential applications of fuzzy logic in multimedia databases, and distinguishes two types of request, namely, those which can be handled within some extended version of an SQL-like language and those for which one has to elicit user's preference through examples.

...read moreread less

Abstract: Fuzzy logic is known for providing a convenient tool for interfacing linguistic categories with numerical data and for expressing user's preference in a gradual and qualitative way. Fuzzy set methods have been already applied to the representation of flexible queries and to the modeling of uncertain pieces of information in databases systems, as well as in information retrieval. This methodology seems to be even more promising in multimedia databases which have a complex structure and from which documents have to be retrieved and selected not only from their contents, but also from "the idea" the user has of their appearance, through queries specified in terms of user's criteria. This paper provides a preliminary investigation of the potential applications of fuzzy logic in multimedia databases. The problem of comparing semistructured documents is first discussed. Querying issues are then more particularly emphasized. We distinguish two types of request, namely, those which can be handled within some extended version of an SQL-like language and those for which one has to elicit user's preference through examples.

...read moreread less

Journal Article•DOI•

A spatio-temporal semantic model for multimedia database systems and multimedia information systems

[...]

Shu-Ching Chen¹, Rangasami L. Kashyap•Institutions (1)

University of Miami¹

01 Jul 2001-IEEE Transactions on Knowledge and Data Engineering

TL;DR: An abstract semantic model based on an augmented transition network (ATN) is presented, which provides three major capabilities: multimedia presentations, temporal/spatial multimedia database searching, and multimedia browsing.

...read moreread less

Abstract: As more information sources become available in multimedia systems, the development of abstract semantic models for video, audio, text, and image data is becoming very important. An abstract semantic model has two requirements: it should be rich enough to provide a friendly interface of multimedia presentation synchronization schedules to the users and it should be a good programming data structure for implementation in order to control multimedia playback. An abstract semantic model based on an augmented transition network (ATN) is presented. The inputs for ATNs are modeled by multimedia input strings. Multimedia input strings provide an efficient means for iconic indexing of the temporal/spatial relations of media streams and semantic objects. An ATN and its subnetworks are used to represent the appearing sequence of media streams and semantic objects. The arc label is a substring of a multimedia input string. In this design, a presentation is driven by a multimedia input string. Each subnetwork has its own multimedia input string. Database queries relative to text, image, and video can be answered via substring matching at subnetworks. Multimedia browsing allows users the flexibility to select any part of the presentation they prefer to see. This means that the ATN and its subnetworks can be included in multimedia database systems which are controlled by a database management system (DBMS). User interactions and loops are also provided in an ATN. Therefore, ATNs provide three major capabilities: multimedia presentations, temporal/spatial multimedia database searching, and multimedia browsing.

...read moreread less

Journal Article•DOI•

Clustering and classification in structured data domains using Fuzzy Lattice Neurocomputing (FLN)

[...]

Vassilios Petridis¹, Vassilis G. Kaburlasos•Institutions (1)

Aristotle University of Thessaloniki¹

01 Mar 2001-IEEE Transactions on Knowledge and Data Engineering

TL;DR: It is demonstrated how the employment of hyperwords implies a reduction, based on the a priori knowledge of semantics contained in the thesaurus, in the number of features to be used for document classification.

...read moreread less

Abstract: A connectionist scheme, namely, /spl sigma/-Fuzzy Lattice Neurocomputing scheme or /spl sigma/-FLN for short, which has been introduced in the literature lately for clustering in a lattice data domain, is employed for computing clusters of directed graphs in a master-graph. New tools are presented and used, including a convenient inclusion measure function for clustering graphs. A directed graph is treated by /spl sigma/-FLN as a single datum in the mathematical lattice of subgraphs stemming from a master-graph. A series of experiments is detailed where the master-graph emanates from a thesaurus of spoken language synonyms. The words of the thesaurus are fed to /spl sigma/-FLN in order to compute clusters of semantically related words, namely hyperwords. The arithmetic parameters of /spl sigma/-FLN can be adjusted so as to calibrate the total number of hyperwords computed in a specific application. It is demonstrated how the employment of hyperwords implies a reduction, based on the a priori knowledge of semantics contained in the thesaurus, in the number of features to be used for document classification. In a series of comparative experiments for document classification, it appears that the proposed method favorably improves classification accuracy in problems involving longer documents, whereas performance deteriorates in problems involving short documents.

...read moreread less

Journal Article•DOI•

Optimizing large join queries using a graph-based approach

[...]

Chiang Lee¹, Chi-Sheng Shih, Yaw-Huei Chen•Institutions (1)

National Cheng Kung University¹

01 Mar 2001-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A graph-theoretic approach presented in the paper provides a sound mathematical basis for representing a query and searching for an execution plan and devise an algorithm that finds a near optimal execution plan using only polynomial time.

...read moreread less

Abstract: Although many query tree optimization strategies have been proposed in the literature, there still is a lack of a formal and complete representation of all possible permutations of query operations (i.e., execution plans) in a uniform manner. A graph-theoretic approach presented in the paper provides a sound mathematical basis for representing a query and searching for an execution plan. In this graph model, a node represents an operation and a directed edge between two nodes indicates the older of executing these two operations in an execution plan. Each node is associated with a weight and so is an edge. The weight is an expression containing optimization required parameters, such as relation size, tuple size, join selectivity factors. All possible execution plans are representable in this graph and each spanning tree of the graph becomes an execution plan. It is a general model which can be used in the optimizer of a DBMS for internal query representation. On the basis of this model, we devise an algorithm that finds a near optimal execution plan using only polynomial time. The algorithm is compared with a few other popular optimization methods. Experiments show that the proposed algorithm is superior to the others under most circumstances.

...read moreread less

Journal Article•DOI•

[...]

Bernhard Braunmüller, Martin Ester, Hans-Peter Kriegel, Jörg Sander

01 Jan 2001-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This work introduces a generic scheme for data mining algorithms and investigates two orthogonal approaches, reducing I/O cost as well as CPU cost, to speed-up the processing of multiple similarity queries in metric databases.

...read moreread less

Abstract: Metric databases are databases where a metric distance function is defined for pairs of database objects. In such databases, similarity queries in the form of range queries or k-nearest-neighbor queries are the most important query types. In traditional query processing, single queries are issued independently by different users. In many data mining applications, however, the database is typically explored by iteratively asking similarity queries for answers of previous similarity queries. We introduce a generic scheme for such data mining algorithms and we investigate two orthogonal approaches, reducing I/O cost as well as CPU cost, to speed-up the processing of multiple similarity queries. The proposed techniques apply to any type of similarity query and to an implementation based on an index or using a sequential scan. Parallelization yields an additional impressive speed-up. An extensive performance evaluation confirms the efficiency of our approach.

...read moreread less

Journal Article•DOI•

Efficient processing of nested Fuzzy SQL queries in a fuzzy database

[...]

Qi Yang¹, Weining Zhang, Chengwen Liu², Jing Wu³, Clement Yu³, Hiroshi Nakajima⁴, Naphtali Rishe⁵ - Show less +3 more•Institutions (5)

University of Wisconsin-Madison¹, DePaul University², University of Illinois at Chicago³, Omron⁴, Florida International University⁵

01 Nov 2001-IEEE Transactions on Knowledge and Data Engineering

TL;DR: In this paper, an extended merge-join is used to evaluate the unnested fuzzy queries, which significantly improves the performance of evaluating nested fuzzy queries. But the results are limited to a subset of nested queries.

...read moreread less

Abstract: In a fuzzy relational database where a relation is a fuzzy set of tuples and ill-known data are represented by possibility distributions, nested fuzzy queries can be expressed in the Fuzzy SQL language. Although it provides a very convenient way for users to express complex queries, a nested fuzzy query may be very inefficient to process with the naive evaluation method based on its semantics. In conventional databases, nested queries are unnested to improve the efficiency of their evaluation. In this paper, we extend the unnesting techniques to process several types of nested fuzzy queries. An extended merge-join is used to evaluate the unnested fuzzy queries. As shown by both theoretical analysis and experimental results, the unnesting techniques with the extended merge-join significantly improve the performance of evaluating nested fuzzy queries.

...read moreread less

Journal Article•DOI•

SQL/SDA: a query language for supporting spatial data analysis and its Web-based implementation

[...]

Hui Lin¹, Bo Huang•Institutions (1)

The Chinese University of Hong Kong¹

01 Jul 2001-IEEE Transactions on Knowledge and Data Engineering

TL;DR: By restructuring the FROM clause via a subquery, SQL/SDA is well-adapted to the general spatial analysis procedures using current GIS packages and stretches the capabilities of previous ones.

...read moreread less

Abstract: An important trend of current GIS development is to provide easy and effective access to spatial analysis functionalities for supporting decision making based on geo-referenced data. Within the framework of the ongoing SQL standards for spatial extensions, a spatial query language, called SQV/SDA, has been designed to meet such a requirement. Since the language needs to incorporate the important derivation functions (e.g., map-overlay and feature-fusion) as well as the spatial relationship and metric functions, the functionality of the FROM clause in SQL is developed in addition to the SELECT and WHERE clauses. By restructuring the FROM clause via a subquery, SQL/SDA is well-adapted to the general spatial analysis procedures using current GIS packages. Such an extended SQL, therefore, stretches the capabilities of previous ones. The implementation of SQL/SDA on the Internet adopts a hybrid model, which takes advantage of the Web GIS design methods in both the client side and server side. The client side of SQL/SDA, programmed in the Java language, provides a query interface by introducing visual constructs such as icons, listboxes, and comboboxes to assist in the composition of queries, thereby enhancing the usability of the language. The server side of SQL/SDA, which is composed of a query processor and Spatial Database Engine (SDE), carries out query processing on spatial databases after receiving user requests.

...read moreread less