scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Knowledge and Data Engineering in 2001"


Journal ArticleDOI
TL;DR: This paper addresses the problem of releasing microdata while safeguarding the anonymity of respondents to which the data refer and introduces the concept of minimal generalization that captures the property of the release process not distorting the data more than needed to achieve k-anonymity.
Abstract: Today's globally networked society places great demands on the dissemination and sharing of information. While in the past released information was mostly in tabular and statistical form, many situations call for the release of specific data (microdata). In order to protect the anonymity of the entities (called respondents) to which information refers, data holders often remove or encrypt explicit identifiers such as names, addresses, and phone numbers. Deidentifying data, however, provides no guarantee of anonymity. Released information often contains other data, such as race, birth date, sex, and ZIP code, that can be linked to publicly available information to reidentify respondents and inferring information that was not intended for disclosure. In this paper we address the problem of releasing microdata while safeguarding the anonymity of respondents to which the data refer. The approach is based on the definition of k-anonymity. A table provides k-anonymity if attempts to link explicitly identifying information to its content map the information to at least k entities. We illustrate how k-anonymity can be provided without compromising the integrity (or truthfulness) of the information released by using generalization and suppression techniques. We introduce the concept of minimal generalization that captures the property of the release process not distorting the data more than needed to achieve k-anonymity, and present an algorithm for the computation of such a generalization. We also discuss possible preference policies to choose among different minimal generalizations.

2,291 citations


Journal ArticleDOI
TL;DR: This work analyzes the clustering property of the Hilbert space-filling curve by deriving closed-form formulas for the number of clusters in a given query region of an arbitrary shape and shows that the Hilbert curve achieves better clustering than the z curve.
Abstract: Several schemes for the linear mapping of a multidimensional space have been proposed for various applications, such as access methods for spatio-temporal databases and image compression. In these applications, one of the most desired properties from such linear mappings is clustering, which means the locality between objects in the multidimensional space being preserved in the linear space. It is widely believed that the Hilbert space-filling curve achieves the best clustering (Abel and Mark, 1990; Jagadish, 1990). We analyze the clustering property of the Hilbert space-filling curve by deriving closed-form formulas for the number of clusters in a given query region of an arbitrary shape (e.g., polygons and polyhedra). Both the asymptotic solution for the general case and the exact solution for a special case generalize previous work. They agree with the empirical results that the number of clusters depends on the hypersurface area of the query region and not on its hypervolume. We also show that the Hilbert curve achieves better clustering than the z curve. From a practical point of view, the formulas given provide a simple measure that can be used to predict the required disk access behaviors and, hence, the total access time.

740 citations


Journal ArticleDOI
TL;DR: This work develops a family of algorithms for solving association-rule mining, employing a combination of random sampling and hashing techniques, and provides analysis of the algorithms developed and experiments on real and synthetic data to obtain a comparative performance analysis.
Abstract: Association-rule mining has heretofore relied on the condition of high support to do its work efficiently. In particular, the well-known a priori algorithm is only effective when the only rules of interest are relationships that occur very frequently. However, there are a number of applications, such as data mining, identification of similar Web documents, clustering, and collaborative filtering, where the rules of interest have comparatively few instances in the data. In these cases, we must look for highly correlated items, or possibly even causal relationships between infrequent items. We develop a family of algorithms for solving this problem, employing a combination of random sampling and hashing techniques. We provide analysis of the algorithms developed and conduct experiments on real and synthetic data to obtain a comparative performance analysis.

370 citations


Journal ArticleDOI
TL;DR: An affinity based unification method for global view construction is proposed and experiences of applying the proposed unification method and the associated tool environment ARTEMIS on databases of the Italian Public Administration information systems are described.
Abstract: The problem of defining global views of heterogeneous data sources to support querying and cooperation activities is becoming more and more important due to the availability of multiple data sources within complex organizations and in global information systems. Global views are defined to provide a unified representation of the information in the different sources by analyzing conceptual schemas associated with them and resolving possible semantic heterogeneity. We propose an affinity based unification method for global view construction. In the method: (1) the concept of affinity is introduced to assess the level of semantic relationship between elements in different schemas by taking into account semantic heterogeneity; (2) schema elements are classified by affinity levels using clustering procedures so that their different representations can be analyzed for unification; (3) global views are constructed starting from selected clusters by unifying representations of their elements. Experiences of applying the proposed unification method and the associated tool environment ARTEMIS on databases of the Italian Public Administration information systems are described.

344 citations


Journal ArticleDOI
TL;DR: This work presents a comprehensive survey of the various approaches to the problem of storing, querying, and updating the location of objects in mobile computing, identifying the fundamental techniques underlying the proposed approaches along various dimensions.
Abstract: In current distributed systems, the notion of mobility is emerging in many forms and applications. Mobility arises naturally in wireless computing since the location of users changes as they move. Besides mobility in wireless computing, software mobile agents are another popular form of moving objects. Locating objects, i.e., identifying their current location, is central to mobile computing. We present a comprehensive survey of the various approaches to the problem of storing, querying, and updating the location of objects in mobile computing. The fundamental techniques underlying the proposed approaches are identified, analyzed, and classified along various dimensions.

276 citations


Journal ArticleDOI
TL;DR: A new practical delivery technique is proposed, called hierarchical multicast stream merging (HMSM), which has a required server bandwidth that is lower than the partitioned dynamic skyscraper and is reasonably close to the minimum achievable server bandwidth over a wide range of client request rates.
Abstract: Two recent techniques for multicast or broadcast delivery of streaming media can provide immediate service to each client request, yet achieve considerable client stream sharing which leads to significant server and network bandwidth savings. The paper considers: 1) how well these recently proposed techniques perform relative to each other and 2) whether there are new practical delivery techniques that can achieve better bandwidth savings than the previous techniques over a wide range of client request rates. The principal results are as follows: First, the recent partitioned dynamic skyscraper technique is adapted to provide immediate service to each client request more simply and directly than the original dynamic skyscraper method. Second, at moderate to high client request rates, the dynamic skyscraper method has required server bandwidth that is significantly lower than the recent optimized stream tapping/patching/controlled multicast technique. Third, the minimum required server bandwidth for any delivery technique that provides immediate real-time delivery to clients increases logarithmically (with constant factor equal to one) as a function of the client request arrival rate. Furthermore, it is (theoretically) possible to achieve very close to the minimum required server bandwidth if client receive bandwidth is equal to two times the data streaming rate and client storage capacity is sufficient for buffering data from shared streams. Finally, we propose a new practical delivery technique, called hierarchical multicast stream merging (HMSM), which has a required server bandwidth that is lower than the partitioned dynamic skyscraper and is reasonably close to the minimum achievable required server bandwidth over a wide range of client request rates.

225 citations


Journal ArticleDOI
TL;DR: It is shown how the Hausdorff and Correlation fractal dimensions of a data set can yield extremely accurate formulas that can predict the I/O performance to within one standard deviation on multiple real and synthetic data sets.
Abstract: Spatial queries in high-dimensional spaces have been studied extensively. Among them, nearest neighbor queries are important in many settings, including spatial databases (Find the k closest cities) and multimedia databases (Find the k most similar images). Previous analyses have concluded that nearest-neighbor search is hopeless in high dimensions due to the notorious "curse of dimensionality". We show that this may be overpessimistic. We show that what determines the search performance (at least for R-tree-like structures) is the intrinsic dimensionality of the data set and not the dimensionality of the address space (referred to as the embedding dimensionality). The typical (and often implicit) assumption in many previous studies is that the data is uniformly distributed, with independence between attributes. However, real data sets overwhelmingly disobey these assumptions; rather, they typically are skewed and exhibit intrinsic ("fractal") dimensionalities that are much lower than their embedding dimension, e.g. due to subtle dependencies between attributes. We show how the Hausdorff and Correlation fractal dimensions of a data set can yield extremely accurate formulas that can predict the I/O performance to within one standard deviation on multiple real and synthetic data sets.

217 citations


Journal ArticleDOI
TL;DR: It is argued that images don't have an intrinsic meaning, but that they are endowed with a meaning by placing them in the context of other images and by the user interaction.
Abstract: In this paper, we briefly discuss some aspects of image semantics and the role that it plays for the design of image databases. We argue that images don't have an intrinsic meaning, but that they are endowed with a meaning by placing them in the context of other images and by the user interaction. From this observation, we conclude that, in an image, database users should be allowed to manipulate not only the individual images, but also the relation between them. We present an interface model based on the manipulation of configurations of images.

215 citations


Journal ArticleDOI
TL;DR: Novel techniques that help significantly reduce the set of statistics that need to be created without sacrificing the quality of query plans generated are introduced.
Abstract: Statistics play a key role in influencing the quality of plans chosen by a database query optimizer. In this paper, we identify the statistics that are essential for an optimizer. We introduce novel techniques that help significantly reduce the set of statistics that need to be created without sacrificing the quality of query plans generated. We discuss how these techniques can be leveraged to automate statistics management in databases. We have implemented and experimentally evaluated our approach on Microsoft SQL Server 7.0.

210 citations


Journal ArticleDOI
TL;DR: A new dimension, called the data span dimension, is introduced, which allows user-defined selections of a temporal subset of the database, and a generic algorithm is described that takes any traditional incremental model maintenance algorithm and transforms it into an algorithm that allows restrictions on the dataspan dimension.
Abstract: Data mining algorithms have been the focus of much research. In practice, the input data to a data mining process resides in a large data warehouse whose data is kept up-to-date through periodic or occasional addition and deletion of blocks of data. Most data mining algorithms have either assumed that the input data is static, or have been designed for arbitrary insertions and deletions of data records. We consider a dynamic environment that evolves through systematic addition or deletion of blocks of data. We introduce a new dimension, called the data span dimension, which allows user-defined selections of a temporal subset of the database. Taking this new degree of freedom into account, we describe efficient model maintenance algorithms for frequent item sets and clusters. We then describe a generic algorithm that takes any traditional incremental model maintenance algorithm and transforms it into an algorithm that allows restrictions on the data span dimension. We also develop an algorithm for automatically discovering a specific class of interesting block selection sequences. In a detailed experimental study, we examine the validity and performance of our ideas on synthetic and real datasets.

140 citations


Journal ArticleDOI
TL;DR: In this paper, linear relational embedding is introduced as a means of learning a distributed representation of concepts from data consisting of binary relations between these concepts, and the operation of applying a relation to a concept as a matrix-vector multiplication that produces an approximation to the related concept is learned by maximizing an appropriate discriminative goodness function using gradient ascent.
Abstract: We introduce linear relational embedding as a means of learning a distributed representation of concepts from data consisting of binary relations between these concepts. The key idea is to represent concepts as vectors, binary relations as matrices, and the operation of applying a relation to a concept as a matrix-vector multiplication that produces an approximation to the related concept. A representation for concepts and relations is learned by maximizing an appropriate discriminative goodness function using gradient ascent. On a task involving family relationships, learning is fast and leads to good generalization.

Journal ArticleDOI
Charu C. Aggarwal1, Philip S. Yu
TL;DR: The problem of online mining of association rules in a large database of sales transactions is discussed, with the use of nonredundant association rules helping significantly in the reduction of irrelevant noise in the data mining process.
Abstract: We discuss the problem of online mining of association rules in a large database of sales transactions. The online mining is performed by preprocessing the data effectively in order to make it suitable for repeated online queries. We store the preprocessed data in such a way that online processing may be done by applying a graph theoretic search algorithm whose complexity is proportional to the size of the output. The result is an online algorithm which is independent of the size of the transactional data and the size of the preprocessed data. The algorithm is almost instantaneous in the size of the output. The algorithm also supports techniques for quickly discovering association rules from large itemsets. The algorithm is capable of finding rules with specific items in the antecedent or consequent. These association rules are presented in a compact form, eliminating redundancy. The use of nonredundant association rules helps significantly in the reduction of irrelevant noise in the data mining process.

Journal ArticleDOI
TL;DR: This work proposes a generic framework, called the parametric framework, as a unifying umbrella for IB frameworks, and develops the declarative, fixpoint, and proof-theoretic semantics of programs in this framework and shows their equivalence.
Abstract: Numerous frameworks have been proposed in recent years for deductive databases with uncertainty. On the basis of how uncertainty is associated with the facts and rules in a program, we classify these frameworks into implication-based (IB) and annotation-based (AB) frameworks. We take the IB approach and propose a generic framework, called the parametric framework, as a unifying umbrella for IB frameworks. We develop the declarative, fixpoint, and proof-theoretic semantics of programs in our framework and show their equivalence. Using the framework as a basis, we then study the query optimization problem of containment of conjunctive queries in this framework and establish necessary and sufficient conditions for containment for several classes of parametric conjunctive queries. Our results yield tools for use in the query optimization for large classes of query programs in IB deductive databases with uncertainty.

Journal ArticleDOI
TL;DR: This paper presents a comprehensive and detailed framework for characterizing problem solving methods and their development process and suggests that PSM development consists of introducing assumptions and commitments along a three-dimensional space defined in terms of problem-solving strategy, task commitments, and domain (knowledge) assumptions.
Abstract: Problem solving methods (PSMs) describe the reasoning components of knowledge-based systems as patterns of behavior that can be reused across applications. While the availability of extensive problem solving method libraries and the emerging consensus on problem solving method specification languages indicate the maturity of the field, a number of important research issues are still open. In particular, very little progress has been achieved on foundational and methodological issues. Hence, despite the number of libraries which have been developed, it is still not clear what organization principles should be adopted to construct truly comprehensive libraries, covering large numbers of applications and encompassing both task-specific and task-independent problem solving methods. In this paper, we address these "fundamental" issues and present a comprehensive and detailed framework for characterizing problem solving methods and their development process. In particular, we suggest that PSM development consists of introducing assumptions and commitments along a three-dimensional space defined in terms of problem-solving strategy, task commitments, and domain (knowledge) assumptions. Individual moves through this space can be formally described by means of adapters. In the paper, we illustrate our approach and argue that our architecture provides answers to three fundamental problems related to research in problem solving methods: 1) what is the epistemological structure and what are the modeling primitives of PSMs? 2) how can we model the PSM development process? and 3) how can we develop and organize truly comprehensive and manageable libraries of problem solving methods?.

Journal ArticleDOI
TL;DR: An approach for indexing animated objects and efficiently answering queries about their position in time and space by using a 2D access method that is made partially persistent and an optimization problem for which the optimal solution for the case where objects move linearly is provided.
Abstract: We present an approach for indexing animated objects and efficiently answering queries about their position in time and space. In particular, we consider an animated movie as a spatiotemporal evolution. A movie is viewed as an ordered sequence of frames, where each frame is a 2D space occupied by the objects that appear in that frame. The queries of interest are range queries of the form, "find the objects that appear in area S between frames f/sub i/ and f/sub j//sup "/ as well as nearest neighbor queries such as, "find the q nearest objects to a given position A between frames f/sub i/ and f/sub j//sup "/. The straightforward approach to index such objects considers the frame sequence as another dimension and uses a 3D access method (such as an R-Tree or its variants). This, however, assigns long "lifetime" intervals to objects that appear through many consecutive frames. Long intervals are difficult to cluster efficiently in a 3D index. Instead, we propose to reduce the problem to a partial-persistence problem. Namely, we use a 2D access method that is made partially persistent. We show that this approach leads to faster query performance while still using storage proportional to the total number of changes in the frame evolution, What differentiates this problem from traditional temporal indexing approaches is that objects are allowed to move and/or change their extent continuously between frames. We present novel methods to approximate such object evolutions, We formulate an optimization problem for which we provide an optimal solution for the case where objects move linearly. Finally, we present an extensive experimental study of the proposed methods. While we concentrate on animated movies, our approach is general and can be applied to other spatiotemporal applications as well.

Journal ArticleDOI
TL;DR: A comparative description of sparse binary distributed representation developed in the framework of the associative-projective neural network architecture and the more well known holographic reduced representations of T.A. Plate and P. Kanerva is provided.
Abstract: The schemes for compositional distributed representations include those allowing on-the-fly construction of fixed dimensionality codevectors to encode structures of various complexity. Similarity of such codevectors takes into account both structural and semantic similarity of represented structures. We provide a comparative description of sparse binary distributed representation developed in the framework of the associative-projective neural network architecture and the more well known holographic reduced representations of T.A. Plate (1995) and binary spatter codes of P. Kanerva (1996). The key procedure in associative-projective neural networks is context-dependent thinning which binds codevectors and maintains their sparseness. The codevectors are stored in structured memory array which can be realized as distributed auto-associative memory. Examples of distributed representation of structured data are given. Fast estimation of the similarity of analogical episodes by the overlap of their codevectors is used in the modeling of analogical reasoning both for retrieval of analogs from memory and for analogical mapping.

Journal ArticleDOI
TL;DR: This work considers the problem of aggregation using an imprecise probability data model that allows us to represent imprecision by partial probabilities and uncertainty using probability distributions to perform the operations necessary for knowledge discovery in databases.
Abstract: Information stored in a database is often subject to uncertainty and imprecision. Probability theory provides a well-known and well understood way of representing uncertainty and may thus be used to provide a mechanism for storing uncertain information in a database. We consider the problem of aggregation using an imprecise probability data model that allows us to represent imprecision by partial probabilities and uncertainty using probability distributions. Most work to date has concentrated on providing functionality for extending the relational algebra with a view to executing traditional queries on uncertain or imprecise data. However, for imprecise and uncertain data, we often require aggregation operators that provide information on patterns in the data. Thus, while traditional query processing is tuple-driven, processing of uncertain data is often attribute-driven where we use aggregation operators to discover attribute properties. The aggregation operator that we define uses the Kullback-Leibler information divergence between the aggregated probability distribution and the individual tuple values to provide a probability distribution for the domain values of an attribute or group of attributes. The provision of such aggregation operators is a central requirement in furnishing a database with the capability to perform the operations necessary for knowledge discovery in databases.

Journal ArticleDOI
TL;DR: This work proposes a graph-based approach to generate various types of association rules from a large database of customer transactions, and shows that its algorithms outperform other algorithms which need to make multiple passes over the database.
Abstract: Mining association rules is an important task for knowledge discovery. We can analyze past transaction data to discover customer behaviors such that the quality of business decisions can be improved. Various types of association rules may exist in a large database of customer transactions. The strategy of mining association rules focuses on discovering large item sets, which are groups of items which appear together in a sufficient number of transactions. We propose a graph-based approach to generate various types of association rules from a large database of customer transactions. This approach scans the database once to construct an association graph and then traverses the graph to generate all large item sets. Empirical evaluations show that our algorithms outperform other algorithms which need to make multiple passes over the database.

Journal ArticleDOI
TL;DR: The technique of hierarchical case based reasoning, which allows complex problems to be solved by reusing multiple cases at various levels of abstraction, is described in the context of Deja Vu, a CBR system aimed at automating plant-control software design.
Abstract: Case based reasoning (CBR) is an artificial intelligence technique that emphasises the role of past experience during future problem solving. New problems are solved by retrieving and adapting the solutions to similar problems, solutions that have been stored and indexed for future reuse as cases in a case-base. The power of CBR is severely curtailed if problem solving is limited to the retrieval and adaptation of a single case, so most CBR systems dealing with complex problem solving tasks have to use multiple cases. The paper describes and evaluates the technique of hierarchical case based reasoning, which allows complex problems to be solved by reusing multiple cases at various levels of abstraction. The technique is described in the context of Deja Vu, a CBR system aimed at automating plant-control software design.

Journal ArticleDOI
TL;DR: This paper presents a scalable content-based image indexing and retrieval system based on vector wavelet coefficients of color images that shows that, in a database of 5,000 images, query search takes less than 30 msec on a 266 MHz Pentium II processor.
Abstract: This paper presents a scalable content-based image indexing and retrieval system based on vector wavelet coefficients of color images. Highly decorrelated wavelet coefficient planes are used to acquire a search efficient feature space. The feature space is subsequently indexed using properties of all the images in the database. Therefore, the feature key of an image not only corresponds to the content of the image itself but also to how much the image is different from the other images being stored in the database. The search time linearly depends on the number of images similar to the query image and is independent of the database size. We show that, in a database of 5,000 images, query search takes less than 30 msec on a 266 MHz Pentium II processor, compared to several seconds of retrieval time in the earlier systems proposed in the literature.

Journal ArticleDOI
TL;DR: The proposed TreeGCS algorithm refines and builds upon the GCS base, overcoming an inconsistency in the original GCS algorithm, where the network topology is susceptible to the ordering of the input vectors.
Abstract: We propose a hierarchical clustering algorithm (TreeGCS) based upon the Growing Cell Structure (GCS) neural network of B. Fritzke (1993). Our algorithm refines and builds upon the GCS base, overcoming an inconsistency in the original GCS algorithm, where the network topology is susceptible to the ordering of the input vectors. Our algorithm is unsupervised, flexible, and dynamic and we have imposed no additional parameters on the underlying GCS algorithm. Our ultimate aim is a hierarchical clustering neural network that is both consistent and stable and identifies the innate hierarchical structure present in vector-based data. We demonstrate improved stability of the GCS foundation and evaluate our algorithm against the hierarchy generated by an ascendant hierarchical clustering dendogram. Our approach emulates the hierarchical clustering of the dendogram. It demonstrates the importance of the parameter settings for GCS and how they affect the stability of the clustering.

Journal ArticleDOI
TL;DR: The Z/sub Y/X model is developed, a comprehensive means for advanced multimedia content creation: support for template-driven authoring of multimedia content and support for flexible, dynamic composition of multimedia documents customized to the user's local context and needs.
Abstract: Advanced multimedia applications require adequate support for the modeling of multimedia content by multimedia document models. More and more this support calls for not only the adequate modeling of the temporal and spatial course of a multimedia presentation and its interactions, but also for the partial reuse of multimedia documents and adaptation to a given user context. However, our thorough investigation of existing standards for multimedia document models such as HTML, MHEG, SMIL, and HyTime leads to us the conclusion that these standard models do not provide sufficient modeling support for reuse and adaptation. Therefore, we propose a new approach for the modeling of adaptable and reusable multimedia content, the Z/sub Y/X model. The model offers primitives that provide-beyond the more or less common primitives for temporal, spatial, and interaction modeling-a variform support for reuse of structure and layout of document fragments and for the adaptation of the content and its presentation to the user context. We present the model in detail and illustrate the application and effectiveness of these concepts by samples taken from our Cardio-OP application in the domain of cardiac surgery. With the Z/sub Y/X model, we developed a comprehensive means for advanced multimedia content creation: support for template-driven authoring of multimedia content and support for flexible, dynamic composition of multimedia documents customized to the user's local context and needs. The approach significantly impacts and supports the authoring process in terms of methodology and economic aspects.

Journal ArticleDOI
TL;DR: An automated process for constructing the combined dependency structure of a multiagent probabilistic network is proposed and the constructed dependency structure is a perfect-map of the minimal cover, which means every probabilism conditional independency logically implied by the minimum cover can be inferred from the dependency structure.
Abstract: A probabilistic network consists of a dependency structure and corresponding probability tables. The dependency structure is a graphical representation of the conditional independencies that are known to hold in the problem domain. We propose an automated process for constructing the combined dependency structure of a multiagent probabilistic network. Each domain expert supplies any known conditional independency information and not necessarily an explicit dependency structure. Our method determines a succinct representation of all the supplied independency information called a minimal cover. This process involves detecting all inconsistent information and removing all redundant information. A unique dependency structure of the multiagent probabilistic network can be constructed directly from this minimal cover. The main result is that the constructed dependency structure is a perfect-map of the minimal cover. That is, every probabilistic conditional independency logically implied by the minimal cover can be inferred from the dependency structure and every probabilistic conditional independency inferred from the dependency structure is logically implied by the minimal cover.

Journal ArticleDOI
TL;DR: This paper provides a preliminary investigation of the potential applications of fuzzy logic in multimedia databases, and distinguishes two types of request, namely, those which can be handled within some extended version of an SQL-like language and those for which one has to elicit user's preference through examples.
Abstract: Fuzzy logic is known for providing a convenient tool for interfacing linguistic categories with numerical data and for expressing user's preference in a gradual and qualitative way. Fuzzy set methods have been already applied to the representation of flexible queries and to the modeling of uncertain pieces of information in databases systems, as well as in information retrieval. This methodology seems to be even more promising in multimedia databases which have a complex structure and from which documents have to be retrieved and selected not only from their contents, but also from "the idea" the user has of their appearance, through queries specified in terms of user's criteria. This paper provides a preliminary investigation of the potential applications of fuzzy logic in multimedia databases. The problem of comparing semistructured documents is first discussed. Querying issues are then more particularly emphasized. We distinguish two types of request, namely, those which can be handled within some extended version of an SQL-like language and those for which one has to elicit user's preference through examples.

Journal ArticleDOI
TL;DR: An abstract semantic model based on an augmented transition network (ATN) is presented, which provides three major capabilities: multimedia presentations, temporal/spatial multimedia database searching, and multimedia browsing.
Abstract: As more information sources become available in multimedia systems, the development of abstract semantic models for video, audio, text, and image data is becoming very important. An abstract semantic model has two requirements: it should be rich enough to provide a friendly interface of multimedia presentation synchronization schedules to the users and it should be a good programming data structure for implementation in order to control multimedia playback. An abstract semantic model based on an augmented transition network (ATN) is presented. The inputs for ATNs are modeled by multimedia input strings. Multimedia input strings provide an efficient means for iconic indexing of the temporal/spatial relations of media streams and semantic objects. An ATN and its subnetworks are used to represent the appearing sequence of media streams and semantic objects. The arc label is a substring of a multimedia input string. In this design, a presentation is driven by a multimedia input string. Each subnetwork has its own multimedia input string. Database queries relative to text, image, and video can be answered via substring matching at subnetworks. Multimedia browsing allows users the flexibility to select any part of the presentation they prefer to see. This means that the ATN and its subnetworks can be included in multimedia database systems which are controlled by a database management system (DBMS). User interactions and loops are also provided in an ATN. Therefore, ATNs provide three major capabilities: multimedia presentations, temporal/spatial multimedia database searching, and multimedia browsing.

Journal ArticleDOI
TL;DR: It is demonstrated how the employment of hyperwords implies a reduction, based on the a priori knowledge of semantics contained in the thesaurus, in the number of features to be used for document classification.
Abstract: A connectionist scheme, namely, /spl sigma/-Fuzzy Lattice Neurocomputing scheme or /spl sigma/-FLN for short, which has been introduced in the literature lately for clustering in a lattice data domain, is employed for computing clusters of directed graphs in a master-graph. New tools are presented and used, including a convenient inclusion measure function for clustering graphs. A directed graph is treated by /spl sigma/-FLN as a single datum in the mathematical lattice of subgraphs stemming from a master-graph. A series of experiments is detailed where the master-graph emanates from a thesaurus of spoken language synonyms. The words of the thesaurus are fed to /spl sigma/-FLN in order to compute clusters of semantically related words, namely hyperwords. The arithmetic parameters of /spl sigma/-FLN can be adjusted so as to calibrate the total number of hyperwords computed in a specific application. It is demonstrated how the employment of hyperwords implies a reduction, based on the a priori knowledge of semantics contained in the thesaurus, in the number of features to be used for document classification. In a series of comparative experiments for document classification, it appears that the proposed method favorably improves classification accuracy in problems involving longer documents, whereas performance deteriorates in problems involving short documents.

Journal ArticleDOI
TL;DR: A graph-theoretic approach presented in the paper provides a sound mathematical basis for representing a query and searching for an execution plan and devise an algorithm that finds a near optimal execution plan using only polynomial time.
Abstract: Although many query tree optimization strategies have been proposed in the literature, there still is a lack of a formal and complete representation of all possible permutations of query operations (i.e., execution plans) in a uniform manner. A graph-theoretic approach presented in the paper provides a sound mathematical basis for representing a query and searching for an execution plan. In this graph model, a node represents an operation and a directed edge between two nodes indicates the older of executing these two operations in an execution plan. Each node is associated with a weight and so is an edge. The weight is an expression containing optimization required parameters, such as relation size, tuple size, join selectivity factors. All possible execution plans are representable in this graph and each spanning tree of the graph becomes an execution plan. It is a general model which can be used in the optimizer of a DBMS for internal query representation. On the basis of this model, we devise an algorithm that finds a near optimal execution plan using only polynomial time. The algorithm is compared with a few other popular optimization methods. Experiments show that the proposed algorithm is superior to the others under most circumstances.

Journal ArticleDOI
TL;DR: This work introduces a generic scheme for data mining algorithms and investigates two orthogonal approaches, reducing I/O cost as well as CPU cost, to speed-up the processing of multiple similarity queries in metric databases.
Abstract: Metric databases are databases where a metric distance function is defined for pairs of database objects. In such databases, similarity queries in the form of range queries or k-nearest-neighbor queries are the most important query types. In traditional query processing, single queries are issued independently by different users. In many data mining applications, however, the database is typically explored by iteratively asking similarity queries for answers of previous similarity queries. We introduce a generic scheme for such data mining algorithms and we investigate two orthogonal approaches, reducing I/O cost as well as CPU cost, to speed-up the processing of multiple similarity queries. The proposed techniques apply to any type of similarity query and to an implementation based on an index or using a sequential scan. Parallelization yields an additional impressive speed-up. An extensive performance evaluation confirms the efficiency of our approach.

Journal ArticleDOI
TL;DR: In this paper, an extended merge-join is used to evaluate the unnested fuzzy queries, which significantly improves the performance of evaluating nested fuzzy queries. But the results are limited to a subset of nested queries.
Abstract: In a fuzzy relational database where a relation is a fuzzy set of tuples and ill-known data are represented by possibility distributions, nested fuzzy queries can be expressed in the Fuzzy SQL language. Although it provides a very convenient way for users to express complex queries, a nested fuzzy query may be very inefficient to process with the naive evaluation method based on its semantics. In conventional databases, nested queries are unnested to improve the efficiency of their evaluation. In this paper, we extend the unnesting techniques to process several types of nested fuzzy queries. An extended merge-join is used to evaluate the unnested fuzzy queries. As shown by both theoretical analysis and experimental results, the unnesting techniques with the extended merge-join significantly improve the performance of evaluating nested fuzzy queries.

Journal ArticleDOI
TL;DR: By restructuring the FROM clause via a subquery, SQL/SDA is well-adapted to the general spatial analysis procedures using current GIS packages and stretches the capabilities of previous ones.
Abstract: An important trend of current GIS development is to provide easy and effective access to spatial analysis functionalities for supporting decision making based on geo-referenced data. Within the framework of the ongoing SQL standards for spatial extensions, a spatial query language, called SQV/SDA, has been designed to meet such a requirement. Since the language needs to incorporate the important derivation functions (e.g., map-overlay and feature-fusion) as well as the spatial relationship and metric functions, the functionality of the FROM clause in SQL is developed in addition to the SELECT and WHERE clauses. By restructuring the FROM clause via a subquery, SQL/SDA is well-adapted to the general spatial analysis procedures using current GIS packages. Such an extended SQL, therefore, stretches the capabilities of previous ones. The implementation of SQL/SDA on the Internet adopts a hybrid model, which takes advantage of the Web GIS design methods in both the client side and server side. The client side of SQL/SDA, programmed in the Java language, provides a query interface by introducing visual constructs such as icons, listboxes, and comboboxes to assist in the composition of queries, thereby enhancing the usability of the language. The server side of SQL/SDA, which is composed of a query processor and Spatial Database Engine (SDE), carries out query processing on spatial databases after receiving user requests.