scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Knowledge and Data Engineering in 1996"


Journal Article•DOI•
TL;DR: In this paper, a survey of the available data mining techniques is provided and a comparative study of such techniques is presented, based on a database researcher's point-of-view.
Abstract: Mining information and knowledge from large databases has been recognized by many researchers as a key research topic in database systems and machine learning, and by many industrial companies as an important area with an opportunity of major revenues. Researchers in many different fields have shown great interest in data mining. Several emerging applications in information-providing services, such as data warehousing and online services over the Internet, also call for various data mining techniques to better understand user behavior, to improve the service provided and to increase business opportunities. In response to such a demand, this article provides a survey, from a database researcher's point of view, on the data mining techniques developed recently. A classification of the available data mining techniques is provided and a comparative study of such techniques is presented.

2,327 citations


Journal Article•DOI•
TL;DR: This work considers the problem of mining association rules on a shared nothing multiprocessor and presents three algorithms that explore a spectrum of trade-offs between computation, communication, memory usage, synchronization, and the use of problem specific information.
Abstract: We consider the problem of mining association rules on a shared nothing multiprocessor. We present three algorithms that explore a spectrum of trade-offs between computation, communication, memory usage, synchronization, and the use of problem specific information. The best algorithm exhibits near perfect scaleup behavior, yet requires only minimal overhead compared to the current best serial algorithm.

1,121 citations


Journal Article•DOI•
TL;DR: The focus of the paper is on studying subjective measures of interestingness, which are classified into actionable and unexpected, and the relationship between them is examined.
Abstract: One of the central problems in the field of knowledge discovery is the development of good measures of interestingness of discovered patterns. Such measures of interestingness are divided into objective measures-those that depend only on the structure of a pattern and the underlying data used in the discovery process, and the subjective measures-those that also depend on the class of users who examine the pattern. The focus of the paper is on studying subjective measures of interestingness. These measures are classified into actionable and unexpected, and the relationship between them is examined. The unexpected measure of interestingness is defined in terms of the belief system that the user has. Interestingness of a pattern is expressed in terms of how it affects the belief system. The paper also discusses how this unexpected measure of interestingness can be used in the discovery process.

746 citations


Journal Article•DOI•
TL;DR: The literature review presented discusses different methods under the general rubric of learning Bayesian networks from data, and includes some overlapping work on more general probabilistic networks.
Abstract: The literature review presented discusses different methods under the general rubric of learning Bayesian networks from data, and includes some overlapping work on more general probabilistic networks. Connections are drawn between the statistical, neural network, and uncertainty communities, and between the different methodological communities, such as Bayesian, description length, and classical statistics. Basic concepts for learning and Bayesian networks are introduced and methods are then reviewed. Methods are discussed for learning parameters of a probabilistic network, for learning the structure, and for learning hidden variables. The article avoids formal definitions and theorems, as these are plentiful in the literature, and instead illustrates key concepts with simplified examples.

544 citations


Journal Article•DOI•
TL;DR: This article describes and evaluates a new visualization-based approach to mining large databases and compares them to other well-known visualization techniques for multidimensional data: the parallel coordinate and stick-figure visualization techniques.
Abstract: Visual data mining techniques have proven to be of high value in exploratory data analysis, and they also have a high potential for mining large databases. In this article, we describe and evaluate a new visualization-based approach to mining large databases. The basic idea of our visual data mining techniques is to represent as many data items as possible on the screen at the same time by mapping each data value to a pixel of the screen and arranging the pixels adequately. The major goal of this article is to evaluate our visual data mining techniques and to compare them to other well-known visualization techniques for multidimensional data: the parallel coordinate and stick-figure visualization techniques. For the evaluation of visual data mining techniques, the perception of data properties counts most, while the CPU time and the number of secondary storage accesses are only of secondary importance. In addition to testing the visualization techniques using real data, we developed a testing environment for database visualizations similar to the benchmark approach used for comparing the performance of database systems. The testing environment allows the generation of test data sets with predefined data characteristics which are important for comparing the perceptual abilities of visual data mining techniques.

405 citations


Journal Article•DOI•
TL;DR: The paper presents an approach to discover symbolic classification rules using neural networks, and demonstrates the effectiveness of the proposed approach by the experimental results on a set of standard data mining test problems.
Abstract: Classification is one of the data mining problems receiving great attention recently in the database community. The paper presents an approach to discover symbolic classification rules using neural networks. Neural networks have not been thought suited for data mining because how the classifications were made is not explicitly stated as symbolic rules that are suitable for verification or interpretation by humans. With the proposed approach, concise symbolic rules with high accuracy can be extracted from a neural network. The network is first trained to achieve the required accuracy rate. Redundant connections of the network are then removed by a network pruning algorithm. The activation values of the hidden units in the network are analyzed, and classification rules are generated using the result of this analysis. The effectiveness of the proposed approach is clearly demonstrated by the experimental results on a set of standard data mining test problems.

396 citations


Journal Article•DOI•
TL;DR: An efficient algorithm called DMA (Distributed Mining of Association rules), which generates a small number of candidate sets and requires only O(n) messages for support-count exchange for each candidate set, in distributed databases.
Abstract: Many sequential algorithms have been proposed for the mining of association rules. However, very little work has been done in mining association rules in distributed databases. A direct application of sequential algorithms to distributed databases is not effective, because it requires a large amount of communication overhead. In this study, an efficient algorithm called DMA (Distributed Mining of Association rules), is proposed. It generates a small number of candidate sets and requires only O(n) messages for support-count exchange for each candidate set, where n is the number of sites in a distributed database. The algorithm has been implemented on an experimental testbed, and its performance is studied. The results show that DMA has superior performance, when compared with the direct application of a popular sequential algorithm, in distributed databases.

365 citations


Journal Article•DOI•
TL;DR: A survey of methods for representing and reasoning with imperfect information can be found in this paper, where a classification of different types of imperfections and sources of such imperfections are discussed.
Abstract: This paper surveys methods for representing and reasoning with imperfect information. It opens with an attempt to classify the different types of imperfection that may pervade data, and a discussion of the sources of such imperfections. The classification is then used as a framework for considering work that explicitly concerns the representation of imperfect information, and related work on how imperfect information may be used as a basis for reasoning. The work that is surveyed is drawn from both the field of databases and the field of artificial intelligence. Both of these areas have long been concerned with the problems caused by imperfect information, and this paper stresses the relationships between the approaches developed in each.

293 citations


Journal Article•DOI•
TL;DR: The scope covers gradient descent and polynomial line search, from backpropagation through conjugate gradients and quasi Newton methods, and shows a consensus among researchers that adaptive step gains can stabilize and accelerate convergence and that a good starting weight set improves both the training speed and the learning quality.
Abstract: We survey research of recent years on the supervised training of feedforward neural networks. The goal is to expose how the networks work, how to engineer them so they can learn data with less extraneous noise, how to train them efficiently, and how to assure that the training is valid. The scope covers gradient descent and polynomial line search, from backpropagation through conjugate gradients and quasi Newton methods. There is a consensus among researchers that adaptive step gains (learning rates) can stabilize and accelerate convergence and that a good starting weight set improves both the training speed and the learning quality. The training problem includes both the design of a network function and the fitting of the function to a set of input and output data points by computing a set of coefficient weights. The form of the function can be adjusted by adjoining new neurons and pruning existing ones and setting other parameters such as biases and exponential rates. Our exposition reveals several useful results that are readily implementable.

178 citations



Journal Article•DOI•
TL;DR: A discretionary access control model in which authorizations contain temporal intervals of validity is presented, and an approach based on establishing an ordering among authorizations and derivation rules, which guarantees a unique set of valid authorizations.
Abstract: The paper presents a discretionary access control model in which authorizations contain temporal intervals of validity. An authorization is automatically revoked when the associated temporal interval expires. The proposed model provides rules for the automatic derivation of new authorizations from those explicitly specified. Both positive and negative authorizations are supported. A formal definition of those concepts is presented, together with the semantic interpretation of authorizations and derivation rules as clauses of a general logic program. Issues deriving from the presence of negative authorizations are discussed. We also allow negation in rules: it is possible to derive new authorizations on the basis of the absence of other authorizations. The presence of this type of rule may lead to the generation of different sets of authorizations, depending on the evaluation order. An approach is presented, based on establishing an ordering among authorizations and derivation rules, which guarantees a unique set of valid authorizations. Moreover, we give an algorithm detecting whether such an ordering can be established for a given set of authorizations and rules. Administrative operations for adding, removing, or modifying authorizations and derivation rules are presented and efficiency issues related to these operations are also tackled in the paper. A materialization approach is proposed, allowing to efficiently perform access control.

Journal Article•DOI•
TL;DR: A unified representation for spatial relationships, 2D Projection Interval Relationships (2D-PIR), that integrates both directional and topological relationships is proposed and techniques for similarity retrieval based on the 2D- PIR representation are developed.
Abstract: Spatial relationships are important ingredients for expressing constraints in retrieval systems for pictorial or multimedia databases. We have proposed a unified representation for spatial relationships, 2D Projection Interval Relationships (2D-PIR), that integrates both directional and topological relationships. We develop techniques for similarity retrieval based on the 2D-PIR representation, including a method for dealing with rotated and reflected images.

Journal Article•DOI•
TL;DR: The main contribution of the paper is the development of Algorithm GenCom (Generalization for Commonality extraction) that makes use of concept generalization to effectively derive many meaningful commonalities that cannot be found otherwise.
Abstract: Studies two spatial knowledge discovery problems involving proximity relationships between clusters and features. The first problem is: given a cluster of points, how can we efficiently find features (represented as polygons) that are closest to the majority of points in the cluster? We measure proximity in an aggregate sense due to the nonuniform distribution of points in a cluster (e.g. houses on a map), and the different shapes and sizes of features (e.g. natural or man-made geographic features). The second problem is: given n clusters of points, how can we extract the aggregate proximity commonalities (i.e. features) that apply to most, if not all, of the n clusters? Regarding the first problem, the main contribution of the paper is the development of Algorithm CRH (Circle, Rectangle and Hull), which uses geometric approximations (i.e. encompassing circles, isothetic rectangles and convex hulls) to filter and select features. The highly scalable and incremental Algorithm CRH can examine over 50,000 features and their spatial relationships with a given cluster in approximately one second of CPU time. Regarding the second problem, the key contribution is the development of Algorithm GenCom (Generalization for Commonality extraction) that makes use of concept generalization to effectively derive many meaningful commonalities that cannot be found otherwise.

Journal Article•DOI•
Jennifer Widom1•
TL;DR: The Starburst Rule System as mentioned in this paper is an active database rules facility integrated into the Starburst extensible relational database system at IBM Almaden Research Center, which is based on arbitrary database state transitions rather than tuple or statement level changes.
Abstract: The paper describes the development of the Starburst Rule System, an active database rules facility integrated into the Starburst extensible relational database system at the IBM Almaden Research Center. The Starburst rule language is based on arbitrary database state transitions rather than tuple or statement level changes, yielding a clear and flexible execution semantics. The rule system has been implemented completely. Its rapid implementation was facilitated by the extensibility features of Starburst, and rule management and rule processing are integrated into all aspects of database processing.

Journal Article•DOI•
TL;DR: A knowledge based approach for retrieving images by content supports the answering of conceptual image queries involving similar-to predicates, spatial semantic operators, and references to conceptual terms.
Abstract: A knowledge based approach is introduced for retrieving images by content. It supports the answering of conceptual image queries involving similar-to predicates, spatial semantic operators, and references to conceptual terms. Interested objects in the images are represented by contours segmented from images. Image content such as shapes and spatial relationships are derived from object contours according to domain specific image knowledge. A three layered model is proposed for integrating image representations, extracted image features, and image semantics. With such a model, images can be retrieved based on the features and content specified in the queries. The knowledge based query processing is based on a query relaxation technique. The image features are classified by an automatic clustering algorithm and represented by Type Abstraction Hierarchies (TAHs) for knowledge based query processing. Since the features selected for TAH generation are based on context and user profile, and the TAHs can be generated automatically by a clustering algorithm from the feature database, our proposed image retrieval approach is scalable and context sensitive. The performance of the proposed knowledge based query processing is also discussed.

Journal Article•DOI•
TL;DR: The architecture and associated algorithms for generating the supported subsuming queries and filters for Boolean queries in one rich front end language are introduced and it is shown that generated subsumed queries return a minimal number of documents.
Abstract: Searching over heterogeneous information sources is difficult because of the nonuniform query languages. Our approach is to allow a user to compose Boolean queries in one rich front end language. For each user query and target source, we transform the user query into a subsuming query that can be supported by the source but that may return extra documents. The results are then processed by a filter query to yield the correct final result. We introduce the architecture and associated algorithms for generating the supported subsuming queries and filters. We show that generated subsuming queries return a minimal number of documents; we also discuss how minimal cost filters can be obtained. We have implemented prototype versions of these algorithms and demonstrated them on heterogeneous Boolean systems.

Journal Article•DOI•
TL;DR: A set of extended aggregate operations, namely sum, average, count, maximum, and minimum, are defined, which can be applied to an attribute containing partial values, and it is pointed out that in general it takes exponential time complexity to do the computations.
Abstract: Imprecise data in databases were originally denoted as null values, which represent the meaning of "values unknown at present." More generally, a partial value corresponds to a finite set of possible values for an attribute in which exactly one of the values is the "true" value. We define a set of extended aggregate operations, namely sum, average, count, maximum, and minimum, which can be applied to an attribute containing partial values. Two types of aggregate operators are considered: scalar aggregates and aggregate functions. We study the properties of the aggregate operations and develop efficient algorithms for count, maximum and minimum. However, for sum and average, we point out that in general it takes exponential time complexity to do the computations.

Journal Article•DOI•
TL;DR: The paper describes the World Wide Web Index and Search Engine (WISE) for Internet resource discovery, designed around a resource database containing meta information about WWW resources and is automatically built using an indexer robot, a special WWW client agent.
Abstract: The paper describes the World Wide Web Index and Search Engine (WISE) for Internet resource discovery. The system is designed around a resource database containing meta information about WWW resources and is automatically built using an indexer robot, a special WWW client agent. The resource database allows users to search for resources based on keywords, and to learn about potentially relevant resources without having to directly access them. Such capabilities can significantly reduce the amount of time that a user needs to spend in order to find the information of his/her interest. We discuss WISE's main components: the resource database, the indexer robot, the search engine, and the user interface, and through the technical discussions, we highlight the research issues involved in the design, the implementation and the evaluation of such a system.

Journal Article•DOI•
TL;DR: This study shows that knowledge discovery substantially broadens the spectrum of intelligent query answering and may have deep implications on query answering in data- and knowledge-base systems.
Abstract: Knowledge discovery facilitates querying database knowledge and intelligent query answering in database systems. We investigate the application of discovered knowledge, concept hierarchies, and knowledge discovery tools for intelligent query answering in database systems. A knowledge-rich data model is constructed to incorporate discovered knowledge and knowledge discovery tools. Queries are classified into data queries and knowledge queries. Both types of queries can be answered directly by simple retrieval or intelligently by analyzing the intent of query and providing generalized, neighborhood or associated information using stored or discovered knowledge. Techniques have been developed for intelligent query answering using discovered knowledge and/or knowledge discovery tools, which includes generalization, data summarization, concept clustering, rule discovery, query rewriting, deduction, lazy evaluation, application of multiple-layered databases, etc. Our study shows that knowledge discovery substantially broadens the spectrum of intelligent query answering and may have deep implications on query answering in data- and knowledge-base systems.

Journal Article•DOI•
E.N. Hanson1•
TL;DR: The design and implementation of the Ariel DBMS and its tightly-coupled forward-chaining rule system are described, which supports traditional relational database query and update operations efficiently, using a System R-like query processing strategy.
Abstract: Describes the design and implementation of the Ariel DBMS and its tightly-coupled forward-chaining rule system. The query language of Ariel is a subset of POSTQUEL (the POSTGRES QUEry Language), extended with a new production-rule sublanguage. Ariel supports traditional relational database query and update operations efficiently, using a System R-like query processing strategy. In addition, the Ariel rule system is tightly coupled with query and update processing. Ariel rules can have conditions based on a mix of selections, joins, events and transitions. For testing rule conditions, Ariel makes use of a discrimination network composed of a special data structure for testing single-relation selection conditions efficiently, and a modified version of the TREAT algorithm, called A-TREAT, for testing join conditions. The key modification to TREAT (which could also be used in the Rete algorithm) is the use of virtual /spl alpha/-memory nodes which save storage since they contain only the predicate associated with the memory node instead of copies of data matching the predicate. In addition, the notions of tokens and /spl alpha/-memory nodes are generalized to support event and transition conditions. The rule-action executor in Ariel binds the data matching a rule's condition to the action of the rule at rule fire time, and executes the rule action using the query processor.

Journal Article•DOI•
TL;DR: This work presents an authorization model for distributed hypertext systems that supports authorizations at different granularity levels, takes into consideration different types of data and the relationships among them, and allows administrative privileges to be delegated.
Abstract: Digital libraries support quick and efficient access to a large number of information sources that are distributed but interlinked. As the amount of information to be shared grows, the need to restrict access only to specific users or for specific usage will surely arise. The protection of information in digital libraries, however, is difficult because of the peculiarity of the hypertext paradigm which is generally used to represent information in digital libraries, together with the fact that related data in a hypertext are often distributed at different sites. We present an authorization model for distributed hypertext systems. Our model supports authorizations at different granularity levels, takes into consideration different types of data and the relationships among them, and allows administrative privileges to be delegated.

Journal Article•DOI•
TL;DR: This paper defines a consistent framework of temporal equivalents of the important conventional database design concepts: functional dependencies, primary keys, and third and Boyce-Codd normal forms, which apply equally well to all temporal data models that have timeslice operators.
Abstract: Normal forms play a central role in the design of relational databases. Several normal forms for temporal relational databases have been proposed. These definitions are particular to specific temporal data models, which are numerous and incompatible. The paper attempts to rectify this situation. We define a consistent framework of temporal equivalents of the important conventional database design concepts: functional dependencies, primary keys, and third and Boyce-Codd normal forms. This framework is enabled by making a clear distinction between the logical concept of a temporal relation and its physical representation. As a result, the role played by temporal normal forms during temporal database design closely parallels that of normal forms during conventional database design. These new normal forms apply equally well to all temporal data models that have timeslice operators, including those employing tuple timestamping, backlogs, and attribute value timestamping. As a basis for our research, we conduct a thorough examination of existing proposals for temporal dependencies, keys, and normal forms. To demonstrate the generality of our approach, we outline how normal forms and dependency theory can also be applied to spatial and spatiotemporal databases.

Journal Article•DOI•
TL;DR: New techniques for uncertainty management in expert systems for two generic class of problems using fuzzy Petri nets that represent logical connectivity among a set of imprecise propositions and an algorithm for selecting one evidence from each set of mutually inconsistent evidences, referred to as nonmonotonic reasoning are developed.
Abstract: The paper aims at developing new techniques for uncertainty management in expert systems for two generic class of problems using fuzzy Petri nets that represent logical connectivity among a set of imprecise propositions. One class of problems deals with the computation of fuzzy belief of any proposition from the fuzzy beliefs of a set of independent initiating propositions in a given network. The other class of problems is concerned with the computation of steady-state fuzzy beliefs of the propositions embedded in the network, from their initial fuzzy beliefs through a process called belief revision. During belief revision, a fuzzy Petri net with cycles may exhibit "limit cycle behavior" of fuzzy beliefs for some propositions in the network. No decisions can be arrived at from a fuzzy Petri net with such behavior. To circumvent this problem, techniques have been developed for the detection and elimination of limit cycles. Further, an algorithm for selecting one evidence from each set of mutually inconsistent evidences, referred to as nonmonotonic reasoning, has also been presented in connection with the problems of belief revision. Finally, the concepts proposed for solving the problems of belief revision have been applied successfully for tackling imprecision, uncertainty, and nonmonotonicity of evidences in an illustrative expert system for criminal investigation.

Journal Article•DOI•
TL;DR: The theory tightly unifies the constraint logic programming scheme of Jaffar and Lassez (1987), the generalized annotated logic programming theory of Kifer and Subrahmanian (1989), and the stable model semantics of Gelfond and Lifschitz (1988).
Abstract: Deductive databases that interact with, and are accessed by, reasoning agents in the real world (such as logic controllers in automated manufacturing, weapons guidance systems, aircraft landing systems, land-vehicle maneuvering systems, and air-traffic control systems) must have the ability to deal with multiple modes of reasoning. Specifically, the types of reasoning we are concerned with include, among others, reasoning about time, reasoning about quantitative relationships that may be expressed in the form of differential equations or optimization problems, and reasoning about numeric modes of uncertainty about the domain which the database seeks to describe. Such databases may need to handle diverse forms of data structures, and frequently they may require use of the assumption-based nonmonotonic representation of knowledge. A hybrid knowledge base is a theoretical framework capturing all the above modes of reasoning. The theory tightly unifies the constraint logic programming scheme of Jaffar and Lassez (1987), the generalized annotated logic programming theory of Kifer and Subrahmanian (1989), and the stable model semantics of Gelfond and Lifschitz (1988). New techniques are introduced which extend both the work on annotated logic programming and the stable model semantics.

Journal Article•DOI•
Heping Shang1, T.H. Merrettal•
TL;DR: A trie based method whose cost is independent of document size is presented, and it is shown that this new method significantly outperforms the nearest competitor for k=0 and k=1, which are arguably the most important cases.
Abstract: Tries offer text searches with costs which are independent of the size of the document being searched, and so are important for large documents requiring spelling checkers, case insensitivity, and limited approximate regular secondary storage. Approximate searches, in which the search pattern differs from the document by k substitutions, transpositions, insertions or deletions, have hitherto been carried out only at costs linear in the size of the document. We present a trie based method whose cost is independent of document size. Our experiments show that this new method significantly outperforms the nearest competitor for k=0 and k=1, which are arguably the most important cases. The linear cost (in k) of the other methods begins to catch up, for our small files, only at k=2. For larger files, complexity arguments indicate that tries will outperform the linear methods for larger values of k. The indexes combine suffixes and so are compact in storage. When the text itself does not need to be stored, as in a spelling checker, we even obtain negative overhead: 50% compression. We discuss a variety of applications and extensions, including best match (for spelling checkers), case insensitivity, and limited approximate regular expression matching.

Journal Article•DOI•
TL;DR: A programmable system that supports implementation-independent specification of application-specific extended transaction models (ETMs) and configuration of transaction management mechanisms (TMMs) to enforce specified ETMs is discussed in the context of a distributed object management system.
Abstract: A Transaction Specification and Management Environment (TSME) is a programmable system that supports implementation-independent specification of application-specific extended transaction models (ETMs) and configuration of transaction management mechanisms (TMMs) to enforce specified ETMs. The TSME can ensure correctness and reliability while allowing the functionality required by workflows and other advanced applications that require access to multiple heterogeneous, autonomous, and/or distributed (HAD) systems. To support ETM specification, the TSME provides a transaction specification language that describes dependencies between transactions. Unlike other ETM specification languages, TSME's dependency descriptors use a common set of primitives, and are enforceable, i.e., can be evaluated at any time during transaction execution to determine whether operations issued violate ETM specifications. To determine whether an ETM can be enforced in a specific HAD system environment, the TSME supports specification of the transactional capabilities of HAD systems, and comparison of these with ETM specifications to determine mismatches. To enforce ETMs that are more restrictive than those supported by the union of the transactional capabilities of HAD systems, the TSME provides a collection of transactional services. These services are programmable and configurable, i.e., they accept instructions that change their behavior as required by an ETM and can be combined in specific ways to create a run-time TMM capable of enforcing the ETM. We discuss the TSME in the context of a distributed object management system. We give ETM specification examples and describe corresponding TMM configurations for a telecommunications application.

Journal Article•DOI•
TL;DR: This paper describes a framework for electromechanical product data that has been implemented in a structure editor and is being used to support a range of engineering applications.
Abstract: IT support for engineering involves the integration of existing, evolving and future product data, and software that processes that data. Thus, there is increasing interest in the representation of product data in the computer to support CAE applications. To avoid duplication and inconsistency, and to support the use of new implementation technology as it emerges, conceptual models of product data are required. Such models are independent of the software and hardware environments in which they are implemented. System architectures to support the integration of applications at implementation time are becoming an accepted part of engineering information systems. To use these software support environments effectively, integrated product data is required. It must also be possible to extend the integrated product data in a controlled fashion if it is to evolve to support future engineering applications effectively. A framework that is a part of the product data at the conceptual modeling stage helps to satisfy these requirements. The framework presented is a structure for the information content of product data rather than for the implementation of such data. Product data based on the framework can be successfully implemented in a number of different database forms. This paper describes a framework for electromechanical product data that has been implemented in a structure editor and is being used to support a range of engineering applications. The process of product data integration can be improved by using existing integration strategies together with a framework that provides an overall organization for the data.

Journal Article•DOI•
TL;DR: It is shown that most distributed query optimization problems can be transformed into an optimization problem comprising a set of binary decisions, termed Sum Product Optimization (SPO) problem, which is proved NP-hard by polynomially reducing SPO to each of them.
Abstract: While a significant amount of research efforts has been reported on developing algorithms, based on joins and semijoins, to tackle distributed query processing, there is relatively little progress made toward exploring the complexity of the problems studied. As a result, proving NP-hardness of or devising polynomial-time algorithms for certain distributed query optimization problems has been elaborated upon by many researchers. However, due to its inherent difficulty, the complexity of the majority of problems on distributed query optimization remains unknown. In this paper we generally characterize the distributed query optimization problems and provide a frame work to explore their complexity. As it will be shown, most distributed query optimization problems can be transformed into an optimization problem comprising a set of binary decisions, termed Sum Product Optimization (SPO) problem. We first prove SPO is NP-hard in light of the NP-completeness of a well-known problem, Knapsack (KNAP). Then, using this result as a basis, we prove that five classes of distributed query optimization problems, which cover the majority of distributed query optimization problems previously studied in the literature, are NP-hard by polynomially reducing SPO to each of them. The detail for each problem transformation is derived. We not only prove the conjecture that many prior studies relied upon, but also provide a frame work for future related studies.

Journal Article•DOI•
TL;DR: The results from training a recurrent neural network to recognize a known non-trivial, randomly-generated regular grammar show that not only do the networks preserve correct rules but that they are able to correct through training inserted rules which were initially incorrect.
Abstract: Recurrent neural networks readily process, recognize and generate temporal sequences. By encoding grammatical strings as temporal sequences, recurrent neural networks can be trained to behave like deterministic sequential finite-state automata. Algorithms have been developed for extracting grammatical rules from trained networks. Using a simple method for inserting prior knowledge (or rules) into recurrent neural networks, we show that recurrent neural networks are able to perform rule revision. Rule revision is performed by comparing the inserted rules with the rules in the finite-state automata extracted from trained networks. The results from training a recurrent neural network to recognize a known non-trivial, randomly-generated regular grammar show that not only do the networks preserve correct rules but that they are able to correct through training inserted rules which were initially incorrect (i.e. the rules were not the ones in the randomly generated grammar).

Journal Article•DOI•
TL;DR: Theoretical and experimental observations show that the method presented is more practical than existing ones considering the use of dynamic key sets, information storage of keys and compression of transitions, and can uniquely determine information corresponding to keys while a DAWG cannot.
Abstract: A trie structure is frequently used for various applications, such as natural language dictionaries, database systems and compilers. However, the total number of states of a trie (and transitions between them) becomes large, so that the space cost may not be acceptable for a huge key set. In order to resolve this disadvantage, this paper presents a new scheme, called a "two-trie", that enables us to perform efficient retrievals, insertions and deletions for the key sets. The essential idea is to construct two tries for both front and rear compressions of keys, which is similar to a DAWG (directed acyclic word-graph). The approach differs from a DAWG in that the two-trie approach presented can uniquely determine information corresponding to keys while a DAWG cannot. For an efficient implementation of the two-trie, two types of data structures are introduced. Theoretical and experimental observations show that the method presented is more practical than existing ones considering the use of dynamic key sets, information storage of keys and compression of transitions.