scispace - formally typeset
Search or ask a question

Showing papers in "Distributed and Parallel Databases in 1995"


Journal ArticleDOI
TL;DR: This paper provides a high-level overview of the current workflow management methodologies and software products and discusses how distributed object management and customized transaction management can support further advances in the commercial state of the art in this area.
Abstract: Today's business enterprises must deal with global competition, reduce the cost of doing business, and rapidly develop new services and products. To address these requirements enterprises must constantly reconsider and optimize the way they do business and change their information systems and applications to support evolving business processes. Workflow technology facilitates these by providing methodologies and software to support (i) business process modeling to capture business processes as workflow specifications, (ii) business process reengineering to optimize specified processes, and (iii) workflow automation to generate workflow implementations from workflow specifications. This paper provides a high-level overview of the current workflow management methodologies and software products. In addition, we discuss the infrastructure technologies that can address the limitations of current commercial workflow technology and extend the scope and mission of workflow management systems to support increased workflow automation in complex real-world environments involving heterogeneous, autonomous, and distributed information systems. In particular, we discuss how distributed object management and customized transaction management can support further advances in the commercial state of the art in this area.

1,687 citations


Journal ArticleDOI
TL;DR: The classification of conflict resolution techniques includes not only those necessary for resolving schematic conflicts identified in the earlier paper, but also additional conflicts that arise when OODBs become part of the databases to be integrated.
Abstract: The objective of a multidatabase system is to provide a single uniform interface to accessing multiple independent databases being managed by multiple independent, and possibly heterogeneous, database systems. One crucial element in the design of a multidatabase system is the design of a data definition language for specifying a schema that represents the integration of the schemas of multiple independent databases. The design of such a language in turn requires a comprehensive classification of the conflicts (i.e., discrepancies) among the schemas of the independent databases and development of techniques for resolving (i.e., homogenizing) all of the conflicts in the classification. An earlier paper provided a comprehensive classification of schematic conflicts that may arise when integrating multiple independent relational database (RDB) schemas into a single multidatabase (MDB) schema. In this paper, we provide a comprehensive classification of techniques for resolving the schematic conflicts that may arise when integrating multiple RDB schemas, or RDB schemas and object-oriented database (OODB) schemas, or multiple OODB schemas. The classification of conflict resolution techniques includes not only those necessary for resolving schematic conflicts identified in the earlier paper, but also additional conflicts that arise when OODBs become part of the databases to be integrated. Most of the conflict resolution techniques discussed in the paper have already been incorporated into SQL/M, a multidatabase language implemented in UniSQL/M, a commercially available multidatabase system from UniSQL, Inc. which integrated SQL-based relational database systems and the UniSQL/X unified relational and object-oriented database system.

212 citations


Journal ArticleDOI
TL;DR: This paper presents a computational model for workflows that captures the behavior of both transactional and non-transactional tasks of different types, and develops two languages for specifying a workflow at different levels of abstraction.
Abstract: The computing environment in most medium-sized and large enterprises involves old main-frame based (legacy) applications and systems as well as new workstation-based distributed computing systems The objective of the METEOR project is to support multi-system workflow applications that automate enterprise operations This paper deals with the modeling and specification of workflows in such applications Tasks in our heterogeneous environment can be submitted through different types of interfaces on different processing entities We first present a computational model for workflows that captures the behavior of both transactional and non-transactional tasks of different types We then develop two languages for specifying a workflow at different levels of abstraction: the Workflow Specification Language (WFSL) is a declarative rule-based language used to express the application-level interactions between multiple tasks, while the Task Specification Language (TSL) focuses on the issues related to individual tasks These languages are designed to address the important issues of inter-task dependencies, data formatting, data exchange, error handling, and recovery The paper also presents an architecture for the workflow management system that supports the model and the languages

188 citations



Journal ArticleDOI
TL;DR: A taxonomy of the fragmentation problem in a distributed object base is reviewed and a comprehensive set of algorithms for horizontally fragmenting the four realizable class models on the taxonomy are presented, shown to be polynomial time.
Abstract: Optimal application performance on a Distributed Object Based System (DOBS) requires class fragmentation and the development of allocation schemes to place fragments at distributed sites so data transfer is minimized. Fragmentation enhances application performance by reducing the amount of irrelevant data accessed and the amount of data transferred unnecessarily between distributed sites. Algorithms for effecting horizontal and vertical fragmentation ofrelations exist, but fragmentation techniques for class objects in a distributed object based system are yet to appear in the literature. This paper first reviews a taxonomy of the fragmentation problem in a distributed object base. The paper then contributes by presenting a comprehensive set of algorithms for horizontally fragmenting the four realizable class models on the taxonomy. The fundamental approach is top-down, where the entity of fragmentation is the class object. Our approach consists of first generating primary horizontal fragments of a class based on only applications accessing this class, and secondly generating derived horizontal fragments of the class arising from primary fragments of its subclasses, its complex attributes (contained classes), and/or its complex methods classes. Finally, we combine the sets of primary and derived fragments of each class to produce the best possible fragments. Thus, these algorithms account for inheritance and class composition hierarchies as well as method nesting among objects, and are shown to be polynomial time.

67 citations


Journal ArticleDOI
TL;DR: Various optimizations are presented and analyzed in terms of reliability, savings in log writes and network traffic, and reduction in resource lock time and the feasibility and performance of several optimization combinations are analyzed.
Abstract: An atomic commit protocol can ensure that all participants in a distributed transaction reach consistent states, whether or not system or network failures occur. The atomic commit protocol used in industry and academia is the well-known two-phase commit (2PC) protocol, which has been the subject of considerable work and technical literature for some years.

65 citations


Journal ArticleDOI
TL;DR: The intention of this article is to demonstrate that a general purpose programming language can serve both aspects of work flow specification and task specification.
Abstract: Work flow management requires language support for work flow specification and task specification. Many approaches and systems for work flow management therefore offer at least one new language for work flow specification; task specification is usually done in a traditional language. This is motivated in particular by the fact that many components already exist and the task of the work flow tool is the specification of the interaction between these components. The intention of this article is to demonstrate that a general purpose programming language can serve both aspects. We do not really see the need to develop yet another language that a user or application programmer must learn. If an existing programming language like C or Prolog is extended towards work flow capabilities, it is easy to reuse autonomous existing software components and to build interfaces among them.

31 citations


Journal ArticleDOI
TL;DR: This paper presents an approach for incrementally updating a distributed, replicated database without requiring multi-site atomic commit protocols, and proves that the mechanism is correct, as it asymptotically performs all the updates on all the copies.
Abstract: Update propagation and transaction atomicity are major obstacles to the development of replicated databases. Many practical applications, such as automated teller machine (ATM) networks, flight reservation, and part inventory control, do not really require these properties. In this paper we present an approach for incrementally updating a distributed, replicated database without requiring multi-site atomic commit protocols. We prove that the mechanism is correct, as it asymptotically performs all the updates on all the copies. Our approach has two important characteristics: it is progressive, and non-blocking. Progressive means that the transaction''s coordinator always commits, possibly together with a group of other sites. The update is later propagated asynchronously to the remaining sites. Non-blocking means that each site can take unilateral decisions at each step of the algorithm. Sites which cannot commit updates are brought to the same final state by means of a reconciliation mechanism. This mechanism uses the history logs, which are stored locally at each site, to bring sites to agreement. It requires a small auxiliary data structure, called reception vector, to keep track of the time until which the other sites are guaranteed to be up-to-date. Several optimizations to the basic mechanism are also discussed.

25 citations


Journal ArticleDOI
Calton Pu1, Wenwey Hseush1, Gail E. Kaiser1, Kun-Lung Wu2, Philip S. Yu2 
TL;DR: A divergence control algorithm for a heterogeneous distributed database system, where the local orderings of all the sub-transactions of a distributed epsilon transaction may not be the same and the total inconsistency may be greater than the sum of those of all its sub- transactions.
Abstract: This paper presents distributed divergence control algorithms for epsilon serializability for both homogeneous and heterogeneous distributed databases. Epsilon serializability allows for more concurrency by permitting non-serializable interleavings of database operations among epsilon transactions. We first present a strict 2-phase locking divergence control algorithm and an optimistic divergence control algorithm for a homogeneous distributed database system, where the local orderings of all the sub-transactions of a distributed epsilon transaction are the same. In such an environment, the total inconsistency of a distributed epsilon transaction is simply the sum of those of all its sub-transactions. We then describe a divergence control algorithm for a heterogeneous distributed database system, where the local orderings of all the sub-transactions of a distributed epsilon transaction may not be the same and the total inconsistency of a distributed epsilon transaction may be greater than the sum of those of all its sub-transactions. As a result, in addition to executing a local divergence control algorithm in each site to maintain the local inconsistency, a global mechanism is needed to take into account the additional inconsistency

22 citations


Journal ArticleDOI
TL;DR: This framework extends the traditional algebraic transformation framework to include two-way outerjoins and GAD operations, and demonstrates that properties of selection/join predicates and attribute derivation functions can be used to provide interesting transformation alternatives.
Abstract: Existence of semantic conflicts between component databases severely impacts query processing in a multidatabase system. In this paper, we describe two types of semantic conflicts that have to be dealt with in the integration of databases modeling information about related sets of real-world entities. These are the entityidentification problem and theattribute value conflict problem. While thetwo-way outerjoin operation has been commonly used for resolving entity identification problem between two component relations, outerjoins using regular equality comparisons between component relation keys is shown to produce counter-intuitive entity identification result. We remedy this by defining a newkey-equality comparator in place of regular equality comparator, for outerjoins. For the attribute value conflict problem, we define aGeneralized Attribute Derivation (GAD) operation which allows user-defined attribute derivation functions to be used to compute new attributes from the component relations' attributes. By adding two-way outerjoin andGAD to the set of relational operations, the traditional algebraic transformation framework for relational queries is no longer adequate for multidatabase query processing and optimization. As a result, we introduceconstrained query tree as the multidatabase query representation. We show that some knowledge about query predicates and attribute derivation functions can be used to simplify queries. Such knowledge is modeled as an outerjoin graph attached to every outerjoin operation in the query tree. Based on this, we further extend the traditional algebraic transformation framework to include two-way outerjoins andGAD operations. Our framework demonstrates that properties of selection/join predicates and attribute derivation functions can be used to provide interesting transformation alternatives. This framework also serves as a formal ground for developing optimization strategies for multidatabase queries.

18 citations


Journal ArticleDOI
TL;DR: A theory of partitioned data is presented that formalizes the concept and establishes the basis to develop a correctness criterion and a concurrency control protocol for partitioned databases.
Abstract: In many distributed databases “locality of reference” is crucial to achieve acceptable performance. However, the purpose of data distribution is to spread the data among several remote sites. One way to solve this contradiction is to use partitioned data techniques. Instead of accessing the entire data, a site works on a fraction that is made locally available, thereby increasing the site's autonomy. We present a theory of partitioned data that formalizes the concept and establishes the basis to develop a correctness criterion and a concurrency control protocol for partitioned databases. Set-serializability is proposed as a correctness criterion and we suggest an implementation that integrates partitioned and non-partitioned data. To complete this study, the policies required in a real implementation are also analyzed.

Journal ArticleDOI
TL;DR: A proof of the intuitively well understood fact—that the “eigenorder” of a “chain” join will be the best pre-defined combinatorial order to implement the algorithm in [21] and a significant reduction of the time complexity fromO(m2n2+m3n) toO(mn+m2).
Abstract: This paper investigates the optimization problem when executing a join in a distributed database environment. The minimization of the communication cost for sending data through links has been adopted as an optimization criterion. We explore in this paper the approach of judiciously using join operations as reducers in distributed query processing. In general, this problem is computationally intractable. A restriction of the execution of a join in a pre-defined combinatorial order leads to a possible solution in polynomial time. An algorithm for a chain query computation has been proposed in [21]. The time complexity of the algorithm is O(m2n2+m3n), where n is the number of sites in the network, and m is the number of relations (fragments) involved in the join. In this paper, we firstly present a proof of the intuitively well understood fact-that the "eigenorder" of a "chain" join will be the best pre-defined combinatorial order to implement the algorithm in [21]. Secondly, we show a sufficient and necessary condition for a chain query with the eigenordering to be a "simple" query. For the process of the class of simple queries, we show a significant reduction of the time complexity from O(m2n2+m3n) to O(mn+m2). It is encouraging that, in practice, the most frequent queries belong to the category of simple queries.

Journal ArticleDOI
TL;DR: This paper presents a load-balanced parallel sorting algorithm for shared-nothing architectures, a multiple-input multiple-output algorithm with four stages, based on a generalization of Batcher's odd-even merge, which guarantees its performance, as long as n is greater thanp3, which is the case of interest for sorting large relations.
Abstract: With the popularity of parallel database machines based on the shared-nothing architecture, it has become important to find external sorting algorithms which lead to a load-balanced computation, i.e., balanced execution, communication and output. If during the course of the sorting algorithm each processor is equally loaded, parallelism is fully exploited. Similarly, balanced communication will not congest the network traffic. Since sorting can be used to support a number of other relational operations (joins, duplicate elimination, building indexes etc.) data skew produced by sorting can further lead to execution skew at later stages of these operations. In this paper we present a load-balanced parallel sorting algorithm for shared-nothing architectures. It is a multiple-input multiple-output algorithm with four stages, based on a generalization of Batcher's odd-even merge. At each stage then keys are evenly distributed among thep processors (i.e., there is no final sequential merge phase) and the distribution of keys between stages ensures against network congestion. There is no assumption made on the key distribution and the algorithm performs equally well in the presence of duplicate keys. Hence our approach always guarantees its performance, as long asn is greater thanp 3, which is the case of interest for sorting large relations. In addition, processors can be added incrementally.

Journal ArticleDOI
TL;DR: Most files are small, about 60% of files on a system have never been accessed again after being created and very few files are ever modified.
Abstract: This paper discusses collection, analysis and interpretation of data pertaining to files in personal computer (PC) environments. We developed programs to collect and analyze data from PCs running the OS/21 operating system and using the High Performance File System (HPFS). The data collection program gathers the information about file sizes, the times and dates of file creation, the last file access, and the last file update by scanning the contents of disk storage devices. The gathered information is used to analyze the distributions of file sizes, functional file lifetimes, and functional lifetimes of file's data. The analysis shows that: most files are small (more than 60% of files on a system are smaller than 8 Kbytes), about 60% of files on a system have never been accessed again after being created and very few files are ever modified.

Journal ArticleDOI
TL;DR: Three successively less restrictive definitions of validity are presented, each providing progressively improved handling of incomplete information and replacing the notion of global reconstructability with the less restrictive, yet intuitively natural notion of object reconstructability.
Abstract: This paper examines correctness issues that arise in distributed database design. A distributed relational database design is traditionally considered to be valid if every global relation can be reconstructed from its fragments by join operations. In this paper, three successively less restrictive definitions of validity are presented, each providing progressively improved handling of incomplete information. Examining these forms, a hybrid reconstruction approach involving inner- and outer-joins is proposed and we briefly describe its application to query formulation.