scispace - formally typeset
Search or ask a question
Book ChapterDOI

An Efficient Heuristic for Logical Optimization of ETL Workflows

TL;DR: This paper identifies activities that can be transferred between linear segments in an ETL flow and uses the re-orderings of the linear segments to obtain a cost-optimal semantically equivalent flow for a given complex flow.
Abstract: An ETL process is used to extract data from various sources, transform it and load it into a Data Warehouse. In this paper, we analyse an ETL flow and observe that only some of the dependencies in an ETL flow are essential while others are basically represents the flow of data. For the linear flows, we exploit the underlying dependency graph and develop a greedy heuristic technique to determine a reordering that significantly improves the quality of the flow. Rather than adopting a state-space search approach, we use the cost functions and selectivities to determine the best option at each position in a right-to-left manner. To deal with complex flows, we identify activities that can be transferred between linear segments in it and position those activities appropriately. We then use the re-orderings of the linear segments to obtain a cost-optimal semantically equivalent flow for a given complex flow. Experimental evaluation has shown that by using the proposed techniques, ETL flows can be better optimized and with much less effort compared to existing methods.
Citations
More filters
Journal ArticleDOI
01 Dec 2017
TL;DR: This paper explains the existing techniques for constructing a conceptual and a logical model of an ETL workflow, its corresponding physical implementation, and its optimization, and proposes a theoretical ETL framework for ETL optimization.
Abstract: In this paper, we discuss the state of the art and current trends in designing and optimizing ETL workflows. We explain the existing techniques for: (1) constructing a conceptual and a logical model of an ETL workflow, (2) its corresponding physical implementation, and (3) its optimization, illustrated by examples. The discussed techniques are analyzed w.r.t. their advantages, disadvantages, and challenges in the context of metrics such as autonomous behavior, support for quality metrics, and support for ETL activities as user-defined functions. We draw conclusions on still open research and technological issues in the field of ETL. Finally, we propose a theoretical ETL framework for ETL optimization.

54 citations


Cites background or methods from "An Efficient Heuristic for Logical ..."

  • ...[11] propose the dependency graph that is used for narrowing the space of allowed rearrangements of tasks within a given workflow....

    [...]

  • ...The optimization concept contributed in [11] draws upon the idea of rearranging tasks (activities) in an ETL workflow (as proposed in [12]), in order to construct a more efficient variant of thisworkflow....

    [...]

  • ...The following assumptions aremade in [11]:...

    [...]

  • ...[11] introduced a dependency graph—a structure that represents dependencies between tasks in aworkflow....

    [...]

  • ...Finding an optimal execution plan is based on: (1) identifying different possible re-orderings of operators (tasks, activities) in a given SE and (2) estimating their execution costs, in the spirit of [11,12]....

    [...]

Journal ArticleDOI
TL;DR: In this paper, the authors present the main dimensions of the relevant optimization problems and the types of optimizations that occur before flow execution and provide a concise overview of the existing approaches with a view to highlighting key observations and areas deserving more attention from the community.
Abstract: Workflow technology is rapidly evolving and, rather than being limited to modeling the control flow in business processes, is becoming a key mechanism to perform advanced data management, such as big data analytics. This survey focuses on data-centric workflows (or workflows for data analytics or data flows), where a key aspect is data passing through and getting manipulated by a sequence of steps. The large volume and variety of data, the complexity of operations performed, and the long time such workflows take to compute give rise to the need for optimization. In general, data-centric workflow optimization is a technology in evolution. This survey focuses on techniques applicable to workflows comprising arbitrary types of data manipulation steps and semantic inter-dependencies between such steps. Further, it serves a twofold purpose: firstly, to present the main dimensions of the relevant optimization problems and the types of optimizations that occur before flow execution and secondly, to provide a concise overview of the existing approaches with a view to highlighting key observations and areas deserving more attention from the community.

34 citations

Proceedings Article
01 Jan 2014
TL;DR: This paper proposes an optimization framework to choose a set of statistics to collect for a given workflow, using which the optimizer can estimate the cost of any alternative plan for the workflow, and experimentally demonstrates the effective and efficiency of the proposed algorithms.
Abstract: of the ETL products in the market today provide tools for design of ETL workflows, with very little or no support for opti- mization of such workflows. Optimization of ETL workflows pose several new challenges compared to traditional query optimization in database systems. There have been many attempts both in the industry and the research community to support cost-based opti- mization techniques for ETL Workflows, but with limited success. Non-availability of source statistics in ETL is one of the major chal- lenges that precludes the use of a cost based optimization strategy. However, the basic philosophy of ETL workflows of design once and execute repeatedly allows interesting possibilities for determin- ing the statistics of the input. In this paper, we propose a frame- work to determine various sets of statistics to collect for a given workflow, using which the optimizer can estimate the cost of any alternative plan for the workflow. The initial few runs of the work- flow are used to collect the statistics and future runs are optimized based on the learned statistics. Since there can be several alterna- tive sets of statistics that are sufficient, we propose an optimization framework to choose a set of statistics that can be measured with the least overhead. We experimentally demonstrate the effective- ness and efficiency of the proposed algorithms.

29 citations


Cites background from "An Efficient Heuristic for Logical ..."

  • ...Most of the papers [17, 23, 16, 14, 21] focused on conceptual modeling of ETL optimization....

    [...]

  • ...[14] touches upon efficient heuristics for logical optimization of the ETL workflows....

    [...]

Posted Content
TL;DR: This survey focuses on data-centric workflows, where a key aspect is data passing through and getting manipulated by a sequence of steps, and techniques applicable to workflows comprising arbitrary types of data manipulation steps and semantic inter-dependencies between such steps.
Abstract: Workflow technology is rapidly evolving and, rather than being limited to modeling the control flow in business processes, is becoming a key mechanism to perform advanced data management, such as big data analytics. This survey focuses on data-centric workflows (or workflows for data analytics or data flows), where a key aspect is data passing through and getting manipulated by a sequence of steps. The large volume and variety of data, the complexity of operations performed, and the long time such workflows take to compute give rise to the need for optimization. In general, data-centric workflow optimization is a technology in evolution. This survey focuses on techniques applicable to workflows comprising arbitrary types of data manipulation steps and semantic inter-dependencies between such steps. Further, it serves a twofold purpose. Firstly, to present the main dimensions of the relevant optimization problems and the types of optimizations that occur before flow execution. Secondly, to provide a concise overview of the existing approaches with a view to highlighting key observations and areas deserving more attention from the community.

18 citations

Proceedings ArticleDOI
07 Nov 2014
TL;DR: It is demonstrated that the current state-of-the-art in flow optimizations needs to be extended and a promising direction for optimizing flows at the logical level is proposed, and more specifically, for deciding the sequence of flow tasks.
Abstract: Modern data analysis is increasingly employing data-intensive flows for processing very large volumes of data. As the data flows become more and more complex and operate in a highly dynamic environment, we argue that we need to resort to automated cost-based optimization solutions rather than relying on efficient designs by human experts. We further demonstrate that the current state-of-the-art in flow optimizations needs to be extended and we propose a promising direction for optimizing flows at the logical level, and more specifically, for deciding the sequence of flow tasks.

12 citations


Additional excerpts

  • ...The algorithm is presented in [12]....

    [...]

References
More filters
Book
01 Jan 1989
TL;DR: Fundamentals of Database Systems combines clear explanations of theory and design, broad coverage of models and real systems, and excellent examples with up-to-date introductions to modern database technologies.
Abstract: From the Publisher: Fundamentals of Database Systems combines clear explanations of theory and design, broad coverage of models and real systems, and excellent examples with up-to-date introductions to modern database technologies. This edition is completely revised and updated, and reflects the latest trends in technological and application development. Professors Elmasri and Navathe focus on the relational model and include coverage of recent object-oriented developments. They also address advanced modeling and system enhancements in the areas of active databases, temporal and spatial databases, and multimedia information systems. This edition also surveys the latest application areas of data warehousing, data mining, web databases, digital libraries, GIS, and genome databases. New to the Third Edition Reorganized material on data modeling to clearly separate entity relationship modeling, extended entity relationship modeling, and object-oriented modeling Expanded coverage of the object-oriented and object/relational approach to data management, including ODMG and SQL3 Uses examples from real database systems including OracleTM and Microsoft AccessAE Includes discussion of decision support applications of data warehousing and data mining, as well as emerging technologies of web databases, multimedia, and mobile databases Covers advanced modeling in the areas of active, temporal, and spatial databases Provides coverage of issues of physical database tuning Discusses current database application areas of GIS, genome, and digital libraries

4,242 citations


"An Efficient Heuristic for Logical ..." refers methods in this paper

  • ...We have used a simple cost Model described in [13]....

    [...]

Book
01 Jan 1992
TL;DR: This Second Edition of Building the Data Warehouse is revised and expanded to include new techniques and applications of data warehouse technology and update existing topics to reflect the latest thinking.
Abstract: From the Publisher: The data warehouse solves the problem of getting information out of legacy systems quickly and efficiently. If designed and built right, data warehouses can provide significant freedom of access to data, thereby delivering enormous benefits to any organization. In this unique handbook, W. H. Inmon, "the father of the data warehouse," provides detailed discussion and analysis of all major issues related to the design and construction of the data warehouse, including granularity of data, partitioning data, metadata, lack of creditability of decision support systems (DSS) data, the system of record, migration and more. This Second Edition of Building the Data Warehouse is revised and expanded to include new techniques and applications of data warehouse technology and update existing topics to reflect the latest thinking. It includes a useful review checklist to help evaluate the effectiveness of the design.

2,820 citations

Proceedings ArticleDOI
08 Nov 2002
TL;DR: The proposed conceptual model is customized for the tracing of inter-attribute relationships and the respective ETL activities in the early stages of a data warehouse project and constructed in a customizable and extensible manner, so that the designer can enrich it with his own re-occurring patterns forETL activities.
Abstract: Extraction-Transformation-Loading (ETL) tools are pieces of software responsible for the extraction of data from several sources, their cleansing, customization and insertion into a data warehouse. In this paper, we focus on the problem of the definition of ETL activities and provide formal foundations for their conceptual representation. The proposed conceptual model is (a) customized for the tracing of inter-attribute relationships and the respective ETL activities in the early stages of a data warehouse project; (b) enriched with a 'palette' of a set of frequently used ETL activities, like the assignment of surrogate keys, the check for null values, etc; and (c) constructed in a customizable and extensible manner, so that the designer can enrich it with his own re-occurring patterns for ETL activities.

393 citations


"An Efficient Heuristic for Logical ..." refers background in this paper

  • ...At the Conceptual level [2,3] details like source schema, target schema and the relationships between their attributes are defined....

    [...]

Book
01 Mar 2002
TL;DR: This third edition of W. H. Inmon's Building the Data Warehouse is completely revised to reflect exciting new techniques and applications, update existing topics, and examine data marts, operational data stores, and the corporate information factory.
Abstract: From the Publisher: Since it was first published in 1990, W. H. Inmon's Building the Data Warehouse has become the bible of data warehousing -- the first and best introduction to the subject. A lot has changed in data warehousing technology since the last edition appeared in 1996, and this latest volume is completely revised to reflect exciting new techniques and applications, update existing topics, and examine data marts, operational data stores, and the corporate information factory. In this Third Edition, Inmon explains what a data warehouse is (and isn't), why it's needed, how it works, and how the traditional data warehouse can be integrated with new technologies, including the Web, to provide enhanced customer service and support. He also addresses the trade-offs between normalized data warehouses and dimensional data marts.

353 citations


"An Efficient Heuristic for Logical ..." refers background in this paper

  • ...A Data Warehouses(DW) [1] plays an important role in decision making for any organization....

    [...]

Proceedings ArticleDOI
05 Apr 2005
TL;DR: This paper delves into the logical optimization of ETL processes, modeling it as a state-space search problem and provides algorithms towards the minimization of the execution cost of an ETL workflow.
Abstract: Extraction-transformation-loading (ETL) tools are pieces of software responsible for the extraction of data from several sources, their cleansing, customization and insertion into a data warehouse. Usually, these processes must be completed in a certain time window; thus, it is necessary to optimize their execution time. In this paper, we delve into the logical optimization of ETL processes, modeling it as a state-space search problem. We consider each ETL workflow as a state and fabricate the state space through a set of correct state transitions. Moreover, we provide algorithms towards the minimization of the execution cost of an ETL workflow.

218 citations


Additional excerpts

  • ...The State Space based algorithms described in [14,8] consider only the converging linear groups (i....

    [...]

Trending Questions (1)
Is ETL developer a good career?

Experimental evaluation has shown that by using the proposed techniques, ETL flows can be better optimized and with much less effort compared to existing methods.