Showing papers on "Data warehouse published in 2008"

PDF

Open Access

Proceedings Article•DOI•

Column-stores vs. row-stores: how different are they really?

[...]

Daniel J. Abadi¹, Samuel Madden², Nabil I. Hachem•Institutions (2)

Yale University¹, Massachusetts Institute of Technology²

09 Jun 2008

TL;DR: It is concluded that while it is not impossible for a row-store to achieve some of the performance advantages of a column-store, changes must be made to both the storage layer and the query executor to fully obtain the benefits of aColumn-oriented approach.

...read moreread less

Abstract: There has been a significant amount of excitement and recent work on column-oriented database systems ("column-stores"). These database systems have been shown to perform more than an order of magnitude better than traditional row-oriented database systems ("row-stores") on analytical workloads such as those found in data warehouses, decision support, and business intelligence applications. The elevator pitch behind this performance difference is straightforward: column-stores are more I/O efficient for read-only queries since they only have to read from disk (or from memory) those attributes accessed by a query.This simplistic view leads to the assumption that one can obtain the performance benefits of a column-store using a row-store: either by vertically partitioning the schema, or by indexing every column so that columns can be accessed independently. In this paper, we demonstrate that this assumption is false. We compare the performance of a commercial row-store under a variety of different configurations with a column-store and show that the row-store performance is significantly slower on a recently proposed data warehouse benchmark. We then analyze the performance difference and show that there are some important differences between the two systems at the query executor level (in addition to the obvious differences at the storage layer level). Using the column-store, we then tease apart these differences, demonstrating the impact on performance of a variety of column-oriented query execution techniques, including vectorized query processing, compression, and a new join algorithm we introduce in this paper. We conclude that while it is not impossible for a row-store to achieve some of the performance advantages of a column-store, changes must be made to both the storage layer and the query executor to fully obtain the benefits of a column-oriented approach.

...read moreread less

526 citations

Journal Article•DOI•

Management Support with Structured and Unstructured Data-An Integrated Business Intelligence Framework

[...]

Henning Baars¹, Hans-George Kemper¹•Institutions (1)

University of Stuttgart¹

01 Mar 2008-Information Systems Management

TL;DR: Three types of approaches for tackling the respective challenges are distinguished and the approaches are mapped to a three layer BI framework and discussed regarding challenges and business potential.

...read moreread less

Abstract: In the course of the evolution of management support towards corporate wide Business Intelligence infrastructures, the integration of components for handling unstructured data comes into focus. In this paper, three types of approaches for tackling the respective challenges are distinguished. The approaches are mapped to a three layer BI framework and discussed regarding challenges and business potential. The application of the framework is exemplified for the domains of Competitive Intelligence and Customer Relationship Management.

...read moreread less

348 citations

Proceedings Article•DOI•

Bootstrapping pay-as-you-go data integration systems

[...]

Anish Das Sarma¹, Xin Dong², Alon Halevy³•Institutions (3)

Stanford University¹, AT&T Labs², Google³

09 Jun 2008

TL;DR: This paper describes the first completely self-configuring data integration system based on the new concept of a probabilistic mediated schema that is automatically created from the data sources that is able to produce high-quality answers with no human intervention.

...read moreread less

Abstract: Data integration systems offer a uniform interface to a set of data sources. Despite recent progress, setting up and maintaining a data integration application still requires significant upfront effort of creating a mediated schema and semantic mappings from the data sources to the mediated schema. Many application contexts involving multiple data sources (e.g., the web, personal information management, enterprise intranets) do not require full integration in order to provide useful services, motivating a pay-as-you-go approach to integration. With that approach, a system starts with very few (or inaccurate) semantic mappings and these mappings are improved over time as deemed necessary.This paper describes the first completely self-configuring data integration system. The goal of our work is to investigate how advanced of a starting point we can provide a pay-as-you-go system. Our system is based on the new concept of a probabilistic mediated schema that is automatically created from the data sources. We automatically create probabilistic schema mappings between the sources and the mediated schema. We describe experiments in multiple domains, including 50-800 data sources, and show that our system is able to produce high-quality answers with no human intervention.

...read moreread less

273 citations

Book•

Advanced Data Warehouse Design: From Conventional to Spatial and Temporal Applications

[...]

Elzbieta Malinowski

15 Jan 2008

TL;DR: This book serves as an introduction to the state of the art on data warehouse design, with many references to more detailed sources, and may help experienced data warehouse designers to enlarge their analysis possibilities by incorporating spatial and temporal information.

...read moreread less

Abstract: A data warehouse stores large volumes of historical data required for analytical purposes. This data is extracted from operational databases; transformed into a coherent whole using a multidimensional model that includes measures, dimensions, and hierarchies; and loaded into a data warehouse during the extraction-transformation-loading (ETL) process. Malinowski and Zimnyi explain in detail conventional data warehouse design, covering in particular complex hierarchy modeling. Additionally, they address two innovative domains recently introduced to extend the capabilities of data warehouse systems, namely the management of spatial and temporal information. Their presentation covers different phases of the design process, such as requirements specification, conceptual, logical, and physical design. They include three different approaches for requirements specification depending on whether users, operational data sources, or both are the driving force in the requirements gathering process, and they show how each approach leads to the creation of a conceptual multidimensional model. Throughout the book the concepts are illustrated using many real-world examples and completed by sample implementations for Microsoft's Analysis Services 2005 and Oracle 10g with the OLAP and the Spatial extensions. For researchers this book serves as an introduction to the state of the art on data warehouse design, with many references to more detailed sources. Providing a clear and a concise presentation of the major concepts and results of data warehouse design, it can also be used as the basis of a graduate or advanced undergraduate course. The book may help experienced data warehouse designers to enlarge their analysis possibilities by incorporating spatial and temporal information. Finally, experts in spatial databases or in geographical information systems could benefit from the data warehouse vision for building innovative spatial analytical applications.

...read moreread less

223 citations

Journal Article•DOI•

GRAnD: A goal-oriented approach to requirement analysis in data warehouses

[...]

Paolo Giorgini¹, Stefano Rizzi², Maddalena Garzetti¹•Institutions (2)

University of Trento¹, University of Bologna²

01 Apr 2008

TL;DR: This paper proposes GRAnD, a goal-oriented approach to requirement analysis for data warehouses based on the Tropos methodology, which can be employed within both a demand-driven and a mixed supply/demand-driven design framework.

...read moreread less

Abstract: Several surveys indicate that a significant percentage of data warehouses fail to meet business objectives or are outright failures. One of the reasons for this is that requirement analysis is typically overlooked in real projects. In this paper we propose GRAnD, a goal-oriented approach to requirement analysis for data warehouses based on the Tropos methodology. Two different perspectives are integrated for requirement analysis: organizational modeling, centered on stakeholders, and decisional modeling, focused on decision makers. Our approach can be employed within both a demand-driven and a mixed supply/demand-driven design framework.

...read moreread less

215 citations

Book•

Master Data Management

[...]

David Loshin

28 Sep 2008

TL;DR: Master Data Management equips you with a deeply practical, business-focused way of thinking about MDMan understanding that will greatly enhance your ability to communicate with stakeholders and win their support.

...read moreread less

Abstract: The key to a successful MDM initiative isnt technology or methods, its people: the stakeholders in the organization and their complex ownership of the data that the initiative will affect.Master Data Management equips you with a deeply practical, business-focused way of thinking about MDMan understanding that will greatly enhance your ability to communicate with stakeholders and win their support. Moreover, it will help you deserve their support: youll master all the details involved in planning and executing an MDM project that leads to measurable improvements in business productivity and effectiveness.* Presents a comprehensive roadmap that you can adapt to any MDM project.* Emphasizes the critical goal of maintaining and improving data quality.* Provides guidelines for determining which data to master.* Examines special issues relating to master data metadata.* Considers a range of MDM architectural styles.* Covers the synchronization of master data across the application infrastructure.

...read moreread less

190 citations

Book•

Dw 2.0: The Architecture for the Next Generation of Data Warehousing

[...]

William H. Inmon, Derek Strauss, Genia Neushloss

09 Jul 2008

TL;DR: This book describes the future of data warehousing that is technologically possible now, at both an architectural level and technology level, and gives the experienced data warehouse professional everything and exactly what is needed in order to implement the new generation DW 2.0.

...read moreread less

Abstract: Data Warehousing has been around for 20 years and has become part of the information technology infrastructure. Data warehousing originally grew in response to the corporate need for information--not data--and it supplies integrated, granular, and historical data to the corporation. There are many kinds of data warehouses, in large part due to evolution and different paths of software and hardware vendors. But DW 2.0, defined by this author in many talks, articles, and his b-eye-network newsletter that reaches 65,000 professionals monthly, is the well-identified and defined next generation data warehouse. The book carries that theme and describes the future of data warehousing that is technologically possible now, at both an architectural level and technology level. The perspective of the book is from the top down: looking at the overall architecture and then delving into the issues underlying the components. The benefit of this for people who are building or using a data warehouse can see what lies ahead, and can determine: what new technology to buy, how to plan extensions to the data warehouse, what can be salvaged from the current system, and how to justify the expense--at the most practical level. All of this gives the experienced data warehouse professional everything and exactly what is needed in order to implement the new generation DW 2.0.* First book on the new generation of data warehouse architecture, DW 2.0. * Written by the "father of the data warehouse", Bill Inmon, a columnist and newsletter editor of The Bill Inmon Channel on the Business Intelligence Network. * Long overdue comprehensive coverage of the implementation of technology and tools that enable the new generation of the DW: metadata, temporal data, ETL, unstructured data, and data quality control.

...read moreread less

167 citations

Journal Article•DOI•

Integrating Data Warehouses with Web Data: A Survey

[...]

Juan Manuel Pérez, Rafael Berlanga, María José Aramburu, Torben Bach Pedersen¹•Institutions (1)

Aalborg University¹

01 Jul 2008-IEEE Transactions on Knowledge and Data Engineering

TL;DR: The paper addresses the application of information retrieval technology in a DW to exploit text-rich documents collections and introduces the problem of dealing with semi-structured data in aDW.

...read moreread less

Abstract: This paper surveys the most relevant research on combining Data Warehouse (DW) and Web data. It studies the XML technologies that are currently being used to integrate, store, query and retrieve web data, and their application to DWs. The paper reviews different DW distributed architectures and the use of XML languages as an integration tool in these systems. It also introduces the problem of dealing with semi-structured data in a DW. It studies Web data repositories, the design of multidimensional databases for XML data sources and the XML extensions of On-Line Analytical Processing techniques. The paper addresses the application of information retrieval technology in a DW to exploit text-rich documents collections. The authors hope that the paper will help to discover the main limitations and opportunities that offer the combination of the DW and the Web fields, as well as, to identify open research lines.

...read moreread less

160 citations

Journal Article•DOI•

An MDA approach for the development of data warehouses

[...]

Jose-Norberto Mazón¹, Juan Trujillo¹•Institutions (1)

University of Alicante¹

01 Apr 2008

TL;DR: This paper describes how to build the different MDA models for the DW repository by using an extension of the Unified Modeling Language (UML) and the Common Warehouse Metamodel (CWM).

...read moreread less

Abstract: Different modeling approaches have been proposed to overcome every design pitfall of different data warehouse (DW) components. However, most of them offer partial solutions that deal only with isolated aspects of the DW and do not provide developers with an integrated and standard framework for designing all DW relevant components, such as ETL processes, data sources, DW repository and so on. To overcome this problem, this paper describes how to align the whole DW development process with a Model Driven Architecture (MDA) framework. We then focus on describing one part of it: an MDA approach for the development of the DW repository, because it is the cornerstone of any DW system. Therefore, we describe how to build the different MDA models for the DW repository by using an extension of the Unified Modeling Language (UML) and the Common Warehouse Metamodel (CWM). Transformations between models are also clearly and formally established by using the Query/View/Transformation (QVT) language. Finally, a case study is provided to exemplify the benefits of our MDA framework.

...read moreread less

157 citations

Journal Article•DOI•

Brighthouse: an analytic data warehouse for ad-hoc queries

[...]

Dominik Ślȩzak, Jakub Wróblewski, Victoria Eastwood, Piotr Synak

01 Aug 2008

TL;DR: Additional benefits resulting from Knowledge Grid for compressed, column-oriented databases, including assistance in query optimization and execution, are demonstrated by minimizing the need of data reads and data decompression.

...read moreread less

Abstract: Brighthouse is a column-oriented data warehouse with an automatically tuned, ultra small overhead metadata layer called Knowledge Grid, that is used as an alternative to classical indexes. The advantages of column-oriented data storage, as well as data compression have already been well-documented, especially in the context of analytic, decision support querying. This paper demonstrates additional benefits resulting from Knowledge Grid for compressed, column-oriented databases. In particular, we explain how it assists in query optimization and execution, by minimizing the need of data reads and data decompression.

...read moreread less

149 citations

Encyclopedia of data warehousing and mining

[...]

Laura Maruster, Niels Faber

01 Jan 2008

Proceedings Article•DOI•

Text Cube: Computing IR Measures for Multidimensional Text Database Analysis

[...]

Cindy Xide Lin¹, Bolin Ding¹, Jiawei Han¹, Feida Zhu¹, Bo Zhao¹ - Show less +1 more•Institutions (1)

University of Illinois at Urbana–Champaign¹

15 Dec 2008

TL;DR: This paper proposes a text-cube model on multidimensional text database and conducts systematic studies on efficient text-Cube implementation, OLAP execution and query processing and shows the high promise of the methods.

...read moreread less

Abstract: Since Jim Gray introduced the concept of rdquodata cuberdquo in 1997, data cube, associated with online analytical processing (OLAP), has become a driving engine in data warehouse industry. Because the boom of Internet has given rise to an ever increasing amount of text data associated with other multidimensional information, it is natural to propose a data cube model that integrates the power of traditional OLAP and IR techniques for text. In this paper, we propose a text-cube model on multidimensional text database and study effective OLAP over such data. Two kinds of hierarchies are distinguishable inside: dimensional hierarchy and term hierarchy. By incorporating these hierarchies, we conduct systematic studies on efficient text-cube implementation, OLAP execution and query processing. Our performance study shows the high promise of our methods.

...read moreread less

Proceedings Article•DOI•

Handling data skew in parallel joins in shared-nothing systems

[...]

Yu Xu¹, Pekka Kostamaa¹, Xin Zhou¹, Liang Chen²•Institutions (2)

Teradata¹, University of California, San Diego²

09 Jun 2008

TL;DR: This work proposes a new join geography called PRPD (Partial Redistribution & Partial Duplication) to improve the performance and scalability of parallel joins in the presence of data skew in a shared-nothing system.

...read moreread less

Abstract: Parallel processing continues to be important in large data warehouses. The processing requirements continue to expand in multiple dimensions. These include greater volumes, increasing number of concurrent users, more complex queries, and more applications which define complex logical, semantic, and physical data models. Shared nothing parallel database management systems [16] can scale up "horizontally" by adding more nodes. Most parallel algorithms, however, do not take into account data skew. Data skew occurs naturally in many applications. A query processing skewed data not only slows down its response time, but generates hot nodes, which become a bottleneck throttling the overall system performance. Motivated by real business problems, we propose a new join geography called PRPD (Partial Redistribution & Partial Duplication) to improve the performance and scalability of parallel joins in the presence of data skew in a shared-nothing system. Our experimental results show that PRPD significantly speeds up query elapsed time in the presence of data skew. Our experience shows that eliminating system bottlenecks caused by data skew improves the throughput of the whole system which is important in parallel data warehouses that often run high concurrency workloads.

...read moreread less

Book•

Building a Data Warehouse: With Examples in SQL Server

[...]

Vincent Rainardi

07 Jan 2008

TL;DR: This book describes how to build a data warehouse completely from scratch and shows practical examples on how to do it, as well as some practical issues he has experienced that developers are likely to encounter in their first data warehousing project, along with solutions and advice.

...read moreread less

Abstract: Building a Data Warehouse: With Examples in SQL Server describes how to build a data warehouse completely from scratch and shows practical examples on how to do it. Author Vincent Rainardi also describes some practical issues he has experienced that developers are likely to encounter in their first data warehousing project, along with solutions and advice. The RDBMS used in the examples is SQL Server; the version will not be an issue as long as the user has SQL Server 2005 or later. The book is organized as follows. In the beginning of this book (Chapters 1 through 6), you learn how to build a data warehouse, for example, defining the architecture, understanding the methodology, gathering the requirements, designing the data models, and creating the databases. Then in Chapters 7 through 10, you learn how to populate the data warehouse, for example, extracting from source systems, loading the data stores, maintaining data quality, and utilizing the metadata. After you populate the data warehouse, in Chapters 11 through 15, you explore how to present data to users using reports and multidimensional databases and how to use the data in the data warehouse for business intelligence, customer relationship management, and other purposes. Chapters 16 and 17 wrap up the book: After you have built your data warehouse, before it can be released to production, you need to test it thoroughly. After your application is in production, you need to understand how to administer data warehouse operation. What youll learn A detailed understanding of what it takes to build a data warehouse The implementation code in SQL Server to build the data warehouse Dimensional modeling, data extraction methods, data warehouse loading, populating dimension and fact tables, data quality, data warehouse architecture, and database design Practical data warehousing applications such as business intelligence reports, analytics applications, and customer relationship management Who is this book for? There are three audiences for the book. The first are the people who implement the data warehouse. This could be considered a field guide for them. The second is database users/admins who want to get a good understanding of what it would take to build a data warehouse. Finally, the third audience is managers who must make decisions about aspects of the data warehousing task before them and use the book to learn about these issues. Related Titles Beginning Relational Data Modeling, Second Edition Data Mining and Statistical Analysis Using SQL

...read moreread less

Book•

Enterprise Master Data Management: An SOA Approach to Managing Core Information

[...]

Allen Dreibelbis, Eberhard Hechler, Ivan M. Milman, Martin Oberhofer, Paul van Run, Dan Wolfson - Show less +2 more

05 Jun 2008

TL;DR: This book systematically introduces MDM key concepts and technical themes, explains its business case, and illuminates how it interrelates with and enables SOA.

...read moreread less

Abstract: The Only Complete Technical Primer for MDM Planners, Architects, and ImplementersCompanies moving toward flexible SOA architectures often face difficult information management and integration challenges. The master data they rely on is often stored and managed in ways that are redundant, inconsistent, inaccessible, non-standardized, and poorly governed. Using Master Data Management (MDM), organizations can regain control of their master data, improve corresponding business processes, and maximize its value in SOA environments.Enterprise Master Data Management provides an authoritative, vendor-independent MDM technical reference for practitioners: architects, technical analysts, consultants, solution designers, and senior IT decisionmakers. Written by the IBM data management innovators who are pioneering MDM, this book systematically introduces MDMs key concepts and technical themes, explains its business case, and illuminates how it interrelates with and enables SOA.Drawing on their experience with cutting-edge projects, the authors introduce MDM patterns, blueprints, solutions, and best practices published nowhere elseeverything you need to establish a consistent, manageable set of master data, and use it for competitive advantage.Coverage includesHow MDM and SOA complement each otherUsing the MDM Reference Architecture to position and design MDM solutions within an enterpriseAssessing the value and risks to master data and applying the right security controlsUsing PIM-MDM and CDI-MDM Solution Blueprints to address industry-specific information management challengesExplaining MDM patterns as enablers to accelerate consistent MDM deploymentsIncorporating MDM solutions into existing IT landscapes via MDM Integration BlueprintsLeveraging master data as an enterprise assetbringing people, processes, and technology together with MDM and data governanceBest practices in MDM deployment, including data warehouse and SAP integration

...read moreread less

Proceedings Article•DOI•

PQR: Predicting Query Execution Times for Autonomous Workload Management

[...]

Chetan Gupta¹, Abhay Mehta¹, Umeahwar Dayal¹•Institutions (1)

Hewlett-Packard¹

02 Jun 2008

TL;DR: This paper uses a machine learning approach that takes the query plan, combines it with the observed load vector of the system and uses the new vector to predict the execution time of the query.

...read moreread less

Abstract: Modern enterprise data warehouses have complex workloads that are notoriously difficult to manage. One of the key pieces to managing workloads is an estimate of how long a query will take to execute. An accurate estimate of this query execution time is critical to self managing Enterprise Class Data Warehouses. In this paper we study the problem of predicting the execution time of a query on a loaded data warehouse with a dynamically changing workload. We use a machine learning approach that takes the query plan, combines it with the observed load vector of the system and uses the new vector to predict the execution time of the query. The predictions are made as time ranges. We validate our solution using real databases and real workloads. We show experimentally that our machine learning approach works well. This technology is slated for incorporation into a commercial, enterprise class DBMS.

...read moreread less

Journal Article•DOI•

Learning to create data-integrating queries

[...]

Partha Pratim Talukdar¹, Marie Jacob¹, Muhammad Salman Mehmood¹, Koby Crammer¹, Zachary G. Ives¹, Fernando Pereira¹, Sudipto Guha¹ - Show less +3 more•Institutions (1)

University of Pennsylvania¹

01 Aug 2008

TL;DR: A system with which a non-expert user can author new query templates and Web forms, to be reused by anyone with related information needs, and which learns to assign costs to sources and associations according to the user's specific information need, as a result changing the ranking of the queries used to generate results.

...read moreread less

Abstract: The number of potentially-related data resources available for querying --- databases, data warehouses, virtual integrated schemas --- continues to grow rapidly. Perhaps no area has seen this problem as acutely as the life sciences, where hundreds of large, complex, interlinked data resources are available on fields like proteomics, genomics, disease studies, and pharmacology. The schemas of individual databases are often large on their own, but users also need to pose queries across multiple sources, exploiting foreign keys and schema mappings. Since the users are not experts, they typically rely on the existence of pre-defined Web forms and associated query templates, developed by programmers to meet the particular scientists' needs. Unfortunately, such forms are scarce commodities, often limited to a single database, and mismatched with biologists' information needs that are often context-sensitive and span multiple databases.We present a system with which a non-expert user can author new query templates and Web forms, to be reused by anyone with related information needs. The user poses keyword queries that are matched against source relations and their attributes; the system uses sequences of associations (e.g., foreign keys, links, schema mappings, synonyms, and taxonomies) to create multiple ranked queries linking the matches to keywords; the set of queries is attached to a Web query form. Now the user and his or her associates may pose specific queries by filling in parameters in the form. Importantly, the answers to this query are ranked and annotated with data provenance, and the user provides feedback on the utility of the answers, from which the system ultimately learns to assign costs to sources and associations according to the user's specific information need, as a result changing the ranking of the queries used to generate results. We evaluate the effectiveness of our method against "gold standard" costs from domain experts and demonstrate the method's scalability.

...read moreread less

Patent•

Smart Data Caching Using Data Mining

[...]

Jo Arao Ramos¹, John Baxter Rollins¹, David G. Wilhite¹•Institutions (1)

IBM¹

07 Jan 2008

TL;DR: In this article, a data mining algorithm is applied to the collected data requests to predict a set of data that is likely to be requested during an upcoming time period, and it is determined whether the complete set of predicted data exists in the data cache.

...read moreread less

Abstract: Methods and apparatus, including computer program products, implementing and using techniques for populating a data cache on a server. Data requests received by the server are collected in a repository. A data mining algorithm is applied to the collected data requests to predict a set of data that is likely to be requested during an upcoming time period. It is determined whether the complete set of predicted data exists in the data cache. If the complete set of predicted data does not exist in the data cache, the missing data is retrieved from a database and added to the data cache.

...read moreread less

Book Chapter•DOI•

A Hybrid Row-Column OLTP Database Architecture for Operational Reporting

[...]

Jan Schaffner¹, Anja Bog¹, Jens Krüger¹, Alexander Zeier¹•Institutions (1)

University of Potsdam¹

24 Aug 2008

TL;DR: Using an ODS as the source for operational reporting exhibits a similar information latency to informational reporting, so it is often not desirable to maintain data on such detailed level in the data warehouse, due to both exploding size of the warehouse and the update frequency.

...read moreread less

Abstract: Operational reporting differs from informational reporting in that its scope is on day-to-day operations and thus requires data on the detail of individual transactions. It is often not desirable to maintain data on such detailed level in the data warehouse, due to both exploding size of the warehouse and the update frequency required for operational reports. Using an ODS as the source for operational reporting exhibits a similar information latency.

...read moreread less

Journal Article•DOI•

Meshing Streaming Updates with Persistent Data in an Active Data Warehouse

[...]

Neoklis Polyzotis¹, Spiros Skiadopoulos, Panos Vassiliadis, Alkis Simitsis, N. E. Frantzell - Show less +1 more•Institutions (1)

University of California, Santa Cruz¹

01 Jul 2008-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A specialized join algorithm, termed mesh join (MESHJOIN), is proposed, which compensates for the difference in the access cost of the two join inputs by 1) relying entirely on fast sequential scans of R and 2) sharing the I/O cost of accessing R across multiple tuples of 5".

...read moreread less

Abstract: Active data warehousing has emerged as an alternative to conventional warehousing practices in order to meet the high demand of applications for up-to-date information. In a nutshell, an active warehouse is refreshed online and thus achieves a higher consistency between the stored information and the latest data updates. The need for online warehouse refreshment introduces several challenges in the implementation of data warehouse transformations, with respect to their execution time and their overhead to the warehouse processes. In this paper, we focus on a frequently encountered operation in this context, namely, the join of a fast stream 5" of source updates with a disk-based relation R, under the constraint of limited memory. This operation lies at the core of several common transformations such as surrogate key assignment, duplicate detection, or identification of newly inserted tuples. We propose a specialized join algorithm, termed mesh join (MESHJOIN), which compensates for the difference in the access cost of the two join inputs by 1) relying entirely on fast sequential scans of R and 2) sharing the I/O cost of accessing R across multiple tuples of 5". We detail the MESHJOIN algorithm and develop a systematic cost model that enables the tuning of MESHJOIN for two objectives: maximizing throughput under a specific memory budget or minimizing memory consumption for a specific throughput. We present an experimental study that validates the performance of MESHJOIN on synthetic and real-life data. Our results verify the scalability of MESHJOIN to fast streams and large relations and demonstrate its numerous advantages over existing join algorithms.

...read moreread less

Proceedings Article•DOI•

Real-time data warehouse loading methodology

[...]

Ricardo Jorge Santos¹, Jorge Bernardino•Institutions (1)

University of Coimbra¹

10 Sep 2008

TL;DR: This paper presents a methodology on how to adapt data warehouse schemas and user-end OLAP queries for efficiently supporting real-time data integration, and uses techniques such as table structure replication and query predicate restrictions for selecting data, to enable continuously loading data in the data warehouse with minimum impact in query execution time.

...read moreread less

Abstract: A data warehouse provides information for analytical processing, decision making and data mining tools. As the concept of real-time enterprise evolves, the synchronism between transactional data and data warehouses, statically implemented, has been redefined. Traditional data warehouse systems have static structures of their schemas and relationships between data, and therefore are not able to support any dynamics in their structure and content. Their data is only periodically updated because they are not prepared for continuous data integration. For real-time enterprises with needs in decision support purposes, real-time data warehouses seem to be very promising. In this paper we present a methodology on how to adapt data warehouse schemas and user-end OLAP queries for efficiently supporting real-time data integration. To accomplish this, we use techniques such as table structure replication and query predicate restrictions for selecting data, to enable continuously loading data in the data warehouse with minimum impact in query execution time. We demonstrate the efficiency of the method by analyzing its impact in query performance using benchmark TPC-H executing query workloads while simultaneously performing continuous data integration at various insertion time rates.

...read moreread less

Journal Article•DOI•

Management and analysis of unstructured construction data types

[...]

Lucio Soibelman¹, Jianfeng Wu¹, Carlos H. Caldas², Ioannis Brilakis³, Ken-Yu Lin⁴ - Show less +1 more•Institutions (4)

Carnegie Mellon University¹, University of Texas at Austin², University of Michigan³, National Taiwan University⁴

01 Jan 2008-Advanced Engineering Informatics

TL;DR: The vision for data management and mining addressing challenges in data preparation, representation, and analysis is presented, together with related research results from previous work, as well as the recent developments of data mining on text-based, web-Based, image- based, and network-based construction databases.

...read moreread less

Proceedings Article•DOI•

Building real-world trajectory warehouses

[...]

Gerasimos Marketos¹, Elias Frentzos¹, Irene Ntoutsi¹, Nikos Pelekis¹, Alessandra Raffaetà, Yannis Theodoridis¹ - Show less +2 more•Institutions (1)

University of Piraeus¹

13 Jun 2008

TL;DR: This work investigates how the traditional data cube model is adapted to trajectory warehouses in order to transform raw location data into valuable information.

...read moreread less

Abstract: The flow of data generated from low-cost modern sensing technologies and wireless telecommunication devices enables novel research fields related to the management of this new kind of data and the implementation of appropriate analytics for knowledge extraction. In this work, we investigate how the traditional data cube model is adapted to trajectory warehouses in order to transform raw location data into valuable information. In particular, we focus our research on three issues that are critical to trajectory data warehousing: (a) the trajectory reconstruction procedure that takes place when loading a moving object database with sampled location data originated e.g. from GPS recordings, (b) the ETL procedure that feeds a trajectory data warehouse, and (c) the aggregation of cube measures for OLAP purposes. We provide design solutions for all these issues and we test their applicability and efficiency in real world settings.

...read moreread less

Book•

Advanced Data Warehouse Design: From Conventional to Spatial and Temporal Applications (Data-Centric Systems and Applications)

[...]

Elzbieta Malinowski, Esteban Zimnyi

25 Feb 2008

TL;DR: In this article, the authors present an overview of the state of the art on data warehouse design, including three different approaches for requirements specification depending on whether users, operational data sources or both are the driving force in the requirements gathering process, and how each approach leads to the creation of a conceptual multidimensional model.

...read moreread less

Patent•

Method and apparatus for detecting policy violations in a data repository having an arbitrary data schema

[...]

Michel Zoppas¹, Jeremy Hermann¹, Conal O'Raghallaigh¹, Eric Bothwell¹, Alexander Fontana¹ - Show less +1 more•Institutions (1)

Symantec¹

28 Mar 2008

TL;DR: In this article, a method and apparatus for scanning structured data from a data repository having an arbitrary data schema and for applying a policy to the data of the data repository are described.

...read moreread less

Abstract: A method and apparatus for scanning structured data from a data repository having an arbitrary data schema and for applying a policy to the data of the data repository are described. In one embodiment, the structured data is converted to unstructured text data to allow a schema-independent policy to be applied to the text data in order to detect a policy violation in the data repository regardless of the data schema used by the data repository.

...read moreread less

Book•

Data Warehousing and Mining: Concepts, Methodologies, Tools, and Applications

[...]

John Wang

23 May 2008

TL;DR: This six-volume set offers tools, designs, and outcomes of the utilization of data warehousing and mining technologies, such as algorithms, concept lattices, multidimensional data, and online analytical processing.

...read moreread less

Abstract: Data Warehousing and Mining: Concepts, Methodologies, Tools and Applications provides the most comprehensive compilation of research available in this emerging and increasingly important field. This six-volume set offers tools, designs, and outcomes of the utilization of data warehousing and mining technologies, such as algorithms, concept lattices, multidimensional data, and online analytical processing. With more than 300 chapters contributed by over 575 experts from around the globe, this authoritative collection will provide libraries with the essential reference on data warehousing and mining.

...read moreread less

Patent•

Complex dependencies for efficient data warehouse updates

[...]

Theodore Johnson¹•Institutions (1)

AT&T¹

15 Jan 2008

TL;DR: In this article, the authors proposed a method of updating a data storage system using an input data stream based on an input temporal value associated with the data stream and a raw temporal value corresponding to the raw database.

...read moreread less

Abstract: The invention relates to a method of updating a data storage system. The method updates a raw database using an input data stream based on an input temporal value associated with the input data stream and a raw temporal value associated with the raw database. The method includes updating a derived database associated with the data storage system using the updated raw database based on the input temporal value, a derived temporal value and a user-defined relationship, the derived temporal value being associated with the derived database. The invention also relates to a computer-readable medium. The computer readable medium including instructions, wherein execution of the instructions by at least one computing device updates a data storage system. The invention further relates to a data storage system. The system includes a raw database, a derived database and a computing device operatively coupled to the raw database and the derived database.

...read moreread less

Journal Article•DOI•

A conceptual model for temporal data warehouses and its transformation to the ER and the object-relational models

[...]

Elzbieta Malinowski¹, Esteban Zimányi²•Institutions (2)

University of Costa Rica¹, Université libre de Bruxelles²

01 Jan 2008

TL;DR: A temporal extension of the MultiDim model is introduced, which allows different temporality types: valid time, transaction time, and lifespan, which are obtained from source systems, and loading time, which is generated in the data warehouse.

...read moreread less

Abstract: The MultiDim model is a conceptual multidimensional model for data warehouse and OLAP applications. These applications require the presence of a time dimension to track changes in measure values. However, the time dimension cannot be used to represent changes in other dimensions. In this paper we introduce a temporal extension of the MultiDim model. This extension is based on research realized in temporal databases. We allow different temporality types: valid time, transaction time, and lifespan, which are obtained from source systems, and loading time, which is generated in the data warehouse. Our model provides temporal support for levels, attributes, hierarchies, and measures. For hierarchies we discuss different cases depending on whether the changes in levels or in the relationships between them must be kept. For measures, we give different scenarios that show the usefulness of the different temporality types. Further, since measures can be aggregated before being inserted into data warehouses, we discuss the issues related to different time granularities between source systems and data warehouses. We finish the paper presenting a transformation of the MultiDim model into the entity-relationship and the object-relational models.

...read moreread less

Journal Article•DOI•

Continental Airlines Continues to Soar with Business Intelligence

[...]

Barbara H. Wixom¹, Hugh J. Watson², Anne Marie Reynolds³, Jeffrey A. Hoffer⁴•Institutions (4)

University of Virginia¹, University of Georgia², United Airlines³, University of Dayton⁴

01 Mar 2008-Information Systems Management

TL;DR: A case study of Continental Airlines describes how business intelligence at Continental has evolved over time and identifies Continental's challenges with its mature data warehouse and provides suggestions for how companies can work to overcome these kinds of obstacles.

...read moreread less

Abstract: As the business intelligence industry matures, it is increasingly important to investigate and understand the nature of mature data warehouses. Although data warehouse research is prevalent, existing research primarily addresses new implementations and initial challenges. This case study of Continental Airlines describes how business intelligence at Continental has evolved over time. It identifies Continental's challenges with its mature data warehouse and provides suggestions for how companies can work to overcome these kinds of obstacles.

...read moreread less

Proceedings Article•DOI•

RiTE: Providing On-Demand Data for Right-Time Data Warehousing

[...]

Christian Thomsen¹, Torben Bach Pedersen¹, Wolfgang Lehner¹•Institutions (1)

Aalborg University¹

07 Apr 2008

TL;DR: RiTE ("Right-Time ETL"), a middleware system that provides "the best of both worlds", i.e., INSERT-like data availability, but with bulk-load speeds (up to 10 times faster).

...read moreread less

Abstract: Data warehouses (DWs) have traditionally been loaded with data at regular time intervals, e.g., monthly, weekly, or daily, using fast bulk loading techniques. Recently, the trend is to insert all (or only some) new source data very quickly into DWs, called near-realtime DWs (right-time DWs). This is done using regular INSERT statements, resulting in too low insert speeds. There is thus a great need for a solution that makes inserted data available quickly, while still providing bulk-load insert speeds. This paper presents RiTE ("Right-Time ETL"), a middleware system that provides exactly that. A data producer (ETL) can insert data that becomes available to data consumers on demand. RiTE includes an innovative main-memory based catalyst that provides fast storage and offers concurrency control. A number of policies controlling the bulk movement of data based on user requirements for persistency, availability, freshness, etc. are supported. The system works transparently to both producer and consumers. The system is integrated with an open source DBMS, and experiments show that it provides "the best of both worlds", i.e., INSERT-like data availability, but with bulk-load speeds (up to 10 times faster).

...read moreread less

Collapse