scispace - formally typeset
Search or ask a question

Showing papers on "Data warehouse published in 2008"


Proceedings ArticleDOI
09 Jun 2008
TL;DR: It is concluded that while it is not impossible for a row-store to achieve some of the performance advantages of a column-store, changes must be made to both the storage layer and the query executor to fully obtain the benefits of aColumn-oriented approach.
Abstract: There has been a significant amount of excitement and recent work on column-oriented database systems ("column-stores"). These database systems have been shown to perform more than an order of magnitude better than traditional row-oriented database systems ("row-stores") on analytical workloads such as those found in data warehouses, decision support, and business intelligence applications. The elevator pitch behind this performance difference is straightforward: column-stores are more I/O efficient for read-only queries since they only have to read from disk (or from memory) those attributes accessed by a query.This simplistic view leads to the assumption that one can obtain the performance benefits of a column-store using a row-store: either by vertically partitioning the schema, or by indexing every column so that columns can be accessed independently. In this paper, we demonstrate that this assumption is false. We compare the performance of a commercial row-store under a variety of different configurations with a column-store and show that the row-store performance is significantly slower on a recently proposed data warehouse benchmark. We then analyze the performance difference and show that there are some important differences between the two systems at the query executor level (in addition to the obvious differences at the storage layer level). Using the column-store, we then tease apart these differences, demonstrating the impact on performance of a variety of column-oriented query execution techniques, including vectorized query processing, compression, and a new join algorithm we introduce in this paper. We conclude that while it is not impossible for a row-store to achieve some of the performance advantages of a column-store, changes must be made to both the storage layer and the query executor to fully obtain the benefits of a column-oriented approach.

526 citations


Journal ArticleDOI
TL;DR: Three types of approaches for tackling the respective challenges are distinguished and the approaches are mapped to a three layer BI framework and discussed regarding challenges and business potential.
Abstract: In the course of the evolution of management support towards corporate wide Business Intelligence infrastructures, the integration of components for handling unstructured data comes into focus. In this paper, three types of approaches for tackling the respective challenges are distinguished. The approaches are mapped to a three layer BI framework and discussed regarding challenges and business potential. The application of the framework is exemplified for the domains of Competitive Intelligence and Customer Relationship Management.

348 citations


Proceedings ArticleDOI
09 Jun 2008
TL;DR: This paper describes the first completely self-configuring data integration system based on the new concept of a probabilistic mediated schema that is automatically created from the data sources that is able to produce high-quality answers with no human intervention.
Abstract: Data integration systems offer a uniform interface to a set of data sources. Despite recent progress, setting up and maintaining a data integration application still requires significant upfront effort of creating a mediated schema and semantic mappings from the data sources to the mediated schema. Many application contexts involving multiple data sources (e.g., the web, personal information management, enterprise intranets) do not require full integration in order to provide useful services, motivating a pay-as-you-go approach to integration. With that approach, a system starts with very few (or inaccurate) semantic mappings and these mappings are improved over time as deemed necessary.This paper describes the first completely self-configuring data integration system. The goal of our work is to investigate how advanced of a starting point we can provide a pay-as-you-go system. Our system is based on the new concept of a probabilistic mediated schema that is automatically created from the data sources. We automatically create probabilistic schema mappings between the sources and the mediated schema. We describe experiments in multiple domains, including 50-800 data sources, and show that our system is able to produce high-quality answers with no human intervention.

273 citations


Book
15 Jan 2008
TL;DR: This book serves as an introduction to the state of the art on data warehouse design, with many references to more detailed sources, and may help experienced data warehouse designers to enlarge their analysis possibilities by incorporating spatial and temporal information.
Abstract: A data warehouse stores large volumes of historical data required for analytical purposes. This data is extracted from operational databases; transformed into a coherent whole using a multidimensional model that includes measures, dimensions, and hierarchies; and loaded into a data warehouse during the extraction-transformation-loading (ETL) process. Malinowski and Zimnyi explain in detail conventional data warehouse design, covering in particular complex hierarchy modeling. Additionally, they address two innovative domains recently introduced to extend the capabilities of data warehouse systems, namely the management of spatial and temporal information. Their presentation covers different phases of the design process, such as requirements specification, conceptual, logical, and physical design. They include three different approaches for requirements specification depending on whether users, operational data sources, or both are the driving force in the requirements gathering process, and they show how each approach leads to the creation of a conceptual multidimensional model. Throughout the book the concepts are illustrated using many real-world examples and completed by sample implementations for Microsoft's Analysis Services 2005 and Oracle 10g with the OLAP and the Spatial extensions. For researchers this book serves as an introduction to the state of the art on data warehouse design, with many references to more detailed sources. Providing a clear and a concise presentation of the major concepts and results of data warehouse design, it can also be used as the basis of a graduate or advanced undergraduate course. The book may help experienced data warehouse designers to enlarge their analysis possibilities by incorporating spatial and temporal information. Finally, experts in spatial databases or in geographical information systems could benefit from the data warehouse vision for building innovative spatial analytical applications.

223 citations


Journal ArticleDOI
01 Apr 2008
TL;DR: This paper proposes GRAnD, a goal-oriented approach to requirement analysis for data warehouses based on the Tropos methodology, which can be employed within both a demand-driven and a mixed supply/demand-driven design framework.
Abstract: Several surveys indicate that a significant percentage of data warehouses fail to meet business objectives or are outright failures. One of the reasons for this is that requirement analysis is typically overlooked in real projects. In this paper we propose GRAnD, a goal-oriented approach to requirement analysis for data warehouses based on the Tropos methodology. Two different perspectives are integrated for requirement analysis: organizational modeling, centered on stakeholders, and decisional modeling, focused on decision makers. Our approach can be employed within both a demand-driven and a mixed supply/demand-driven design framework.

215 citations


Book
28 Sep 2008
TL;DR: Master Data Management equips you with a deeply practical, business-focused way of thinking about MDMan understanding that will greatly enhance your ability to communicate with stakeholders and win their support.
Abstract: The key to a successful MDM initiative isnt technology or methods, its people: the stakeholders in the organization and their complex ownership of the data that the initiative will affect.Master Data Management equips you with a deeply practical, business-focused way of thinking about MDMan understanding that will greatly enhance your ability to communicate with stakeholders and win their support. Moreover, it will help you deserve their support: youll master all the details involved in planning and executing an MDM project that leads to measurable improvements in business productivity and effectiveness.* Presents a comprehensive roadmap that you can adapt to any MDM project.* Emphasizes the critical goal of maintaining and improving data quality.* Provides guidelines for determining which data to master.* Examines special issues relating to master data metadata.* Considers a range of MDM architectural styles.* Covers the synchronization of master data across the application infrastructure.

190 citations


Book
09 Jul 2008
TL;DR: This book describes the future of data warehousing that is technologically possible now, at both an architectural level and technology level, and gives the experienced data warehouse professional everything and exactly what is needed in order to implement the new generation DW 2.0.
Abstract: Data Warehousing has been around for 20 years and has become part of the information technology infrastructure. Data warehousing originally grew in response to the corporate need for information--not data--and it supplies integrated, granular, and historical data to the corporation. There are many kinds of data warehouses, in large part due to evolution and different paths of software and hardware vendors. But DW 2.0, defined by this author in many talks, articles, and his b-eye-network newsletter that reaches 65,000 professionals monthly, is the well-identified and defined next generation data warehouse. The book carries that theme and describes the future of data warehousing that is technologically possible now, at both an architectural level and technology level. The perspective of the book is from the top down: looking at the overall architecture and then delving into the issues underlying the components. The benefit of this for people who are building or using a data warehouse can see what lies ahead, and can determine: what new technology to buy, how to plan extensions to the data warehouse, what can be salvaged from the current system, and how to justify the expense--at the most practical level. All of this gives the experienced data warehouse professional everything and exactly what is needed in order to implement the new generation DW 2.0.* First book on the new generation of data warehouse architecture, DW 2.0. * Written by the "father of the data warehouse", Bill Inmon, a columnist and newsletter editor of The Bill Inmon Channel on the Business Intelligence Network. * Long overdue comprehensive coverage of the implementation of technology and tools that enable the new generation of the DW: metadata, temporal data, ETL, unstructured data, and data quality control.

167 citations


Journal ArticleDOI
TL;DR: The paper addresses the application of information retrieval technology in a DW to exploit text-rich documents collections and introduces the problem of dealing with semi-structured data in aDW.
Abstract: This paper surveys the most relevant research on combining Data Warehouse (DW) and Web data. It studies the XML technologies that are currently being used to integrate, store, query and retrieve web data, and their application to DWs. The paper reviews different DW distributed architectures and the use of XML languages as an integration tool in these systems. It also introduces the problem of dealing with semi-structured data in a DW. It studies Web data repositories, the design of multidimensional databases for XML data sources and the XML extensions of On-Line Analytical Processing techniques. The paper addresses the application of information retrieval technology in a DW to exploit text-rich documents collections. The authors hope that the paper will help to discover the main limitations and opportunities that offer the combination of the DW and the Web fields, as well as, to identify open research lines.

160 citations


Journal ArticleDOI
01 Apr 2008
TL;DR: This paper describes how to build the different MDA models for the DW repository by using an extension of the Unified Modeling Language (UML) and the Common Warehouse Metamodel (CWM).
Abstract: Different modeling approaches have been proposed to overcome every design pitfall of different data warehouse (DW) components. However, most of them offer partial solutions that deal only with isolated aspects of the DW and do not provide developers with an integrated and standard framework for designing all DW relevant components, such as ETL processes, data sources, DW repository and so on. To overcome this problem, this paper describes how to align the whole DW development process with a Model Driven Architecture (MDA) framework. We then focus on describing one part of it: an MDA approach for the development of the DW repository, because it is the cornerstone of any DW system. Therefore, we describe how to build the different MDA models for the DW repository by using an extension of the Unified Modeling Language (UML) and the Common Warehouse Metamodel (CWM). Transformations between models are also clearly and formally established by using the Query/View/Transformation (QVT) language. Finally, a case study is provided to exemplify the benefits of our MDA framework.

157 citations


Journal ArticleDOI
01 Aug 2008
TL;DR: Additional benefits resulting from Knowledge Grid for compressed, column-oriented databases, including assistance in query optimization and execution, are demonstrated by minimizing the need of data reads and data decompression.
Abstract: Brighthouse is a column-oriented data warehouse with an automatically tuned, ultra small overhead metadata layer called Knowledge Grid, that is used as an alternative to classical indexes. The advantages of column-oriented data storage, as well as data compression have already been well-documented, especially in the context of analytic, decision support querying. This paper demonstrates additional benefits resulting from Knowledge Grid for compressed, column-oriented databases. In particular, we explain how it assists in query optimization and execution, by minimizing the need of data reads and data decompression.

149 citations



Proceedings ArticleDOI
15 Dec 2008
TL;DR: This paper proposes a text-cube model on multidimensional text database and conducts systematic studies on efficient text-Cube implementation, OLAP execution and query processing and shows the high promise of the methods.
Abstract: Since Jim Gray introduced the concept of rdquodata cuberdquo in 1997, data cube, associated with online analytical processing (OLAP), has become a driving engine in data warehouse industry. Because the boom of Internet has given rise to an ever increasing amount of text data associated with other multidimensional information, it is natural to propose a data cube model that integrates the power of traditional OLAP and IR techniques for text. In this paper, we propose a text-cube model on multidimensional text database and study effective OLAP over such data. Two kinds of hierarchies are distinguishable inside: dimensional hierarchy and term hierarchy. By incorporating these hierarchies, we conduct systematic studies on efficient text-cube implementation, OLAP execution and query processing. Our performance study shows the high promise of our methods.

Proceedings ArticleDOI
09 Jun 2008
TL;DR: This work proposes a new join geography called PRPD (Partial Redistribution & Partial Duplication) to improve the performance and scalability of parallel joins in the presence of data skew in a shared-nothing system.
Abstract: Parallel processing continues to be important in large data warehouses. The processing requirements continue to expand in multiple dimensions. These include greater volumes, increasing number of concurrent users, more complex queries, and more applications which define complex logical, semantic, and physical data models. Shared nothing parallel database management systems [16] can scale up "horizontally" by adding more nodes. Most parallel algorithms, however, do not take into account data skew. Data skew occurs naturally in many applications. A query processing skewed data not only slows down its response time, but generates hot nodes, which become a bottleneck throttling the overall system performance. Motivated by real business problems, we propose a new join geography called PRPD (Partial Redistribution & Partial Duplication) to improve the performance and scalability of parallel joins in the presence of data skew in a shared-nothing system. Our experimental results show that PRPD significantly speeds up query elapsed time in the presence of data skew. Our experience shows that eliminating system bottlenecks caused by data skew improves the throughput of the whole system which is important in parallel data warehouses that often run high concurrency workloads.

Book
07 Jan 2008
TL;DR: This book describes how to build a data warehouse completely from scratch and shows practical examples on how to do it, as well as some practical issues he has experienced that developers are likely to encounter in their first data warehousing project, along with solutions and advice.
Abstract: Building a Data Warehouse: With Examples in SQL Server describes how to build a data warehouse completely from scratch and shows practical examples on how to do it. Author Vincent Rainardi also describes some practical issues he has experienced that developers are likely to encounter in their first data warehousing project, along with solutions and advice. The RDBMS used in the examples is SQL Server; the version will not be an issue as long as the user has SQL Server 2005 or later. The book is organized as follows. In the beginning of this book (Chapters 1 through 6), you learn how to build a data warehouse, for example, defining the architecture, understanding the methodology, gathering the requirements, designing the data models, and creating the databases. Then in Chapters 7 through 10, you learn how to populate the data warehouse, for example, extracting from source systems, loading the data stores, maintaining data quality, and utilizing the metadata. After you populate the data warehouse, in Chapters 11 through 15, you explore how to present data to users using reports and multidimensional databases and how to use the data in the data warehouse for business intelligence, customer relationship management, and other purposes. Chapters 16 and 17 wrap up the book: After you have built your data warehouse, before it can be released to production, you need to test it thoroughly. After your application is in production, you need to understand how to administer data warehouse operation. What youll learn A detailed understanding of what it takes to build a data warehouse The implementation code in SQL Server to build the data warehouse Dimensional modeling, data extraction methods, data warehouse loading, populating dimension and fact tables, data quality, data warehouse architecture, and database design Practical data warehousing applications such as business intelligence reports, analytics applications, and customer relationship management Who is this book for? There are three audiences for the book. The first are the people who implement the data warehouse. This could be considered a field guide for them. The second is database users/admins who want to get a good understanding of what it would take to build a data warehouse. Finally, the third audience is managers who must make decisions about aspects of the data warehousing task before them and use the book to learn about these issues. Related Titles Beginning Relational Data Modeling, Second Edition Data Mining and Statistical Analysis Using SQL

Book
05 Jun 2008
TL;DR: This book systematically introduces MDM key concepts and technical themes, explains its business case, and illuminates how it interrelates with and enables SOA.
Abstract: The Only Complete Technical Primer for MDM Planners, Architects, and ImplementersCompanies moving toward flexible SOA architectures often face difficult information management and integration challenges. The master data they rely on is often stored and managed in ways that are redundant, inconsistent, inaccessible, non-standardized, and poorly governed. Using Master Data Management (MDM), organizations can regain control of their master data, improve corresponding business processes, and maximize its value in SOA environments.Enterprise Master Data Management provides an authoritative, vendor-independent MDM technical reference for practitioners: architects, technical analysts, consultants, solution designers, and senior IT decisionmakers. Written by the IBM data management innovators who are pioneering MDM, this book systematically introduces MDMs key concepts and technical themes, explains its business case, and illuminates how it interrelates with and enables SOA.Drawing on their experience with cutting-edge projects, the authors introduce MDM patterns, blueprints, solutions, and best practices published nowhere elseeverything you need to establish a consistent, manageable set of master data, and use it for competitive advantage.Coverage includesHow MDM and SOA complement each otherUsing the MDM Reference Architecture to position and design MDM solutions within an enterpriseAssessing the value and risks to master data and applying the right security controlsUsing PIM-MDM and CDI-MDM Solution Blueprints to address industry-specific information management challengesExplaining MDM patterns as enablers to accelerate consistent MDM deploymentsIncorporating MDM solutions into existing IT landscapes via MDM Integration BlueprintsLeveraging master data as an enterprise assetbringing people, processes, and technology together with MDM and data governanceBest practices in MDM deployment, including data warehouse and SAP integration

Proceedings ArticleDOI
02 Jun 2008
TL;DR: This paper uses a machine learning approach that takes the query plan, combines it with the observed load vector of the system and uses the new vector to predict the execution time of the query.
Abstract: Modern enterprise data warehouses have complex workloads that are notoriously difficult to manage. One of the key pieces to managing workloads is an estimate of how long a query will take to execute. An accurate estimate of this query execution time is critical to self managing Enterprise Class Data Warehouses. In this paper we study the problem of predicting the execution time of a query on a loaded data warehouse with a dynamically changing workload. We use a machine learning approach that takes the query plan, combines it with the observed load vector of the system and uses the new vector to predict the execution time of the query. The predictions are made as time ranges. We validate our solution using real databases and real workloads. We show experimentally that our machine learning approach works well. This technology is slated for incorporation into a commercial, enterprise class DBMS.

Journal ArticleDOI
01 Aug 2008
TL;DR: A system with which a non-expert user can author new query templates and Web forms, to be reused by anyone with related information needs, and which learns to assign costs to sources and associations according to the user's specific information need, as a result changing the ranking of the queries used to generate results.
Abstract: The number of potentially-related data resources available for querying --- databases, data warehouses, virtual integrated schemas --- continues to grow rapidly. Perhaps no area has seen this problem as acutely as the life sciences, where hundreds of large, complex, interlinked data resources are available on fields like proteomics, genomics, disease studies, and pharmacology. The schemas of individual databases are often large on their own, but users also need to pose queries across multiple sources, exploiting foreign keys and schema mappings. Since the users are not experts, they typically rely on the existence of pre-defined Web forms and associated query templates, developed by programmers to meet the particular scientists' needs. Unfortunately, such forms are scarce commodities, often limited to a single database, and mismatched with biologists' information needs that are often context-sensitive and span multiple databases.We present a system with which a non-expert user can author new query templates and Web forms, to be reused by anyone with related information needs. The user poses keyword queries that are matched against source relations and their attributes; the system uses sequences of associations (e.g., foreign keys, links, schema mappings, synonyms, and taxonomies) to create multiple ranked queries linking the matches to keywords; the set of queries is attached to a Web query form. Now the user and his or her associates may pose specific queries by filling in parameters in the form. Importantly, the answers to this query are ranked and annotated with data provenance, and the user provides feedback on the utility of the answers, from which the system ultimately learns to assign costs to sources and associations according to the user's specific information need, as a result changing the ranking of the queries used to generate results. We evaluate the effectiveness of our method against "gold standard" costs from domain experts and demonstrate the method's scalability.

Patent
07 Jan 2008
TL;DR: In this article, a data mining algorithm is applied to the collected data requests to predict a set of data that is likely to be requested during an upcoming time period, and it is determined whether the complete set of predicted data exists in the data cache.
Abstract: Methods and apparatus, including computer program products, implementing and using techniques for populating a data cache on a server. Data requests received by the server are collected in a repository. A data mining algorithm is applied to the collected data requests to predict a set of data that is likely to be requested during an upcoming time period. It is determined whether the complete set of predicted data exists in the data cache. If the complete set of predicted data does not exist in the data cache, the missing data is retrieved from a database and added to the data cache.

Book ChapterDOI
24 Aug 2008
TL;DR: Using an ODS as the source for operational reporting exhibits a similar information latency to informational reporting, so it is often not desirable to maintain data on such detailed level in the data warehouse, due to both exploding size of the warehouse and the update frequency.
Abstract: Operational reporting differs from informational reporting in that its scope is on day-to-day operations and thus requires data on the detail of individual transactions. It is often not desirable to maintain data on such detailed level in the data warehouse, due to both exploding size of the warehouse and the update frequency required for operational reports. Using an ODS as the source for operational reporting exhibits a similar information latency.

Journal ArticleDOI
TL;DR: A specialized join algorithm, termed mesh join (MESHJOIN), is proposed, which compensates for the difference in the access cost of the two join inputs by 1) relying entirely on fast sequential scans of R and 2) sharing the I/O cost of accessing R across multiple tuples of 5".
Abstract: Active data warehousing has emerged as an alternative to conventional warehousing practices in order to meet the high demand of applications for up-to-date information. In a nutshell, an active warehouse is refreshed online and thus achieves a higher consistency between the stored information and the latest data updates. The need for online warehouse refreshment introduces several challenges in the implementation of data warehouse transformations, with respect to their execution time and their overhead to the warehouse processes. In this paper, we focus on a frequently encountered operation in this context, namely, the join of a fast stream 5" of source updates with a disk-based relation R, under the constraint of limited memory. This operation lies at the core of several common transformations such as surrogate key assignment, duplicate detection, or identification of newly inserted tuples. We propose a specialized join algorithm, termed mesh join (MESHJOIN), which compensates for the difference in the access cost of the two join inputs by 1) relying entirely on fast sequential scans of R and 2) sharing the I/O cost of accessing R across multiple tuples of 5". We detail the MESHJOIN algorithm and develop a systematic cost model that enables the tuning of MESHJOIN for two objectives: maximizing throughput under a specific memory budget or minimizing memory consumption for a specific throughput. We present an experimental study that validates the performance of MESHJOIN on synthetic and real-life data. Our results verify the scalability of MESHJOIN to fast streams and large relations and demonstrate its numerous advantages over existing join algorithms.

Proceedings ArticleDOI
10 Sep 2008
TL;DR: This paper presents a methodology on how to adapt data warehouse schemas and user-end OLAP queries for efficiently supporting real-time data integration, and uses techniques such as table structure replication and query predicate restrictions for selecting data, to enable continuously loading data in the data warehouse with minimum impact in query execution time.
Abstract: A data warehouse provides information for analytical processing, decision making and data mining tools. As the concept of real-time enterprise evolves, the synchronism between transactional data and data warehouses, statically implemented, has been redefined. Traditional data warehouse systems have static structures of their schemas and relationships between data, and therefore are not able to support any dynamics in their structure and content. Their data is only periodically updated because they are not prepared for continuous data integration. For real-time enterprises with needs in decision support purposes, real-time data warehouses seem to be very promising. In this paper we present a methodology on how to adapt data warehouse schemas and user-end OLAP queries for efficiently supporting real-time data integration. To accomplish this, we use techniques such as table structure replication and query predicate restrictions for selecting data, to enable continuously loading data in the data warehouse with minimum impact in query execution time. We demonstrate the efficiency of the method by analyzing its impact in query performance using benchmark TPC-H executing query workloads while simultaneously performing continuous data integration at various insertion time rates.

Journal ArticleDOI
TL;DR: The vision for data management and mining addressing challenges in data preparation, representation, and analysis is presented, together with related research results from previous work, as well as the recent developments of data mining on text-based, web-Based, image- based, and network-based construction databases.

Proceedings ArticleDOI
13 Jun 2008
TL;DR: This work investigates how the traditional data cube model is adapted to trajectory warehouses in order to transform raw location data into valuable information.
Abstract: The flow of data generated from low-cost modern sensing technologies and wireless telecommunication devices enables novel research fields related to the management of this new kind of data and the implementation of appropriate analytics for knowledge extraction. In this work, we investigate how the traditional data cube model is adapted to trajectory warehouses in order to transform raw location data into valuable information. In particular, we focus our research on three issues that are critical to trajectory data warehousing: (a) the trajectory reconstruction procedure that takes place when loading a moving object database with sampled location data originated e.g. from GPS recordings, (b) the ETL procedure that feeds a trajectory data warehouse, and (c) the aggregation of cube measures for OLAP purposes. We provide design solutions for all these issues and we test their applicability and efficiency in real world settings.

Book
25 Feb 2008
TL;DR: In this article, the authors present an overview of the state of the art on data warehouse design, including three different approaches for requirements specification depending on whether users, operational data sources or both are the driving force in the requirements gathering process, and how each approach leads to the creation of a conceptual multidimensional model.
Abstract: A data warehouse stores large volumes of historical data required for analytical purposes. This data is extracted from operational databases; transformed into a coherent whole using a multidimensional model that includes measures, dimensions, and hierarchies; and loaded into a data warehouse during the extraction-transformation-loading (ETL) process. Malinowski and Zimnyi explain in detail conventional data warehouse design, covering in particular complex hierarchy modeling. Additionally, they address two innovative domains recently introduced to extend the capabilities of data warehouse systems, namely the management of spatial and temporal information. Their presentation covers different phases of the design process, such as requirements specification, conceptual, logical, and physical design. They include three different approaches for requirements specification depending on whether users, operational data sources, or both are the driving force in the requirements gathering process, and they show how each approach leads to the creation of a conceptual multidimensional model. Throughout the book the concepts are illustrated using many real-world examples and completed by sample implementations for Microsoft's Analysis Services 2005 and Oracle 10g with the OLAP and the Spatial extensions. For researchers this book serves as an introduction to the state of the art on data warehouse design, with many references to more detailed sources. Providing a clear and a concise presentation of the major concepts and results of data warehouse design, it can also be used as the basis of a graduate or advanced undergraduate course. The book may help experienced data warehouse designers to enlarge their analysis possibilities by incorporating spatial and temporal information. Finally, experts in spatial databases or in geographical information systems could benefit from the data warehouse vision for building innovative spatial analytical applications.

Patent
28 Mar 2008
TL;DR: In this article, a method and apparatus for scanning structured data from a data repository having an arbitrary data schema and for applying a policy to the data of the data repository are described.
Abstract: A method and apparatus for scanning structured data from a data repository having an arbitrary data schema and for applying a policy to the data of the data repository are described. In one embodiment, the structured data is converted to unstructured text data to allow a schema-independent policy to be applied to the text data in order to detect a policy violation in the data repository regardless of the data schema used by the data repository.

Book
23 May 2008
TL;DR: This six-volume set offers tools, designs, and outcomes of the utilization of data warehousing and mining technologies, such as algorithms, concept lattices, multidimensional data, and online analytical processing.
Abstract: Data Warehousing and Mining: Concepts, Methodologies, Tools and Applications provides the most comprehensive compilation of research available in this emerging and increasingly important field. This six-volume set offers tools, designs, and outcomes of the utilization of data warehousing and mining technologies, such as algorithms, concept lattices, multidimensional data, and online analytical processing. With more than 300 chapters contributed by over 575 experts from around the globe, this authoritative collection will provide libraries with the essential reference on data warehousing and mining.

Patent
Theodore Johnson1
15 Jan 2008
TL;DR: In this article, the authors proposed a method of updating a data storage system using an input data stream based on an input temporal value associated with the data stream and a raw temporal value corresponding to the raw database.
Abstract: The invention relates to a method of updating a data storage system. The method updates a raw database using an input data stream based on an input temporal value associated with the input data stream and a raw temporal value associated with the raw database. The method includes updating a derived database associated with the data storage system using the updated raw database based on the input temporal value, a derived temporal value and a user-defined relationship, the derived temporal value being associated with the derived database. The invention also relates to a computer-readable medium. The computer readable medium including instructions, wherein execution of the instructions by at least one computing device updates a data storage system. The invention further relates to a data storage system. The system includes a raw database, a derived database and a computing device operatively coupled to the raw database and the derived database.

Journal ArticleDOI
01 Jan 2008
TL;DR: A temporal extension of the MultiDim model is introduced, which allows different temporality types: valid time, transaction time, and lifespan, which are obtained from source systems, and loading time, which is generated in the data warehouse.
Abstract: The MultiDim model is a conceptual multidimensional model for data warehouse and OLAP applications. These applications require the presence of a time dimension to track changes in measure values. However, the time dimension cannot be used to represent changes in other dimensions. In this paper we introduce a temporal extension of the MultiDim model. This extension is based on research realized in temporal databases. We allow different temporality types: valid time, transaction time, and lifespan, which are obtained from source systems, and loading time, which is generated in the data warehouse. Our model provides temporal support for levels, attributes, hierarchies, and measures. For hierarchies we discuss different cases depending on whether the changes in levels or in the relationships between them must be kept. For measures, we give different scenarios that show the usefulness of the different temporality types. Further, since measures can be aggregated before being inserted into data warehouses, we discuss the issues related to different time granularities between source systems and data warehouses. We finish the paper presenting a transformation of the MultiDim model into the entity-relationship and the object-relational models.

Journal ArticleDOI
TL;DR: A case study of Continental Airlines describes how business intelligence at Continental has evolved over time and identifies Continental's challenges with its mature data warehouse and provides suggestions for how companies can work to overcome these kinds of obstacles.
Abstract: As the business intelligence industry matures, it is increasingly important to investigate and understand the nature of mature data warehouses. Although data warehouse research is prevalent, existing research primarily addresses new implementations and initial challenges. This case study of Continental Airlines describes how business intelligence at Continental has evolved over time. It identifies Continental's challenges with its mature data warehouse and provides suggestions for how companies can work to overcome these kinds of obstacles.

Proceedings ArticleDOI
07 Apr 2008
TL;DR: RiTE ("Right-Time ETL"), a middleware system that provides "the best of both worlds", i.e., INSERT-like data availability, but with bulk-load speeds (up to 10 times faster).
Abstract: Data warehouses (DWs) have traditionally been loaded with data at regular time intervals, e.g., monthly, weekly, or daily, using fast bulk loading techniques. Recently, the trend is to insert all (or only some) new source data very quickly into DWs, called near-realtime DWs (right-time DWs). This is done using regular INSERT statements, resulting in too low insert speeds. There is thus a great need for a solution that makes inserted data available quickly, while still providing bulk-load insert speeds. This paper presents RiTE ("Right-Time ETL"), a middleware system that provides exactly that. A data producer (ETL) can insert data that becomes available to data consumers on demand. RiTE includes an innovative main-memory based catalyst that provides fast storage and offers concurrency control. A number of policies controlling the bulk movement of data based on user requirements for persistency, availability, freshness, etc. are supported. The system works transparently to both producer and consumers. The system is integrated with an open source DBMS, and experiments show that it provides "the best of both worlds", i.e., INSERT-like data availability, but with bulk-load speeds (up to 10 times faster).