scispace - formally typeset
Search or ask a question

Showing papers on "Data warehouse published in 2009"


Journal ArticleDOI
TL;DR: This article places data fusion into the greater context of data integration, precisely defines the goals of data fusion, namely, complete, concise, and consistent data, and highlights the challenges of data Fusion.
Abstract: The development of the Internet in recent years has made it possible and useful to access many different information systems anywhere in the world to obtain information. While there is much research on the integration of heterogeneous information systems, most commercial systems stop short of the actual integration of available data. Data fusion is the process of fusing multiple records representing the same real-world object into a single, consistent, and clean representation.This article places data fusion into the greater context of data integration, precisely defines the goals of data fusion, namely, complete, concise, and consistent data, and highlights the challenges of data fusion, namely, uncertain and conflicting data values. We give an overview and classification of different ways of fusing data and present several techniques based on standard and advanced operators of the relational algebra and SQL. Finally, the article features a comprehensive survey of data integration systems from academia and industry, showing if and how data fusion is performed in each.

1,797 citations


Book
01 Jan 2009
TL;DR: This second edition of The Data Warehouse Lifecycle Toolkit again sets the standard in data warehousing for the next decade.
Abstract: The world of data warehousing has changed remarkably since the first edition of The Data Warehouse Lifecycle Toolkit was published in 1998. With this new edition, Ralph Kimball and his colleagues have refined the original set of Lifecycle methods and techniques based on their consulting and training experience. They walk you through the detailed steps of designing, developing, and deploying a data warehousing/business intelligence system. With substantial new and updated content, this second edition again sets the standard in data warehousing for the next decade.

627 citations


Journal ArticleDOI
TL;DR: There is a rapid growth in the application of data mining in the context of manufacturing processes and enterprises in the last 3 years, and a review of the literature reveals the progressive applications and existing gaps identified.
Abstract: In modern manufacturing environments, vast amounts of data are collected in database management systems and data warehouses from all involved areas, including product and process design, assembly, materials planning, quality control, scheduling, maintenance, fault detection etc. Data mining has emerged as an important tool for knowledge acquisition from the manufacturing databases. This paper reviews the literature dealing with knowledge discovery and data mining applications in the broad domain of manufacturing with a special emphasis on the type of functions to be performed on the data. The major data mining functions to be performed include characterization and description, association, classification, prediction, clustering and evolution analysis. The papers reviewed have therefore been categorized in these five categories. It has been shown that there is a rapid growth in the application of data mining in the context of manufacturing processes and enterprises in the last 3 years. This review reveals the progressive applications and existing gaps identified in the context of data mining in manufacturing. A novel text mining approach has also been used on the abstracts and keywords of 150 papers to identify the research gaps and find the linkages between knowledge area, knowledge type and the applied data mining tools and techniques.

450 citations


BookDOI
01 Aug 2009
TL;DR: Knowledge Discovery from Data Streams as mentioned in this paper presents a coherent overview of state-of-the-art research in learning from data streams, covering the fundamentals that are imperative to understand data streams and describes important applications, such as TCP/IP traffic, GPS data, sensor networks and customer click streams.
Abstract: Since the beginning of the Internet age and the increased use of ubiquitous computing devices, the large volume and continuous flow of distributed data have imposed new constraints on the design of learning algorithms. Exploring how to extract knowledge structures from evolving and time-changing data, Knowledge Discovery from Data Streams presents a coherent overview of state-of-the-art research in learning from data streams. The book covers the fundamentals that are imperative to understanding data streams and describes important applications, such as TCP/IP traffic, GPS data, sensor networks, and customer click streams. It also addresses several challenges of data mining in the future, when stream mining will be at the core of many applications. These challenges involve designing useful and efficient data mining solutions applicable to real-world problems. In the appendix, the author includes examples of publicly available software and online data sets. This practical, up-to-date book focuses on the new requirements of the next generation of data mining. Although the concepts presented in the text are mainly about data streams, they also are valid for different areas of machine learning and data mining.

423 citations


Journal ArticleDOI
TL;DR: The pragmatic approach that has been adopted has been able to address the key challenges in establishing a national databank of anonymised person-based records, so that the data are available for research and evaluation whilst meeting the requirements of Information Governance.
Abstract: Background: Vast quantities of electronic data are collected about patients and service users as they pass through health service and other public sector organisations, and these data present enormous potential for research and policy evaluation. The Health Information Research Unit (HIRU) aims to realise the potential of electronically-held, person-based, routinely-collected data to conduct and support health-related studies. However, there are cons iderable challenges that must be addressed before such data can be used for these purposes, to ensure compliance with the legislation and guidelines generally known as Information Governance. Methods: A set of objectives was identified to address the challenges and establish the Secure Anonymised Information Linkage (SAIL) system in accordance with Information Governance. These were to: 1) ensure data transportation is secure; 2) operate a reliable record matching technique to enable accurate record linkage across datasets; 3) anonymise and encrypt the data to prevent re-identification of individuals; 4) apply measures to address disclosure risk in data views created for researchers; 5) ensure data access is controlled and authorised; 6) establish methods for scrutinising proposals for data utilisation and approving output; and 7) gain external verification of compliance with Information Governance. Results: The SAIL databank has been established and it operates on a DB2 platform (Data Warehouse Edition on AIX) running on an IBM 'P' series Supercomputer: Blue-C. The findings of an independent internal audit were favourable and concluded that the systems in place provide adequate assurance of compliance with Information Governance. This expanding databank already holds over 500 million anonymised and encrypted individual-level records from a range of sources relevant to health and well-being. This includes national datasets covering the whole of Wales (approximately 3 million population) and local provider-level datasets, with further growth in progress. The utility of the databank is demonstrated by increasing engagement in high quality research studies. Conclusion: Through the pragmatic approach that has been adopted, we have been able to address the key challenges in establishing a national databank of anonymised person-based records, so that the data are available for research and evaluation whilst meeting the requirements of Information Governance.

419 citations


Proceedings ArticleDOI
29 Jun 2009
TL;DR: This paper will question some of the fundamentals of the OLAP and OLTP separation and present a new proposal for an enterprise data management concept that will allow for revolutionize transactional applications while providing an optimal platform for analytical data processing.
Abstract: When SQL and the relational data model were introduced 25 years ago as a general data management concept, enterprise software migrated quickly to this new technology. It is fair to say that SQL and the various implementations of RDBMSs became the backbone of enterprise systems. In those days. we believed that business planning, transaction processing and analytics should reside in one single system. Despite the incredible improvements in computer hardware, high-speed networks, display devices and the associated software, speed and flexibility remained an issue. The nature of RDBMSs, being organized along rows, prohibited us from providing instant analytical insight and finally led to the introduction of so-called data warehouses. This paper will question some of the fundamentals of the OLAP and OLTP separation. Based on the analysis of real customer environments and experience in some prototype implementations, a new proposal for an enterprise data management concept will be presented. In our proposal, the participants in enterprise applications, customers, orders, accounting documents, products, employees etc. will be modeled as objects and also stored and maintained as such. Despite that, the vast majority of business functions will operate on an in memory representation of their objects. Using the relational algebra and a column-based organization of data storage will allow us to revolutionize transactional applications while providing an optimal platform for analytical data processing. The unification of OLTP and OLAP workloads on a shared architecture and the reintegration of planning activities promise significant gains in application development while simplifying enterprise systems drastically. The latest trends in computer technology -- e.g. blade architecture, multiple CPUs per blade with multiple cores per CPU allow for a significant parallelization of application processes. The organization of data in columns supports the parallel use of cores for filtering and aggregation. Elements of application logic can be implemented as highly efficient stored procedures operating on columns. The vast increase in main memory combined with improvements in L1--, L2--, L3--caching, together with the high data compression rate column storage will allow us to support substantial data volumes on one single blade. Distributing data across multiple blades using a shared nothing approach provides further scalability.

404 citations


Book
03 Mar 2009
TL;DR: This chapter discusses the design and implementation of the Data Warehouse System Lifecycle, as well as some of the key concepts and techniques used in the design of the system.
Abstract: Chapter 1. Introduction to Data Warehousing Chapter 2. Data Warehouse System Lifecycle Chapter 3. Analysis and Reconciliation of Data Sources Chapter 4. User Requirement Analysis Chapter 5. Conceptual Modeling Chapter 6. Conceptual Design Chapter 7. Workload and Data Volume Chapter 8. Logical Modeling Chapter 9. Logical Design Chapter 10. Data-staging Design Chapter 11. Indexes for the Data Warehouse Chapter 12. Physical Design Chapter 13. Data Warehouse Project Documentation Chapter 14. A Case Study Chapter 15. Business Intelligence: Beyond the Data Warehouse Glossary Bibliography Index

284 citations


Patent
31 Mar 2009
TL;DR: In this paper, an improved method of and apparatus for joining and aggregating data elements integrated within a relational database management system (RDBMS) using a non-relational multi-dimensional data structure (MDD) is presented.
Abstract: Improved method of and apparatus for joining and aggregating data elements integrated within a relational database management system (RDBMS) using a non-relational multi-dimensional data structure (MDD). The improved RDBMS system of the present invention can be used to realize achieving a significant increase in system performance (e.g. deceased access/search time), user flexibility and ease of use. The improved RDBMS system of the present invention can be used to realize an improved Data Warehouse for supporting on-line analytical processing (OLAP) operations or to realize an improved informational database system or the like.

265 citations


Journal ArticleDOI
TL;DR: This survey covers the conceptual and logical modeling of ETL processes, along with some design methods, and visits each stage of the E-T-L triplet, and examines problems that fall within each of these stages.
Abstract: The software processes that facilitate the original loading and the periodic refreshment of the data warehouse contents are commonly known as Extraction-Transformation-Loading (ETL) processes. The intention of this survey is to present the research work in the field of ETL technology in a structured way. To this end, we organize the coverage of the field as follows: (a) first, we cover the conceptual and logical modeling of ETL processes, along with some design methods, (b) we visit each stage of the E-T-L triplet, and examine problems that fall within each of these stages, (c) we discuss problems that pertain to the entirety of an ETL process, and, (d) we review some research prototypes of academic origin. [Article copies are available for purchase from InfoSci-on-Demand.com]

255 citations


Book ChapterDOI
TL;DR: A simple way to adjoin physical copies of dimension columns to the fact table, dicing data to effectively cluster query retrieval, and how such dicing can be achieved on database products other than DB2 is shown.
Abstract: We provide a benchmark measuring star schema queries retrieving data from a fact table with Where clause column restrictions on dimension tables. Clustering is crucial to performance with modern disk technology, since retrievals with filter factors down to 0.0005 are now performed most efficiently by sequential table search rather than by indexed access. DB2's Multi-Dimensional Clustering (MDC) provides methods to "dice" the fact table along a number of orthogonal "dimensions", but only when these dimensions are columns in the fact table. The diced cells cluster fact rows on several of these "dimensions" at once so queries restricting several such columns can access crucially localized data, with much faster query response. Unfortunately, columns of dimension tables of a star schema are not usually represented in the fact table. In this paper, we show a simple way to adjoin physical copies of dimension columns to the fact table, dicing data to effectively cluster query retrieval, and explain how such dicing can be achieved on database products other than DB2. We provide benchmark measurements to show successful use of this methodology on three commercial database products.

230 citations


Journal ArticleDOI
01 Aug 2009
TL;DR: Modern data management applications often require integrating available data sources and providing a uniform interface for users to access data from different sources, and such requirements have been driving fruitful research on data integration over the last two decades.
Abstract: The amount of information produced in the world increases by 30% every year and this rate will only go up. With advanced network technology, more and more sources are available either over the Internet or in enterprise intranets. Modern data management applications, such as setting up Web portals, managing enterprise data, managing community data, and sharing scientific data, often require integrating available data sources and providing a uniform interface for users to access data from different sources; such requirements have been driving fruitful research on data integration over the last two decades [11, 13].

Proceedings ArticleDOI
24 Mar 2009
TL;DR: The requirements for data integration flows in this next generation of operational BI system are described, the limitations of current technologies, the research challenges in meeting these requirements, and a framework for addressing these challenges are described.
Abstract: Business Intelligence (BI) refers to technologies, tools, and practices for collecting, integrating, analyzing, and presenting large volumes of information to enable better decision making. Today's BI architecture typically consists of a data warehouse (or one or more data marts), which consolidates data from several operational databases, and serves a variety of front-end querying, reporting, and analytic tools. The back-end of the architecture is a data integration pipeline for populating the data warehouse by extracting data from distributed and usually heterogeneous operational sources; cleansing, integrating and transforming the data; and loading it into the data warehouse. Since BI systems have been used primarily for off-line, strategic decision making, the traditional data integration pipeline is a oneway, batch process, usually implemented by extract-transform-load (ETL) tools. The design and implementation of the ETL pipeline is largely a labor-intensive activity, and typically consumes a large fraction of the effort in data warehousing projects. Increasingly, as enterprises become more automated, data-driven, and real-time, the BI architecture is evolving to support operational decision making. This imposes additional requirements and tradeoffs, resulting in even more complexity in the design of data integration flows. These include reducing the latency so that near real-time data can be delivered to the data warehouse, extracting information from a wider variety of data sources, extending the rigidly serial ETL pipeline to more general data flows, and considering alternative physical implementations. We describe the requirements for data integration flows in this next generation of operational BI system, the limitations of current technologies, the research challenges in meeting these requirements, and a framework for addressing these challenges. The goal is to facilitate the design and implementation of optimal flows to meet business requirements.

01 Jan 2009
TL;DR: In this paper, a Model Driven Architecture (MDA) framework is proposed to align the whole DW development process with a MDA framework for the development of the DW repository, which is the cornerstone of any DW system.
Abstract: Different modeling approaches have been proposed to overcome every design pitfall of different data warehouse (DW) components. However, most of them offer partial solutions that deal only with isolated aspects of the DW and do not provide developers with an integrated and standard framework for designing all DW relevant components, such as ETL processes, data sources, DW repository and so on. To overcome this problem, this paper describes how to align the whole DW development process with a Model Driven Architecture (MDA) framework. We then focus on describing one part of it: an MDA approach for the development of the DW repository, because it is the cornerstone of any DW system. Therefore, we describe how to build the different MDA models for the DW repository by using an extension of the Unified Modeling Language (UML) and the Common Warehouse Metamodel (CWM). Transformations between models are also clearly and formally established by using the Query/View/Transformation (QVT) language. Finally, a case study is provided to exemplify the benefits of our MDA framework.

Journal ArticleDOI
TL;DR: This article presents the most relevant methodologies introduced in the literature and a detailed comparison showing main features of each approach is presented.
Abstract: Many methodologies have been presented to support the multidimensional design of the data warehouse First methodologies introduced were requirement-driven but the semantics of a data warehouse require to also consider data sources along the design process In the following years, data sources gained relevance in multidimensional modeling and gave rise to several data-driven methodologies that automate the data warehouse design process from relational sources Currently, research on multidimensional modeling is still a hot topic and we have two main research lines On the one hand, new hybrid automatic methodologies have been introduced proposing to combine data-driven and requirement-driven approaches On the other hand, new approaches focus on considering other kind of structured data sources that have gained relevance in the last years such as ontologies or XML In this article we present the most relevant methodologies introduced in the literature and a detailed comparison showing main features of each approach

Patent
15 May 2009
TL;DR: In this article, the authors present methods and systems to model and acquire data from a variety of data and information sources, to integrate the data into a structured database, and to manage the continuing reintegration of updated data from those sources over time.
Abstract: Methods and systems to model and acquire data from a variety of data and information sources, to integrate the data into a structured database, and to manage the continuing reintegration of updated data from those sources over time. For any given domain, a variety of individual information and data sources that contain information relevant to the schema can be identified. Data elements associated with a schema may be identified in a training source, such as by user tagging. A formal grammar may be induced appropriate to the schema and layout of the training source. A Hidden Markov Model (HMM) corresponding to the grammar may learn where in the sources the elements can be found. The system can automatically mutate its schema into a grammar matching the structure of the source documents. By following an inverse transformation sequence, data that is parsed by the mutated grammar can be fit back into the original grammar structure, matching the original data schema defined through domain modeling. Features disclosed herein may be implemented with respect to web-scraping and data acquisition, and to represent data in support of data-editing and data-merging tasks. A schema may be defined with respect to a graph-based domain model.

Journal ArticleDOI
TL;DR: This work presents an ontology-based approach for BI applications, specifically in statistical analysis and data mining, and implements this approach in financial knowledge management system (FKMS), which is able to do data extraction, transformation and loading, and data cubes creation and retrieval.
Abstract: Business intelligence (BI) applications within an enterprise range over enterprise reporting, cube and ad hoc query analysis, statistical analysis, data mining, and proactive report delivery and alerting. The most sophisticated applications of BI are statistical analysis and data mining, which involve mathematical and statistical treatment of data for correlation analysis, trend analysis, hypothesis testing, and predictive analysis. They are used by relatively small groups of users consisting of information analysts and power users, for whom data and analysis are their primary jobs. We present an ontology-based approach for BI applications, specifically in statistical analysis and data mining. We implemented our approach in financial knowledge management system (FKMS), which is able to do: (i) data extraction, transformation and loading, (ii) data cubes creation and retrieval, (iii) statistical analysis and data mining, (iv) experiment metadata management, (v) experiment retrieval for new problem solving. The resulting knowledge from each experiment defined as a knowledge set consisting of strings of data, model, parameters, and reports are stored, shared, disseminated, and thus helpful to support decision making. We finally illustrate the above claims with a process of applying data mining techniques to support corporate bonds classification.

01 Jan 2009
TL;DR: This paper is an overview of artificial neural networks and questions their position as a preferred tool by data mining practitioners.
Abstract: Companies have been collecting data for decades, building massive data warehouses in which to store it. Even though this data is available, very few companies have been able to realize the actual value stored in it. The question these companies are asking is how to extract this value. The answer is Data mining. There are many technologies available to data mining practitioners, including Artificial Neural Networks, Regression, and Decision Trees. Many practitioners are wary of Neural Networks due to their black box nature, even though they have proven themselves in many situations. This paper is an overview of artificial neural networks and questions their position as a preferred tool by data mining practitioners.

Journal ArticleDOI
TL;DR: The comprehensive analysis of common data stream processing operators and their impact on data quality allows a fruitful data evaluation and diminishes incorrect business decisions and is proposed to adapt the data quality granularity to the data stream interestingness.
Abstract: Sensors in smart-item environments capture data about product conditions and usage to support business decisions as well as production automation processes. A challenging issue in this application area is the restricted quality of sensor data due to limited sensor precision and sensor failures. Moreover, data stream processing to meet resource constraints in streaming environments introduces additional noise and decreases the data quality. In order to avoid wrong business decisions due to dirty data, quality characteristics have to be captured, processed, and provided to the respective business task. However, the issue of how to efficiently provide applications with information about data quality is still an open research problem.In this article, we address this problem by presenting a flexible model for the propagation and processing of data quality. The comprehensive analysis of common data stream processing operators and their impact on data quality allows a fruitful data evaluation and diminishes incorrect business decisions. Further, we propose the data quality model control to adapt the data quality granularity to the data stream interestingness.

Journal ArticleDOI
01 Oct 2009
TL;DR: The frequent items problem (also known as the heavy hitters problem) is one of the most heavily studied questions in data streams, and is important both in itself, and as a subroutine within more advanced data stream computations.
Abstract: Many data generation processes can be modeled as data streams. They produce huge numbers of pieces of data, each of which is simple in isolation, but which taken together lead to a complex whole. For example, the sequence of queries posed to an Internet search engine can be thought of as a stream, as can the collection of transactions across all branches of a supermarket chain. In aggregate, this data can arrive at enormous rates, easily in the realm of hundreds of gigabytes per day or higher. While this data may be archived and indexed within a data warehouse, it is also important to process the data "as it happens," to provide up to the minute analysis and statistics on current trends. Methods to achieve this must be quick to respond to each new piece of information, and use resources which are very small when compared to the total quantity of data. These applications and others like them have led to the formulation of the so-called "streaming model." In this abstraction, algorithms take only a single pass over their input, and must accurately compute various functions while using resources (space and time per item) that are strictly sublinear in the size of the input---ideally, polynomial in the logarithm of the input size. The output must be produced at the end of the stream, or when queried on the prefix of the stream that has been observed so far. (Other variations ask for the output to be maintained continuously in the presence of updates, or on a "sliding window" of only the most recent updates.) Some problems are simple in this model: for example, given a stream of transactions, finding the mean and standard deviation of the bill totals can be accomplished by retaining a few "sufficient statistics" (sum of all values, sum of squared values, etc.). Others can be shown to require a large amount of information to be stored, such as determining whether a particular search query has already appeared anywhere within a large stream of queries. Determining which problems can be solved effectively within this model remains an active research area. The frequent items problem (also known as the heavy hitters problem) is one of the most heavily studied questions in data streams. The problem is popular due to its simplicity to state, and its intuitive interest and value. It is important both in itself, and as a subroutine within more advanced data stream computations. Informally, given a sequence of items, the problem is simply to find those items which occur most frequently. Typically, this is formalized as finding all items whose frequency exceeds a specified fraction of the total number of items. This is shown in Figure 1. Variations arise when the items are given weights, and further when these weights can also be negative. This abstract problem captures a wide variety of settings. The items can represent packets on the Internet, and the weights are the size of the packets. Then the frequent items represent the most popular destinations, or the heaviest bandwidth users (depending on how the items are extracted from the flow identifiers). This knowledge can help in optimizing routing decisions, for in-network caching, and for planning where to add new capacity. Or, the items can represent queries made to an Internet search engine, and the frequent items are now the (currently) popular terms. These are not simply hypothetical examples, but genuine cases where algorithms for this problem have been applied by large corporations: ATT existing work is sometimes claimed to be incapable of a certain guarantee, which in truth it can provide with only minor modifications; and experimental evaluations do not always compare against the most suitable methods. In this paper, we present the main ideas in this area, by describing some of the most significant algorithms for the core problem of finding frequent items using common notation and terminology. In doing so, we also present the historical development of these algorithms. Studying these algorithms is instructive, as they are relatively simple, but can be shown to provide formal guarantees on the quality of their output as a function of an accuracy parameter e. We also provide baseline implementations of many of these algorithms against which future algorithms can be compared, and on top of which algorithms for different problems can be built. We perform experimental evaluation of the algorithms over a variety of data sets to indicate their performance in practice. From this, we are able to identify clear distinctions among the algorithms that are not apparent from their theoretical analysis alone.

Proceedings ArticleDOI
29 Jun 2009
TL;DR: The DataDepot architecture is discussed, with an emphasis on several of its novel and critical features, which are currently being used for five very large warehousing projects within AT&T.
Abstract: We describe DataDepot, a tool for generating warehouses from streaming data feeds, such as network-traffic traces, router alerts, financial tickers, transaction logs, and so on. DataDepot is a streaming data warehouse designed to automate the ingestion of streaming data from a wide variety of sources and to maintain complex materialized views over these sources. As a streaming warehouse, DataDepot is similar to Data Stream Management Systems (DSMSs) with its emphasis on temporal data, best-effort consistency, and real-time response. However, as a data warehouse, DataDepot is designed to store tens to hundreds of terabytes of historical data, allow time windows measured in years or decades, and allow both real-time queries on recent data and deep analyses on historical data. In this paper we discuss the DataDepot architecture, with an emphasis on several of its novel and critical features. DataDepot is currently being used for five very large warehousing projects within ATT one of these warehouses ingests 500 Mbytes per minute (and is growing). We use these installations to illustrate streaming warehouse use and behavior, and design choices made in developing DataDepot. We conclude with a discussion of DataDepot applications and the efficacy of some optimizations.

Journal ArticleDOI
01 Aug 2009
TL;DR: This work describes an augmentation of traditional query engines that improves join throughput in large-scale concurrent data warehouses by using an "always-on" pipeline of non-blocking operators, coupled with a controller that continuously examines the current query mix and performs run-time optimizations.
Abstract: Conventional data warehouses employ the query-at-a-time model, which maps each query to a distinct physical plan. When several queries execute concurrently, this model introduces contention, because the physical plans---unaware of each other---compete for access to the underlying I/O and computation resources. As a result, while modern systems can efficiently optimize and evaluate a single complex data analysis query, their performance suffers significantly when multiple complex queries run at the same time.We describe an augmentation of traditional query engines that improves join throughput in large-scale concurrent data warehouses. In contrast to the conventional query-at-a-time model, our approach employs a single physical plan that can share I/O, computation, and tuple storage across all in-flight join queries. We use an "always-on" pipeline of non-blocking operators, coupled with a controller that continuously examines the current query mix and performs run-time optimizations. Our design allows the query engine to scale gracefully to large data sets, provide predictable execution times, and reduce contention. In our empirical evaluation, we found that our prototype outperforms conventional commercial systems by an order of magnitude for tens to hundreds of concurrent queries.

Book
15 Jun 2009
TL;DR: The book introduces the various features and architecture of a Data Warehouse followed by a detailed study of the Business Requirements and Dimensional Modelling and leads up to the core area of the subject by providing a thorough understanding of the building and maintenance of a data Warehouse.
Abstract: Data Warehousing is designed to serve as a textbook for students of Computer Science & Engineering (BE/Btech), computer applications (BCA/MCA) and computer science (B.Sc) for an introductory course on Data Warehousing. It provides a thorough understanding of the fundamentals of Data Warehousing and aims to impart a sound knowledge to users for creating and managing a Data Warehouse. The book introduces the various features and architecture of a Data Warehouse followed by a detailed study of the Business Requirements and Dimensional Modelling. It goes on to discuss the components of a Data Warehouse and thereby leads up to the core area of the subject by providing a thorough understanding of the building and maintenance of a Data Warehouse. This is then followed up by an overview of planning and project management, testing and growth and then finishing with Data Warehouse solutions and the latest trends in this field. The book is finally rounded off with a broad overview of its related field of study, Data Mining. The text is ably supported by plenty of examples to illustrate concepts and contains several review questions and other end-chapter exercises to test the understanding of students. The book also carries a running case study that aims to bring out the practical aspects of the subject. This will be useful for students to master the basics and apply them to real-life scenario.

Journal ArticleDOI
TL;DR: The main concepts and terminology of temporal databases are introduced and the open research issues also in connection with their implementation on commercial tools are discussed.
Abstract: Data warehouses are information repositories specialized in supporting decision making. Since the decisional process typically requires an analysis of historical trends, time and its management acquire a huge importance. In this paper we consider the variety of issues, often grouped under term temporal data warehousing, implied by the need for accurately describing how information changes over time in data warehousing systems. We recognize that, with reference to a three-levels architecture, these issues can be classified into some topics, namely: handling data/schema changes in the data warehouse, handling data/schema changes in the data mart, querying temporal data, and designing temporal data warehouses. After introducing the main concepts and terminology of temporal databases, we separately survey these topics. Finally, we discuss the open research issues also in connection with their implementation on commercial tools.

Proceedings Article
01 Jan 2009
TL;DR: This paper describes the Continuous Analytics approach and outlines some of the key technical arguments behind it, creating a powerful and flexible system that can run SQL over tables, streams, and combinations of the two.
Abstract: Modern data analysis applications driven by the Network Effect are pushing traditional database and data warehousing technologies beyond their limits due to their massively increasing data volumes and demands for low latency. To address this problem, we advocate an integrated query processing approach that runs SQL continuously and incrementally over data before that data is stored in the database. Continuous Analytics technology is seamlessly integrated into a full-function database system, creating a powerful and flexible system that can run SQL over tables, streams, and combinations of the two. A continuous analytics system can run many orders of magnitude more efficiently than traditional store-first-query-later technologies. In this paper, we describe the Continuous Analytics approach and outline some of the key technical arguments behind it.

Journal ArticleDOI
TL;DR: The SEAMLESS integrated database on European agricultural systems contains data on cropping patterns, production, farm structural data, soil and climate conditions, current agricultural management and policy information and a shared ontology was developed according to a collaborative process.

Proceedings ArticleDOI
06 Nov 2009
TL;DR: This paper proposes a platform-independent conceptual model of ETL processes based on the Business Process Model Notation (BPMN) standard and shows how such a conceptual model can be implemented using Business Process Execution Language (BPEL), a standard executable language for specifying interactions with web services.
Abstract: Decisional systems are crucial for enterprise improvement. They allow the consolidation of heterogeneous data from distributed enterprise data stores into strategic indicators. An essential component of this data consolidation is the Extract, Transform, and Load (ETL) process. In the research literature there has been very few work defining conceptual models for ETL processes. At the same time, there are currently many tools that manage such processes. However, each tool uses its own model, which is not necessarily able to communicate with the models of other tools. In this paper, we propose a platform-independent conceptual model of ETL processes based on the Business Process Model Notation (BPMN) standard. We also show how such a conceptual model can be implemented using Business Process Execution Language (BPEL), a standard executable language for specifying interactions with web services.

Book ChapterDOI
01 Jan 2009
TL;DR: The state of the art for both conventional and near real time ETL is reviewed, the background, the architecture, and the technical issues that arise are discussed, and interesting research challenges for future work are pinpointed.
Abstract: Near real time ETL deviates from the traditional conception of data warehouse refreshment, which is performed off-line in a batch mode, and adopts the strategy of propagating changes that take place in the sources towards the data warehouse to the extent that both the sources and the warehouse can sustain the incurred workload. In this article, we review the state of the art for both conventional and near real time ETL, we discuss the background, the architecture, and the technical issues that arise in the area of near real time ETL, and we pinpoint interesting research challenges for future work.

01 Jan 2009
TL;DR: In this paper, the authors consider the variety of issues, often grouped under term temporal data warehousing, implied by the need for accurately describing how information changes over time in data warehouse systems and recognize that, with reference to a three-levels architecture, these issues can be classified into some topics, namely: handling data/schema changes in the data warehouse, handling data and schema changes in data mart, querying temporal data, and designing temporal data warehouses.
Abstract: Data warehouses are information repositories specialized in supporting decision making. Since the decisional process typically requires an analysis of historical trends, time and its management acquire a huge importance. In this paper we consider the variety of issues, often grouped under term temporal data warehousing, implied by the need for accurately describing how information changes over time in data warehousing systems. We recognize that, with reference to a three-levels architecture, these issues can be classified into some topics, namely: handling data/schema changes in the data warehouse, handling data/schema changes in the data mart, querying temporal data, and designing temporal data warehouses. After introducing the main concepts and terminology of temporal databases, we separately survey these topics. Finally, we discuss the open research issues also in connection with their implementation on commercial tools.

Journal ArticleDOI
01 Aug 2009
TL;DR: The design of index compression in DB2 LUW is detailed and the challenges that were encountered in meeting the design goals are discussed and its effectiveness is demonstrated by showing performance results on typical customer scenarios.
Abstract: In database systems, the cost of data storage and retrieval are important components of the total cost and response time of the system. A popular mechanism to reduce the storage footprint is by compressing the data residing in tables and indexes. Compressing indexes efficiently, while maintaining response time requirements, is known to be challenging. This is especially true when designing for a workload spectrum covering both data warehousing and transaction processing environments. DB2 Linux, UNIX, Windows (LUW) recently introduced index compression for use in both environments. This uses techniques that are able to compress index data efficiently while incurring virtually no performance penalty for query processing. On the contrary, for certain operations, the performance is actually better. In this paper, we detail the design of index compression in DB2 LUW and discuss the challenges that were encountered in meeting the design goals. We also demonstrate its effectiveness by showing performance results on typical customer scenarios.

Proceedings ArticleDOI
29 Mar 2009
TL;DR: The notion of data staleness is defined, the problem of scheduling updates in a way that minimizes average data stalness is formalized, and scheduling algorithms designed to handle the complex environment of a real-time stream warehouse are presented.
Abstract: This paper discusses updating a data warehouse that collects near-real-time data streams from a variety of external sources. The objective is to keep all the tables and materialized views up-to-date as new data arrive over time. We define the notion of data staleness, formalize the problem of scheduling updates in a way that minimizes average data staleness, and present scheduling algorithms designed to handle the complex environment of a real-time stream warehouse. A novel feature of our scheduling framework is that it considers the effect of an update on the staleness of the underlying tables rather than any property of the update job itself (such as deadline).