Showing papers on "Data warehouse published in 2009"

PDF

Open Access

Journal Article•DOI•

[...]

Jens Bleiholder¹, Felix Naumann¹•Institutions (1)

15 Jan 2009-ACM Computing Surveys

TL;DR: This article places data fusion into the greater context of data integration, precisely defines the goals of data fusion, namely, complete, concise, and consistent data, and highlights the challenges of data Fusion.

...read moreread less

Abstract: The development of the Internet in recent years has made it possible and useful to access many different information systems anywhere in the world to obtain information. While there is much research on the integration of heterogeneous information systems, most commercial systems stop short of the actual integration of available data. Data fusion is the process of fusing multiple records representing the same real-world object into a single, consistent, and clean representation.This article places data fusion into the greater context of data integration, precisely defines the goals of data fusion, namely, complete, concise, and consistent data, and highlights the challenges of data fusion, namely, uncertain and conflicting data values. We give an overview and classification of different ways of fusing data and present several techniques based on standard and advanced operators of the relational algebra and SQL. Finally, the article features a comprehensive survey of data integration systems from academia and industry, showing if and how data fusion is performed in each.

...read moreread less

1,797 citations

Book•

The Data Warehouse Lifecycle Toolkit

[...]

Ralph Kimball, Margy Ross, Warren Thornthwaite, Joy Mundy, Bob Becker - Show less +1 more

01 Jan 2009

TL;DR: This second edition of The Data Warehouse Lifecycle Toolkit again sets the standard in data warehousing for the next decade.

...read moreread less

Abstract: The world of data warehousing has changed remarkably since the first edition of The Data Warehouse Lifecycle Toolkit was published in 1998. With this new edition, Ralph Kimball and his colleagues have refined the original set of Lifecycle methods and techniques based on their consulting and training experience. They walk you through the detailed steps of designing, developing, and deploying a data warehousing/business intelligence system. With substantial new and updated content, this second edition again sets the standard in data warehousing for the next decade.

...read moreread less

627 citations

Journal Article•DOI•

Data mining in manufacturing: a review based on the kind of knowledge

[...]

Alok K. Choudhary¹, Jenny A. Harding¹, Manoj Kumar Tiwari²•Institutions (2)

Loughborough University¹, Indian Institutes of Technology²

01 Jan 2009-Journal of Intelligent Manufacturing

TL;DR: There is a rapid growth in the application of data mining in the context of manufacturing processes and enterprises in the last 3 years, and a review of the literature reveals the progressive applications and existing gaps identified.

...read moreread less

Abstract: In modern manufacturing environments, vast amounts of data are collected in database management systems and data warehouses from all involved areas, including product and process design, assembly, materials planning, quality control, scheduling, maintenance, fault detection etc. Data mining has emerged as an important tool for knowledge acquisition from the manufacturing databases. This paper reviews the literature dealing with knowledge discovery and data mining applications in the broad domain of manufacturing with a special emphasis on the type of functions to be performed on the data. The major data mining functions to be performed include characterization and description, association, classification, prediction, clustering and evolution analysis. The papers reviewed have therefore been categorized in these five categories. It has been shown that there is a rapid growth in the application of data mining in the context of manufacturing processes and enterprises in the last 3 years. This review reveals the progressive applications and existing gaps identified in the context of data mining in manufacturing. A novel text mining approach has also been used on the abstracts and keywords of 150 papers to identify the research gaps and find the linkages between knowledge area, knowledge type and the applied data mining tools and techniques.

...read moreread less

450 citations

Book•DOI•

Knowledge discovery from data streams

[...]

João Gama¹, Auroop R. Ganguly², Olufemi A. Omitaomu², Raju Vatsavai², Mohamed Medhat Gaber³ - Show less +1 more•Institutions (3)

University of Porto¹, Oak Ridge National Laboratory², Monash University³

01 Aug 2009

TL;DR: Knowledge Discovery from Data Streams as mentioned in this paper presents a coherent overview of state-of-the-art research in learning from data streams, covering the fundamentals that are imperative to understand data streams and describes important applications, such as TCP/IP traffic, GPS data, sensor networks and customer click streams.

...read moreread less

Abstract: Since the beginning of the Internet age and the increased use of ubiquitous computing devices, the large volume and continuous flow of distributed data have imposed new constraints on the design of learning algorithms. Exploring how to extract knowledge structures from evolving and time-changing data, Knowledge Discovery from Data Streams presents a coherent overview of state-of-the-art research in learning from data streams. The book covers the fundamentals that are imperative to understanding data streams and describes important applications, such as TCP/IP traffic, GPS data, sensor networks, and customer click streams. It also addresses several challenges of data mining in the future, when stream mining will be at the core of many applications. These challenges involve designing useful and efficient data mining solutions applicable to real-world problems. In the appendix, the author includes examples of publicly available software and online data sets. This practical, up-to-date book focuses on the new requirements of the next generation of data mining. Although the concepts presented in the text are mainly about data streams, they also are valid for different areas of machine learning and data mining.

...read moreread less

423 citations

Journal Article•DOI•

The SAIL Databank: building a national architecture for e-health research and evaluation

[...]

David V. Ford¹, Kerina H. Jones¹, Jean Philippe Verplancke¹, Ronan A Lyons¹, Gareth John², Ginevra Brown¹, Caroline J. Brooks¹, Simon Thompson¹, Owen Bodger¹, Tony Couch², Ken Leake² - Show less +7 more•Institutions (2)

Swansea University¹, Cardiff University²

04 Sep 2009-BMC Health Services Research

TL;DR: The pragmatic approach that has been adopted has been able to address the key challenges in establishing a national databank of anonymised person-based records, so that the data are available for research and evaluation whilst meeting the requirements of Information Governance.

...read moreread less

Abstract: Background: Vast quantities of electronic data are collected about patients and service users as they pass through health service and other public sector organisations, and these data present enormous potential for research and policy evaluation. The Health Information Research Unit (HIRU) aims to realise the potential of electronically-held, person-based, routinely-collected data to conduct and support health-related studies. However, there are cons iderable challenges that must be addressed before such data can be used for these purposes, to ensure compliance with the legislation and guidelines generally known as Information Governance. Methods: A set of objectives was identified to address the challenges and establish the Secure Anonymised Information Linkage (SAIL) system in accordance with Information Governance. These were to: 1) ensure data transportation is secure; 2) operate a reliable record matching technique to enable accurate record linkage across datasets; 3) anonymise and encrypt the data to prevent re-identification of individuals; 4) apply measures to address disclosure risk in data views created for researchers; 5) ensure data access is controlled and authorised; 6) establish methods for scrutinising proposals for data utilisation and approving output; and 7) gain external verification of compliance with Information Governance. Results: The SAIL databank has been established and it operates on a DB2 platform (Data Warehouse Edition on AIX) running on an IBM 'P' series Supercomputer: Blue-C. The findings of an independent internal audit were favourable and concluded that the systems in place provide adequate assurance of compliance with Information Governance. This expanding databank already holds over 500 million anonymised and encrypted individual-level records from a range of sources relevant to health and well-being. This includes national datasets covering the whole of Wales (approximately 3 million population) and local provider-level datasets, with further growth in progress. The utility of the databank is demonstrated by increasing engagement in high quality research studies. Conclusion: Through the pragmatic approach that has been adopted, we have been able to address the key challenges in establishing a national databank of anonymised person-based records, so that the data are available for research and evaluation whilst meeting the requirements of Information Governance.

...read moreread less

419 citations

Proceedings Article•DOI•

A common database approach for OLTP and OLAP using an in-memory column database

[...]

Hasso Plattner¹•Institutions (1)

Hasso Plattner Institute¹

29 Jun 2009

TL;DR: This paper will question some of the fundamentals of the OLAP and OLTP separation and present a new proposal for an enterprise data management concept that will allow for revolutionize transactional applications while providing an optimal platform for analytical data processing.

...read moreread less

Abstract: When SQL and the relational data model were introduced 25 years ago as a general data management concept, enterprise software migrated quickly to this new technology. It is fair to say that SQL and the various implementations of RDBMSs became the backbone of enterprise systems. In those days. we believed that business planning, transaction processing and analytics should reside in one single system. Despite the incredible improvements in computer hardware, high-speed networks, display devices and the associated software, speed and flexibility remained an issue. The nature of RDBMSs, being organized along rows, prohibited us from providing instant analytical insight and finally led to the introduction of so-called data warehouses. This paper will question some of the fundamentals of the OLAP and OLTP separation. Based on the analysis of real customer environments and experience in some prototype implementations, a new proposal for an enterprise data management concept will be presented. In our proposal, the participants in enterprise applications, customers, orders, accounting documents, products, employees etc. will be modeled as objects and also stored and maintained as such. Despite that, the vast majority of business functions will operate on an in memory representation of their objects. Using the relational algebra and a column-based organization of data storage will allow us to revolutionize transactional applications while providing an optimal platform for analytical data processing. The unification of OLTP and OLAP workloads on a shared architecture and the reintegration of planning activities promise significant gains in application development while simplifying enterprise systems drastically. The latest trends in computer technology -- e.g. blade architecture, multiple CPUs per blade with multiple cores per CPU allow for a significant parallelization of application processes. The organization of data in columns supports the parallel use of cores for filtering and aggregation. Elements of application logic can be implemented as highly efficient stored procedures operating on columns. The vast increase in main memory combined with improvements in L1--, L2--, L3--caching, together with the high data compression rate column storage will allow us to support substantial data volumes on one single blade. Distributing data across multiple blades using a shared nothing approach provides further scalability.

...read moreread less

404 citations

Book•

Data Warehouse Design: Modern Principles and Methodologies

[...]

Matteo Golfarelli, Stefano Rizzi

03 Mar 2009

TL;DR: This chapter discusses the design and implementation of the Data Warehouse System Lifecycle, as well as some of the key concepts and techniques used in the design of the system.

...read moreread less

Abstract: Chapter 1. Introduction to Data Warehousing Chapter 2. Data Warehouse System Lifecycle Chapter 3. Analysis and Reconciliation of Data Sources Chapter 4. User Requirement Analysis Chapter 5. Conceptual Modeling Chapter 6. Conceptual Design Chapter 7. Workload and Data Volume Chapter 8. Logical Modeling Chapter 9. Logical Design Chapter 10. Data-staging Design Chapter 11. Indexes for the Data Warehouse Chapter 12. Physical Design Chapter 13. Data Warehouse Project Documentation Chapter 14. A Case Study Chapter 15. Business Intelligence: Beyond the Data Warehouse Glossary Bibliography Index

...read moreread less

284 citations

Patent•

Relational database management system having integrated non-relational multi-dimensional data store of aggregated data elements

[...]

Reuven Bakalash, Guy Shaked, Joseph Caspi

31 Mar 2009

TL;DR: In this paper, an improved method of and apparatus for joining and aggregating data elements integrated within a relational database management system (RDBMS) using a non-relational multi-dimensional data structure (MDD) is presented.

...read moreread less

Abstract: Improved method of and apparatus for joining and aggregating data elements integrated within a relational database management system (RDBMS) using a non-relational multi-dimensional data structure (MDD). The improved RDBMS system of the present invention can be used to realize achieving a significant increase in system performance (e.g. deceased access/search time), user flexibility and ease of use. The improved RDBMS system of the present invention can be used to realize an improved Data Warehouse for supporting on-line analytical processing (OLAP) operations or to realize an improved informational database system or the like.

...read moreread less

265 citations

Journal Article•DOI•

A Survey of Extract–Transform–Load Technology

[...]

Panos Vassiliadis¹•Institutions (1)

University of Ioannina¹

01 Jul 2009-International Journal of Data Warehousing and Mining

TL;DR: This survey covers the conceptual and logical modeling of ETL processes, along with some design methods, and visits each stage of the E-T-L triplet, and examines problems that fall within each of these stages.

...read moreread less

Abstract: The software processes that facilitate the original loading and the periodic refreshment of the data warehouse contents are commonly known as Extraction-Transformation-Loading (ETL) processes. The intention of this survey is to present the research work in the field of ETL technology in a structured way. To this end, we organize the coverage of the field as follows: (a) first, we cover the conceptual and logical modeling of ETL processes, along with some design methods, (b) we visit each stage of the E-T-L triplet, and examine problems that fall within each of these stages, (c) we discuss problems that pertain to the entirety of an ETL process, and, (d) we review some research prototypes of academic origin. [Article copies are available for purchase from InfoSci-on-Demand.com]

...read moreread less

255 citations

Book Chapter•DOI•

The Star Schema Benchmark and Augmented Fact Table Indexing

[...]

Patrick O'Neil¹, Elizabeth O'Neil¹, Xuedong Chen¹, Stephen Revilak¹•Institutions (1)

University of Massachusetts Amherst¹

28 Oct 2009-Lecture Notes in Computer Science

TL;DR: A simple way to adjoin physical copies of dimension columns to the fact table, dicing data to effectively cluster query retrieval, and how such dicing can be achieved on database products other than DB2 is shown.

...read moreread less

Abstract: We provide a benchmark measuring star schema queries retrieving data from a fact table with Where clause column restrictions on dimension tables. Clustering is crucial to performance with modern disk technology, since retrievals with filter factors down to 0.0005 are now performed most efficiently by sequential table search rather than by indexed access. DB2's Multi-Dimensional Clustering (MDC) provides methods to "dice" the fact table along a number of orthogonal "dimensions", but only when these dimensions are columns in the fact table. The diced cells cluster fact rows on several of these "dimensions" at once so queries restricting several such columns can access crucially localized data, with much faster query response. Unfortunately, columns of dimension tables of a star schema are not usually represented in the fact table. In this paper, we show a simple way to adjoin physical copies of dimension columns to the fact table, dicing data to effectively cluster query retrieval, and explain how such dicing can be achieved on database products other than DB2. We provide benchmark measurements to show successful use of this methodology on three commercial database products.

...read moreread less

230 citations

Journal Article•DOI•

Data fusion: resolving data conflicts for integration

[...]

Xin Luna Dong¹, Felix Naumann²•Institutions (2)

AT&T Labs¹, Hasso Plattner Institute²

01 Aug 2009

TL;DR: Modern data management applications often require integrating available data sources and providing a uniform interface for users to access data from different sources, and such requirements have been driving fruitful research on data integration over the last two decades.

...read moreread less

Abstract: The amount of information produced in the world increases by 30% every year and this rate will only go up. With advanced network technology, more and more sources are available either over the Internet or in enterprise intranets. Modern data management applications, such as setting up Web portals, managing enterprise data, managing community data, and sharing scientific data, often require integrating available data sources and providing a uniform interface for users to access data from different sources; such requirements have been driving fruitful research on data integration over the last two decades [11, 13].

...read moreread less

Proceedings Article•DOI•

Data integration flows for business intelligence

[...]

Umeshwar Dayal¹, Malu Castellanos¹, Alkis Simitsis¹, Kevin Wilkinson¹•Institutions (1)

Hewlett-Packard¹

24 Mar 2009

TL;DR: The requirements for data integration flows in this next generation of operational BI system are described, the limitations of current technologies, the research challenges in meeting these requirements, and a framework for addressing these challenges are described.

...read moreread less

Abstract: Business Intelligence (BI) refers to technologies, tools, and practices for collecting, integrating, analyzing, and presenting large volumes of information to enable better decision making. Today's BI architecture typically consists of a data warehouse (or one or more data marts), which consolidates data from several operational databases, and serves a variety of front-end querying, reporting, and analytic tools. The back-end of the architecture is a data integration pipeline for populating the data warehouse by extracting data from distributed and usually heterogeneous operational sources; cleansing, integrating and transforming the data; and loading it into the data warehouse. Since BI systems have been used primarily for off-line, strategic decision making, the traditional data integration pipeline is a oneway, batch process, usually implemented by extract-transform-load (ETL) tools. The design and implementation of the ETL pipeline is largely a labor-intensive activity, and typically consumes a large fraction of the effort in data warehousing projects. Increasingly, as enterprises become more automated, data-driven, and real-time, the BI architecture is evolving to support operational decision making. This imposes additional requirements and tradeoffs, resulting in even more complexity in the design of data integration flows. These include reducing the latency so that near real-time data can be delivered to the data warehouse, extracting information from a wider variety of data sources, extending the rigidly serial ETL pipeline to more general data flows, and considering alternative physical implementations. We describe the requirements for data integration flows in this next generation of operational BI system, the limitations of current technologies, the research challenges in meeting these requirements, and a framework for addressing these challenges. The goal is to facilitate the design and implementation of optimal flows to meet business requirements.

...read moreread less

An MDA approach for the development of data warehouses.

[...]

Jose-Norberto Mazón¹, Juan Trujillo¹•Institutions (1)

University of Alicante¹

01 Jan 2009

TL;DR: In this paper, a Model Driven Architecture (MDA) framework is proposed to align the whole DW development process with a MDA framework for the development of the DW repository, which is the cornerstone of any DW system.

...read moreread less

Abstract: Different modeling approaches have been proposed to overcome every design pitfall of different data warehouse (DW) components. However, most of them offer partial solutions that deal only with isolated aspects of the DW and do not provide developers with an integrated and standard framework for designing all DW relevant components, such as ETL processes, data sources, DW repository and so on. To overcome this problem, this paper describes how to align the whole DW development process with a Model Driven Architecture (MDA) framework. We then focus on describing one part of it: an MDA approach for the development of the DW repository, because it is the cornerstone of any DW system. Therefore, we describe how to build the different MDA models for the DW repository by using an extension of the Unified Modeling Language (UML) and the Common Warehouse Metamodel (CWM). Transformations between models are also clearly and formally established by using the Query/View/Transformation (QVT) language. Finally, a case study is provided to exemplify the benefits of our MDA framework.

...read moreread less

Journal Article•DOI•

A Survey of Multidimensional Modeling Methodologies

[...]

Oscar Romero¹, Alberto Abelló¹•Institutions (1)

Polytechnic University of Catalonia¹

01 Apr 2009-International Journal of Data Warehousing and Mining

TL;DR: This article presents the most relevant methodologies introduced in the literature and a detailed comparison showing main features of each approach is presented.

...read moreread less

Abstract: Many methodologies have been presented to support the multidimensional design of the data warehouse First methodologies introduced were requirement-driven but the semantics of a data warehouse require to also consider data sources along the design process In the following years, data sources gained relevance in multidimensional modeling and gave rise to several data-driven methodologies that automate the data warehouse design process from relational sources Currently, research on multidimensional modeling is still a hot topic and we have two main research lines On the one hand, new hybrid automatic methodologies have been introduced proposing to combine data-driven and requirement-driven approaches On the other hand, new approaches focus on considering other kind of structured data sources that have gained relevance in the last years such as ontologies or XML In this article we present the most relevant methodologies introduced in the literature and a detailed comparison showing main features of each approach

...read moreread less

Patent•

Methods and systems to train models to extract and integrate information from data sources

[...]

Justin Boyan, Glenn McDonald, Margaret Benthall, Ray Molnar

15 May 2009

TL;DR: In this article, the authors present methods and systems to model and acquire data from a variety of data and information sources, to integrate the data into a structured database, and to manage the continuing reintegration of updated data from those sources over time.

...read moreread less

Abstract: Methods and systems to model and acquire data from a variety of data and information sources, to integrate the data into a structured database, and to manage the continuing reintegration of updated data from those sources over time. For any given domain, a variety of individual information and data sources that contain information relevant to the schema can be identified. Data elements associated with a schema may be identified in a training source, such as by user tagging. A formal grammar may be induced appropriate to the schema and layout of the training source. A Hidden Markov Model (HMM) corresponding to the grammar may learn where in the sources the elements can be found. The system can automatically mutate its schema into a grammar matching the structure of the source documents. By following an inverse transformation sequence, data that is parsed by the mutated grammar can be fit back into the original grammar structure, matching the original data schema defined through domain modeling. Features disclosed herein may be implemented with respect to web-scraping and data acquisition, and to represent data in support of data-editing and data-merging tasks. A schema may be defined with respect to a graph-based domain model.

...read moreread less

Journal Article•DOI•

An ontology-based business intelligence application in a financial knowledge management system

[...]

Hilary Cheng¹, Yi-Chuan Lu¹, Calvin Sheu²•Institutions (2)

Yuan Ze University¹, Saint Petersburg State University²

01 Mar 2009-Expert Systems With Applications

TL;DR: This work presents an ontology-based approach for BI applications, specifically in statistical analysis and data mining, and implements this approach in financial knowledge management system (FKMS), which is able to do data extraction, transformation and loading, and data cubes creation and retrieval.

...read moreread less

Abstract: Business intelligence (BI) applications within an enterprise range over enterprise reporting, cube and ad hoc query analysis, statistical analysis, data mining, and proactive report delivery and alerting. The most sophisticated applications of BI are statistical analysis and data mining, which involve mathematical and statistical treatment of data for correlation analysis, trend analysis, hypothesis testing, and predictive analysis. They are used by relatively small groups of users consisting of information analysts and power users, for whom data and analysis are their primary jobs. We present an ontology-based approach for BI applications, specifically in statistical analysis and data mining. We implemented our approach in financial knowledge management system (FKMS), which is able to do: (i) data extraction, transformation and loading, (ii) data cubes creation and retrieval, (iii) statistical analysis and data mining, (iv) experiment metadata management, (v) experiment retrieval for new problem solving. The resulting knowledge from each experiment defined as a knowledge set consisting of strings of data, model, parameters, and reports are stored, shared, disseminated, and thus helpful to support decision making. We finally illustrate the above claims with a process of applying data mining techniques to support corporate bonds classification.

...read moreread less

Neural networks in data mining

[...]

Yashpal Singh, Alok Singh Chauhan

01 Jan 2009

TL;DR: This paper is an overview of artificial neural networks and questions their position as a preferred tool by data mining practitioners.

...read moreread less

Abstract: Companies have been collecting data for decades, building massive data warehouses in which to store it. Even though this data is available, very few companies have been able to realize the actual value stored in it. The question these companies are asking is how to extract this value. The answer is Data mining. There are many technologies available to data mining practitioners, including Artificial Neural Networks, Regression, and Decision Trees. Many practitioners are wary of Neural Networks due to their black box nature, even though they have proven themselves in many situations. This paper is an overview of artificial neural networks and questions their position as a preferred tool by data mining practitioners.

...read moreread less

Journal Article•DOI•

Representing Data Quality in Sensor Data Streaming Environments

[...]

Anja Klein, Wolfgang Lehner

01 Sep 2009-Journal of Data and Information Quality

TL;DR: The comprehensive analysis of common data stream processing operators and their impact on data quality allows a fruitful data evaluation and diminishes incorrect business decisions and is proposed to adapt the data quality granularity to the data stream interestingness.

...read moreread less

Abstract: Sensors in smart-item environments capture data about product conditions and usage to support business decisions as well as production automation processes. A challenging issue in this application area is the restricted quality of sensor data due to limited sensor precision and sensor failures. Moreover, data stream processing to meet resource constraints in streaming environments introduces additional noise and decreases the data quality. In order to avoid wrong business decisions due to dirty data, quality characteristics have to be captured, processed, and provided to the respective business task. However, the issue of how to efficiently provide applications with information about data quality is still an open research problem.In this article, we address this problem by presenting a flexible model for the propagation and processing of data quality. The comprehensive analysis of common data stream processing operators and their impact on data quality allows a fruitful data evaluation and diminishes incorrect business decisions. Further, we propose the data quality model control to adapt the data quality granularity to the data stream interestingness.

...read moreread less

Journal Article•DOI•

Finding the frequent items in streams of data

[...]

Graham Cormode¹, Marios Hadjieleftheriou¹•Institutions (1)

AT&T Labs¹

01 Oct 2009

TL;DR: The frequent items problem (also known as the heavy hitters problem) is one of the most heavily studied questions in data streams, and is important both in itself, and as a subroutine within more advanced data stream computations.

...read moreread less

Abstract: Many data generation processes can be modeled as data streams. They produce huge numbers of pieces of data, each of which is simple in isolation, but which taken together lead to a complex whole. For example, the sequence of queries posed to an Internet search engine can be thought of as a stream, as can the collection of transactions across all branches of a supermarket chain. In aggregate, this data can arrive at enormous rates, easily in the realm of hundreds of gigabytes per day or higher. While this data may be archived and indexed within a data warehouse, it is also important to process the data "as it happens," to provide up to the minute analysis and statistics on current trends. Methods to achieve this must be quick to respond to each new piece of information, and use resources which are very small when compared to the total quantity of data. These applications and others like them have led to the formulation of the so-called "streaming model." In this abstraction, algorithms take only a single pass over their input, and must accurately compute various functions while using resources (space and time per item) that are strictly sublinear in the size of the input---ideally, polynomial in the logarithm of the input size. The output must be produced at the end of the stream, or when queried on the prefix of the stream that has been observed so far. (Other variations ask for the output to be maintained continuously in the presence of updates, or on a "sliding window" of only the most recent updates.) Some problems are simple in this model: for example, given a stream of transactions, finding the mean and standard deviation of the bill totals can be accomplished by retaining a few "sufficient statistics" (sum of all values, sum of squared values, etc.). Others can be shown to require a large amount of information to be stored, such as determining whether a particular search query has already appeared anywhere within a large stream of queries. Determining which problems can be solved effectively within this model remains an active research area. The frequent items problem (also known as the heavy hitters problem) is one of the most heavily studied questions in data streams. The problem is popular due to its simplicity to state, and its intuitive interest and value. It is important both in itself, and as a subroutine within more advanced data stream computations. Informally, given a sequence of items, the problem is simply to find those items which occur most frequently. Typically, this is formalized as finding all items whose frequency exceeds a specified fraction of the total number of items. This is shown in Figure 1. Variations arise when the items are given weights, and further when these weights can also be negative. This abstract problem captures a wide variety of settings. The items can represent packets on the Internet, and the weights are the size of the packets. Then the frequent items represent the most popular destinations, or the heaviest bandwidth users (depending on how the items are extracted from the flow identifiers). This knowledge can help in optimizing routing decisions, for in-network caching, and for planning where to add new capacity. Or, the items can represent queries made to an Internet search engine, and the frequent items are now the (currently) popular terms. These are not simply hypothetical examples, but genuine cases where algorithms for this problem have been applied by large corporations: ATT existing work is sometimes claimed to be incapable of a certain guarantee, which in truth it can provide with only minor modifications; and experimental evaluations do not always compare against the most suitable methods. In this paper, we present the main ideas in this area, by describing some of the most significant algorithms for the core problem of finding frequent items using common notation and terminology. In doing so, we also present the historical development of these algorithms. Studying these algorithms is instructive, as they are relatively simple, but can be shown to provide formal guarantees on the quality of their output as a function of an accuracy parameter e. We also provide baseline implementations of many of these algorithms against which future algorithms can be compared, and on top of which algorithms for different problems can be built. We perform experimental evaluation of the algorithms over a variety of data sets to indicate their performance in practice. From this, we are able to identify clear distinctions among the algorithms that are not apparent from their theoretical analysis alone.

...read moreread less

Proceedings Article•DOI•

Stream warehousing with DataDepot

[...]

Lukasz Golab¹, Theodore Johnson¹, J. Spencer Seidel¹, Vladislav Shkapenyuk¹•Institutions (1)

AT&T Labs¹

29 Jun 2009

TL;DR: The DataDepot architecture is discussed, with an emphasis on several of its novel and critical features, which are currently being used for five very large warehousing projects within AT&T.

...read moreread less

Abstract: We describe DataDepot, a tool for generating warehouses from streaming data feeds, such as network-traffic traces, router alerts, financial tickers, transaction logs, and so on. DataDepot is a streaming data warehouse designed to automate the ingestion of streaming data from a wide variety of sources and to maintain complex materialized views over these sources. As a streaming warehouse, DataDepot is similar to Data Stream Management Systems (DSMSs) with its emphasis on temporal data, best-effort consistency, and real-time response. However, as a data warehouse, DataDepot is designed to store tens to hundreds of terabytes of historical data, allow time windows measured in years or decades, and allow both real-time queries on recent data and deep analyses on historical data. In this paper we discuss the DataDepot architecture, with an emphasis on several of its novel and critical features. DataDepot is currently being used for five very large warehousing projects within ATT one of these warehouses ingests 500 Mbytes per minute (and is growing). We use these installations to illustrate streaming warehouse use and behavior, and design choices made in developing DataDepot. We conclude with a discussion of DataDepot applications and the efficacy of some optimizations.

...read moreread less

Journal Article•DOI•

A scalable, predictable join operator for highly concurrent data warehouses

[...]

George Candea¹, Neoklis Polyzotis², Radek Vingralek¹•Institutions (2)

Aster¹, University of California, Santa Cruz²

01 Aug 2009

TL;DR: This work describes an augmentation of traditional query engines that improves join throughput in large-scale concurrent data warehouses by using an "always-on" pipeline of non-blocking operators, coupled with a controller that continuously examines the current query mix and performs run-time optimizations.

...read moreread less

Abstract: Conventional data warehouses employ the query-at-a-time model, which maps each query to a distinct physical plan. When several queries execute concurrently, this model introduces contention, because the physical plans---unaware of each other---compete for access to the underlying I/O and computation resources. As a result, while modern systems can efficiently optimize and evaluate a single complex data analysis query, their performance suffers significantly when multiple complex queries run at the same time.We describe an augmentation of traditional query engines that improves join throughput in large-scale concurrent data warehouses. In contrast to the conventional query-at-a-time model, our approach employs a single physical plan that can share I/O, computation, and tuple storage across all in-flight join queries. We use an "always-on" pipeline of non-blocking operators, coupled with a controller that continuously examines the current query mix and performs run-time optimizations. Our design allows the query engine to scale gracefully to large data sets, provide predictable execution times, and reduce contention. In our empirical evaluation, we found that our prototype outperforms conventional commercial systems by an order of magnitude for tens to hundreds of concurrent queries.

...read moreread less

Book•

Data Warehousing

[...]

Reema Thareja

15 Jun 2009

TL;DR: The book introduces the various features and architecture of a Data Warehouse followed by a detailed study of the Business Requirements and Dimensional Modelling and leads up to the core area of the subject by providing a thorough understanding of the building and maintenance of a data Warehouse.

...read moreread less

Abstract: Data Warehousing is designed to serve as a textbook for students of Computer Science & Engineering (BE/Btech), computer applications (BCA/MCA) and computer science (B.Sc) for an introductory course on Data Warehousing. It provides a thorough understanding of the fundamentals of Data Warehousing and aims to impart a sound knowledge to users for creating and managing a Data Warehouse. The book introduces the various features and architecture of a Data Warehouse followed by a detailed study of the Business Requirements and Dimensional Modelling. It goes on to discuss the components of a Data Warehouse and thereby leads up to the core area of the subject by providing a thorough understanding of the building and maintenance of a Data Warehouse. This is then followed up by an overview of planning and project management, testing and growth and then finishing with Data Warehouse solutions and the latest trends in this field. The book is finally rounded off with a broad overview of its related field of study, Data Mining. The text is ably supported by plenty of examples to illustrate concepts and contains several review questions and other end-chapter exercises to test the understanding of students. The book also carries a running case study that aims to bring out the practical aspects of the subject. This will be useful for students to master the basics and apply them to real-life scenario.

...read moreread less

Journal Article•DOI•

A Survey on Temporal Data Warehousing

[...]

Matteo Golfarelli¹, Stefano Rizzi¹•Institutions (1)

University of Bologna¹

01 Jan 2009-International Journal of Data Warehousing and Mining

TL;DR: The main concepts and terminology of temporal databases are introduced and the open research issues also in connection with their implementation on commercial tools are discussed.

...read moreread less

Abstract: Data warehouses are information repositories specialized in supporting decision making. Since the decisional process typically requires an analysis of historical trends, time and its management acquire a huge importance. In this paper we consider the variety of issues, often grouped under term temporal data warehousing, implied by the need for accurately describing how information changes over time in data warehousing systems. We recognize that, with reference to a three-levels architecture, these issues can be classified into some topics, namely: handling data/schema changes in the data warehouse, handling data/schema changes in the data mart, querying temporal data, and designing temporal data warehouses. After introducing the main concepts and terminology of temporal databases, we separately survey these topics. Finally, we discuss the open research issues also in connection with their implementation on commercial tools.

...read moreread less

Proceedings Article•

Continuous Analytics: Rethinking Query Processing in a Network-Effect World

[...]

Michael J. Franklin¹, Sailesh Krishnamurthy, Neil Conway², Alan Li, Alex Russakovsky, Neil Thombre² - Show less +2 more•Institutions (2)

University of California, Berkeley¹, Cisco Systems, Inc.²

01 Jan 2009

TL;DR: This paper describes the Continuous Analytics approach and outlines some of the key technical arguments behind it, creating a powerful and flexible system that can run SQL over tables, streams, and combinations of the two.

...read moreread less

Abstract: Modern data analysis applications driven by the Network Effect are pushing traditional database and data warehousing technologies beyond their limits due to their massively increasing data volumes and demands for low latency. To address this problem, we advocate an integrated query processing approach that runs SQL continuously and incrementally over data before that data is stored in the database. Continuous Analytics technology is seamlessly integrated into a full-function database system, creating a powerful and flexible system that can run SQL over tables, streams, and combinations of the two. A continuous analytics system can run many orders of magnitude more efficiently than traditional store-first-query-later technologies. In this paper, we describe the Continuous Analytics approach and outline some of the key technical arguments behind it.

...read moreread less

Journal Article•DOI•

A database for integrated assessment of European agricultural systems.

[...]

Sander Janssen, Erling Andersen¹, Ioannis N. Athanasiadis², M.K. van Ittersum•Institutions (2)

University of Copenhagen¹, Dalle Molle Institute for Artificial Intelligence Research²

01 Aug 2009-Environmental Science & Policy

TL;DR: The SEAMLESS integrated database on European agricultural systems contains data on cropping patterns, production, farm structural data, soil and climate conditions, current agricultural management and policy information and a shared ontology was developed according to a collaborative process.

...read moreread less

Proceedings Article•DOI•

Defining ETL worfklows using BPMN and BPEL

[...]

Zineb El Akkaoui¹, Esteban Zimányi¹•Institutions (1)

Université libre de Bruxelles¹

06 Nov 2009

TL;DR: This paper proposes a platform-independent conceptual model of ETL processes based on the Business Process Model Notation (BPMN) standard and shows how such a conceptual model can be implemented using Business Process Execution Language (BPEL), a standard executable language for specifying interactions with web services.

...read moreread less

Abstract: Decisional systems are crucial for enterprise improvement. They allow the consolidation of heterogeneous data from distributed enterprise data stores into strategic indicators. An essential component of this data consolidation is the Extract, Transform, and Load (ETL) process. In the research literature there has been very few work defining conceptual models for ETL processes. At the same time, there are currently many tools that manage such processes. However, each tool uses its own model, which is not necessarily able to communicate with the models of other tools. In this paper, we propose a platform-independent conceptual model of ETL processes based on the Business Process Model Notation (BPMN) standard. We also show how such a conceptual model can be implemented using Business Process Execution Language (BPEL), a standard executable language for specifying interactions with web services.

...read moreread less

Book Chapter•DOI•

Near Real Time ETL

[...]

Panos Vassiliadis¹, Alkis Simitsis²•Institutions (2)

University of Ioannina¹, Stanford University²

01 Jan 2009

TL;DR: The state of the art for both conventional and near real time ETL is reviewed, the background, the architecture, and the technical issues that arise are discussed, and interesting research challenges for future work are pinpointed.

...read moreread less

Abstract: Near real time ETL deviates from the traditional conception of data warehouse refreshment, which is performed off-line in a batch mode, and adopts the strategy of propagating changes that take place in the sources towards the data warehouse to the extent that both the sources and the warehouse can sustain the incurred workload. In this article, we review the state of the art for both conventional and near real time ETL, we discuss the background, the architecture, and the technical issues that arise in the area of near real time ETL, and we pinpoint interesting research challenges for future work.

...read moreread less

A Survey on Temporal Data Warehousing.

[...]

Matteo Golfarelli¹, Stefano Rizzi¹•Institutions (1)

University of Bologna¹

01 Jan 2009

TL;DR: In this paper, the authors consider the variety of issues, often grouped under term temporal data warehousing, implied by the need for accurately describing how information changes over time in data warehouse systems and recognize that, with reference to a three-levels architecture, these issues can be classified into some topics, namely: handling data/schema changes in the data warehouse, handling data and schema changes in data mart, querying temporal data, and designing temporal data warehouses.

...read moreread less

Journal Article•DOI•

Efficient index compression in DB2 LUW

[...]

Bishwaranjan Bhattacharjee¹, Lipyeow Lim¹, Timothy R. Malkemus¹, George A. Mihaila¹, Kenneth A. Ross¹, Sherman Lau¹, Cathy Mcarthur¹, Zoltan Toth¹, Reza Sherkat² - Show less +5 more•Institutions (2)

IBM¹, University of Alberta²

01 Aug 2009

TL;DR: The design of index compression in DB2 LUW is detailed and the challenges that were encountered in meeting the design goals are discussed and its effectiveness is demonstrated by showing performance results on typical customer scenarios.

...read moreread less

Abstract: In database systems, the cost of data storage and retrieval are important components of the total cost and response time of the system. A popular mechanism to reduce the storage footprint is by compressing the data residing in tables and indexes. Compressing indexes efficiently, while maintaining response time requirements, is known to be challenging. This is especially true when designing for a workload spectrum covering both data warehousing and transaction processing environments. DB2 Linux, UNIX, Windows (LUW) recently introduced index compression for use in both environments. This uses techniques that are able to compress index data efficiently while incurring virtually no performance penalty for query processing. On the contrary, for certain operations, the performance is actually better. In this paper, we detail the design of index compression in DB2 LUW and discuss the challenges that were encountered in meeting the design goals. We also demonstrate its effectiveness by showing performance results on typical customer scenarios.

...read moreread less

Proceedings Article•DOI•

Scheduling Updates in a Real-Time Stream Warehouse

[...]

Lukasz Golab, Theodore Johnson, Vladislav Shkapenyuk

29 Mar 2009

TL;DR: The notion of data staleness is defined, the problem of scheduling updates in a way that minimizes average data stalness is formalized, and scheduling algorithms designed to handle the complex environment of a real-time stream warehouse are presented.

...read moreread less

Abstract: This paper discusses updating a data warehouse that collects near-real-time data streams from a variety of external sources. The objective is to keep all the tables and materialized views up-to-date as new data arrive over time. We define the notion of data staleness, formalize the problem of scheduling updates in a way that minimizes average data staleness, and present scheduling algorithms designed to handle the complex environment of a real-time stream warehouse. A novel feature of our scheduling framework is that it considers the effect of an update on the staleness of the underlying tables rather than any property of the update job itself (such as deadline).

...read moreread less

Collapse