scispace - formally typeset
Search or ask a question

Showing papers on "Data warehouse published in 2013"


Proceedings ArticleDOI
20 May 2013
TL;DR: This paper presents an overview of big data's content, scope, samples, methods, advantages and challenges, and discusses privacy concern on it.
Abstract: Big data is a term for massive data sets having large, more varied and complex structure with the difficulties of storing, analyzing and visualizing for further processes or results. The process of research into massive amounts of data to reveal hidden patterns and secret correlations named as big data analytics. These useful informations for companies or organizations with the help of gaining richer and deeper insights and getting an advantage over the competition. For this reason, big data implementations need to be analyzed and executed as accurately as possible. This paper presents an overview of big data's content, scope, samples, methods, advantages and challenges and discusses privacy concern on it.

1,077 citations


Proceedings ArticleDOI
07 Jan 2013
TL;DR: The issues and challenges are analyzed as a collaborative research program into methodologies for big data analysis and design are begun.
Abstract: Big data refers to data volumes in the range of exabytes (1018) and beyond. Such volumes exceed the capacity of current on-line storage systems and processing systems. Data, information, and knowledge are being created and collected at a rate that is rapidly approaching the exabyte/year range. But, its creation and aggregation are accelerating and will approach the zettabyte/year range within a few years. Volume is only one aspect of big data; other attributes are variety, velocity, value, and complexity. Storage and data transport are technology issues, which seem to be solvable in the near-term, but represent longterm challenges that require research and new paradigms. We analyze the issues and challenges as we begin a collaborative research program into methodologies for big data analysis and design.

835 citations


Book
01 Jan 2013
TL;DR: The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3rd Edition is a complete library of updated dimensional modeling techniques, the most comprehensive collection ever.
Abstract: Updated new edition of Ralph Kimball's groundbreaking book on dimensional modeling for data warehousing and business intelligence!The first edition of Ralph Kimball'sThe Data Warehouse Toolkitintroduced the industry to dimensional modeling, and now his books are considered the most authoritative guides in this space. This new third edition is a complete library of updated dimensional modeling techniques, the most comprehensive collection ever. It covers new and enhanced star schema dimensional modeling patterns, adds two new chapters on ETL techniques, includes new and expanded business matrices for 12 case studies, and more.Authored by Ralph Kimball and Margy Ross, known worldwide as educators, consultants, and influential thought leaders in data warehousing and business intelligenceBegins with fundamental design recommendations and progresses through increasingly complex scenariosPresents unique modeling techniques for business applications such as inventory management, procurement, invoicing, accounting, customer relationship management, big data analytics, and moreDraws real-world case studies from a variety of industries, including retail sales, financial services, telecommunications, education, health care, insurance, e-commerce, and moreDesign dimensional databases that are easy to understand and provide fast query response withThe Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3rd Edition.

372 citations


Journal ArticleDOI
TL;DR: This paper presents an overview of techniques that allow the linking of databases between organizations while at the same time preserving the privacy of these data, and presents a taxonomy of PPRL techniques to characterize these techniques along 15 dimensions.

241 citations


Journal ArticleDOI
H. G. Miller1, P. Mork1
TL;DR: With exponential growth in data, enterprises must act to make the most of the vast data landscape-to thoughtfully apply multiple technologies, carefully select key data for specific investigations, and innovatively tailor large integrated datasets to support specific queries and analyses.
Abstract: With exponential growth in data, enterprises must act to make the most of the vast data landscape-to thoughtfully apply multiple technologies, carefully select key data for specific investigations, and innovatively tailor large integrated datasets to support specific queries and analyses. All these actions will flow from a data value chain-a framework to manage data holistically from capture to decision making and to support a variety of stakeholders and their technologies.

240 citations


Patent
14 Oct 2013
TL;DR: In this article, an Extract Transform and Load (ETLTL) data replication method for Chart of Account (COA) standardization is proposed to extract data from a data source, extracting data in a nonintrusive manner from the data source.
Abstract: Remote data collection systems and methods retrieve data including financial, sales, marketing, operational and the like data from a plurality of databases and database types remotely over a network in an automated, platform-agnostic manner. An Extract Transform and Load (ETL) data replication method for Chart of Account (COA) standardization includes receiving a request for remote data collection to extract data from a data source; extracting data in a non-intrusive manner from the data source, wherein the data comprises non-standard COA data; and transforming one of an entire set or a subset of the extracted data based on the request based on a template or a standardized form desired for comparisons.

177 citations


Book
02 May 2013
TL;DR: This book will help you navigate through the complex layers of Big Data and data warehousing while providing you information on how to effectively think about using all these technologies and the architectures to design the next-generation data warehouse.
Abstract: Data Warehousing in the Age of the Big Data will help you and your organization make the most of unstructured data with your existing data warehouse. As Big Data continues to revolutionize how we use data, it doesn't have to create more confusion. Expert author Krish Krishnan helps you make sense of how Big Data fits into the world of data warehousing in clear and concise detail. The book is presented in three distinct parts. Part 1 discusses Big Data, its technologies and use cases from early adopters. Part 2 addresses data warehousing, its shortcomings, and new architecture options, workloads, and integration techniques for Big Data and the data warehouse. Part 3 deals with data governance, data visualization, information life-cycle management, data scientists, and implementing a Big Data??ready data warehouse. Extensive appendixes include case studies from vendor implementations and a special segment on how we can build a healthcare information factory. Ultimately, this book will help you navigate through the complex layers of Big Data and data warehousing while providing you information on how to effectively think about using all these technologies and the architectures to design the next-generation data warehouse. Learn how to leverage Big Data by effectively integrating it into your data warehouse. Includes real-world examples and use cases that clearly demonstrate Hadoop, NoSQL, HBASE, Hive, and other Big Data technologies Understand how to optimize and tune your current data warehouse infrastructure and integrate newer infrastructure matching data processing workloads and requirements Table of Contents Part 1 - Big Data Chapter 1 - Introduction to Big Data Chapter 2 - Complexity of Big Data Chapter 3 - Big Data Processing Architectures Chapter 4 - Big Data Technologies Chapter 5 - Big Data Business Value Part 2 - The Data Warehouse Chapter 6 - Data Warehouse Chapter 7 - Re-Engineering the Data Warehouse Chapter 8 ??Workload Management in the Data Warehouse Chapter 9 - New Technology Approaches Part 3 - Extending Big Data into the Data Warehouse Chapter 10 - Integration of Big Data and Data Warehouse Chapter 11 - Data Driven Architecture Chapter 12 - Information Management and Lifecycle Chapter 13 - Big Data Analytics, Visualization and Data Scientist Chapter 14 - Implementing The "Big Data" Data Warehouse Appendix A - Customer Case Studies From Vendors Appendix B - Building The HealthCare Information Factory

154 citations


Journal ArticleDOI
TL;DR: The underlying core idea is the notion of fusion cubes, i.e., multidimensional cubes that can be dynamically extended both in their schema and their instances, and in which situational data and metadata are associated with quality and provenance annotations.
Abstract: Self-service business intelligence is about enabling non-expert users to make well-informed decisions by enriching the decision process with situational data, i.e., data that have a narrow focus on a specific business problem and, typically, a short lifespan for a small group of users. Often, these data are not owned and controlled by the decision maker; their search, extraction, integration, and storage for reuse or sharing should be accomplished by decision makers without any intervention by designers or programmers. The goal of this paper is to present the framework we envision to support self-service business intelligence and the related research challenges; the underlying core idea is the notion of fusion cubes, i.e., multidimensional cubes that can be dynamically extended both in their schema and their instances, and in which situational data and metadata are associated with quality and provenance annotations.

130 citations


Proceedings ArticleDOI
28 Oct 2013
TL;DR: Open problems and actual research trends in the field of Data Warehousing and OLAP over Big Data are highlighted and several novel research directions arising in this field are derived.
Abstract: In this paper, we highlight open problems and actual research trends in the field of Data Warehousing and OLAP over Big Data, an emerging term in Data Warehousing and OLAP research. We also derive several novel research directions arising in this field, and put emphasis on possible contributions to be achieved by future research efforts.

120 citations


Book
25 Nov 2013
TL;DR: This book is ideal for R developers who are looking for a way to perform big data analytics with Hadoop and focuses on all the powerful big data tasks that can be achieved by integrating R andHadoop.
Abstract: Set up an integrated infrastructure of R and Hadoop to turn your data analytics into Big Data analytics Overview Write Hadoop MapReduce within R Learn data analytics with R and the Hadoop platform Handle HDFS data within R Understand Hadoop streaming with R Encode and enrich datasets into R In Detail Big data analytics is the process of examining large amounts of data of a variety of types to uncover hidden patterns, unknown correlations, and other useful information. Such information can provide competitive advantages over rival organizations and result in business benefits, such as more effective marketing and increased revenue. New methods of working with big data, such as Hadoop and MapReduce, offer alternatives to traditional data warehousing. Big Data Analytics with R and Hadoop is focused on the techniques of integrating R and Hadoop by various tools such as RHIPE and RHadoop. A powerful data analytics engine can be built, which can process analytics algorithms over a large scale dataset in a scalable manner. This can be implemented through data analytics operations of R, MapReduce, and HDFS of Hadoop. You will start with the installation and configuration of R and Hadoop. Next, you will discover information on various practical data analytics examples with R and Hadoop. Finally, you will learn how to import/export from various data sources to R. Big Data Analytics with R and Hadoop will also give you an easy understanding of the R and Hadoop connectors RHIPE, RHadoop, and Hadoop streaming. What you will learn from this book Integrate R and Hadoop via RHIPE, RHadoop, and Hadoop streaming Develop and run a MapReduce application that runs with R and Hadoop Handle HDFS data from within R using RHIPE and RHadoop Run Hadoop streaming and MapReduce with R Import and export from various data sources to R Approach Big Data Analytics with R and Hadoop is a tutorial style book that focuses on all the powerful big data tasks that can be achieved by integrating R and Hadoop. Who this book is written for This book is ideal for R developers who are looking for a way to perform big data analytics with Hadoop. This book is also aimed at those who know Hadoop and want to build some intelligent applications over Big data with R packages. It would be helpful if readers have basic knowledge of R.

107 citations


Patent
04 Mar 2013
TL;DR: A big data network or system for a process control system or plant includes a big data apparatus including a data storage area configured to store, using a common data schema, multiple types of process data and/or plant data (such as configuration and real-time data) as mentioned in this paper.
Abstract: A big data network or system for a process control system or plant includes a big data apparatus including a data storage area configured to store, using a common data schema, multiple types of process data and/or plant data (such as configuration and real-time data) that is used in, generated by or received by the process control system, and one or more data receiver computing devices to receive the data from multiple nodes or devices. The data may be cached and time-stamped at the nodes and streamed to the big data apparatus for storage. The process control system big data system provides services and/or data analyses to automatically or manually discover prescriptive and/or predictive knowledge, and to determine, based on the discovered knowledge, changes and/or additions to the process control system and to the set of services and/or analyses to optimize the process control system or plant.

Book ChapterDOI
05 Dec 2013
TL;DR: In this article, the authors present the story as it is Told from the Business Perspective and from the Technology Perspective, where they discuss the challenges of big data from a business perspective.
Abstract: Introduction 104 The Story as it is Told from the Business Perspective 104 The Story as it is Told from the Technology Perspective 107Data Challenges 107 Volume 107 Variety, Combining Multiple Data Sets 108 Velocity 108 Veracity, Data Quality, Data Availability 109 Data Discovery 109 Quality and Relevance 109 Data Comprehensiveness 109 Personally Identifiable Information 109 Data Dogmatism 110 Scalability 110Process Challenges 110 Management Challenges 110 Big Data Platforms Technology: Current State of the Art 111Take the Analysis to the Data! 111 What Is Apache Hadoop? 111 Who Are the Hadoop Users? 112 An Example of an Advanced User: Amazon 113 Big Data in Data Warehouse or in Hadoop? 113 Big Data in the Database World (Early 1980s Till Now) 113 Big Data in the Systems World (Late 1990s Till Now) 113 Enterprise Search 115 Big Data “Dichotomy” 115 Hadoop and the Cloud 116 Hadoop Pros 116 Hadoop Cons 116Technological Solutions for Big Data Analytics 118 Scalability and Performance at eBay 122 Unstructured Data 123 Cloud Computing and Open Source 123Every day, 2.5 quintillion bytes of data are created. These data come from digital pictures, videos, posts to social media sites, intelligent sensors, purchase transaction records, cell phone GPS signals, to name a few. This is known as Big Data.

Proceedings ArticleDOI
23 May 2013
TL;DR: An approach to achieve data security & privacy through out the complete data lifecycle: data generation/collection, transfer, storage, processing and sharing is proposed.
Abstract: A framework for maintaining security & preserving privacy for analysis of sensor data from smart homes, without compromising on data utility is presented. Storing the personally identifiable data as hashed values withholds identifiable information from any computing nodes. However the very nature of smart home data analytics is establishing preventive care. Data processing results should be identifiable to certain users responsible for direct care. Through a separate encrypted identifier dictionary with hashed and actual values of all unique sets of identifiers, we suggest re-identification of any data processing results. However the level of re-identification needs to be controlled, depending on the type of user accessing the results. Generalization and suppression on identifiers from the identifier dictionary before re-introduction could achieve different levels of privacy preservation. In this paper we propose an approach to achieve data security & privacy through out the complete data lifecycle: data generation/collection, transfer, storage, processing and sharing.

Book
27 Jun 2013
TL;DR: The complementary nature of traditional data warehouses and big-data analytics platforms and how they feed each other are described, with a greater focus on architectures that leverage the scale and power of big data and the ability to integrate and apply analytics principles to data which earlier was not accessible.
Abstract: Big Data Imperatives, focuses on resolving the key questions on everyones mind: Which data matters? Do you have enough data volume to justify the usage? How you want to process this amount of data? How long do you really need to keep it active for your analysis, marketing, and BI applications? Big data is emerging from the realm of one-off projects to mainstream business adoption; however, the real value of big data is not in the overwhelming size of it, but more in its effective use. This book addresses the following big data characteristics: Very large, distributed aggregations of loosely structured data often incomplete and inaccessible Petabytes/Exabytes of data Millions/billions of people providing/contributing to the context behind the data Flat schema's with few complex interrelationships Involves time-stamped events Made up of incomplete data Includes connections between data elements that must be probabilistically inferred Big Data Imperativesexplains 'what big data can do'. It can batch process millions and billions of records both unstructured and structured much faster and cheaper. Big data analytics provide a platform to merge all analysis which enables data analysis to be more accurate, well-rounded, reliable and focused on a specific business capability. Big Data Imperativesdescribes the complementary nature of traditional data warehouses and big-data analytics platforms and how they feed each other. This book aims to bring the big data and analytics realms together with a greater focus on architectures that leverage the scale and power of big data and the ability to integrate and apply analytics principles to data which earlier was not accessible. This book can also be used as a handbook for practitioners; helping them on methodology,technical architecture, analytics techniques and best practices. At the same time, this book intends to hold the interest of those new to big data and analytics by giving them a deep insight into the realm of big data. What youll learn Understanding the technology, implementation of big data platforms and their usage for analytics Big data architectures Big data design patterns Implementation best practices Who this book is for This book is designed for IT professionals, data warehousing, business intelligence professionals, data analysis professionals, architects, developers and business users. Table of Contents The New Information ManagementParadigm Big Data's Implication for Businesses Big Data Implications for Information Management Defining Big Data Architecture Characteristics Co-Existent Architectures Data Quality for Big Data Data Security and Privacy Considerations for Big Data Big Data and Analytics Big Data Implications for Practitioners

Patent
Peter Ciurea1
01 Aug 2013
TL;DR: In this paper, a data warehouse storing first data representing a personal privacy policy of a user, a portal coupled between merchant systems and points of interaction of users; and a transaction handler coupled between acquirer processors and issuer processors.
Abstract: A computing apparatus includes: a data warehouse storing first data representing a personal privacy policy of a user; a portal coupled between merchant systems and points of interaction of users; and a transaction handler coupled between acquirer processors and issuer processors. The portal and the transaction handler are configured to shield user data, such as address, account information, etc., from the merchant systems in accordance with the personal privacy policy of the user.

Journal ArticleDOI
TL;DR: If current and proposed regulation strikes the right balance between the risks and benefits of Big Data is considered, which cuts across the whole of the Big Data lifecycle: collection, combination, analysis and use.

Proceedings ArticleDOI
23 Dec 2013
TL;DR: This paper analyzes existing big data applications by taking into consideration the core elements of a business (via business model canvas) and presents how these applications provide value to their customers by making profit out of using big data.
Abstract: Large and complex data that becomes difficult to be handled by traditional data processing applications triggers the development of big data applications which have become more pervasive than ever before. In the era of big data, data exploration and analysis turned into a difficult problem in many sectors such as the smart routing and health care sectors. Companies which can adapt their businesses well to leverage big data have significant advantages over those that lag this capability. The need for exploring new approaches to address the challenges of big data forces companies to shape their business models accordingly. In this paper, we summarize and share our findings regarding the business models deployed in big data applications in different sectors. We analyze existing big data applications by taking into consideration the core elements of a business (via business model canvas) and present how these applications provide value to their customers by making profit out of using big data.

Proceedings ArticleDOI
22 Jul 2013
TL;DR: This paper explores the convergence of Data Warehousing, OLAP and data-intensive Cloud Infrastructures in the context of so-called analytics over Big Data.
Abstract: This paper explores the convergence of Data Warehousing, OLAP and data-intensive Cloud Infrastructures in the context of so-called analytics over Big Data The paper briefly reviews some state-of-the-art proposals, highlights open research issues and, finally, it draws possible research directions in this scientific field

Proceedings ArticleDOI
13 Oct 2013
TL;DR: The aim is to enable organizations to better manage and architect a very large Big Data application to gain competitive advantage by allowing management to have a better handle on data processing.
Abstract: We are constantly being told that we live in the Information Era - the Age of BIG data. It is clearly apparent that organizations need to employ data-driven decision making to gain competitive advantage. Processing, integrating and interacting with more data should make it better data, providing both more panoramic and more granular views to aid strategic decision making. This is made possible via Big Data exploiting affordable and usable Computational and Storage Resources. Many offerings are based on the Map-Reduce and Hadoop paradigms and most focus solely on the analytical side. Nonetheless, in many respects it remains unclear what Big Data actually is, current offerings appear as isolated silos that are difficult to integrate and/or make it difficult to better utilize existing data and systems. Paper addresses this lacunae by characterising the facets of Big Data and proposing a framework in which Big Data applications can be developed. The framework consists of three Stages and seven Layers to divide Big Data application into modular blocks. The aim is to enable organizations to better manage and architect a very large Big Data application to gain competitive advantage by allowing management to have a better handle on data processing.

Patent
Peter Ciurea1
10 Jun 2013
TL;DR: In this paper, a rule engine coupled with the data warehouse and the portal is used to identify a policy customization to bridge the gap between the personal privacy policy of the user and the site privacy policy.
Abstract: A computing apparatus includes: a data warehouse storing first data representing a personal privacy policy of a user; a portal configured to communicate with a remote system storing second data representing a site privacy policy of an entity with whom the user interacts; and a rule engine coupled with the data warehouse and the portal to identify a policy customization to bridge a gap between the personal privacy policy of the user and the site privacy policy of the entity.

Journal ArticleDOI
TL;DR: A comprehensive review of the commonly used data analysis methods applied to building-related data and a data mining framework are proposed that enable building- related data to be analyzed more efficiently.
Abstract: Energy management systems provide an opportunity to collect vast amounts of building-related data. The data contain abundant knowledge about the interactions between a building’s energy consumption and the influencing factors. It is highly desirable that the hidden knowledge can be extracted from the data in order to help improve building energy performance. However, the data are rarely translated into useful knowledge due to their complexity and a lack of effective data analysis techniques. This paper first conducts a comprehensive review of the commonly used data analysis methods applied to building-related data. Both the strengths and weaknesses of each method are discussed. Then, the critical analysis of the previous solutions to three fundamental problems of building energy performance improvement that remain significant barriers is performed. Considering the limitations of those commonly used data analysis methods, data mining techniques are proposed as a primary tool to analyze building-related data. Moreover, a data analysis process and a data mining framework are proposed that enable building-related data to be analyzed more efficiently. The process refers to a series of sequential steps in analyzing data. The framework includes different data mining techniques and algorithms, from which a set of efficient data analysis methodologies can be developed. The applications of the process and framework to two sets of collected data demonstrate their applicability and abilities to extract useful knowledge. Particularly, four data analysis methodologies were developed to solve the three problems. For demonstration purposes, these methodologies were applied to the collected data. These methodologies are introduced in the published papers and are summarized in this paper. More extensive investigations will be performed in order to further evaluate the effectiveness of the framework.

Proceedings ArticleDOI
01 Nov 2013
TL;DR: This work presents a unified cloud platform for batch log data analysis with the combination of Hadoop and Spark, which provides a distributed file system and off-line batch computing framework, while the computing pattern in Spark is based on distributed memory.
Abstract: Log is the main source of the system operation status, user behavior analysis etc. Log analysis system needs not only the massive and stable data processing ability but also the adaptation to a variety of scenarios under the requirement of efficiency, which can't be achieved from standalone analysis tools or even single cloud computing framework. We present a unified cloud platform for batch log data analysis with the combination of Hadoop and Spark. Hadoop provides a distributed file system and off-line batch computing framework, while the computing pattern in Spark is based on distributed memory. The joint of Hadoop, Spark and the data warehouse and analysis tools of Hive and Shark makes it possible to provide a unified cloud platform with batch analysis and in-memory computing capacity in order to process log in a high available, stable and efficient way.

Journal ArticleDOI
TL;DR: In this paper, the authors design a self-scaling registry technology for collaborative data sharing, based upon the widely adopted Integrating Biology & the Bedside (i2b2) data warehousing framework and the Shared Health Research Information Network (SHRINE) peer-to-peer networking software.

Proceedings ArticleDOI
22 Jun 2013
TL;DR: This paper gives an overview of SQL Server's column stores and batch processing, in particular the enhancements introduced in the upcoming release.
Abstract: SQL Server 2012 introduced two innovations targeted for data warehousing workloads: column store indexes and batch (vectorized) processing mode. Together they greatly improve performance of typical data warehouse queries, routinely by 10X and in some cases by a 100X or more. The main limitations of the initial version are addressed in the upcoming release. Column store indexes are updatable and can be used as the base storage for a table. The repertoire of batch mode operators has been expanded, existing operators have been improved, and query optimization has been enhanced. This paper gives an overview of SQL Server's column stores and batch processing, in particular the enhancements introduced in the upcoming release.

Journal ArticleDOI
TL;DR: Aggregating multiple data sources in a data warehouse combined with tools for extraction of relevant parameters is beneficial for data collection times and offers the ability to improve data quality.

Book ChapterDOI
01 Jan 2013
TL;DR: This chapter provides you a concise and example driven introduction to what is Big Data, and how any organization needs to understand the value of Big Data.
Abstract: Why this book? Why now? The goal of this book is to provide readers with a concise perspective into the biggest buzz in the industry—Big Data—and, more importantly, its impact on data processing, management, decision support, and data warehousing. At the time of this writing, there is a lot of interest to adopt a Big Data solution, but the profound confusion is what is the future of data warehousing and many investments that have been made over the years into building the decision support platform. This book addresses those areas of concern and provides readers an introduction to the next-generation of data management and data warehousing. This chapter provides you a concise and example driven introduction to what is Big Data, and how any organization needs to understand the value of Big Data.

Proceedings ArticleDOI
03 Jul 2013
TL;DR: A DaaS approach for intelligent sharing and processing of large data collections with the aim of abstracting the data location (by making it relevant to the needs of sharing and accessing) and to fully decouple the data and its processing is proposed.
Abstract: Data as a Service (DaaS) is among the latest kind of services being investigated in the Cloud computing community. The main aim of DaaS is to overcome limitations of state-of-the-art approaches in data technologies, according to which data is stored and accessed from repositories whose location is known and is relevant for sharing and processing. Besides limitations for the data sharing, current approaches also do not achieve to fully separate/decouple software services from data and thus impose limitations in inter-operability. In this paper we propose a DaaS approach for intelligent sharing and processing of large data collections with the aim of abstracting the data location (by making it relevant to the needs of sharing and accessing) and to fully decouple the data and its processing. The aim of our approach is to build a Cloud computing platform, offering DaaS to support large communities of users that need to share, access, and process the data for collectively building knowledge from data. We exemplify the approach from large data collections from health and biology domains.

Patent
25 Nov 2013
TL;DR: In this article, a distributed data warehouse system may maintain data blocks on behalf of clients in multiple clusters in a data store, each cluster may include a single leader node and multiple compute nodes, each including multiple disks storing data.
Abstract: A distributed data warehouse system may maintain data blocks on behalf of clients in multiple clusters in a data store. Each cluster may include a single leader node and multiple compute nodes, each including multiple disks storing data. The warehouse system may store primary and secondary copies of each data block on different disks or nodes in a cluster. Each node may include a data structure that maintains metadata about each data block stored on the node, including its unique identifier. The warehouse system may back up data blocks in a remote key-value backup storage system with high durability. A streaming restore operation may be used to retrieve data blocks from backup storage using their unique identifiers as keys. The warehouse system may service incoming queries (and may satisfy some queries by retrieving data from backup storage on an as-needed basis) prior to completion of the restore operation.

Patent
03 Jun 2013
TL;DR: In this article, a system and computer-implemented method for automating data warehousing processes is provided, which comprises a code generator configured to generate codes for Extract, Transform and Load (ETL) tools, wherein the codes facilitate the ETL tools in extracting, transforming and loading data read from data sources.
Abstract: A system and computer-implemented method for automating data warehousing processes is provided. The system comprises a code generator configured to generate codes for Extract, Transform and Load (ETL) tools, wherein the codes facilitate the ETL tools in extracting, transforming and loading data read from data sources. The system further comprises a code reviewer configured to review and analyze the generated codes. Furthermore, the system comprises a data migration module configured to facilitate migrating the data read from the data sources to one or more data warehouses. Also, the system comprises a data generator configured to mask the data read from the data sources to generate processed data. In addition, the system comprises a Data Warehouse Quality Assurance module configured to facilitate testing the read and the processed data. The system further comprises a reporting module configured to provide status reports on the data warehousing processes.

Book ChapterDOI
01 Jan 2013
TL;DR: This chapter reviews some of the main contributions of Perm, a DBMS that generates different types of provenance information for complex SQL queries (including nested and correlated subqueries and aggregation).
Abstract: In applications such as data warehousing or data exchange, the ability to efficiently generate and query provenance information is crucial to understand the origin of data. In this chapter, we review some of the main contributions of Perm, a DBMS that generates different types of provenance information for complex SQL queries (including nested and correlated subqueries and aggregation). The two key ideas behind Perm are representing data and its provenance together in a single relation and relying on query rewrites to generate this representation. Through this, Perm supports fully integrated, on-demand provenance generation and querying using SQL. Since Perm rewrites a query requesting provenance into a regular SQL query and generates easily optimizable SQL code, its performance greatly benefits from the query optimization techniques provided by the underlying DBMS.