Showing papers on "Data warehouse published in 2014"

PDF

Open Access

Journal Article•DOI•

[...]

Xindong Wu¹, Xingquan Zhu², Gongqing Wu¹, Wei Ding³•Institutions (3)

Hefei University of Technology¹, Florida Atlantic University², University of Massachusetts Boston³

01 Jan 2014-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A HACE theorem is presented that characterizes the features of the Big Data revolution, and a Big Data processing model is proposed, from the data mining perspective, which involves demand-driven aggregation of information sources, mining and analysis, user interest modeling, and security and privacy considerations.

...read moreread less

Abstract: Big Data concern large-volume, complex, growing data sets with multiple, autonomous sources. With the fast development of networking, data storage, and the data collection capacity, Big Data are now rapidly expanding in all science and engineering domains, including physical, biological and biomedical sciences. This paper presents a HACE theorem that characterizes the features of the Big Data revolution, and proposes a Big Data processing model, from the data mining perspective. This data-driven model involves demand-driven aggregation of information sources, mining and analysis, user interest modeling, and security and privacy considerations. We analyze the challenging issues in the data-driven model and also in the Big Data revolution.

...read moreread less

2,233 citations

Journal Article•DOI•

Toward Scalable Systems for Big Data Analytics: A Technology Tutorial

[...]

Han Hu¹, Yonggang Wen², Tat-Seng Chua¹, Xuelong Li³•Institutions (3)

National University of Singapore¹, Nanyang Technological University², Chinese Academy of Sciences³

24 Jun 2014-IEEE Access

TL;DR: This paper presents a systematic framework to decompose big data systems into four sequential modules, namely data generation, data acquisition, data storage, and data analytics, and presents the prevalent Hadoop framework for addressing big data challenges.

...read moreread less

Abstract: Recent technological advancements have led to a deluge of data from distinctive domains (e.g., health care and scientific sensors, user-generated data, Internet and financial companies, and supply chain systems) over the past two decades. The term big data was coined to capture the meaning of this emerging trend. In addition to its sheer volume, big data also exhibits other unique characteristics as compared with traditional data. For instance, big data is commonly unstructured and require more real-time analysis. This development calls for new system architectures for data acquisition, transmission, storage, and large-scale data processing mechanisms. In this paper, we present a literature survey and system tutorial for big data analytics platforms, aiming to provide an overall picture for nonexpert readers and instill a do-it-yourself spirit for advanced audiences to customize their own big-data solutions. First, we present the definition of big data and discuss big data challenges. Next, we present a systematic framework to decompose big data systems into four sequential modules, namely data generation, data acquisition, data storage, and data analytics. These four modules form a big data value chain. Following that, we present a detailed survey of numerous approaches and mechanisms from research and industry communities. In addition, we present the prevalent Hadoop framework for addressing big data challenges. Finally, we outline several evaluation benchmarks and potential research directions for big data systems.

...read moreread less

1,002 citations

Book•

Data Preprocessing in Data Mining

[...]

Salvador Garca, Julin Luengo, Francisco Herrera

30 Aug 2014

TL;DR: This book is intended to review the tasks that fill the gap between the data acquisition from the source and the data mining process, and contains a comprehensive look from a practical point of view, including basic concepts and surveying the techniques proposed in the specialized literature.

...read moreread less

Abstract: Data Preprocessing for Data Mining addresses one of the most important issues within the well-known Knowledge Discovery from Data process. Data directly taken from the source will likely have inconsistencies, errors or most importantly, it is not ready to be considered for a data mining process. Furthermore, the increasing amount of data in recent science, industry and business applications, calls to the requirement of more complex tools to analyze it. Thanks to data preprocessing, it is possible to convert the impossible into possible, adapting the data to fulfill the input demands of each data mining algorithm. Data preprocessing includes the data reduction techniques, which aim at reducing the complexity of the data, detecting or removing irrelevant and noisy elements from the data. This book is intended to review the tasks that fill the gap between the data acquisition from the source and the data mining process. A comprehensive look from a practical point of view, including basic concepts and surveying the techniques proposed in the specialized literature, is given. Each chapter is a stand-alone guide to a particular data preprocessing topic, from basic concepts and detailed descriptions of classical algorithms, to an incursion of an exhaustive catalog of recent developments. The in-depth technical descriptions make this book suitable for technical professionals, researchers, senior undergraduate and graduate students in data science, computer science and engineering.

...read moreread less

678 citations

Journal Article•DOI•

Information Security in Big Data: Privacy and Data Mining

[...]

Lei Xu¹, Chunxiao Jiang¹, Jian Wang¹, Jian Yuan¹, Yong Ren¹ - Show less +1 more•Institutions (1)

Tsinghua University¹

09 Oct 2014-IEEE Access

TL;DR: This paper identifies four different types of users involved in data mining applications, namely, data provider, data collector, data miner, and decision maker, and examines various approaches that can help to protect sensitive information.

...read moreread less

Abstract: The growing popularity and development of data mining technologies bring serious threat to the security of individual,'s sensitive information. An emerging research topic in data mining, known as privacy-preserving data mining (PPDM), has been extensively studied in recent years. The basic idea of PPDM is to modify the data in such a way so as to perform data mining algorithms effectively without compromising the security of sensitive information contained in the data. Current studies of PPDM mainly focus on how to reduce the privacy risk brought by data mining operations, while in fact, unwanted disclosure of sensitive information may also happen in the process of data collecting, data publishing, and information (i.e., the data mining results) delivering. In this paper, we view the privacy issues related to data mining from a wider perspective and investigate various approaches that can help to protect sensitive information. In particular, we identify four different types of users involved in data mining applications, namely, data provider, data collector, data miner, and decision maker. For each type of user, we discuss his privacy concerns and the methods that can be adopted to protect sensitive information. We briefly introduce the basics of related research topics, review state-of-the-art approaches, and present some preliminary thoughts on future research directions. Besides exploring the privacy-preserving approaches for each type of user, we also review the game theoretical approaches, which are proposed for analyzing the interactions among different users in a data mining scenario, each of whom has his own valuation on the sensitive information. By differentiating the responsibilities of different users with respect to security of sensitive information, we would like to provide some useful insights into the study of PPDM.

...read moreread less

528 citations

Journal Article•DOI•

The Stratosphere platform for big data analytics

[...]

Alexander Alexandrov¹, Rico Bergmann², Stephan Ewen¹, Johann-Christoph Freytag², Fabian Hueske¹, Arvid Heise³, Odej Kao¹, Marcus Leich¹, Ulf Leser², Volker Markl¹, Felix Naumann³, Mathias Peters², Astrid Rheinländer², Matthias J. Sax², Sebastian Schelter¹, Mareike Hoger¹, Kostas Tzoumas¹, Daniel Warneke⁴ - Show less +14 more•Institutions (4)

Technical University of Berlin¹, Humboldt University of Berlin², Hasso Plattner Institute³, International Computer Science Institute⁴

01 Dec 2014

TL;DR: The overall system architecture design decisions are presented, Stratosphere is introduced through example queries, and the internal workings of the system’s components that relate to extensibility, programming model, optimization, and query execution are dive into.

...read moreread less

Abstract: We present Stratosphere, an open-source software stack for parallel data analysis. Stratosphere brings together a unique set of features that allow the expressive, easy, and efficient programming of analytical applications at very large scale. Stratosphere's features include "in situ" data processing, a declarative query language, treatment of user-defined functions as first-class citizens, automatic program parallelization and optimization, support for iterative programs, and a scalable and efficient execution engine. Stratosphere covers a variety of "Big Data" use cases, such as data warehousing, information extraction and integration, data cleansing, graph analysis, and statistical analysis applications. In this paper, we present the overall system architecture design decisions, introduce Stratosphere through example queries, and then dive into the internal workings of the system's components that relate to extensibility, programming model, optimization, and query execution. We experimentally compare Stratosphere against popular open-source alternatives, and we conclude with a research outlook for the next years.

...read moreread less

491 citations

Journal Article•DOI•

The HMO Research Network Virtual Data Warehouse: A Public Data Model to Support Collaboration.

[...]

Tyler R. Ross¹, Daniel Ng², Jeffrey S. Brown³, Roy Pardee¹, Mark C. Hornbrook², Gene Hart¹, John F. Steiner² - Show less +3 more•Institutions (3)

Group Health Research Institute¹, Kaiser Permanente², Harvard University³

24 Mar 2014

TL;DR: The HMORN VDW data model, its governance principles, data content, and quality assurance procedures are highlighted to help those wishing to implement a distributed interoperable health care data system.

...read moreread less

Abstract: The HMO Research Network (HMORN) Virtual Data Warehouse (VDW) is a public, non-proprietary, research-focused data model implemented at 17 health care systems across theUnited States. The HMORN has created a governance structure and specified policies concerning the VDW’s content, development, implementation, and quality assurance. Data extracted from the VDW have been used by thousands of studies published in peer-reviewed journal articles. Advances in software supporting care delivery and claims processing and the availability of new data sources have greatly expanded the data available for research, but substantially increased the complexity of data management. The VDW data model incorporates software and data advances to ensure that comprehensive, up-to-date data of known quality are available for research. VDW governance works to accommodate new data and system complexities. This article highlights the HMORN VDW data model, its governance principles, data content, and quality assurance procedures. Our goal is to share the VDW data model and its operations to those wishing to implement a distributed interoperable health care data system.

...read moreread less

307 citations

Journal Article•DOI•

Creating Value In Health Care Through Big Data: Opportunities And Policy Implications

[...]

Joachim Roski¹, George W. Bo-Linn, Timothy A. Andrews¹•Institutions (1)

Booz Allen Hamilton¹

01 Jul 2014-Health Affairs

TL;DR: Big data's success in creating value in the health care sector may require changes in current polices to balance the potential societal benefits of big-data approaches and the protection of patients' confidentiality.

...read moreread less

Abstract: Big data has the potential to create significant value in health care by improving outcomes while lowering costs. Big data’s defining features include the ability to handle massive data volume and variety at high velocity. New, flexible, and easily expandable information technology (IT) infrastructure, including so-called data lakes and cloud data storage and management solutions, make big-data analytics possible. However, most health IT systems still rely on data warehouse structures. Without the right IT infrastructure, analytic tools, visualization approaches, work flows, and interfaces, the insights provided by big data are likely to be limited. Big data’s success in creating value in the health care sector may require changes in current polices to balance the potential societal benefits of big-data approaches and the protection of patients’ confidentiality. Other policy implications of using big data are that many current practices and policies related to data use, access, sharing, privacy, and stewa...

...read moreread less

266 citations

Proceedings Article•DOI•

Data quality: The other face of Big Data

[...]

Barna Saha¹, Divesh Srivastava¹•Institutions (1)

AT&T Labs¹

19 May 2014

TL;DR: This tutorial presents recent results that are relevant to big data quality management, focusing on the two major dimensions of discovering quality issues from the data itself, and (ii) trading-off accuracy vs efficiency, and identifies a range of open problems for the community.

...read moreread less

Abstract: In our Big Data era, data is being generated, collected and analyzed at an unprecedented scale, and data-driven decision making is sweeping through all aspects of society. Recent studies have shown that poor quality data is prevalent in large databases and on the Web. Since poor quality data can have serious consequences on the results of data analyses, the importance of veracity, the fourth ‘V’ of big data is increasingly being recognized. In this tutorial, we highlight the substantial challenges that the first three ‘V’s, volume, velocity and variety, bring to dealing with veracity in big data. Due to the sheer volume and velocity of data, one needs to understand and (possibly) repair erroneous data in a scalable and timely manner. With the variety of data, often from a diversity of sources, data quality rules cannot be specified a priori; one needs to let the “data to speak for itself” in order to discover the semantics of the data. This tutorial presents recent results that are relevant to big data quality management, focusing on the two major dimensions of (i) discovering quality issues from the data itself, and (ii) trading-off accuracy vs efficiency, and identifies a range of open problems for the community.

...read moreread less

203 citations

Book•

Predictive Analytics and Data Mining: Concepts and Practice with RapidMiner

[...]

Vijay Kotu, Bala Deshpande

27 Nov 2014

TL;DR: This book will show you how to analyze data, uncover hidden patterns and relationships to aid important decisions and predictions, and implement a simple step-by-step process for predicting an outcome or discovering hidden relationships using RapidMiner, an open source GUI based data mining tool.

...read moreread less

Abstract: Put Predictive Analytics into Action Learn the basics of Predictive Analysis and Data Mining through an easy to understand conceptual framework and immediately practice the concepts learned using the open source RapidMiner tool. Whether you are brand new to Data Mining or working on your tenth project, this book will show you how to analyze data, uncover hidden patterns and relationships to aid important decisions and predictions. Data Mining has become an essential tool for any enterprise that collects, stores and processes data as part of its operations. This book is ideal for business users, data analysts, business analysts, business intelligence and data warehousing professionals and for anyone who wants to learn Data Mining. Youll be able to: 1. Gain the necessary knowledge of different data mining techniques, so that you can select the right technique for a given data problem and create a general purpose analytics process. 2. Get up and running fast with more than two dozen commonly used powerful algorithms for predictive analytics using practical use cases. 3. Implement a simple step-by-step process for predicting an outcome or discovering hidden relationships from the data using RapidMiner, an open source GUI based data mining tool Predictive analytics and Data Mining techniques covered: Exploratory Data Analysis, Visualization, Decision trees, Rule induction, k-Nearest Neighbors, Nave Bayesian, Artificial Neural Networks, Support Vector machines, Ensemble models, Bagging, Boosting, Random Forests, Linear regression, Logistic regression, Association analysis using Apriori and FP Growth, K-Means clustering, Density based clustering, Self Organizing Maps, Text Mining, Time series forecasting, Anomaly detection and Feature selection. Implementation files can be downloaded from the book companion site at www.LearnPredictiveAnalytics.com Demystifies data mining concepts with easy to understand language Shows how to get up and running fast with 20 commonly used powerful techniques for predictive analysis Explains the process of using open source RapidMiner toolsDiscusses a simple 5 step process for implementing algorithms that can be used for performing predictive analytics Includes practical use cases and examples

...read moreread less

194 citations

Journal Article•DOI•

AsterixDB: a scalable, open source BDMS

[...]

Sattam Alsubaiee¹, Yasser Altowim¹, Hotham Altwaijry¹, Alexander Behm², Vinayak Borkar¹, Yingyi Bu¹, Michael J. Carey¹, Inci Cetindil¹, Madhusudan Cheelangi³, Khurram Faraaz⁴, Eugenia Gabrielova¹, Raman Grover¹, Zachary Heilbron¹, Young-Seok Kim¹, Chen Li¹, Guangqiang Li, Ji Mahn Ok¹, Nicola Onose, Pouria Pirzadeh¹, Vassilis J. Tsotras⁵, Rares Vernica⁶, Jian Wen⁷, Till Westmann⁷ - Show less +19 more•Institutions (7)

University of California, Irvine¹, Cloudera², Google³, IBM⁴, University of California, Riverside⁵, Hewlett-Packard⁶, Oracle Corporation⁷

01 Oct 2014

TL;DR: AsterixDB as mentioned in this paper is a full-function BDMS (Big Data Management System) with a feature set that distinguishes it from other platforms in today's open source Big Data ecosystem.

...read moreread less

Abstract: AsterixDB is a new, full-function BDMS (Big Data Management System) with a feature set that distinguishes it from other platforms in today's open source Big Data ecosystem. Its features make it well-suited to applications like web data warehousing, social data storage and analysis, and other use cases related to Big Data. AsterixDB has a flexible NoSQL style data model; a query language that supports a wide range of queries; a scalable runtime; partitioned, LSM-based data storage and indexing (including B+-tree, R-tree, and text indexes); support for external as well as natively stored data; a rich set of built-in types; support for fuzzy, spatial, and temporal types and queries; a built-in notion of data feeds for ingestion of data; and transaction support akin to that of a NoSQL store.Development of AsterixDB began in 2009 and led to a mid-2013 initial open source release. This paper is the first complete description of the resulting open source AsterixDB system. Covered herein are the system's data model, its query language, and its software architecture. Also included are a summary of the current status of the project and a first glimpse into how AsterixDB performs when compared to alternative technologies, including a parallel relational DBMS, a popular NoSQL store, and a popular Hadoop-based SQL data analytics platform, for things that both technologies can do. Also included is a brief description of some initial trials that the system has undergone and the lessons learned (and plans laid) based on those early "customer" engagements.

...read moreread less

185 citations

Posted Content•

AsterixDB: A Scalable, Open Source BDMS

[...]

University of California, Irvine¹, Cloudera², Google³, IBM⁴, University of California, Riverside⁵, Hewlett-Packard⁶, Oracle Corporation⁷

02 Jul 2014-arXiv: Databases

TL;DR: This paper is the first complete description of the resulting open source AsterixDB system, covering the system's data model, its query language, and its software architecture.

...read moreread less

Abstract: AsterixDB is a new, full-function BDMS (Big Data Management System) with a feature set that distinguishes it from other platforms in today's open source Big Data ecosystem. Its features make it well-suited to applications like web data warehousing, social data storage and analysis, and other use cases related to Big Data. AsterixDB has a flexible NoSQL style data model; a query language that supports a wide range of queries; a scalable runtime; partitioned, LSM-based data storage and indexing (including B+-tree, R-tree, and text indexes); support for external as well as natively stored data; a rich set of built-in types; support for fuzzy, spatial, and temporal types and queries; a built-in notion of data feeds for ingestion of data; and transaction support akin to that of a NoSQL store. Development of AsterixDB began in 2009 and led to a mid-2013 initial open source release. This paper is the first complete description of the resulting open source AsterixDB system. Covered herein are the system's data model, its query language, and its software architecture. Also included are a summary of the current status of the project and a first glimpse into how AsterixDB performs when compared to alternative technologies, including a parallel relational DBMS, a popular NoSQL store, and a popular Hadoop-based SQL data analytics platform, for things that both technologies can do. Also included is a brief description of some initial trials that the system has undergone and the lessons learned (and plans laid) based on those early "customer" engagements.

...read moreread less

Patent•

System and methods for adaptive model generation for detecting intrusion in computer systems

[...]

Andrew Honig¹, Andrew Howard¹, Eleazar Eskin¹, Salvatore J. Stolfo¹•Institutions (1)

Columbia University¹

08 Oct 2014

TL;DR: In this paper, a system and methods for detecting intrusions in the operation of a computer system comprises a sensor configured to gather information regarding the operation, to format the information in a data record having a predetermined format, and to transmit the data in the predetermined data format.

...read moreread less

Abstract: A system and methods for detecting intrusions in the operation of a computer system comprises a sensor configured to gather information regarding the operation of the computer system, to format the information in a data record having a predetermined format, and to transmit the data in the predetermined data format. A data warehouse is configured to receive the data record from the sensor in the predetermined data format and to store the data in a SQL database. A detection model generator is configured to request data records from the data warehouse in the predetermined data format, to generate an intrusion detection model based on said data records, and to transmit the intrusion detection model to the data warehouse according to the predetermined data format. A detector is configured to receive a data record in the predetermined data format from the sensor and to classify the data record in real-time as one of normal operation and an attack based on said intrusion detection model. A data analysis engine is configured to request data records from the data warehouse according to the predetermined data format and to perform a data processing function on the data records.

...read moreread less

Proceedings Article•DOI•

Major technical advancements in apache hive

[...]

Yin Huai¹, Ashutosh Chauhan, Alan Gates, Günther Hagleitner, Eric N. Hanson², Owen O'Malley, Jitendra Pandey, Yuan Yuan¹, Rubao Lee¹, Xiaodong Zhang¹ - Show less +6 more•Institutions (2)

Ohio State University¹, Microsoft²

18 Jun 2014

TL;DR: A community-based effort on technical advancements in Hive provides significant improvements on storage efficiency and query execution performance and shows how academic research lays a foundation for Hive to improve its daily operations.

...read moreread less

Abstract: Apache Hive is a widely used data warehouse system for Apache Hadoop, and has been adopted by many organizations for various big data analytics applications. Closely working with many users and organizations, we have identified several shortcomings of Hive in its file formats, query planning, and query execution, which are key factors determining the performance of Hive. In order to make Hive continuously satisfy the requests and requirements of processing increasingly high volumes data in a scalable and efficient way, we have set two goals related to storage and runtime performance in our efforts on advancing Hive. First, we aim to maximize the effective storage capacity and to accelerate data accesses to the data warehouse by updating the existing file formats. Second, we aim to significantly improve cluster resource utilization and runtime performance of Hive by developing a highly optimized query planner and a highly efficient query execution engine. In this paper, we present a community-based effort on technical advancements in Hive. Our performance evaluation shows that these advancements provide significant improvements on storage efficiency and query execution performance. This paper also shows how academic research lays a foundation for Hive to improve its daily operations.

...read moreread less

Journal Article•DOI•

Using 'Big Data' for analytics and decision support

[...]

Daniel J. Power¹•Institutions (1)

University of Northern Iowa¹

02 Apr 2014-Journal of Decision Systems

TL;DR: Researchers need to study and document use cases that explain how specific, novel data, so-called Big Data, can be used to support decision-making.

...read moreread less

Abstract: People and the computers they use are generating large amounts of varied data. The phenomenon of capturing and trying to use all of the semi-structured and unstructured data has been called by vendors and bloggers ‘Big Data’. Organisations can capture and store data of many types from almost any source, but capturing and storing data only adds value when it has a useful purpose. Big Data must be used to provide input to analytics and decision support capabilities if it is to create real value for organisations. Some bloggers, industry leaders and academics have become disillusioned by the term Big Data. It is a marketing term and not a technical term. More descriptive terms like unstructured data, process data and machine data are more useful for information technology (IT) professionals. Researchers need to study and document use cases that explain how specific, novel data, so-called Big Data, can be used to support decision-making.

...read moreread less

Book•DOI•

Data Warehouse Systems: Design and Implementation

[...]

Alejandro A. Vaisman, Esteban Zimnyi

11 Sep 2014

TL;DR: Students, practitioners and researchers alike will find this book the most comprehensive reference work on data warehouses, with key topics described in a clear and educational style.

...read moreread less

Abstract: With this textbook, Vaisman and Zimnyi deliver excellent coverage of data warehousing and business intelligence technologies ranging from the most basic principles to recent findings and applications. To this end, their work is structured into three parts. Part I describes Fundamental Concepts including multi-dimensional models; conceptual and logical data warehouse design and MDX and SQL/OLAP. Subsequently, Part II details Implementation and Deployment, which includes physical data warehouse design; data extraction, transformation, and loading (ETL) and data analytics. Lastly, Part III covers Advanced Topics such as spatial data warehouses; trajectory data warehouses; semantic technologies in data warehouses and novel technologies like Map Reduce, column-store databases and in-memory databases. As a key characteristic of the book, most of the topics are presented and illustrated using application tools. Specifically, a case study based on the well-known Northwind database illustrates how the concepts presented in the book can be implemented using Microsoft Analysis Services and Pentaho Business Analytics. All chapters are summarized using review questions and exercises to support comprehensive student learning. Supplemental material to assist instructors using this book as a course text is available at http://cs.ulb.ac.be/DWSDIbook/, including electronic versions of the figures, solutions to all exercises, and a set of slides accompanying each chapter. Overall, students, practitioners and researchers alike will find this book the most comprehensive reference work on data warehouses, with key topics described in a clear and educational style.

...read moreread less

Journal Article•DOI•

From the Cloud to the Atmosphere: Running MapReduce across Data Centers

[...]

Chamikara Jayalath¹, Julian James Stephen¹, Patrick Eugster¹•Institutions (1)

Purdue University¹

01 Jan 2014-IEEE Transactions on Computers

TL;DR: G-MR is introduced, a system for executing sequences of MapReduce jobs on geo-distributed data sets, which implements the optimization framework, and evaluations show that using G-MR significantly improves processing time and cost for geodistributed data set.

...read moreread less

Abstract: Efficiently analyzing big data is a major issue in our current era. Examples of analysis tasks include identification or detection of global weather patterns, economic changes, social phenomena, or epidemics. The cloud computing paradigm along with software tools such as implementations of the popular MapReduce framework offer a response to the problem by distributing computations among large sets of nodes. In many scenarios, input data are, however, geographically distributed (geodistributed) across data centers, and straightforwardly moving all data to a single data center before processing it can be prohibitively expensive. Above-mentioned tools are designed to work within a single cluster or data center and perform poorly or not at all when deployed across data centers. This paper deals with executing sequences of MapReduce jobs on geo-distributed data sets. We analyze possible ways of executing such jobs, and propose data transformation graphs that can be used to determine schedules for job sequences which are optimized either with respect to execution time or monetary cost. We introduce G-MR, a system for executing such job sequences, which implements our optimization framework. We present empirical evidence in Amazon EC2 and VICCI of the benefits of G-MR over common, naive deployments for processing geodistributed data sets. Our evaluations show that using G-MR significantly improves processing time and cost for geodistributed data sets.

...read moreread less

Proceedings Article•DOI•

Fine-grained partitioning for aggressive data skipping

[...]

Liwen Sun¹, Michael J. Franklin¹, Sanjay Krishnan¹, Reynold Xin•Institutions (1)

University of California, Berkeley¹

18 Jun 2014

TL;DR: This paper proposes a fine-grained blocking technique that reorganizes the data tuples into blocks with a goal of enabling queries to skip blocks aggressively, and shows that this technique leads to 2-5x improvement in query response time over traditional range-based blocking techniques.

...read moreread less

Abstract: Modern query engines are increasingly being required to process enormous datasets in near real-time. While much can be done to speed up the data access, a promising technique is to reduce the need to access data through data skipping. By maintaining some metadata for each block of tuples, a query may skip a data block if the metadata indicates that the block does not contain relevant data. The effectiveness of data skipping, however, depends on how well the blocking scheme matches the query filters. In this paper, we propose a fine-grained blocking technique that reorganizes the data tuples into blocks with a goal of enabling queries to skip blocks aggressively. We first extract representative filters in a workload as features using frequent itemset mining. Based on these features, each data tuple can be represented as a feature vector. We then formulate the blocking problem as a optimization problem on the feature vectors, called Balanced MaxSkip Partitioning, which we prove is NP-hard. To find an approximate solution efficiently, we adopt the bottom-up clustering framework. We prototyped our blocking techniques on Shark, an open-source data warehouse system. Our experiments on TPC-H and a real-world workload show that our blocking technique leads to 2-5x improvement in query response time over traditional range-based blocking techniques.

...read moreread less

Journal Article•DOI•

Data Science in Statistics Curricula: Preparing Students to "Think with Data"

[...]

Johanna Hardin, Roger Hoerl, Nicholas J. Horton, Deborah Nolan

12 Oct 2014-arXiv: Other Statistics

TL;DR: In this article, the importance of data science proficiency and resources for instructors to implement data science in their own statistics curricula are discussed, as well as case studies from seven institutions.

...read moreread less

Abstract: A growing number of students are completing undergraduate degrees in statistics and entering the workforce as data analysts. In these positions, they are expected to understand how to utilize databases and other data warehouses, scrape data from Internet sources, program solutions to complex problems in multiple languages, and think algorithmically as well as statistically. These data science topics have not traditionally been a major component of undergraduate programs in statistics. Consequently, a curricular shift is needed to address additional learning outcomes. The goal of this paper is to motivate the importance of data science proficiency and to provide examples and resources for instructors to implement data science in their own statistics curricula. We provide case studies from seven institutions. These varied approaches to teaching data science demonstrate curricular innovations to address new needs. Also included here are examples of assignments designed for courses that foster engagement of undergraduates with data and data science.

...read moreread less

Proceedings Article•DOI•

Towards a Semantic Extract-Transform-Load (ETL) Framework for Big Data Integration

[...]

Srividya K. Bansal¹•Institutions (1)

Arizona State University¹

27 Jun 2014

TL;DR: A semantic Extract-Transform-Load (ETL) framework that uses semantic technologies to integrate and publish data from multiple sources as open linked data as well as creating a distributed Web of data using Resource Description Framework (RDF) as the graph data model.

...read moreread less

Abstract: Big Data has become the new ubiquitous term used to describe massive collection of datasets that are difficult to process using traditional database and software techniques. Most of this data is inaccessible to users, as we need technology and tools to find, transform, analyze, and visualize data in order to make it consumable for decision-making. One aspect of Big Data research is dealing with the Variety of data that includes various formats such as structured, numeric, unstructured text data, email, video, audio, stock ticker, etc. Managing, merging, and governing a variety of data is the focus of this paper. This paper proposes a semantic Extract-Transform-Load (ETL) framework that uses semantic technologies to integrate and publish data from multiple sources as open linked data. This includes - creation of a semantic data model to provide a basis for integration and understanding of knowledge from multiple sources, creation of a distributed Web of data using Resource Description Framework (RDF) as the graph data model, extraction of useful knowledge and information from the combined data using SPARQL as the semantic query language.

...read moreread less

Journal Article•DOI•

Embedding AI and Crowdsourcing in the Big Data Lake

[...]

Daniel E. O'Leary¹•Institutions (1)

University of Southern California¹

07 Nov 2014-IEEE Intelligent Systems

TL;DR: Daniel E. O'Leary investigates using different AI and crowdsourcing applications in that lake in order to integrate disparate data sources, facilitate master data management and analyze data quality.

...read moreread less

Abstract: Daniel E O'Leary examines the notion of the Big Data Lake and contrasts it with decision support-based data warehouses In addition, some of the risks of the emerging Lake concept that ultimately require data governance are analyzed O'Leary investigates using different AI and crowdsourcing (human intelligence) applications in that lake in order to integrate disparate data sources, facilitate master data management and analyze data quality Although data governance often is not seen as a technology issue, it is seen as a critical component of making the Big Data Lake "work"

...read moreread less

Journal Article•DOI•

Mesa: geo-replicated, near real-time, scalable data warehousing

[...]

Ashish Gupta¹, Fan Yang¹, Jason Govig¹, Adam Kirsch¹, Kelvin K. W. Chan¹, Kevin Lai¹, Shuo Wu¹, Sandeep Govind Dhoot¹, Abhilash Rajesh Kumar¹, Ankur Agiwal¹, Sanjay Bhansali¹, Mingsheng Hong¹, Jamie Cameron¹, Masood Siddiqi¹, David Jones¹, Jeff Shute¹, Andrey Gubarev¹, Shivakumar Venkataraman¹, Divyakant Agrawal¹ - Show less +15 more•Institutions (1)

Google¹

01 Aug 2014

TL;DR: The Mesa system is presented and reports the performance and scale that it achieves, including near real-time data ingestion and queryability, as well as high availability, reliability, fault tolerance, and scalability for large data and query volumes.

...read moreread less

Abstract: Mesa is a highly scalable analytic data warehousing system that stores critical measurement data related to Google's Internet advertising business. Mesa is designed to satisfy a complex and challenging set of user and systems requirements, including near real-time data ingestion and queryability, as well as high availability, reliability, fault tolerance, and scalability for large data and query volumes. Specifically, Mesa handles petabytes of data, processes millions of row updates per second, and serves billions of queries that fetch trillions of rows per day. Mesa is geo-replicated across multiple datacenters and provides consistent and repeatable query answers at low latency, even when an entire datacenter fails. This paper presents the Mesa system and reports the performance and scale that it achieves.

...read moreread less

Journal Article•DOI•

TPC-DI: the first industry benchmark for data integration

[...]

Meikel Poess¹, Tilmann Rabl², Hans-Arno Jacobsen², Brian K. Caufield³•Institutions (3)

Oracle Corporation¹, University of Toronto², IBM³

01 Aug 2014

TL;DR: TPC-DI, an innovative benchmark for data integration, is released and the reasons behind its development are explained, its main characteristics including workload, run rules, metric, and explains key decisions are described.

...read moreread less

Abstract: Historically, the process of synchronizing a decision support system with data from operational systems has been referred to as Extract, Transform, Load (ETL) and the tools supporting such process have been referred to as ETL tools. Recently, ETL was replaced by the more comprehensive acronym, data integration (DI). DI describes the process of extracting and combining data from a variety of data source formats, transforming that data into a unified data model representation and loading it into a data store. This is done in the context of a variety of scenarios, such as data acquisition for business intelligence, analytics and data warehousing, but also synchronization of data between operational applications, data migrations and conversions, master data management, enterprise data sharing and delivery of data services in a service-oriented architecture context, amongst others. With these scenarios relying on up-to-date information it is critical to implement a highly performing, scalable and easy to maintain data integration system. This is especially important as the complexity, variety and volume of data is constantly increasing and performance of data integration systems is becoming very critical. Despite the significance of having a highly performing DI system, there has been no industry standard for measuring and comparing their performance. The TPC, acknowledging this void, has released TPC-DI, an innovative benchmark for data integration. This paper motivates the reasons behind its development, describes its main characteristics including workload, run rules, metric, and explains key decisions.

...read moreread less

Patent•

Managing big data in process control systems

[...]

Mark J. Nixon, Paul Richard Muston, Deji Chen, Song Han

31 Jan 2014

TL;DR: In this article, a big data network or system for a process control system or plant includes a data storage device configured to receive process control data from control system devices and store the process data.

...read moreread less

Abstract: A big data network or system for a process control system or plant includes a data storage device configured to receive process control data from control system devices and store the process control data The big data network or system identifies various parameters or attributes from the process control data, and creates and uses rowkeys to store the parameters according to various combinations, such as combinations using timestamps The big data network or system may also store certain aggregate data analyses associated with time periods specified by the timestamps Accordingly, the big data network or system efficiently stores real-time data having measurements within a database schema, and users or administrators can leverage the aggregate data to analyze certain data associated with certain time periods

...read moreread less

Proceedings Article•DOI•

SATO: a spatial data partitioning framework for scalable query processing

[...]

Hoang Vo¹, Ablimit Aji², Fusheng Wang¹•Institutions (2)

Emory University¹, Hewlett-Packard²

04 Nov 2014

TL;DR: SATO can generate much balanced partitioning that can significantly improve spatial query performance with MapReduce comparing to traditional spatial partitioning approaches, and can be used to significantly improve window based queries in cloud based spatial query processing systems.

...read moreread less

Abstract: Scalable spatial query processing relies on effective spatial data partitioning for query parallelization, data pruning, and load balancing. These are often challenged by the intrinsic characteristics of spatial data, such as high skew in data distribution and high complexity of irregular multi-dimensional objects. In this demo, we present SATO, a spatial data partitioning framework that can quickly analyze and partition spatial data with an optimal spatial partitioning strategy for scalable query processing. SATO works in following steps: 1) Sample, which samples a small fraction of input data for analysis, 2) Analyze, which quickly analyzes sampled data to find an optimal partition strategy, 3) Tear, which provides data skew aware partitioning and supports MapReduce based scalable partitioning, and 4) Optimize, which collects succinct partition statistics for potential query optimization. SATO also provides multiple level partitioning, which can be used to significantly improve window based queries in cloud based spatial query processing systems. SATO comes with a visualization component that provides heat maps and histograms for qualitative evaluation. SATO has been implemented within the Hadoop-GIS, a high performance spatial data warehousing system over MapReduce. SATO is also released as an independent software package to support various scalable spatial query processing systems. Our experiments have demonstrated that SATO can generate much balanced partitioning that can significantly improve spatial query performance with MapReduce comparing to traditional spatial partitioning approaches.

...read moreread less

Patent•

Heterogeneous large data integration method and system based on data warehouses

[...]

Xu Xiaodong, Zou Tiepeng, He Changtao, Huang Jianpeng

26 Mar 2014

TL;DR: In this paper, a heterogeneous large data integration method and system based on data warehouses is presented, where all kinds of data are integrated by combining the advantages of a relational database, a distributed database and a memory database, deep data analysis is carried out on the basis of the data warehouses, and data mining is deepened continuously.

...read moreread less

Abstract: The invention provides a heterogeneous large data integration method and system based on data warehouses. The incidence relation between structurized data, semi-structurized data and non-structurized data are is established, all kinds of data are integrated by combining the advantages of a relational database, a distributed database and a memory database, deep data analysis is carried out on the basis of the data warehouses, data mining is deepened continuously, and thus high-efficiency and high-quality heterogeneous large data analysis is achieved. The structurized data, the semi-structurized data and the non-structurized data in Internet applications are associated, through Map/Reduce distributed processing and data mining, the processing result and relevant data are written into a memory in a database structure mode, thus, a simple memory database is formed, and high speed calculation and fast response can be carried out conveniently.

...read moreread less

Book Chapter•DOI•

Modeling and Querying Data Warehouses on the Semantic Web Using QB4OLAP

[...]

Lorena Etcheverry¹, Alejandro A. Vaisman², Esteban Zimányi³•Institutions (3)

University of the Republic¹, Instituto Tecnológico de Buenos Aires², Université libre de Bruxelles³

02 Sep 2014

TL;DR: This paper extends the QB4OLAP RDF vocabulary to represent balanced, recursive, and ragged hierarchies, and shows how complex real-world OLAP queries expressed in SPARQL can be posed to the resulting QB4 OLAP model.

...read moreread less

Abstract: The web is changing the way in which data warehouses are designed and exploited. Nowadays, for many data analysis tasks, data contained in a conventional data warehouse may not suffice, and external data sources, like the web, can provide useful multidimensional information. Also, large repositories of semantically annotated data are becoming available on the web, opening new opportunities for enhancing current decision-support systems. Representation of multidimensional data via semantic web standards is crucial to achieve such goal. In this paper we extend the QB4OLAP RDF vocabulary to represent balanced, recursive, and ragged hierarchies. We also present a set of rules to obtain a QB4OLAP representation of a conceptual multidimensional model, and a procedure to populate the result from a relational implementation of the multidimensional model. We conclude the paper showing how complex real-world OLAP queries expressed in SPARQL can be posed to the resulting QB4OLAP model.

...read moreread less

Journal Article•DOI•

Modular design, application architecture, and usage of a self-service model for enterprise data delivery

[...]

Monica M. Horvath¹, Shelley A. Rusincovitch¹, Stephanie Brinson¹, Howard Shang¹, Steve Evans¹, Jeffrey M. Ferranti¹ - Show less +2 more•Institutions (1)

Duke University¹

01 Dec 2014-Journal of Biomedical Informatics

TL;DR: This work describes the design, application architecture, and use of a self-service model for enterprise data delivery within Duke Medicine, designed to be responsive to source data and to allow modification through alterations in metadata rather than programming, allowing an agile response to source system changes.

...read moreread less

Journal Article•DOI•

High dimensional biological data retrieval optimization with NoSQL technology

[...]

Shicai Wang¹, Ioannis Pandis¹, Chao Wu¹, Sijin He¹, David Johnson¹, Ibrahim Emam¹, Florian Guitton¹, Yike Guo¹ - Show less +4 more•Institutions (1)

Imperial College London¹

13 Nov 2014-BMC Genomics

TL;DR: A new key-value pair data model better suited for high-dimensional data storage and querying, optimized for database scalability and performance is introduced and used as a basis for migrating tranSMART's implementation to a more scalable solution for Big Data.

...read moreread less

Abstract: High-throughput transcriptomic data generated by microarray experiments is the most abundant and frequently stored kind of data currently used in translational medicine studies. Although microarray data is supported in data warehouses such as tranSMART, when querying relational databases for hundreds of different patient gene expression records queries are slow due to poor performance. Non-relational data models, such as the key-value model implemented in NoSQL databases, hold promise to be more performant solutions. Our motivation is to improve the performance of the tranSMART data warehouse with a view to supporting Next Generation Sequencing data. In this paper we introduce a new data model better suited for high-dimensional data storage and querying, optimized for database scalability and performance. We have designed a key-value pair data model to support faster queries over large-scale microarray data and implemented the model using HBase, an implementation of Google's BigTable storage system. An experimental performance comparison was carried out against the traditional relational data model implemented in both MySQL Cluster and MongoDB, using a large publicly available transcriptomic data set taken from NCBI GEO concerning Multiple Myeloma. Our new key-value data model implemented on HBase exhibits an average 5.24-fold increase in high-dimensional biological data query performance compared to the relational model implemented on MySQL Cluster, and an average 6.47-fold increase on query performance on MongoDB. The performance evaluation found that the new key-value data model, in particular its implementation in HBase, outperforms the relational model currently implemented in tranSMART. We propose that NoSQL technology holds great promise for large-scale data management, in particular for high-dimensional biological data such as that demonstrated in the performance evaluation described in this paper. We aim to use this new data model as a basis for migrating tranSMART's implementation to a more scalable solution for Big Data.

...read moreread less

Book•DOI•

Databases Theory and Applications

[...]

Hua Wang, Mohamed A. Sharaf

01 Jan 2014

TL;DR: This research presents a meta-modelling architecture for social media data management that automates the very labor-intensive and therefore time-heavy and expensive process of manually cataloging and managing social media accounts.

...read moreread less

Abstract: Data warehousing.- Database integration.- Mobile databases.- Cloud, distributed, and parallel databases.- High dimensional and temporal data.- Image/video retrieval and databases.- Database performance and tuning.- Privacy and security in databases.- Query processing and optimization.- Semi-structured data and XML.- Spatial data processing and management.- Stream and sensor data management.- Uncertain and probabilistic databases.- Web databases.- Graph databases.- Web service management.- Social media data management.

...read moreread less

Proceedings Article•DOI•

RDF analytics: lenses over semantic graphs

[...]

Dario Colazzo¹, François Goasdoué, Ioana Manolescu², Alexandra Roatis²•Institutions (2)

Paris Dauphine University¹, French Institute for Research in Computer Science and Automation²

07 Apr 2014

TL;DR: This work fully redesigns, from the bottom up, core data analytics concepts and tools in the context of RDF data, leading to the first complete formal framework for warehouse-style RDF analytics.

...read moreread less

Abstract: The development of Semantic Web (RDF) brings new requirements for data analytics tools and methods, going beyond querying to semantics-rich analytics through warehouse-style tools. In this work, we fully redesign, from the bottom up, core data analytics concepts and tools in the context of RDF data, leading to the first complete formal framework for warehouse-style RDF analytics. Notably, we define i) analytical schemas tailored to heterogeneous, semantics-rich RDF graph, ii) analytical queries which (beyond relational cubes) allow flexible querying of the data and the schema as well as powerful aggregation and iii) OLAP-style operations. Experiments on a fully-implemented platform demonstrate the practical interest of our approach.

...read moreread less

Collapse