Showing papers on "Data warehouse published in 2020"

PDF

Open Access

Journal Article•DOI•

Delta lake: high-performance ACID table storage over cloud object stores

[...]

Michael Armbrust, Tathagata Das, Liwen Sun, Burak Yavuz, Shixiong Zhu, Mukul Murthy, Joseph Torres, Herman van Hovell, Adrian Ionescu, Alicja Łuszczak, Michał Świtakowski, Michał Szafrański, Xiao Li, Takuya Ueshin, Mostafa Mokhtar, Peter Boncz¹, Ali Ghodsi², Sameer Paranjpye, Pieter Senster, Reynold Xin, Matei Zaharia³ - Show less +17 more•Institutions (3)

Centrum Wiskunde & Informatica¹, University of California, Berkeley², Stanford University³

01 Aug 2020

TL;DR: Delta Lake is presented, an open source ACID table storage layer over cloud object stores initially developed at Databricks that uses a transaction log that is compacted into Apache Parquet format to provide ACID properties, time travel, and significantly faster metadata operations for large tabular datasets.

...read moreread less

Abstract: Cloud object stores such as Amazon S3 are some of the largest and most cost-effective storage systems on the planet, making them an attractive target to store large data warehouses and data lakes. Unfortunately, their implementation as key-value stores makes it difficult to achieve ACID transactions and high performance: metadata operations such as listing objects are expensive, and consistency guarantees are limited. In this paper, we present Delta Lake, an open source ACID table storage layer over cloud object stores initially developed at Databricks. Delta Lake uses a transaction log that is compacted into Apache Parquet format to provide ACID properties, time travel, and significantly faster metadata operations for large tabular datasets (e.g., the ability to quickly search billions of table partitions for those relevant to a query). It also leverages this design to provide high-level features such as automatic data layout optimization, upserts, caching, and audit logs. Delta Lake tables can be accessed from Apache Spark, Hive, Presto, Redshift and other systems. Delta Lake is deployed at thousands of Databricks customers that process exabytes of data per day, with the largest instances managing exabyte-scale datasets and billions of objects.

...read moreread less

59 citations

Posted Content•

Tsunami: A Learned Multi-dimensional Index for Correlated Data and Skewed Workloads

[...]

Jialin Ding, Vikram Nathan, Mohammad Alizadeh, Tim Kraska

23 Jun 2020-arXiv: Databases

TL;DR: This paper introduces Tsunami, which addresses limitations to achieve up to 6X faster query performance and up to 8X smaller index size than existing learned multi-dimensional indexes, in addition to up to 11X faster queries performance and 170X smallerindex size than optimally-tuned traditional indexes.

...read moreread less

Abstract: Filtering data based on predicates is one of the most fundamental operations for any modern data warehouse Techniques to accelerate the execution of filter expressions include clustered indexes, specialized sort orders (eg, Z-order), multi-dimensional indexes, and, for high selectivity queries, secondary indexes However, these schemes are hard to tune and their performance is inconsistent Recent work on learned multi-dimensional indexes has introduced the idea of automatically optimizing an index for a particular dataset and workload However, the performance of that work suffers in the presence of correlated data and skewed query workloads, both of which are common in real applications In this paper, we introduce Tsunami, which addresses these limitations to achieve up to 6X faster query performance and up to 8X smaller index size than existing learned multi-dimensional indexes, in addition to up to 11X faster query performance and 170X smaller index size than optimally-tuned traditional indexes

...read moreread less

56 citations

Posted Content•

A new paradigm for accelerating clinical data science at Stanford Medicine.

[...]

Somalee Datta, Jose D. Posada, Garrick Olson, Wencheng Li, Ciaran O'Reilly, Deepa Balraj, Joseph Mesterhazy, Joseph Pallas, Priyamvada Desai, Nigam H. Shah - Show less +6 more

17 Mar 2020-arXiv: Computers and Society

TL;DR: A new secure Big Data platform that aims to reduce time to access and analyze data and is designed to bring the modern data science community to highly sensitive clinical data in a secure and collaborative big data analytics environment with a goal to enable bigger, better and faster science.

...read moreread less

Abstract: Stanford Medicine is building a new data platform for our academic research community to do better clinical data science. Hospitals have a large amount of patient data and researchers have demonstrated the ability to reuse that data and AI approaches to derive novel insights, support patient care, and improve care quality. However, the traditional data warehouse and Honest Broker approaches that are in current use, are not scalable. We are establishing a new secure Big Data platform that aims to reduce time to access and analyze data. In this platform, data is anonymized to preserve patient data privacy and made available preparatory to Institutional Review Board (IRB) submission. Furthermore, the data is standardized such that analysis done at Stanford can be replicated elsewhere using the same analytical code and clinical concepts. Finally, the analytics data warehouse integrates with a secure data science computational facility to support large scale data analytics. The ecosystem is designed to bring the modern data science community to highly sensitive clinical data in a secure and collaborative big data analytics environment with a goal to enable bigger, better and faster science.

...read moreread less

46 citations

Journal Article•DOI•

Leaf: an open-source, model-agnostic, data-driven web application for cohort discovery and translational biomedical research.

[...]

Nicholas J. Dobbins¹, Clifford H Spital¹, Robert A Black¹, Jason A. Morrison, Bas de Veer¹, Elizabeth Zampino¹, Robert D. Harrington¹, Bethene D Britt¹, Kari A. Stephens¹, Adam B. Wilcox¹, Peter Tarczy-Hornoch¹, Sean D. Mooney¹ - Show less +8 more•Institutions (1)

University of Washington¹

01 Jan 2020-Journal of the American Medical Informatics Association

TL;DR: Leaf is a lightweight self-service web application for querying clinical data from heterogeneous data models and sources that does not specify a required data model and is designed to seamlessly leverage existing user authentication systems and clinical databases in situ.

...read moreread less

37 citations

Journal Article•DOI•

Tsunami: A Learned Multi-dimensional Index for Correlated Data and Skewed Workloads

[...]

Jialin Ding, Vikram Nathan, Mohammad Alizadeh, Tim Kraska

23 Jun 2020

TL;DR: In this paper, Tsunami, a learned multi-dimensional index is proposed to achieve up to 6X faster query performance and up to 8X smaller index size than existing learned multidimensional indexes.

...read moreread less

Abstract: Filtering data based on predicates is one of the most fundamental operations for any modern data warehouse. Techniques to accelerate the execution of filter expressions include clustered indexes, specialized sort orders (e.g., Z-order), multi-dimensional indexes, and, for high selectivity queries, secondary indexes. However, these schemes are hard to tune and their performance is inconsistent. Recent work on learned multi-dimensional indexes has introduced the idea of automatically optimizing an index for a particular dataset and workload. However, the performance of that work suffers in the presence of correlated data and skewed query workloads, both of which are common in real applications. In this paper, we introduce Tsunami, which addresses these limitations to achieve up to 6X faster query performance and up to 8X smaller index size than existing learned multi-dimensional indexes, in addition to up to 11X faster query performance and 170X smaller index size than optimally-tuned traditional indexes.

...read moreread less

35 citations

Journal Article•DOI•

A Survey on Big Data for Trajectory Analytics

[...]

Damião Ribeiro de Almeida, Cláudio de Souza Baptista, Fabio Gomes de Andrade, Amílcar Soares

01 Feb 2020-ISPRS international journal of geo-information

TL;DR: Understanding how the research in trajectory data are being conducted, what main techniques have been used, and how they can be embedded in an Online Analytical Processing (OLAP) architecture can enhance the efficiency and development of decision-making systems that deal with trajectory data.

...read moreread less

Abstract: Trajectory data allow the study of the behavior of moving objects, from humans to animals. Wireless communication, mobile devices, and technologies such as Global Positioning System (GPS) have contributed to the growth of the trajectory research field. With the considerable growth in the volume of trajectory data, storing such data into Spatial Database Management Systems (SDBMS) has become challenging. Hence, Spatial Big Data emerges as a data management technology for indexing, storing, and retrieving large volumes of spatio-temporal data. A Data Warehouse (DW) is one of the premier Big Data analysis and complex query processing infrastructures. Trajectory Data Warehouses (TDW) emerge as a DW dedicated to trajectory data analysis. A list and discussions on problems that use TDW and forward directions for the works in this field are the primary goals of this survey. This article collected state-of-the-art on Big Data trajectory analytics. Understanding how the research in trajectory data are being conducted, what main techniques have been used, and how they can be embedded in an Online Analytical Processing (OLAP) architecture can enhance the efficiency and development of decision-making systems that deal with trajectory data.

...read moreread less

35 citations

Journal Article•DOI•

Electronic health records for the diagnosis of rare diseases

[...]

Nicolas Garcelon¹, Nicolas Garcelon², Anita Burgun¹, Rémi Salomon³, Antoine Neuraz¹, Antoine Neuraz³ - Show less +2 more•Institutions (3)

Paris Descartes University¹, French Institute of Health and Medical Research², Necker-Enfants Malades Hospital³

01 Apr 2020-Kidney International

TL;DR: Research is examined that provides solutions to unlock barriers and accelerate translational research: structured electronic health records and free-text search engines to find patients, data warehouses and natural language processing to extract phenotypes, machine learning algorithms to classify patients, and similarity metrics to diagnose patients.

...read moreread less

27 citations

Book Chapter•DOI•

Proactive and Predictive Maintenance of Cyber-Physical Systems

[...]

Maxim Shcherbakov, Artem V. Glotov, Sergey V. Cheremisinov

01 Jan 2020

TL;DR: A concept model for proactive decision support system based on (real-time) predictive analytics and designed for maintenance of cyber-physical systems (CPSs) in order to optimize its downtime is described.

...read moreread less

Abstract: The following chapter describes a concept model for proactive decision support system based on (real-time) predictive analytics and designed for maintenance of cyber-physical systems (CPSs) in order to optimize its downtime. This concept later is referred to as proactive and predictive maintenance decision support systems or P2M for short. The concept is based on (i) the axioms of predictive decisions making, (ii) the proactive computing principles and (iii) models and methods for intelligent data processing. The aforementioned concept extends an idea of data-driven intelligent systems by using two approaches. The first approach implements predictive analytics, i.e. detection of a pre-failure event (called a proactive event) over a certain time period. This approach is based on the sequence of the following operational processes: to detect–to predict–to decide–to act. The second approach helps to automate maintenance decisions, which allows to exclude operational roles and move to supervisory level positions in the operational management structure. The concept includes the following primary components: ontology, a data warehouse (data lake), data factory as a set of data processing methods, flexible pipelines for data handling and processing and business processes with predictive decision logic for cyber-physical systems maintenance. This concept model is considered as the platform for the design of cyber-physical asset performance management systems.

...read moreread less

26 citations

Journal Article•DOI•

Design of a sports culture data fusion system based on a data mining algorithm

[...]

Lan Zhang

01 Feb 2020-Personal and Ubiquitous Computing

TL;DR: The actual case analysis and performance test results show that the realized sports cultural goods consumption data fusion system can provide a scientific reference model and basis for the modern sports stationery industry to use data mining and other new technologies to establish a decision-making information system.

...read moreread less

Abstract: The sports industry is an important component of social life and the national economy. With the advent of the era of big data, promoting the decision-making and scientific construction of the sports and cultural goods industry is conducive to the transformation and upgrading of the sports industry. In view of the shortcomings of the current sports stationery industry consumption data system, this paper combines K-means spatial clustering, fusion decision tree, naive Bayes, and other data mining algorithms and data warehouse technologies to the sports stationery industry. The consumption data system is the research object, and the analysis of geospatial feature clustering, customer segmentation, and consumption preference prediction of sports stationery industry consumption is carried out. The data mining–based sports cultural product industry data fusion system model is constructed, and the architecture, technology path, and function realization of the model are clarified. The actual case analysis and performance test results show that the realized sports cultural goods consumption data fusion system can provide a scientific reference model and basis for the modern sports stationery industry to use data mining and other new technologies to establish a decision-making information system.

...read moreread less

25 citations

Journal Article•DOI•

Optimizing Timeliness and Cost in Geo-Distributed Streaming Analytics

[...]

Benjamin Heintz¹, Abhishek Chandra¹, Ramesh K. Sitaraman²•Institutions (2)

University of Minnesota¹, University of Massachusetts Amherst²

01 Jan 2020-IEEE Transactions on Cloud Computing

TL;DR: This paper develops algorithms to optimize two key metrics: WAN traffic and staleness and presents a family of optimal offline algorithms that jointly minimize these metrics, and uses these to guide the design of practical online algorithms.

...read moreread less

Abstract: Rapid data streams are generated continuously from diverse sources including users, devices, and sensors located around the globe. This results in the need for efficient geo-distributed streaming analytics to extract timely information. A typical geo-distributed analytics service uses a hub-and-spoke model, comprising multiple edges connected by a wide-area-network (WAN) to a central data warehouse. In this paper, we focus on the widely used primitive of windowed grouped aggregation , and examine the question of how much computation should be performed at the edges versus the center . We develop algorithms to optimize two key metrics: WAN traffic and staleness (delay in getting results). We present a family of optimal offline algorithms that jointly minimize these metrics, and we use these to guide our design of practical online algorithms based on the insight that windowed grouped aggregation can be modeled as a caching problem where the cache size varies over time. We evaluate our algorithms through an implementation in Apache Storm deployed on PlanetLab. Using workloads derived from anonymized traces of a popular analytics service from a large commercial CDN, our experiments show that our online algorithms achieve near-optimal traffic and staleness for a variety of system configurations, stream arrival rates, and queries.

...read moreread less

25 citations

Journal Article•DOI•

COVID-WAREHOUSE: A Data Warehouse of Italian COVID-19, Pollution, and Climate Data.

[...]

Giuseppe Agapito¹, Chiara Zucco¹, Mario Cannataro¹•Institutions (1)

Magna Græcia University¹

03 Aug 2020-International Journal of Environmental Research and Public Health

TL;DR: The preliminary findings indicate that COVID-19 pandemic is significantly spread in regions characterized by high concentration of particulate in the air and the absence of rain and wind, as even stated in other works available in literature.

...read moreread less

Abstract: The management of the COVID-19 pandemic presents several unprecedented challenges in different fields, from medicine to biology, from public health to social science, that may benefit from computing methods able to integrate the increasing available COVID-19 and related data (e.g., pollution, demographics, climate, etc.). With the aim to face the COVID-19 data collection, harmonization and integration problems, we present the design and development of COVID-WAREHOUSE, a data warehouse that models, integrates and stores the COVID-19 data made available daily by the Italian Protezione Civile Department and several pollution and climate data made available by the Italian Regions. After an automatic ETL (Extraction, Transformation and Loading) step, COVID-19 cases, pollution measures and climate data, are integrated and organized using the Dimensional Fact Model, using two main dimensions: time and geographical location. COVID-WAREHOUSE supports OLAP (On-Line Analytical Processing) analysis, provides a heatmap visualizer, and allows easy extraction of selected data for further analysis. The proposed tool can be used in the context of Public Health to underline how the pandemic is spreading, with respect to time and geographical location, and to correlate the pandemic to pollution and climate data in a specific region. Moreover, public decision-makers could use the tool to discover combinations of pollution and climate conditions correlated to an increase of the pandemic, and thus, they could act in a consequent manner. Case studies based on data cubes built on data from Lombardia and Puglia regions are discussed. Our preliminary findings indicate that COVID-19 pandemic is significantly spread in regions characterized by high concentration of particulate in the air and the absence of rain and wind, as even stated in other works available in literature.

...read moreread less

Journal Article•DOI•

Challenges and Solutions for Processing Real-Time Big Data Stream: A Systematic Literature Review

[...]

Erum Mehmood¹, Tayyaba Anees¹•Institutions (1)

University of Management and Technology, Lahore¹

26 Jun 2020-IEEE Access

TL;DR: This study found that there exists various algorithms for implementing real-time join processing at ETL stage for structured data whereas less work for un-structured data is found in this subject matter.

...read moreread less

Abstract: Contribution: Recently, real-time data warehousing (DWH) and big data streaming have become ubiquitous due to the fact that a number of business organizations are gearing up to gain competitive advantage The capability of organizing big data in efficient manner to reach a business decision empowers data warehousing in terms of real-time stream processing A systematic literature review for real-time stream processing systems is presented in this paper which rigorously look at the recent developments and challenges of real-time stream processing systems and can serve as a guide for the implementation of real-time stream processing framework for all shapes of data streams Background: Published surveys and reviews either cover papers focusing on stream analysis in applications other than real-time DWH or focusing on extraction, transformation, loading (ETL) challenges for traditional DWH This systematic review attempts to answer four specific research questions Research Questions: 1)Which are the relevant publication channels for real-time stream processing research? 2) Which challenges have been faced during implementation of real-time stream processing? 3) Which approaches/tools have been reported to address challenges introduced at ETL stage while processing real-time stream for real-time DWH? 4) What evidence have been reported while addressing different challenges for processing real-time stream? Methodology: A systematic literature was conducted to compile studies related to publication channels targeting real-time stream processing/joins challenges and developments Following a formal protocol, semi-automatic and manual searches were performed for work from 2011 to 2020 excluding research in traditional data warehousing Of 679,547 papers selected for data extraction, 74 were retained after quality assessment Findings: This systematic literature highlights implementation challenges along with developed approaches for real-time DWH and big data stream processing systems and provides their comparisons This study found that there exists various algorithms for implementing real-time join processing at ETL stage for structured data whereas less work for un-structured data is found in this subject matter

...read moreread less

Journal Article•DOI•

A new hybridization of DBSCAN and fuzzy earthworm optimization algorithm for data cube clustering

[...]

Mina Hosseini Rad¹, Majid Abdolrazzagh-Nezhad•Institutions (1)

Islamic Azad University¹

01 Oct 2020

TL;DR: In this article, a hybridization of the fuzzy earthworm optimization algorithm and DBSCAN is proposed to solve the challenges of data aggregation from different databases into a data warehouse creates multidimensional data such as data cubes.

...read moreread less

Abstract: Data aggregation from different databases into a data warehouse creates multidimensional data such as data cubes. With regard to the 3D structure of data, data cube clustering has significant challenges to perform on data cube. In this paper, new preprocessing techniques and a novel hybridization of DBSCAN and fuzzy earthworm optimization algorithm (EWOA) are proposed to solve the challenges. Proposed preprocessing consists of an assigned address to each cube cell and dimension move to create a related 2D data from the data cube and new similarity metric. The DBSCAN algorithm, as a density-based clustering algorithm, is adopted based on both Euclidean and newly proposed similarity metric, which are called DBSCAN1 and DBSCAN2 for the related 2D data. A new hybridization of the EWOA and DBSCAN is proposed to improve the DBSCAN, and it is called EWOA–DBSCAN. Also, to dynamically tune parameters of EWOA, a fuzzy logic controller is designed with two fuzzy group rules of Mamdani (EWOA–DBSCAN-Mamdani) and Sugeno (EWOA–DBSCAN-Sugeno), separately. These ideas are proposed to present efficient and flexible unsupervised analysis for a data cube by utilizing a meta-heuristic algorithm to optimize DBSCAN’s parameters and increasing the efficiency of the idea by applying dynamic tuning parameters of the algorithm. To evaluate the efficiency, the proposed algorithms are compared with DBSCAN1 and GA-DBSCAN1, GA-DBSCAN1-Mamdani and GA-DBSCAN1-Sugeno. The experimental results, consisting of 20 runs, indicate that the proposed ideas achieved their targets.

...read moreread less

Journal Article•DOI•

Semantic Integration of Heterogeneous Databases of Same Domain Using Ontology

[...]

Muhammad Asfand-e-yar¹, Ramis Ali¹•Institutions (1)

Bahria University¹

20 Apr 2020-IEEE Access

TL;DR: The semantic Web ontology model is experimented and discussed in the article, which is based on the query execution model and shows the effectiveness and scalability of the methodology.

...read moreread less

Abstract: Heterogeneous database integration is the study of integrating data from multiple databases. Integrating the heterogeneous database of the same domain has three main challenges that make the heterogeneity problem difficult to solve. The three problems are Semantic, Syntactic and Structural Heterogeneity. Conventional heterogeneous database integration schemes, like De-duplication Techniques, Data Warehouse, and Information Retrieval (IR) Search technique lack the capability to solve the integration of databases completely. The only reason is they cannot deal with Semantic heterogeneity problems efficiently. The semantic Web ontology model is experimented and discussed in the article, which is based on the query execution model. The ontology modeling is divided into two phases, initially to translate the database rules according to ontology rules to find an abstract ontology model. Secondly, to extend the abstract ontology model according to the databases. The method facilitates to apply similarly SPQRAL queries to search the data in the databases. Therefore, the Jena API is used to retrieve semantically similar records. The experiment is based on the two heterogeneous Universities Library Databases. The results show the effectiveness and scalability of the methodology.

...read moreread less

Journal Article•DOI•

EVA: Efficient Versatile Auditing Scheme for IoT-Based Datamarket in Jointcloud

[...]

Ke Huang¹, Xiaosong Zhang¹, Yi Mu², Fatemeh Rezaeibagha³, Xiaofen Wang¹, Li Jingwei¹, Qi Xia¹, Jing Qin⁴ - Show less +4 more•Institutions (4)

University of Electronic Science and Technology of China¹, Fujian Normal University², Murdoch University³, Shandong University⁴

01 Feb 2020

TL;DR: This article proposes a versatile auditing scheme (EVA) that ensures that data are securely, efficiently, and dynamically stored in the jointcloud meanwhile supported by data trades via blockchain and gives a comprehensive security analysis based on security definitions and experiments to support claims.

...read moreread less

Abstract: Cloud storage offers convenient outsourcing services to users, and it serves as a basic platform to drive Internet-of-Things (IoT) where massive devices are connected to the cloud storage and interact with each other. However, cloud storage is more than a data warehouse. In the literature, data market was proposed as a novel model to empower IoT, where data are circulated as merchandise in the digital marketplace with financial activities. When storing IoT data in cloud storage, security and efficiency rules should be applied. Meanwhile, data dynamics is counted as a critical factor to the feasibility of datamarket as data are supposed to be manipulated through circulation and exploitation for IoT. Another issue is the single-point-of-failure (SPoF) of cloud server in which the initiative of jointcloud was suggested. Since providing data security, efficiency, and dynamics simultaneously is challenging, in this article, we propose a versatile auditing scheme (EVA) as a solution to problems. Our proposal ensures that data are securely, efficiently, and dynamically stored in the jointcloud meanwhile supported by data trades via blockchain. We give a comprehensive security analysis based on our security definitions and experiments to support our claims. The evidence has shown that our EVA is efficient for processing large files when proper parameters are chosen.

...read moreread less

Journal Article•DOI•

POLARIS: the distributed SQL engine in azure synapse

[...]

Josep Aguilar-Saborit¹, Raghu Ramakrishnan¹, Krish Srinivasan¹, Kevin Bocksrocker¹, Ioannis Alagiannis¹, Mahadevan Sankara¹, Moe Shafiei¹, José A. Blakeley¹, Girish Dasarathy¹, Sumeet Dash¹, Lazar Davidovic¹, Maja Damjanic¹, Slobodan Djunic¹, Nemanja Djurkic¹, Charles Feddersen¹, Cesar A. Galindo-Legaria¹, Alan Halverson¹, Milana Kovacevic¹, Nikola Kicovic¹, Goran Lukic¹, Djordje Maksimovic¹, Ana Manic¹, Nikola Markovic¹, Bosko Mihic¹, Ugljesa Milic¹, Marko Milojevic¹, Tapas Kumar Nayak¹, Milan Potocnik¹, Milos Radic¹, Bozidar Radivojevic¹, Srikumar Rangarajan¹, Milan Ruzic¹, Milan Simic¹, Marko Sosic¹, Igor Stanko¹, Maja Stikic¹, Sasa Stanojkov¹, Vukasin Stefanovic¹, Milos Sukovic¹, Aleksandar Tomic¹, Dragan Tomic¹, Steve Toscano¹, Djordje Trifunovic¹, Veljko Vasic¹, Tomer Verona¹, Aleksandar Vujic¹, Nikola Vujic¹, Marko Vukovic¹, Marko Zivanovic¹ - Show less +45 more•Institutions (1)

Microsoft¹

01 Aug 2020

TL;DR: The Polaris distributed SQL query engine in Azure Synapse is the result of a multi-year project to rearchitect the query processing framework in the SQL DW parallel data warehouse service, and addresses two main goals: converge data warehousing and big data workloads, and separate compute and state for cloud-native execution.

...read moreread less

Abstract: In this paper, we describe the Polaris distributed SQL query engine in Azure Synapse. It is the result of a multi-year project to re-architect the query processing framework in the SQL DW parallel data warehouse service, and addresses two main goals: (i) converge data warehousing and big data workloads, and (ii) separate compute and state for cloud-native execution.From a customer perspective, these goals translate into many useful features, including the ability to resize live workloads, deliver predictable performance at scale, and to efficiently handle both relational and unstructured data. Achieving these goals required many innovations, including a novel "cell" data abstraction, and flexible, fine-grained, task monitoring and scheduling capable of handling partial query restarts and PB-scale execution. Most importantly, while we develop a completely new scale-out framework, it is fully compatible with T-SQL and leverages decades of investment in the SQL Server single-node runtime and query optimizer. The scalability of the system is highlighted by a 1PB scale run of all 22 TPC-H queries; to our knowledge, this is the first reported run with scale larger than 100TB.

...read moreread less

Proceedings Article•DOI•

Data Integration in ETL Using TALEND

[...]

Sreemathy J, Infant Joseph, Nisha S, Chaaru Prabha I, Gokula Priya Rm - Show less +1 more

06 Mar 2020

TL;DR: The various steps involved in integrating data from various sources using the ETL process - The Extract, Transform and Load process is described, how the Talend Open Studio acting as a Data Integration and ETL tool helps in transforming heterogeneous data into homogeneous data for easy analysis and how all the integrated data is stored in a Data Warehouse.

...read moreread less

Abstract: Data Integration is the process of combining data from different sources to support Data Analytics in organizations. The best definition of data integration is given by IBM, stating “Data Integration is the combination of technical processes and business processes used to combine data from disparate sources into valuable and meaningful information.” The important terms here are “combine data… into valuable and meaningful data” where it's about making the data more organized and useful. There are various methods of combining data into an integrated view. This paper describes the various steps involved in integrating data from various sources using the ETL process - The Extract, Transform and Load process, how the Talend Open Studio acting as a Data Integration and ETL tool helps in transforming heterogeneous data into homogeneous data for easy analysis and how all the integrated data is stored in a Data Warehouse to provide Business Intelligence users with suitable data for easy analysis.

...read moreread less

Journal Article•DOI•

An Integrated Sensor Data Logging, Survey, and Analytics Platform for Field Research and Its Application in HAPIN, a Multi-Center Household Energy Intervention Trial

[...]

Daniel Wilson, Kendra N. Williams, Ajay Pillarisetti

28 Feb 2020-Sustainability

TL;DR: Geocene, an integrated sensor data logging, survey, and analytics platform for field research, is described and an example of Geocene’s ongoing use in the Household Air Pollution Intervention Network (HAPIN) is provided.

...read moreread less

Abstract: Researchers rely on sensor-derived data to gain insights on numerous human behaviors and environmental characteristics. While commercially available data-logging sensors can be deployed for a range of measurements, there have been limited resources for integrated hardware, software, and analysis platforms targeting field researcher use cases. In this paper, we describe Geocene, an integrated sensor data logging, survey, and analytics platform for field research. We provide an example of Geocene’s ongoing use in the Household Air Pollution Intervention Network (HAPIN). HAPIN is a large, multi-center, randomized controlled trial evaluating the impacts of a clean cooking fuel and stove intervention in Guatemala, India, Peru, and Rwanda. The platform includes Bluetooth-enabled, data-logging temperature sensors; a mobile application to survey participants, provision sensors, download sensor data, and tag sensor missions with metadata; and a cloud-based application for data warehousing, visualization, and analysis. Our experience deploying the Geocene platform within HAPIN suggests that the platform may have broad applicability to facilitate sensor-based monitoring and evaluation efforts and projects. This data platform can unmask heterogeneity in study participant behavior by using sensors that capture both compliance with and utilization of the intervention. Platforms like this could help researchers measure adoption of technology, collect more robust intervention and covariate data, and improve study design and impact assessments.

...read moreread less

Journal Article•DOI•

What You Need to Know Before Implementing a Clinical Research Data Warehouse: Comparative Review of Integrated Data Repositories in Health Care Institutions.

[...]

Kristina K. Gagalova¹, M Angelica Leon Elizalde¹, Elodie Portales-Casamar¹, Matthias Görges¹•Institutions (1)

University of British Columbia¹

27 Aug 2020

TL;DR: The underlying models and common features of IDRs are explored, a high-level overview is provided for those entering the field, and a set of guiding principles for small- to medium-sized health institutions embarking on IDR implementation are proposed.

...read moreread less

Abstract: Background: Integrated data repositories (IDRs), also referred to as clinical data warehouses, are platforms used for the integration of several data sources through specialized analytical tools that facilitate data processing and analysis. IDRs offer several opportunities for clinical data reuse, and the number of institutions implementing an IDR has grown steadily in the past decade. Objective: The architectural choices of major IDRs are highly diverse and determining their differences can be overwhelming. This review aims to explore the underlying models and common features of IDRs, provide a high-level overview for those entering the field, and propose a set of guiding principles for small- to medium-sized health institutions embarking on IDR implementation. Methods: We reviewed manuscripts published in peer-reviewed scientific literature between 2008 and 2020, and selected those that specifically describe IDR architectures. Of 255 shortlisted articles, we found 34 articles describing 29 different architectures. The different IDRs were analyzed for common features and classified according to their data processing and integration solution choices. Results: Despite common trends in the selection of standard terminologies and data models, the IDRs examined showed heterogeneity in the underlying architecture design. We identified 4 common architecture models that use different approaches for data processing and integration. These different approaches were driven by a variety of features such as data sources, whether the IDR was for a single institution or a collaborative project, the intended primary data user, and purpose (research-only or including clinical or operational decision making). Conclusions: IDR implementations are diverse and complex undertakings, which benefit from being preceded by an evaluation of requirements and definition of scope in the early planning stage. Factors such as data source diversity and intended users of the IDR influence data flow and synchronization, both of which are crucial factors in IDR architecture planning.

...read moreread less

Journal Article•DOI•

Efficient incremental loading in ETL processing for real-time data integration

[...]

Neepa Biswas¹, Anamitra Sarkar¹, Kartick Chandra Mondal¹•Institutions (1)

Jadavpur University¹

01 Mar 2020-Innovations in Systems and Software Engineering

TL;DR: This paper focuses on alternative ETL developmental approach taken by hand coding, and presents a comparative evaluation of some well-known code-based open-source ETL tools developed by the academic world.

...read moreread less

Abstract: ETL (extract transform load) is the widely used standard process for creating and maintaining a data warehouse (DW). ETL is the most resource-, cost- and time-demanding process in DW implementation and maintenance. Nowadays, many graphical user interfaces (GUI)-based solutions are available to facilitate the ETL processes. In spite of the high popularity of GUI-based tool, there is still some downside of such approach. This paper focuses on alternative ETL developmental approach taken by hand coding. In some contexts like research and academic work, it is appropriate to go for custom-coded solution which can be cheaper, faster and maintainable compared to any GUI-based tools. Some well-known code-based open-source ETL tools developed by the academic world have been studied in this article. Their architecture and implementation details are addressed here. The aim of this paper is to present a comparative evaluation of these code-based ETL tools. Finally, an efficient ETL model is designed to meet the near real-time responsibility of the present days.

...read moreread less

Journal Article•DOI•

A Survey on Trajectory Data Warehouse

[...]

Tariq Alsahfi¹, Mousa Almotairi¹, Mousa Almotairi², Ramez Elmasri¹•Institutions (2)

University of Texas at Austin¹, King Saud University²

01 Feb 2020-Spatial Information Research

TL;DR: A framework that aims to provide the requirements for building the Trajectory Data Warehouse (TDW) is proposed and discussed, which discusses different applications using the TDW and how these applications utilize theTDW.

...read moreread less

Abstract: Advanced technologies in location acquisition allow us to track the movement of moving objects (people, planes, vehicles, animals, ships, ...) in geographical space. These technologies generate a vast amount of trajectory data (TD). Several applications in different fields can utilize such TD, for example, traffic management control, social behavior analysis, wildlife migrations and movements, ship trajectories, shoppers behavior in a mall, facial nerve trajectory, location-based services and many others. Trajectory data can be mainly handled either with Moving Object Databases (MOD) or Trajectory Data Warehouse (TDW). In this paper, we aim to review existing studies on storing, managing, and analyzing TD using data warehouse technologies. We propose a framework that aims to provide the requirements for building the TDW. Furthermore, we discuss different applications using the TDW and how these applications utilize the TDW. We address some issues with existing TDWs and discuss future work in this field.

...read moreread less

Journal Article•DOI•

Data governance and citizen participation in the digital welfare state

[...]

Liesbet van Zoonen

06 Jul 2020

TL;DR: In this article, the authors zoom in on The Netherlands and show in detail how sound data governance is lacking at three levels: data experimentation and practices take place in a so-called "institutional void" without any clear democratic mandate; and they are often based on disputable quality of data and analytic models.

...read moreread less

Abstract: U.S., UK, and European municipalities are increasingly experimenting with data as an instrument for social policy. This movement pertains often to the design of municipal data warehouses, dashboards, and predictive analytics, the latter mostly to identify risk of fraud. This transition to data-driven social policy, captured by the term “digital welfare state,” almost completely takes place out of political and social view, and escapes democratic decision making. In this article, I zoom in on The Netherlands and show in detail how sound data governance is lacking at three levels: data experiments and practices take place in a so-called “institutional void” without any clear democratic mandate; moreover, they are often based on disputable quality of data and analytic models; and they tend to transgress the recent EU General Data Protection Regulation (GDPR) about privacy and data protection. I also assess that key stakeholders in this data transition, that is the citizens whose data are used, are not actively informed let alone invited to participate. As a result, a practice of top-down monitoring, containment and control is evolving despite the desire of civil servants in this domain to do “good” with data. I explore several data and policy alternatives in the conclusion to contribute to a higher quality and more democratic usage of data in the digital welfare state.

...read moreread less

Journal Article•DOI•

Towards an Ontology Proposal Model in Data Lake for Real-time COVID-19 Cases Prevention

[...]

Jabrane Kachaoui, Jihane Larioui, Abdessamad Belangour

13 Aug 2020

TL;DR: A novel approach that combines between the Semantic Web Services (SWS) and the Big Data characteristics in order to extract a significant information from multiple Data sources that can be exploitable for generating real-time statistics and reports is presented.

...read moreread less

Abstract: Globally, the coronavirus epidemic has now hit lives of millions and thousands of people around the world. The growing threat of this virus continues rising as new cases appear every day. Yet, affected countries by coronavirus are currently taking important measures to remedy it by using artificial intelligence (AI) and Big Data technologies. According to the World Health Organization (WHO), AI and Big Data have performed an important role in China's response to COVID-19, the genetic mutation name for coronavirus. Predicting an epidemic emergence, from the corona virus appearance to a person's predisposition to develop it, is fundamental to combating it. In this battle, Big Data is on the front line. However, Big Data cannot provide all of the expected insights and derive value from manipulated data. This is why we propose a semantic approach to facilitate the use of these data. In this paper, we present a novel approach that combines between the Semantic Web Services (SWS) and the Big Data characteristics in order to extract a significant information from multiple Data sources that can be exploitable for generating real-time statistics and reports.

...read moreread less

Journal Article•DOI•

Applying the ETL Process to Blockchain Data. Prospect and Findings

[...]

Roberta Galici, Laura Ordile, Michele Marchesi, Andrea Pinna, Roberto Tonelli - Show less +1 more

10 Apr 2020-Information-an International Interdisciplinary Journal

TL;DR: The ETL process to analyze blockchain data is proven to be able to perform a reliable and scalable data acquisition process, whose result makes stored data available for further analysis and business.

...read moreread less

Abstract: We present a novel strategy, based on the Extract, Transform and Load (ETL) process, to collect data from a blockchain, elaborate and make it available for further analysis. The study aims to satisfy the need for increasingly efficient data extraction strategies and effective representation methods for blockchain data. For this reason, we conceived a system to make scalable the process of blockchain data extraction and clustering, and to provide a SQL database which preserves the distinction between transaction and addresses. The proposed system satisfies the need to cluster addresses in entities, and the need to store the extracted data in a conventional database, making possible the data analysis by querying the database. In general, ETL processes allow the automation of the operation of data selection, data collection and data conditioning from a data warehouse, and produce output data in the best format for subsequent processing or for business. We focus on the Bitcoin blockchain transactions, which we organized in a relational database to distinguish between the input section and the output section of each transaction. We describe the implementation of address clustering algorithms specific for the Bitcoin blockchain and the process to collect and transform data and to load them in the database. To balance the input data rate with the elaboration time, we manage blockchain data according to the lambda architecture. To evaluate our process, we first analyzed the performances in terms of scalability, and then we checked its usability by analyzing loaded data. Finally, we present the results of a toy analysis, which provides some findings about blockchain data, focusing on a comparison between the statistics of the last year of transactions, and previous results of historical blockchain data found in the literature. The ETL process we realized to analyze blockchain data is proven to be able to perform a reliable and scalable data acquisition process, whose result makes stored data available for further analysis and business.

...read moreread less

Journal Article•DOI•

Obi-Wan: ontology-based RDF integration of heterogeneous data

[...]

Maxime Buron¹, François Goasdoué², Ioana Manolescu¹, Marie-Laure Mugnier¹•Institutions (2)

French Institute for Research in Computer Science and Automation¹, University of Rennes²

01 Aug 2020

TL;DR: The novelty of Obi-Wan is to combine maximum integration power (GLAV mappings) with the highest query answering power supported by an RDF mediator: RDF queries not only over the data but also over the integration ontologies, which makes it more flexible and powerful than comparable systems.

...read moreread less

Abstract: We consider the problem of integrating heterogeneous data (relational, JSON, key-values, graphs etc.) and querying it efficiently. Traditional data integration systems fall into two classes: data warehousing, where all data source content is materialized in a single repository, and mediation, where data remains in their original stores and all data can be queried through a mediator.We propose to demonstrate Obi-Wan, a novel mediator following the Ontology-Based Data access (OBDA) paradigm. Obi-Wan integrates data sources of many data models under an interface based on RDF graphs and ontologies (classes, properties, and relations between them). The novelty of Obi-Wan is to combine maximum integration power (GLAV mappings, see below) with the highest query answering power supported by an RDF mediator: RDF queries not only over the data but also over the integration ontologies. This makes it more flexible and powerful than comparable systems.

...read moreread less

Journal Article•DOI•

Autonomic performance prediction framework for data warehouse queries using lazy learning approach

[...]

Basit Raza¹, Adeel Aslam², Asma Sher¹, Ahmad Kamran Malik¹, Muhammad Faheem³ - Show less +1 more•Institutions (3)

COMSATS Institute of Information Technology¹, Huazhong University of Science and Technology², Abdullah Gül University³

01 Jun 2020-Applied Soft Computing

TL;DR: A cluster-based autonomic performance prediction framework using a case-based reasoning approach that determines the performance metrics of the data warehouse in advance by incorporating autonomic computing characteristics is proposed, which is helpful for query monitoring and management.

...read moreread less

Journal Article•DOI•

An improved join‐free snowflake schema for ETL and OLAP of data warehouse

[...]

Wang Jianmin, Zhao Wenbin, Fan Tongrang, Yang Shilong, Lv Hongwei - Show less +1 more

10 Dec 2020-Concurrency and Computation: Practice and Experience

TL;DR: The results show the proposed scheme can provide lower overhead than the traditional SQL‐based database while facilitating the scope and flexibility of data warehouse services.

...read moreread less

Abstract: The emergence of big data makes more and more enterprise change data management strategy, from simple data storage to OLAP query analysis; meanwhile, NoSQL‐based data warehouse receive more increasing attention than traditional SQL‐based database. By improving the JFSS model for ETL, this paper proposes the uniform distribution code (UDC), model identification code (MIC), standard dimension code (SDC), and attribute dimensional code (ADC); defines the data storage format of ; and identifies the extraction, transformation, and loading strategies of data warehouse. Several experiments are carried out to analyze single record and range record queries as typical OLAP based on Hadoop database (HBase). The results show the proposed scheme can provide lower overhead than the traditional SQL‐based database while facilitating the scope and flexibility of data warehouse services.

...read moreread less

Journal Article•DOI•

Automated pipeline framework for processing of large-scale building energy time series data.

[...]

Arash Khalilnejad¹, Ahmad Maroof Karimi¹, Shreyas Kamath¹, Rojiar Haddadian¹, Roger H. French¹, Alexis R. Abramson¹ - Show less +2 more•Institutions (1)

Case Western Reserve University¹

01 Dec 2020-PLOS ONE

TL;DR: This paper demonstrates virtual energy audits applied to large populations of buildings’ time-series smart-meter data using a systematic approach and a fully automated Building Energy Analytics (BEA) Pipeline that unifies, cleans, stores and analyzes building energy datasets in a non-relational data warehouse for efficient insights and results.

...read moreread less

Abstract: Commercial buildings account for one third of the total electricity consumption in the United States and a significant amount of this energy is wasted. Therefore, there is a need for “virtual” energy audits, to identify energy inefficiencies and their associated savings opportunities using methods that can be non-intrusive and automated for application to large populations of buildings. Here we demonstrate virtual energy audits applied to large populations of buildings’ time-series smart-meter data using a systematic approach and a fully automated Building Energy Analytics (BEA) Pipeline that unifies, cleans, stores and analyzes building energy datasets in a non-relational data warehouse for efficient insights and results. This BEA pipeline is based on a custom compute job scheduler for a high performance computing cluster to enable parallel processing of Slurm jobs. Within the analytics pipeline, we introduced a data qualification tool that enhances data quality by fixing common errors, while also detecting abnormalities in a building’s daily operation using hierarchical clustering. We analyze the HVAC scheduling of a population of 816 buildings, using this analytics pipeline, as part of a cross-sectional study. With our approach, this sample of 816 buildings is improved in data quality and is efficiently analyzed in 34 minutes, which is 85 times faster than the time taken by a sequential processing. The analytical results for the HVAC operational hours of these buildings show that among 10 building use types, food sales buildings with 17.75 hours of daily HVAC cooling operation are decent targets for HVAC savings. Overall, this analytics pipeline enables the identification of statistically significant results from population based studies of large numbers of building energy time-series datasets with robust results. These types of BEA studies can explore numerous factors impacting building energy efficiency and virtual building energy audits. This approach enables a new generation of data-driven buildings energy analysis at scale.

...read moreread less

Journal Article•DOI•

Fast and effective distribution-key recommendation for amazon redshift

[...]

Panos Parchas¹, Yonatan Naamad¹, Peter Van Bouwel¹, Christos Faloutsos², Michalis Petropoulos¹ - Show less +1 more•Institutions (2)

Amazon.com¹, Carnegie Mellon University²

01 Jul 2020

TL;DR: A generally-applicable data distribution framework initially designed for Amazon Redshift, a fully-managed petabyte-scale data warehouse in the cloud, and proposes BaW, a hybrid approach that combines heuristic and exact algorithms to find a good data distribution scheme.

...read moreread less

Abstract: How should we split data among the nodes of a distributed data warehouse in order to boost performance for a forecasted workload? In this paper, we study the effect of different data partitioning schemes on the overall network cost of pairwise joins. We describe a generally-applicable data distribution framework initially designed for Amazon Redshift, a fully-managed petabyte-scale data warehouse in the cloud. To formalize the problem, we first introduce the Join Multi-Graph, a concise graph-theoretic representation of the workload history of a cluster. We then formulate the "Distribution-Key Recommendation" problem - a novel combinatorial problem on the Join Multi-Graph - and relate it to problems studied in other subfields of computer science. Our theoretical analysis proves that "Distribution-Key Recommendation" is NP-complete and is hard to approximate efficiently. Thus, we propose BaW, a hybrid approach that combines heuristic and exact algorithms to find a good data distribution scheme. Our extensive experimental evaluation on real and synthetic data showcases the efficacy of our method into recommending optimal (or close to optimal) distribution keys, which improve the cluster performance by reducing network cost up to 32x in some real workloads.

...read moreread less

Journal Article•DOI•

The Moderating Role of Business Intelligence in the Impact of Big Data on Financial Reports Quality in Jordanian Telecom Companies

[...]

Omar Zraqat

29 Jan 2020-Modern Applied Science

TL;DR: In this article, a study aimed at discovering the impact of big data in terms of its dimensions (Variety, Velocity, Volume, and Veracity) on financial reports quality in the present business intelligence (OLAP, Data Mining, and Data Warehouse) as a moderating variable in Jordanian telecom companies.

...read moreread less

Abstract: This study aimed at discovering the impact of big data in terms of its dimensions (Variety, Velocity, Volume, and Veracity) on financial reports quality in the present business intelligence in terms of its dimensions (Online Analytical Processing (OLAP), Data Mining, and Data Warehouse) as a moderating variable in Jordanian telecom companies. The sample included (139) employees in Jordanian Telecom Companies. Multiple and Stepwise Linear Regression were used to test the effect of the independent variable on the dependent variable. And Hierarchical Regression analysis, to test the effect of the independent variable on the dependent variable in the presence of the moderating variable. The study reached a set of results, the most prominent of which was the presence of a statistically significant effect of using big data in improve the quality of financial reports, Business intelligence contributes to improving the impact of big data in terms of its dimensions (Volume, Velocity, Variety, and Veracity) on the quality of financial reports. The study recommends the necessity of working on making use of big data and resorting to business intelligence solutions because of its great role in improving the quality of financial reports and thus supporting decision-making functions for a large group of users.

...read moreread less

Collapse