scispace - formally typeset
Search or ask a question

Showing papers on "Online analytical processing published in 2018"


Journal ArticleDOI
TL;DR: The survey can help the partitioners to understand existing AQP techniques and select appropriate methods in their applications and provide research challenges and opportunities of AQP.
Abstract: Online analytical processing (OLAP) is a core functionality in database systems. The performance of OLAP is crucial to make online decisions in many applications. However, it is rather costly to support OLAP on large datasets, especially big data, and the methods that compute exact answers cannot meet the high-performance requirement. To alleviate this problem, approximate query processing (AQP) has been proposed, which aims to find an approximate answer as close as to the exact answer efficiently. Existing AQP techniques can be broadly categorized into two categories. (1) Online aggregation: select samples online and use these samples to answer OLAP queries. (2) Offline synopses generation: generate synopses offline based on a-priori knowledge (e.g., data statistics or query workload) and use these synopses to answer OLAP queries. We discuss the research challenges in AQP and summarize existing techniques to address these challenges. In addition, we review how to use AQP to support other complex data types, e.g., spatial data and trajectory data, and support other applications, e.g., data visualization and data cleaning. We also introduce existing AQP systems and summarize their advantages and limitations. Lastly, we provide research challenges and opportunities of AQP. We believe that the survey can help the partitioners to understand existing AQP techniques and select appropriate methods in their applications.

99 citations


Proceedings ArticleDOI
19 Jul 2018
TL;DR: This work presents REACT, a recommender system designed for modern IDA platforms that identifies and generalizes relevant (previous) sessions to generate personalized next-action suggestions to the user.
Abstract: Modern Interactive Data Analysis (IDA) platforms, such as Kibana, Splunk, and Tableau, are gradually replacing traditional OLAP/SQL tools, as they allow for easy-to-use data exploration, visualization, and mining, even for users lacking SQL and programming skills. Nevertheless, data analysis is still a di cult task, especially for non-expert users. To that end we present REACT, a recommender system designed for modern IDA platforms. In these platforms, analysis sessions interweave high-level actions of multiple types and operate over diverse datasets . REACT identifies and generalizes relevant (previous) sessions to generate personalized next-action suggestions to the user. We model the user's analysis context using a generic tree based model, where the edges represent the user's recent actions, and the nodes represent their result "screens". A dedicated context-similarity metric is employed for efficient indexing and retrieval of relevant candidate next-actions. These are then generalized to abstract actions that convey common fragments, then adapted to the specific user context. To prove the utility of REACT we performed an extensive online and offline experimental evaluation over real-world analysis logs from the cyber security domain, which we also publish to serve as a benchmark dataset for future work.

61 citations


Journal ArticleDOI
01 Dec 2018
TL;DR: This paper shows how to exploit hardware acceleration as part of a hybrid CPU+FPGA system to provide on-the-fly data transformation combined with an FPGA-based coordinate-descent engine and creates a column-store DBMS with its important features preserved that offers high performance machine learning capabilities.
Abstract: The ability to perform machine learning (ML) tasks in a database management system (DBMS) provides the data analyst with a powerful tool. Unfortunately, integration of ML into a DBMS is challenging for reasons varying from differences in execution model to data layout requirements. In this paper, we assume a column-store main-memory DBMS, optimized for online analytical processing, as our initial system. On this system, we explore the integration of coordinate-descent based methods working natively on columnar format to train generalized linear models. We use a cache-efficient, partitioned stochastic coordinate descent algorithm providing linear throughput scalability with the number of cores while preserving convergence quality, up to 14 cores in our experiments.Existing column oriented DBMS rely on compression and even encryption to store data in memory. When those features are considered, the performance of a CPU based solution suffers. Thus, in the paper we also show how to exploit hardware acceleration as part of a hybrid CPU+FPGA system to provide on-the-fly data transformation combined with an FPGA-based coordinate-descent engine. The resulting system is a column-store DBMS with its important features preserved (e.g., data compression) that offers high performance machine learning capabilities.

40 citations


Book ChapterDOI
01 Jan 2018
TL;DR: More than 20 years of research on data warehouse systems are surveyed, from their early relational implementations (still widely adopted in corporate environments), to the new architectures solicited by Business Intelligence 2.0 scenarios during the last decade, and up to the exciting challenges now posed by the integration with big data settings.
Abstract: Data Warehouses are the core of the modern systems for decision making They store integrated information extracted from various and heterogeneous data sources, making it available in multidimensional form for analyses aimed at improving the users’ knowledge of their business Though the first use of the term dates back to the 80s, only during the late 90s data warehousing has emerged as a research area on its own, though in strict correlation with several other research topics as database integration, view materialization, data visualization, etc This paper surveys more than 20 years of research on data warehouse systems, from their early relational implementations (still widely adopted in corporate environments), to the new architectures solicited by Business Intelligence 20 scenarios during the last decade, and up to the exciting challenges now posed by the integration with big data settings The timeline of research is organized into three interrelated tracks: techniques, architectures, and methodologies

38 citations


Proceedings ArticleDOI
27 May 2018
TL;DR: In this article, a system to detect, explain, and resolve bias in decision-support queries is proposed, which performs a set of independence tests on the data to detect bias and also develops an automated method for rewriting a biased query into an unbiased query.
Abstract: On line analytical processing (OLAP) is an essential element of decision-support systems. OLAP tools provide insights and understanding needed for improved decision making. However, the answers to OLAP queries can be biased and lead to perplexing and incorrect insights. In this paper, we propose, a system to detect, explain, and to resolve bias in decision-support queries. We give a simple definition of a biased query, which performs a set of independence tests on the data to detect bias. We propose a novel technique that gives explanations for bias, thus assisting an analyst in understanding what goes on. Additionally, we develop an automated method for rewriting a biased query into an unbiased query, which shows what the analyst intended to examine. In a thorough evaluation on several real datasets we show both the quality and the performance of our techniques, including the completely automatic discovery of the revolutionary insights from a famous 1973 discrimination case.

35 citations


Journal ArticleDOI
01 Aug 2018
TL;DR: This paper presents the end-to-end design of F1 Query, a stand-alone, federated query processing platform that executes SQL queries against data stored in different file-based formats as well as different storage systems at Google.
Abstract: F1 Query is a stand-alone, federated query processing platform that executes SQL queries against data stored in different file-based formats as well as different storage systems at Google (e.g., Bigtable, Spanner, Google Spreadsheets, etc.). F1 Query eliminates the need to maintain the traditional distinction between different types of data processing workloads by simultaneously supporting: (i) OLTP-style point queries that affect only a few records; (ii) low-latency OLAP querying of large amounts of data; and (iii) large ETL pipelines. F1 Query has also significantly reduced the need for developing hard-coded data processing pipelines by enabling declarative queries integrated with custom business logic. F1 Query satisfies key requirements that are highly desirable within Google: (i) it provides a unified view over data that is fragmented and distributed over multiple data sources; (ii) it leverages datacenter resources for performant query processing with high throughput and low latency; (iii) it provides high scalability for large data sizes by increasing computational parallelism; and (iv) it is extensible and uses innovative approaches to integrate complex business logic in declarative query processing. This paper presents the end-to-end design of F1 Query. Evolved out of F1, the distributed database originally built to manage Google's advertising data, F1 Query has been in production for multiple years at Google and serves the querying needs of a large number of users and systems.

32 citations


Proceedings ArticleDOI
27 May 2018
TL;DR: Pinot is presented, a single system used in production at Linkedin that can serve tens of thousands of analytical queries per second, offers near-realtime data ingestion from streaming data sources, and handles the operational requirements of large web properties.
Abstract: Modern users demand analytical features on fresh, real time data. Offering these analytical features to hundreds of millions of users is a relevant problem encountered by many large scale web companies. Relational databases and key-value stores can be scaled to provide point lookups for a large number of users but fall apart at the combination of high ingest rates, high query rates at low latency for analytical queries. Online analytical databases typically rely on bulk data loads and are not typically built to handle nonstop operation in demanding web environments. Offline analytical systems have high throughput but do not offer low query latencies nor can scale to serving tens of thousands of queries per second. We present Pinot, a single system used in production at Linkedin that can serve tens of thousands of analytical queries per second, offers near-realtime data ingestion from streaming data sources, and handles the operational requirements of large web properties. We also provide a performance comparison with Druid, a system similar to Pinot.

27 citations


Journal ArticleDOI
TL;DR: iMOLD is described, called iMOLD, that enables non-technical users to enrich an RDF cube with multidimensional knowledge by discovering aggregation hierarchies in LOD through a user-guided process that recognizes in the LOD the recurring modeling patterns that express roll-up relationships between RDF concepts.

24 citations


Journal ArticleDOI
TL;DR: A game theory based framework for the materialized view selection is proposed, and experimental results show that the GTMV method has better performance comparing previous algorithms and substantially outperform former methods.

22 citations


Journal ArticleDOI
01 Aug 2018
TL;DR: This paper presents an execution strategy, called OLTPShare that implements a novel batching scheme for OLTP workloads that enables SAP HANA to provide a significant throughput increase in high-load scenarios compared to the conventional execution strategy without sharing.
Abstract: In the past, resource sharing has been extensively studied for OLAP workloads. Naturally, the question arises, why studies mainly focus on OLAP and not on OLTP workloads? At first sight, OLTP queries - due to their short runtime - may not have enough potential for the additional overhead. In addition, OLTP workloads do not only execute read operations but also updates. In this paper, we address query sharing for OLTP workloads. We first analyze the sharing potential in real-world OLTP workloads. Based on those findings, we then present an execution strategy, called OLTPShare that implements a novel batching scheme for OLTP workloads. We analyze the sharing benefits by integrating OLTPShare into a prototype version of the commercial database system SAP HANA. Our results show for different OLTP workloads that OLTPShare enables SAP HANA to provide a significant throughput increase in high-load scenarios compared to the conventional execution strategy without sharing.

22 citations


Proceedings ArticleDOI
27 May 2018
TL;DR: This approach, termed AnKer, follows the current trend of co-designing underlying system components and the DBMS, to overcome the restrictions of the OS by introducing a custom system call vm_snapshot that allows fine-granular snapshot creation that is orders of magnitudes faster than state-of-the-art approaches.
Abstract: Efficient transaction management is a delicate task. As systems face transactions of inherently different types, ranging from point updates to long-running analytical queries, it is hard to satisfy their requirements with a single execution engine. Unfortunately, most systems rely on such a design that implements its parallelism using multi-version concurrency control. While MVCC parallelizes short-running OLTP transactions well, it struggles in the presence of mixed workloads containing long-running OLAP queries, as scans have to work their way through vast amounts of versioned data. To overcome this problem, we reintroduce the concept of hybrid processing and combine it with state-of-the-art MVCC: OLAP queries are outsourced to run on separate virtual snapshots while OLTP transactions run on the most recent version of the database. Inside both execution engines, we still apply MVCC. The most significant challenge of a hybrid approach is to generate the snapshots at a high frequency. Previous approaches heavily suffered from the high cost of snapshot creation. In our approach termed AnKer, we follow the current trend of co-designing underlying system components and the DBMS, to overcome the restrictions of the OS by introducing a custom system call vm_snapshot. It allows fine-granular snapshot creation that is orders of magnitudes faster than state-of-the-art approaches. Our experimental evaluation on an HTAP workload based on TPC-C transactions and OLAP queries show that our snapshotting mechanism is more than a factor of 100x faster than fork-based snapshotting and that the latency of OLAP queries is up to a factor of 4x lower than MVCC in a single execution engine. Besides, our approach enables a higher OLTP throughput than all state-of-the-art methods.

Posted Content
TL;DR: A novel technique is proposed that gives explanations for bias, thus assisting an analyst in understanding what goes on, and an automated method for rewriting a biased query into an unbiased query, which shows what the analyst intended to examine.
Abstract: On line analytical processing (OLAP) is an essential element of decision-support systems. OLAP tools provide insights and understanding needed for improved decision making. However, the answers to OLAP queries can be biased and lead to perplexing and incorrect insights. In this paper, we propose HypDB, a system to detect, explain, and to resolve bias in decision-support queries. We give a simple definition of a \emph{biased query}, which performs a set of independence tests on the data to detect bias. We propose a novel technique that gives explanations for bias, thus assisting an analyst in understanding what goes on. Additionally, we develop an automated method for rewriting a biased query into an unbiased query, which shows what the analyst intended to examine. In a thorough evaluation on several real datasets we show both the quality and the performance of our techniques, including the completely automatic discovery of the revolutionary insights from a famous 1973 discrimination case.

Proceedings ArticleDOI
10 Jun 2018
TL;DR: Gremlinator is presented, the first translator from SPARQL - the W3C standardized language for RDF - to Gremlin - a popular property graph traversal language, making it a desirable choice for supporting interoperability for querying Graph databases.
Abstract: In the past decade Knowledge graphs have become very popular and frequently rely on the Resource Description Framework (RDF) or Property Graphs (PG) as their data models. However, the query languages for these two data models - SPARQL for RDF and the PG traversal language Gremlin - are lacking basic interoperability. In this demonstration paper, we present Gremlinator, the first translator from SPARQL - the W3C standardized language for RDF - to Gremlin - a popular property graph traversal language. Gremlinator translates SPARQL queries to Gremlin path traversals for executing graph pattern matching queries over graph databases. This allows a user, who is well versed in SPARQL, to access and query a wide variety of Graph databases avoiding the steep learning curve for adapting to a new Graph Query Language (GQL). Gremlin is a graph computing system-agnostic traversal language (covering both OLTP graph databases and OLAP graph processors), making it a desirable choice for supporting interoperability for querying Graph databases. Gremlinator is planned to be released as an Apache TinkerPop plugin in the upcoming releases.

Proceedings ArticleDOI
01 Feb 2018
TL;DR: A dual-addressable memory architecture based on non-volatile memory, called RC-NVM, to support both row-oriented and column-oriented accesses is proposed and a group caching technique that combines the IMDB knowledge with the memory architecture to further optimize the system is proposed.
Abstract: Ever increasing DRAM capacity has fostered the development of in-memory databases (IMDB). The massive performance improvements provided by IMDBs have enabled transactions and analytics on the same database. In other words, the integration of OLTP (on-line transactional processing) and OLAP (on-line analytical processing) systems is becoming a general trend. However, conventional DRAM-based main memory is optimized for row-oriented accesses generated by OLTP workloads in row-based databases. OLAP queries scanning on specified columns cause so-called strided accesses and result in poor memory performance. Since memory access latency dominates in IMDB processing time, it can degrade overall performance significantly. To overcome this problem, we propose a dual-addressable memory architecture based on non-volatile memory, called RC-NVM, to support both row-oriented and column-oriented accesses. We first present circuit-level analysis to prove that such a dual-addressable architecture is only practical with RC-NVM rather than DRAM technology. Then, we rethink the addressing schemes, data layouts, cache synonym, and coherence issues of RC-NVM in architectural level to make it applicable for IMDBs. Finally, we propose a group caching technique that combines the IMDB knowledge with the memory architecture to further optimize the system. Experimental results show that the memory access performance can be improved up to 14.5X with only 15% area overhead.

Journal ArticleDOI
Babak Salimi1, Corey Cole1, Peter Li1, Johannes Gehrke2, Dan Suciu1 
01 Aug 2018
TL;DR: This work presents HypDB, the first system to detect, explain and resolve bias in OLAP queries, and demonstrates step-by-step how it eliminates the bias via query rewriting and generates decision-support insights.
Abstract: On line analytical processing (OLAP) is an essential element of decision-support systems. However, OLAP queries can be biased and lead to perplexing and incorrect insights. In this demo, we present HypDB, the first system to detect, explain and resolve bias in OLAP queries. Our demonstration, shows several examples of OLAP queries from real world datasets that are biased and could lead to statistical anomalies such as Simpson's paradox. Then, we demonstrate step-by-step how HypDB: (1) detects whether an OLAP query is biased, (2) explains the root causes of the bias and reveals illuminating insights about the domain and the data collection process and (3) eliminates the bias via query rewriting and generates decision-support insights.

Proceedings ArticleDOI
01 Dec 2018
TL;DR: The Polypheny-DB vision of a distributed polystore system that seamlessly combines replication and partitioning with local polystores and that is able to dynamically adapt all parts of the system when the workload changes is presented.
Abstract: Cloud providers are more and more confronted with very diverse and heterogeneous requirements their customers impose on the management of data. First, these requirements stem from service-level agreements that specify a desired degree of availability and a guaranteed latency. As a consequence, Cloud providers replicate data across data centers or availability zones and/or partition data and place it close to the location of their customers. Second, the workload at each Cloud data center or availability zone is diverse and may significantly change over time – e. g., an OLTP workload during regular business hours and OLAP analyzes over night. For this, polystore and multistore databases have recently been introduced as they are intrinsically able to cope with such mixed and varying workloads. While the problem of heterogeneous requirements on data management in the Cloud is either addressed at global level by replicating and partitioning data across data centers or at local level by providing polystore systems in a Cloud data center, there is no integrated solution that leverages the benefits of both approaches. In this paper, we present the Polypheny-DB vision of a distributed polystore system that seamlessly combines replication and partitioning with local polystores and that is able to dynamically adapt all parts of the system when the workload changes. We present the basic building blocks for both parts of the system and we discuss open challenges towards the implementation of the Polypheny-DB vision.

Journal ArticleDOI
TL;DR: This work postulates that the challenge of dynamically configuring hardware accelerators to match a given OLAP query can only be met in a scalable fashion when providing a cooperative optimization between global and FPGA-specific optimizers, and demonstrates how this is addressed in two current research projects on FPGAs-based query processing.
Abstract: In the presence of exponential growth of the data produced every day in volume, velocity, and variety, online analytical processing (OLAP) is becoming increasingly challenging FPGAs offer hardware reconfiguration to enable query-specific pipelined and parallel data processing with the potential of maximizing throughput, speedup as well as energy and resource efficiency However, dynamically configuring hardware accelerators to match a given OLAP query is a complex task Furthermore, resource limitations restrict the coverage of OLAP operators As a consequence, query optimization through partitioning the processing onto components of heterogeneous hardware/software systems seems a promising direction While there exists work on operator placement for heterogeneous systems, it mainly targets systems combining multi-core CPUs with GPUs However, an inclusion of FPGAs, which uniquely offer efficient and high-throughput pipelined processing at the expense of potential reconfiguration overheads, is still an open problem We postulate that this challenge can only be met in a scalable fashion when providing a cooperative optimization between global and FPGA-specific optimizers We demonstrate how this is addressed in two current research projects on FPGA-based query processing

Proceedings ArticleDOI
27 May 2018
TL;DR: The problem of computing data statistics for workloads with rapid data ingestion and a lightweight statistics-collection framework that exploits the properties of LSM storage are addressed and an in-depth empirical evaluation is performed.
Abstract: Data sources, such as social media, mobile apps and IoT sensors, generate billions of records each day. Keeping up with this influx of data while providing useful analytics to the users is a major challenge for today's data-intensive systems. A popular solution that allows such systems to handle rapidly incoming data is to rely on log-structured merge (LSM) storage models. LSM-based systems provide a tunable trade-off between ingesting vast amounts of data at a high rate and running efficient analytical queries on top of that data. For queries, it is well-known that the query processing performance largely depends on the ability to generate efficient execution plans. Previous research showed that OLAP query workloads rely on having small, yet precise, statistical summaries of the underlying data, which can drive the cost-based query optimization. In this paper we address the problem of computing data statistics for workloads with rapid data ingestion and propose a lightweight statistics-collection framework that exploits the properties of LSM storage. Our approach is designed to piggyback on the events (flush and merge) of the LSM lifecycle. This allows us to easily create an initial statistics and then keep them in sync with rapidly changing data while minimizing the overhead to the existing system. We have implemented and adapted well-known algorithms to produce various types of statistical synopses, including equi-width histograms, equi-height histograms, and wavelets. We performed an in-depth empirical evaluation that considers both the cardinality estimation accuracy and runtime overheads of collecting and using statistics. The experiments were conducted by prototyping our approach on top of Apache AsterixDB, an open source Big Data management system that has an entirely LSM-based storage backend.

Proceedings ArticleDOI
01 Jun 2018
TL;DR: PMU data recovery performance of three existing algorithms, namely, the Singular Value Thresholding (SVT) algorithm, the OnLine Algorithm for PMU data processing (OLAP), and the Jones-Pal-Thorp extrapolation algorithm are investigated using historic PMUData from the New York and New England power systems.
Abstract: Rising numbers of Phasor Measurement Units (PMUs) are being installed in the North American power grid, allowing for a large amount of system dynamics to be collected at high sampling rates. The data collection process requires a phasor data network, which due to congestion, results in missing data affecting the reliability of applications using PMU data. Low-rank matrix methods have been proposed as tools to recover missing PMU data. The PMU data recovery performance of three existing algorithms, namely, the Singular Value Thresholding (SVT) algorithm, the OnLine Algorithm for PMU data processing (OLAP), and the Jones-Pal-Thorp extrapolation algorithm, as well as a modified version of the OLAP algorithm proposed in this paper, is investigated using historic PMU data from the New York and New England power systems.

Book ChapterDOI
06 Dec 2018
TL;DR: New rules for transforming a multidimensional conceptual model into NoSQL graph-oriented model are proposed, in this paper, for the implementation of data warehouse under NoSQL model.
Abstract: Big volumes of data cannot be processed by traditional warehouses and OLAP servers which are based on RDBMS solutions. As an alternative solution, Not only SQL (NoSQL) databases are becoming increasingly popular as they have interesting strengths such as scalability and flexibility for an OLAP system. As NoSQL database offer great flexibility, they can improve the classic solution based on data warehouses (DW). In the recent years, many web applications are moving towards the use of data in the form of graphs. For example, social media and the emergence of Facebook, LinkedIn and Twitter have accelerated the emergence of the NoSQL database and in particular graph-oriented databases that represent the basic format with which data in these media is stored. Based on these findings and in addition to the absence of a clear approach which allows the implementation of data warehouse under NoSQL model, we propose, in this paper, new rules for transforming a multidimensional conceptual model into NoSQL graph-oriented model.

Book ChapterDOI
01 Jan 2018
TL;DR: The process of integrating NoSQL and relational database with Hadoop Cluster, during an academic project using the Scrum Agile Method is described, which resulted in processing time significantly decreased.
Abstract: The project entitled as Big Data, Internet of Things, and Mobile Devices, in Portuguese Banco de Dados, Internet das Coisas e Dispositivos Moveis (BDIC-DM) was implemented at the Brazilian Aeronautics Institute of Technology (ITA) on the 1st Semester of 2015. It involved 60 graduate students within just 17 academic weeks. As a starting point for some features of real time Online Transactional Processing (OLTP) system, the Relational Database Management System (RDBMS) MySQL was used along with the NoSQL Cassandra to store transaction data generated from web portal and mobile applications. Considering batch data analysis, the Apache Hadoop Ecosystem was used for Online Analytical Processing (OLAP). The infrastructure based on the Apache Sqoop tool has allowed exporting data from the relational database MySQL to the Hadoop File System (HDFS), while Python scripts were used to export transaction data from the NoSQL database to the HDFS. The main objective of the BDIC-DM project was to implement an e-Commerce prototype system to manage credit card transactions, involving large volumes of data, by using different technologies. The used tools involved generation, storage, and consumption of Big Data. This paper describes the process of integrating NoSQL and relational database with Hadoop Cluster, during an academic project using the Scrum Agile Method. At the end, processing time significantly decreased, by using appropriate tools and available data. For future work, it is suggested the investigation of other tools and datasets.

Journal ArticleDOI
01 Oct 2018
TL;DR: This paper proposes many-query join (MQJoin), a novel method for sharing the execution of a join that can efficiently deal with hundreds of concurrent queries by minimizing redundant work and making efficient use of main-memory bandwidth and multi-core architectures.
Abstract: Database architectures typically process queries one at a time, executing concurrent queries in independent execution contexts. Often, such a design leads to unpredictable performance and poor scalability. One approach to circumvent the problem is to take advantage of sharing opportunities across concurrently running queries. In this paper, we propose many-query join (MQJoin), a novel method for sharing the execution of a join that can efficiently deal with hundreds of concurrent queries. This is achieved by minimizing redundant work and making efficient use of main-memory bandwidth and multi-core architectures. Compared to existing proposals, MQJoin is able to efficiently handle larger workloads regardless of the schema by exploiting more sharing opportunities. We also compared MQJoin to two commercial main-memory column-store databases. For a TPC-H-based workload, we show that MQJoin provides 2---5× higher throughput with significantly more stable response times.

Journal ArticleDOI
TL;DR: This work proposes a novel aggregation function for textual data based on the discovery of frequent closed patterns in a generated documents/keywords matrix that largely outperforms four state-of-the-art textual aggregation methods in terms of recall, precision, F-measure and runtime.
Abstract: Text mining approaches are commonly used to discover relevant information and relationships in huge amounts of text data. The term data mining refers to methods for analyzing data with the objective of finding patterns that aggregate the main properties of the data. The merger between the data mining approaches and on-line analytical processing (OLAP) tools allows us to refine techniques used in textual aggregation. In this paper, we propose a novel aggregation function for textual data based on the discovery of frequent closed patterns in a generated documents/keywords matrix. Our contribution aims at using a data mining technique, mainly a closed pattern mining algorithm, to aggregate keywords. An experimental study on a real corpus of more than 700 scientific papers collected on Microsoft Academic Search shows that the proposed algorithm largely outperforms four state-of-the-art textual aggregation methods in terms of recall, precision, F-measure and runtime.

Journal ArticleDOI
TL;DR: This work shows how a sizable amount of data, spread across a wide range of file formats and structures, and originating from a number of different sources belonging to various business domains, can be integrated in a single system that researchers can use for global data analysis and mining.
Abstract: Processing data that originates from different sources (such as environmental and medical data) can prove to be a difficult task, due to the heterogeneity of variables, storage systems, and file formats that can be used. Moreover, once the amount of data reaches a certain threshold, conventional mining methods (based on spreadsheets or statistical software) become cumbersome or even impossible to apply. Data Extract, Transform, and Load (ETL) solutions provide a framework to normalize and integrate heterogeneous data into a local data store. Additionally, the application of Online Analytical Processing (OLAP), a set of Business Intelligence (BI) methodologies and practices for multidimensional data analysis, can be an invaluable tool for its examination and mining. In this article, we describe a solution based on an ETL + OLAP tandem used for the on-the-fly analysis of tens of millions of individual medical, meteorological, and air quality observations from 16 provinces in Spain provided by 20 different national and regional entities in a diverse array for file types and formats, with the intention of evaluating the effect of several environmental variables on human health in future studies. Our work shows how a sizable amount of data, spread across a wide range of file formats and structures, and originating from a number of different sources belonging to various business domains, can be integrated in a single system that researchers can use for global data analysis and mining.

Journal ArticleDOI
TL;DR: This paper proposes Janus, a hybrid scalable cloud datastore, which enables the efficient execution of diverse workloads by storing data in different representations, and uses row and column-oriented representations, which are the most efficient representations for these workloads.
Abstract: Cloud-based data-intensive applications have to process high volumes of transactional and analytical requests on large-scale data. Businesses base their decisions on the results of analytical requests, creating a need for real-time analytical processing. We propose Janus, a hybrid scalable cloud datastore, which enables the efficient execution of diverse workloads by storing data in different representations. Janus manages big datasets in the context of datacenters, thus supporting scaling out by partitioning the data across multiple servers. This requires Janus to efficiently support distributed transactions. In order to support the different datacenter requirements, Janus also allows diverse partitioning strategies for the different representations. Janus proposes a novel data movement pipeline to continuously ensure up to date data between the different representations. Unlike existing multi-representation storage systems and Change Data Capture (CDC) pipelines, the data movement pipeline in Janus supports partitioning and handles both distributed transactions and diverse partitioning strategies. In this paper, we focus on supporting Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP) workloads, and hence use row and column-oriented representations, which are the most efficient representations for these workloads. Our evaluations over Amazon AWS illustrate that Janus can provide real-time analytical results, in addition to processing high-throughput transactional workloads.

Proceedings ArticleDOI
01 Jul 2018
TL;DR: A cost based optimization framework that identifies appropriate ML models to combine at query time and conduct extensive experiments on real-world and synthetic datasets, indicating that the framework can support analytic queries on ML models, with superior performance, achieving dramatic speedups of several orders in magnitude on very large datasets.
Abstract: Machine learning has become an essential toolkit for complex analytic processing. Data is typically stored in large data warehouses with multiple dimension hierarchies. Often, data used for building an ML model are aligned on OLAP hierarchies such as location or time. In this paper, we investigate the feasibility of efficiently constructing approximate ML models for new queries from previously constructed ML models by leveraging the concepts of model materialization and reuse. For example, is it possible to construct an approximate ML model for data from the year 2017 if one already has ML models for each of its quarters? We propose algorithms that can support a wide variety of ML models such as generalized linear models for classification along with K-Means and Gaussian Mixture models for clustering. We propose a cost based optimization framework that identifies appropriate ML models to combine at query time and conduct extensive experiments on real-world and synthetic datasets. Our results indicate that our framework can support analytic queries on ML models, with superior performance, achieving dramatic speedups of several orders in magnitude on very large datasets.

Journal ArticleDOI
TL;DR: In this paper, the authors propose a set of intentional OLAP operators, namely, describe, assess, explain, predict, and suggest, which express the user's need for results.
Abstract: This paper structures a novel vision for OLAP by fundamentally redefining several of the pillars on which OLAP has been based for the last 20 years. We redefine OLAP queries, in order to move to higher degrees of abstraction from roll-up's and drill-down's, and we propose a set of novel intentional OLAP operators, namely, describe, assess, explain, predict, and suggest, which express the user's need for results. We fundamentally redefine what a query answer is, and escape from the constraint that the answer is a set of tuples; on the contrary, we complement the set of tuples with models (typically, but not exclusively, results of data mining algorithms over the involved data) that concisely represent the internal structure or correlations of the data. Due to the diverse nature of the involved models, we come up (for the first time ever, to the best of our knowledge) with a unifying framework for them, that places its pillars on the extension of each data cell of a cube with information about the models that pertain to it -- practically converting the small parts that build up the models to data that annotate each cell. We exploit this data-to-model mapping to provide highlights of the data, by isolating data and models that maximize the delivery of new information to the user. We introduce a novel method for assessing the surprise that a new query result brings to the user, with respect to the information contained in previous results the user has seen via a new interestingness measure. The individual parts of our proposal are integrated in a new data model for OLAP, which we call the Intentional Analytics Model. We complement our contribution with a list of significant open problems for the community to address.

Journal ArticleDOI
TL;DR: This paper proposes the creation, integration and implementation of a new dimension called Contextual Dimension from texts obtained from social networks into a multidimensional model, automatically created after applying hierarchical clustering algorithms and is fully independent from the language of the texts.
Abstract: Due to the continuous growth of social networks the textual information available has increased exponentially. Data warehouses (DW) and online analytical processing (OLAP) are some of the established technologies to process and analyze structured data. However, one of their main limitations is the lack of automatic processing and analysis of unstructured data (specifically, textual data), and its integration with structured data. This paper proposes the creation, integration and implementation of a new dimension called Contextual Dimension from texts obtained from social networks into a multidimensional model. Such a dimension is automatically created after applying hierarchical clustering algorithms and is fully independent from the language of the texts. This dimension allows the inclusion of multidimensional analysis of texts using contexts and topics integrated with conventional dimensions into business decisions. The experiments were carried out by means of a freeware OLAP system (Wonder 3.0) using real data from social networks.

Journal ArticleDOI
TL;DR: An algebraic query language based on “incident patterns” with four operators inspired from Business Process Model and Notation representation is developed, allowing the user to formulate ad hoc queries directly over workflow logs, bypassing the traditional methodology for more flexibility in querying.
Abstract: A business process or workflow is an assembly of tasks that accomplishes a business goal. Business process management is the study of the design, configuration/implementation, enactment and monitoring, analysis, and re-design of workflows. The traditional methodology for the re-design and improvement of workflows relies on the well-known sequence of extract, transform, and load (ETL), data/process warehousing, and online analytical processing (OLAP) tools. In this paper, we study the ad hoc queryiny of process enactments for (data-centric) business processes, bypassing the traditional methodology for more flexibility in querying. We develop an algebraic query language based on “incident patterns” with four operators inspired from Business Process Model and Notation (BPMN) representation, allowing the user to formulate ad hoc queries directly over workflow logs. A formal semantics of this query language, a preliminary query evaluation algorithm, and a group of elementary properties of the operators are provided.

Proceedings ArticleDOI
27 May 2018