scispace - formally typeset
Search or ask a question

Showing papers on "Online analytical processing published in 2012"


Book
16 Nov 2012
TL;DR: This monograph provides an accessible introduction and reference to materialized views, explains its core ideas, highlights its recent developments, and points out its sometimes subtle connections to other research topics in databases.
Abstract: Materialized views are a natural embodiment of the ideas of precomputation and caching in databases. Instead of computing a query from scratch, a system can use results that have already been computed, stored, and kept in sync with database updates. The ability of materialized views to speed up queries benefits most database applications, ranging from traditional querying and reporting to web database caching, online analytical processing, and data mining. By reducing dependency on the availability of base data, materialized views have also laid much of the foundation for information integration and data warehousing systems. The database tradition of declarative querying distinguishes materialized views from generic applications of precomputation and caching in other contexts, and makes materialized views especially interesting, powerful, and challenging at the same time. Study of materialized views has generated a rich research literature and mature commercial implementations, aimed at providing efficient, effective, automated, and general solutions to the selection, use, and maintenance of materialized views. This monograph provides an accessible introduction and reference to materialized views, explains its core ideas, highlights its recent developments, and points out its sometimes subtle connections to other research topics in databases.

172 citations


Journal ArticleDOI
01 Jul 2012
TL;DR: This work studies the problem of efficient processing of multi-way Theta-join queries using MapReduce from a cost-effective perspective and achieves significant improvement of the join processing efficiency.
Abstract: Multi-way Theta-join queries are powerful in describing complex relations and therefore widely employed in real practices. However, existing solutions from traditional distributed and parallel databases for multi-way Theta-join queries cannot be easily extended to fit a shared-nothing distributed computing paradigm, which is proven to be able to support OLAP applications over immense data volumes. In this work, we study the problem of efficient processing of multi-way Theta-join queries using MapReduce from a cost-effective perspective. Although there have been some works using the (key, value) pair-based programming model to support join operations, efficient processing of multi-way Theta-join queries has never been fully explored. The substantial challenge lies in, given a number of processing units (that can run Map or Reduce tasks), mapping a multi-way Theta-join query to a number of MapReduce jobs and having them executed in a well scheduled sequence, such that the total processing time span is minimized. Our solution mainly includes two parts: 1) cost metrics for both single MapReduce job and a number of MapReduce jobs executed in a certain order; 2) the efficient execution of a chain-typed Theta-join with only one MapReduce job. Comparing with the query evaluation strategy proposed in [23] and the widely adopted Pig Latin and Hive SQL solutions, our method achieves significant improvement of the join processing efficiency.

101 citations


Book ChapterDOI
27 May 2012
TL;DR: This work investigates the problem of executing OLAP queries via SPARQL on an RDF store and defines projection, slice, dice and roll-up operations on single data cubes published as Linked Data reusing the RDF Data Cube vocabulary and shows how a nested set of operations lead to an OLAP query.
Abstract: Online Analytical Processing (OLAP) promises an interface to analyse Linked Data containing statistics going beyond other interaction paradigms such as follow-your-nose browsers, faceted-search interfaces and query builders. Transforming statistical Linked Data into a star schema to populate a relational database and applying a common OLAP engine do not allow to optimise OLAP queries on RDF or to directly propagate changes of Linked Data sources to clients. Therefore, as a new way to interact with statistics published as Linked Data, we investigate the problem of executing OLAP queries via SPARQL on an RDF store. First, we define projection, slice, dice and roll-up operations on single data cubes published as Linked Data reusing the RDF Data Cube vocabulary and show how a nested set of operations lead to an OLAP query. Second, we show how to transform an OLAP query to a SPARQL query which generates all required tuples from the data cube. In a small experiment, we show the applicability of our OLAP-to-SPARQL mapping in answering a business question in the financial domain.

81 citations


Journal ArticleDOI
01 Jul 2012
TL;DR: A programming model oriented around deltas is presented, how to execute and optimize such programs in the authors' REX runtime system are described, and it is validated that the platform also handles failures gracefully.
Abstract: In today's Web and social network environments, query workloads include ad hoc and OLAP queries, as well as iterative algorithms that analyze data relationships (e.g., link analysis, clustering, learning). Modern DBMSs support ad hoc and OLAP queries, but most are not robust enough to scale to large clusters. Conversely, "cloud" platforms like MapReduce execute chains of batch tasks across clusters in a fault tolerant way, but have too much overhead to support ad hoc queries.Moreover, both classes of platform incur significant overhead in executing iterative data analysis algorithms. Most such iterative algorithms repeatedly refine portions of their answers, until some convergence criterion is reached. However, general cloud platforms typically must reprocess all data in each step. DBMSs that support recursive SQL are more efficient in that they propagate only the changes in each step --- but they still accumulate each iteration's state, even if it is no longer useful. User-defined functions are also typically harder to write for DBMSs than for cloud platforms.We seek to unify the strengths of both styles of platforms, with a focus on supporting iterative computations in which changes, in the form of deltas, are propagated from iteration to iteration, and state is efficiently updated in an extensible way. We present a programming model oriented around deltas, describe how we execute and optimize such programs in our REX runtime system, and validate that our platform also handles failures gracefully. We experimentally validate our techniques, and show speedups over the competing methods ranging from 2.5 to nearly 100 times.

65 citations


Journal ArticleDOI
01 Mar 2012
TL;DR: The Semantic Web deployment is now a realization and the amount of semantic annotations is ever increasing thanks to several initiatives that promote a change in the current Web towards the The authors towards the semantic web.
Abstract: The Semantic Web (SW) deployment is now a realization and the amount of semantic annotations is ever increasing thanks to several initiatives that promote a change in the current Web towards the We...

62 citations


01 Jan 2012
TL;DR: This work proposes a new vocabulary, denoted QB4OLAP, which extends QB to fully support OLAP models and operators, and provides algorithms that build the structures that allow performing both kinds of analysis, and shows that compatibility between QB and QB4 OLAP can be achieved at low cost.
Abstract: On-Line Analytical Processing (OLAP) tools allow querying large multidimensional (MD) databases called data warehouses (DW). OLAP-style data analysis over the semantic web (SW) is gaining momentum, and thus SW technologies will be needed to model, manipulate, and share MD data. To achieve this, the definition of a vocabulary that adequately represents OLAP data is required. Unfortunately, so far, the proposals in this direction have followed different roads. On the one hand, the QB vocabulary (a proposal by the W3C Government Linked Data Working Group) follows a model initially devised for analyzing statistical data, but does not adequately support OLAP multidimensional data. Another recent proposal, the Open Cube vocabulary (OC) follows closely the classic MD models for OLAP and allows implementing OLAP operators as SPARQL queries, but does not provide a mechanism for reusing data already published using QB. In this work, we propose a new vocabulary, denoted QB4OLAP, which extends QB to fully support OLAP models and operators.We show how data already published in QB can be analyzed a la OLAP using the QB4OLAP vocabulary, and vice versa. To this end we provide algorithms that build the structures that allow performing both kinds of analysis, and show that compatibility between QB and QB4OLAP can be achieved at low cost, only adding dimensional information.

60 citations


Posted Content
TL;DR: This approach reorganizes and compresses transactional data online and yet hardly affects the mission-critical OLTP throughput by unburdening the OLTP threads from all additional processing and performing these tasks asynchronously.
Abstract: Growing main memory sizes have facilitated database management systems that keep the entire database in main memory. The drastic performance improvements that came along with these in-memory systems have made it possible to reunite the two areas of online transaction processing (OLTP) and online analytical processing (OLAP): An emerging class of hybrid OLTP and OLAP database systems allows to process analytical queries directly on the transactional data. By offering arbitrarily current snapshots of the transactional data for OLAP, these systems enable real-time business intelligence. Despite memory sizes of several Terabytes in a single commodity server, RAM is still a precious resource: Since free memory can be used for intermediate results in query processing, the amount of memory determines query performance to a large extent. Consequently, we propose the compaction of memory-resident databases. Compaction consists of two tasks: First, separating the mutable working set from the immutable "frozen" data. Second, compressing the immutable data and optimizing it for efficient, memory-consumption-friendly snapshotting. Our approach reorganizes and compresses transactional data online and yet hardly affects the mission-critical OLTP throughput. This is achieved by unburdening the OLTP threads from all additional processing and performing these tasks asynchronously.

59 citations


Book ChapterDOI
27 May 2012
TL;DR: This paper introduces OpenCubes, an RDFS vocabulary for the specification and publication of multidimensional cubes on the Semantic Web, and shows how classical OLAP operations can be implemented over Open Cubes using SPARQL 1.1 without the need of mapping the multiddimensional information to the local database.
Abstract: Traditional OLAP tools have proven to be successful in analyzing large sets of enterprise data. For today's business dynamics, sometimes these highly curated data is not enough. External data (particularly web data), may be useful to enhance local analysis. In this paper we discuss the extraction of multidimensional data from web sources, and their representation in RDFS. We introduce Open Cubes, an RDFS vocabulary for the specification and publication of multidimensional cubes on the Semantic Web, and show how classical OLAP operations can be implemented over Open Cubes using SPARQL 1.1, without the need of mapping the multidimensional information to the local database (the usual approach to multidimensional analysis of Semantic Web data). We show that our approach is plausible for the data sizes that can usually be retrieved to enhance local data repositories.

56 citations


Journal ArticleDOI
01 Jul 2012
TL;DR: In this article, the compaction of memory-resident databases is proposed, which separates the mutable working set from the immutable "frozen" data and compresses the immutable data and optimizes it for efficient, memoryconsumption-friendly snapshotting.
Abstract: Growing main memory sizes have facilitated database management systems that keep the entire database in main memory. The drastic performance improvements that came along with these in-memory systems have made it possible to reunite the two areas of online transaction processing (OLTP) and online analytical processing (OLAP): An emerging class of hybrid OLTP and OLAP database systems allows to process analytical queries directly on the transactional data. By offering arbitrarily current snapshots of the transactional data for OLAP, these systems enable real-time business intelligence.Despite memory sizes of several Terabytes in a single commodity server, RAM is still a precious resource: Since free memory can be used for intermediate results in query processing, the amount of memory determines query performance to a large extent. Consequently, we propose the compaction of memory-resident databases. Compaction consists of two tasks: First, separating the mutable working set from the immutable "frozen" data. Second, compressing the immutable data and optimizing it for efficient, memory-consumption-friendly snapshotting. Our approach reorganizes and compresses transactional data online and yet hardly affects the mission-critical OLTP throughput. This is achieved by unburdening the OLTP threads from all additional processing and performing these tasks asynchronously.

54 citations


Proceedings ArticleDOI
01 Apr 2012
TL;DR: DPCube, a component in the HIDE framework, is demonstrated for releasing differentially private data cubes (or multi-dimensional histograms) for sensitive data and can serve as a sanitized synopsis of the raw database and can support various Online Analytical Processing queries and learning tasks.
Abstract: We propose to demonstrate DPCube, a component in our Health Information DE-identification (HIDE) framework, for releasing differentially private data cubes (or multidimensional histograms) for sensitive data. HIDE is a framework we developed for integrating heterogenous structured and unstructured health information and provides methods for privacy preserving data publishing. The DPCube component provides the differentially private multidimensional data cube release. The DPCube algorithm uses the differentially private access mechanisms as provided by HIDE and guarantees differential privacy for the released data. It utilizes an innovative two-step multidimensional partitioning technique to publish a generalized data cube or multi-dimensional histogram that achieve good utility while satisfying the privacy requirement. We demonstrate that the released data cubes can serve as a sanitized synopsis of the raw database and, together with an optional synthesized dataset based on the data cubes, can support various Online Analytical Processing (OLAP) queries and learning tasks.

51 citations


Journal ArticleDOI
TL;DR: A peer-to-peer data warehousing architecture based on a network of heterogeneous peers, each exposing query answering functionalities aimed at sharing business information is envisioned, and a query reformulation framework that relies on the translation of mappings, queries, and multidimensional schemata onto the relational level is introduced.

Proceedings ArticleDOI
27 Mar 2012
TL;DR: This is the first proposal that allows analyzing discrete and continuous spatiotemporal data and OLAP cubes together, using just the traditional OLAP operations, thus providing a very general framework for spatiotmporal data analysis.
Abstract: Nowadays, organizations need to use OLAP (On Line Analytical Processing) tools together with geographical information. To support this, the notion of SOLAP (Spatial OLAP) arouse, aimed at exploring spatial data in the same way as OLAP operates over tables. SOLAP however, only accounts for discrete spatial data. More sophisticated GIS-based decision support systems are increasingly being needed, to handle more complex types of data, like continuous fields. Fields describe physical phenomena that change continuously in time and/or space (e.g., temperature). Although many models have been proposed for adding spatial information to OLAP tools, no one allows the user to perceive data as a cube, and analyze any type of spatial data, continuous or discrete, together with typical alphanumerical discrete OLAP data, using only the classic OLAP operators (e.g., Roll-up, Drill-down). In this paper we propose an algebra that operates over data cubes, independently of the underlying data types and physical data representation. That means, in our approach, the final user only sees the typical OLAP operators at the query level. At lower abstraction levels we provide discrete and continuous spatial data support as well as different ways of partitioning the space. We also describe a proof-of-concept implementation to illustrate the ideas presented in the paper. As far as we are aware of, this is the first proposal that allows analyzing discrete and continuous spatiotemporal data and OLAP cubes together, using just the traditional OLAP operations, thus providing a very general framework for spatiotemporal data analysis.

12 Nov 2012
TL;DR: In this paper, the authors propose a new vocabulary, denoted QB4OLAP, which extends QB to fully support OLAP models and operators, and show how data already published in QB can be analyzed a la OLAP using the new vocabulary and vice versa.
Abstract: On-Line Analytical Processing (OLAP) tools allow querying large multidimensional (MD) databases called data warehouses (DW). OLAP-style data analysis over the semantic web (SW) is gaining momentum, and thus SW technologies will be needed to model, manipulate, and share MD data. To achieve this, the definition of a vocabulary that adequately represents OLAP data is required. Unfortunately, so far, the proposals in this direction have followed different roads. On the one hand, the QB vocabulary (a proposal by the W3C Government Linked Data Working Group) follows a model initially devised for analyzing statistical data, but does not adequately support OLAP multidimensional data. Another recent proposal, the Open Cube vocabulary (OC) follows closely the classic MD models for OLAP and allows implementing OLAP operators as SPARQL queries, but does not provide a mechanism for reusing data already published using QB. In this work, we propose a new vocabulary, denoted QB4OLAP, which extends QB to fully support OLAP models and operators. We show how data already published in QB can be analyzed a la OLAP using the QB4OLAP vocabulary, and vice versa. To this end we provide algorithms that build the structures that allow performing both kinds of analysis, and show that compatibility between QB and QB4OLAP can be achieved at low cost, only adding dimensional information.

Book ChapterDOI
28 Nov 2012
TL;DR: A graph data model, GOLAP, is proposed for online analytical processing on graphs that enables extending decision support on multidimensional networks considering both data objects and the relationships among them and extends SPARQL to support n-dimensional computations.
Abstract: Graphs are essential modeling and analytical objects for representing information networks. Existing approaches, in on-line analytical processing on graphs, took the first step by supporting multi-level and multi-dimensional queries on graphs, but they do not provide a semantic-driven framework and a language to support n-dimensional computations, which are frequent in OLAP environments. The major challenge here is how to extend decision support on multidimensional networks considering both data objects and the relationships among them. Moreover, one of the critical deficiencies of graph query languages, e.g. SPARQL, is the lack of support for n-dimensional computations. In this paper, we propose a graph data model, GOLAP, for online analytical processing on graphs. This data model enables extending decision support on multidimensional networks considering both data objects and the relationships among them. Moreover, we extend SPARQL to support n-dimensional computations. The approaches presented in this paper have been implemented on top of FPSPARQL, Folder-Path enabled extension of SPARQL, and experimentally validated on synthetic and real-world datasets.

Proceedings ArticleDOI
26 Aug 2012
TL;DR: This work describes the process of transforming the original stream into a set of related multidimensional cubes and demonstrates how the resulting data warehouse can be used for solving a variety of analytical tasks.
Abstract: In the recent year Twitter has evolved into an extremely popular social network and has revolutionized the ways of interacting and exchanging information on the Internet. By making its public stream available through a set of APIs Twitter has triggered a wave of research initiatives aimed at analysis and knowledge discovery from the data about its users and their messaging activities. While most of the projects and tools are tailored towards solving specific tasks, we pursue a goal of providing an application in dependent and universal analytical platform for supporting any kind of analysis and knowledge discovery. We employ the well established data warehousing technology with its underlying multidimensional data model, ETL routine for loading and consolidating data from different sources, OLAP functionality for exploring the data and data mining tools for more sophisticated analysis. In this work we describe the process of transforming the original stream into a set of related multidimensional cubes and demonstrate how the resulting data warehouse can be used for solving a variety of analytical tasks. We expect our proposed approach to be applicable for analyzing the data of other social networks as well.

Patent
16 May 2012
TL;DR: In this paper, a concurrent on-line analytical processing (OLAP)-oriented database query processing method is described, for performing, on the basis of predicate vector-based memory OLAP star-join optimization, concurrent OLAP query processing based on a batch query predicate vector bit operation.
Abstract: A concurrent on-line analytical processing (OLAP)-oriented database query processing method is described, for performing, on the basis of predicate vector-based memory OLAP star-join optimization, concurrent OLAP query processing based on a batch query predicate vector bit operation. The concurrent query processing optimization technology is implemented for I/O performance and parallel OLAP processing performance in a database management system, and setting of concurrent OLAP processing load in an optimized way catering to the I/O performance is supported, thereby improving predictable processing performance oriented to diversified OLAP queries and implementing concurrent query star-join bitmap filtering processing based on predicate vector arrays.

Patent
16 May 2012
TL;DR: In this paper, an OLAP query processing method oriented to a database and Hadoop hybrid platform is described, where the database technology and the hadoop technology are combined, and the storage performance of the database and the high expandability and high availability of the Hadoops are combined; the database query processing and the MapReduce query processing are integrated in a loosely-coupled mode.
Abstract: An OLAP query processing method oriented to a database and Hadoop hybrid platform is described. When OLAP query processing is performed, the processing is executed first on a main working copy, and a query processing result is recorded in an aggregate result table of a local database; when a working node is faulty, node information of a fault-tolerant copy corresponding to the main working copy is searched for through namenode, and a MapReduce task is invoked to complete the OLAP query processing task on the fault-tolerant copy. The database technology and the Hadoop technology are combined, and the storage performance of the database and the high expandability and high availability of the Hadoop are combined; the database query processing and the MapReduce query processing are integrated in a loosely-coupled mode, thereby ensuring the high query processing performance, and ensuring the high fault-tolerance performance.

Journal ArticleDOI
01 Jan 2012
TL;DR: An Information Volatility Measure (IVM) is proposed to complement business intelligence (BI) tools when considering aggregated data (intra-cell) or when observing trends in data (inter-cell), drawn from volatility measures found in the field of finance.
Abstract: Health care decision makers and researchers often use reporting tools (e.g. Online Analytical Processing (OLAP)) that present data aggregated from multiple medical registries and electronic medical records to gain insights into health care practices and to understand and improve patient outcomes and quality of care. An important limitation is that the data are usually displayed as point estimates without full description of the instability of the underlying data, thus decision makers are often unaware of the presence of outliers or data errors. To manage this problem, we propose an Information Volatility Measure (IVM) to complement business intelligence (BI) tools when considering aggregated data (intra-cell) or when observing trends in data (inter-cell). The IVM definitions and calculations are drawn from volatility measures found in the field of finance, since the underlying data in both arenas display similar behaviors. The presentation of the IVM is supplemented with three types of benchmarking to support improved user understanding of the measure: numerical benchmarking, graphical benchmarking, and categorical benchmarking. The IVM is designed and evaluated using exploratory and confirmatory focus groups.

Patent
13 Jan 2012
TL;DR: In this paper, techniques for managing aggregation of data in a distributed manner, such as for a particular client based on specified configuration information, are described for receiving information about multi-stage data manipulation operations that are to be performed as part of the data aggregation.
Abstract: Techniques are described for managing aggregation of data in a distributed manner, such as for a particular client based on specified configuration information. The described techniques may include receiving information about multi-stage data manipulation operations that are to be performed as part of the data aggregation, with each stage able to be performed in a distributed manner using multiple computing nodes—for example, a map-reduce architecture may be used, with a first stage involving the use of one or more specified map functions to be performed, and with at least a second stage involving the use of one or more specified reduce functions to be performed. In some situations, a particular set of input data may be used to generate the data for a multi-dimensional OLAP (“online analytical processing”) cube, such as for input data corresponding to a large quantity of transactions of one or more types.

Journal ArticleDOI
01 Mar 2012
TL;DR: This work argues for considering spatiality as a personalization feature within a formal design process so that each decision maker will be able to access its own personalized SMD schema with its required spatial structures and instances, suitable to be properly analyzed at a glance.
Abstract: Spatial data warehouses (SDW) rely on extended multidimensional (MD) models in order to provide decision makers with appropriate structures to intuitively explore spatial data by using different analysis techniques such as OLAP (On-Line Analytical Processing) or data mining. Current development approaches are focused on defining a unique and static Spatial multidimensional (SMD) schema at the conceptual level over which all decision makers fulfill their current spatial information needs. However, considering the required spatiality for each decision maker is likely to derive in a potentially misleading SMD schema (even if a departmental DW or data mart is being defined). Furthermore, spatial needs of each decision maker could change over time or depending on the context, thus requiring the SMD schema to be continuously updated with changes that can hamper decision making. Therefore, if a unique and static SMD schema is designed, acquiring the required spatial information is more costly than expected for decision makers and they may get frustrated during the analysis. To overcome these drawbacks, we argue for considering spatiality as a personalization feature within a formal design process. In this way, each decision maker will be able to access its own personalized SMD schema with its required spatial structures and instances, suitable to be properly analyzed at a glance. Our approach considers several novel artifacts: (i) a UML profile for spatial multidimensional modeling at the conceptual level, (ii) a spatial-aware user model in order to define decision maker profile; and (iii) a spatial personalization language to define spatial needs of decision makers as personalization rules. The definition of personalized SMD schemas by using these artifacts is formally defined using the Software Process Engineering Metamodel Specification (SPEM) standard. Finally, the applicability of our approach is shown through a running example based on our Eclipse-based tool for SDW development.

Journal ArticleDOI
TL;DR: The authors define all modeling and querying requirements to do this integration, and then present a SOLAP model and algebra that support map generalization concepts, and extend SOLAP spatial hierarchies introducing multi-association relationships.
Abstract: Map generalization can be used as a central component of Spatial Decision Support Systems to provide a simplified and more readable cartographic visualization of geographic information. Indeed, it supports the user mental process for discovering important and unknown geospatial relations, trends and patterns. Spatial OLAP SOLAP integrates spatial data into OLAP and data warehouse systems. SOLAP models and tools are based on the concepts of spatial dimensions and measures that represent the axes and the subjects of the spatio-multidimensional analysis. Although powerful under some respect, current SOLAP models cannot support map generalization capabilities. This paper provides the first effort to integrate Map Generalization and OLAP. Firstly the authors define all modeling and querying requirements to do this integration, and then present a SOLAP model and algebra that support map generalization concepts. The approach extends SOLAP spatial hierarchies introducing multi-association relationships, supports imprecise measures, and it takes into account spatial dimensions constraints generated by multiple map generalization hierarchies.

Patent
16 May 2012
TL;DR: In this article, a multi-dimensional OLAP query processing method oriented to a column store data warehouse is described, which is divided into a bitmap filtering operation, a group-by operation and an aggregate operation.
Abstract: A multi-dimensional OLAP query processing method oriented to a column store data warehouse is described. With this method, an OLAP query is divided into a bitmap filtering operation, a group-by operation and an aggregate operation. In the bitmap filtering operation, a predicate is first executed on a dimension table to generate a predicate vector bitmap, and a join operation is converted, through address mapping of a surrogate key, into a direct dimension table tuple access operation; in the group-by operation, a fact table tuple satisfying a filtering condition is pre-generated into a group-by unit according to a group-by attribute in an SQL command and is allocated with an increasing ID; and in the aggregate operation, group-by aggregate calculation is performed according to a group item of a fact table filtering group-by vector through one-pass column scan on a fact table measure attribute.

Patent
12 Sep 2012
TL;DR: In this article, a multi-dimensional OLAP (On Line Analytical Processing) inquiry processing method facing a column storage data warehouse is presented. But the method is not suitable for the case of large data sets.
Abstract: The invention discloses a multi-dimensional OLAP (On Line Analytical Processing) inquiry processing method facing a column storage data warehouse. In the multi-dimensional OLAP inquiry processing method, the OLAP inquiry is decomposed into a bitmap filtering operation, a grouping operation and an aggregation operation. In the bitmap filtering operation, firstly, the predication is executed on a dimension table and a predicate vector bitmap is generated; and a connection operation is converted into a direct dimension table record access operation by surrogate key address mapping so as to implement access according to the positions. In the grouping operation, the pre-generation of grouping units is carried out on fact table records which meet the filtering conditions according to grouping attributes in an SQL (Structured Query Language) command and increasing IDs (identity) are distributed. In the aggregation operation, the grouping aggregation calculation which is carried out according to a grouping item of a grouping filtering vector of a fact table is implemented by carrying out column scanning on the metric attribute of the fact table for once. According to the invention, all OLAP processing tasks can be completed only by carrying out column scanning on the fact table for once, so that the cost of repeatedly scanning is avoided.

Proceedings ArticleDOI
02 Nov 2012
TL;DR: This work proposes to extend the capabilities of OLAP via content-driven discovery of measures and dimensional characteristics in the original dataset via structural enrichment of the original data beyond the scope of the existing methods for generating multidimensional models from relational or semi-structured data.
Abstract: With the standard OLAP technology, cubes are constructed from the input data based on the available data fields and known relationships between them. Structuring the data into a set of numeric measures distributed along a set of uniformly structured dimensions may be unrealistic for applications dealing with semi-structured data. We propose to extend the capabilities of OLAP via content-driven discovery of measures and dimensional characteristics in the original dataset. New structural elements are discovered by means of data mining and other techniques and are therefore prone to changes as the underlying dataset evolves. In this work we focus on the challenge of generating, maintaining, and querying such discovered elements of the cube.We demonstrate the benefits of our approach by providing OLAP to the public stream of user-generated content of the popular microblogging service Twitter. We were able to enrich the original set by discovering dynamic characteristics such as user activity, popularity, messaging behavior, as well as classifying messages by topic, impact, origin, method of generation, etc. Application of knowledge discovery techniques coupled with human expertise enable structural enrichment of the original data beyond the scope of the existing methods for generating multidimensional models from relational or semi-structured data.

01 Jan 2012
TL;DR: This work presents context-sensitive ranking, a ranking framework that integrates structured queries and relevance ranking, and leverages and extends the materialized view research in OLAP to deliver algorithms and data structures that evaluate context- sensitive ranking efficiently.
Abstract: We are witnessing a growing number of applications that involve both structured data and unstructured data. A simple example is academic citations : while the citation's content is unstructured text, the citation is associated with structured data such as author list, categories and publication time. To query such hybrid data, a natural approach is to combine structured queries with keyword search. Two fundamental problems arise for this unique marriage : (1) How to evaluate hybrid queries efficiently? (2) How to model relevance ranking? The second problem is especially difficult, because all the foundations of relevance ranking in information retrieval are built on unstructured text and no structures are considered. We present context-sensitive ranking, a ranking framework that integrates structured queries and relevance ranking. The key insight is that structured queries provide expressive search contexts. The ranking model collects keyword statistics in the contexts and feeds them into conventional ranking formulas to compute ranking scores. The query evaluation challenge is the computation of keyword statistics at runtime, which involves expensive online aggregations. At the core of our solution to overcome the efficiency issue is an innovative reduction from computing keyword statistics to answering aggregation queries. Many statistics, such as document frequency, require aggregations over the data space returned by the structured query. This is analogous to analytical queries in OLAP applications, which involve a large number of aggregations. We leverage and extend the materialized view research in OLAP to deliver algorithms and data structures that evaluate context-sensitive ranking efficiently

Book ChapterDOI
29 May 2012
TL;DR: A new Intrusion Detection Systems (IDS) is introduced, called OMC-IDS, which integrates data mining techniques and On Line Analytical Processing (OLAP) tools, which can be a powerful solution to deal with the defects of IDS.
Abstract: Due to the growing threat of network attacks, the efficient detection as well as the network abuse assessment are of paramount importance. In this respect, the Intrusion Detection Systems (IDS) are intended to protect information systems against intrusions. However, IDS are plugged with several problems that slow down their development, such as low detection accuracy and high false alarm rate. In this paper, we introduce a new IDS, called OMC-IDS, which integrates data mining techniques and On Line Analytical Processing (OLAP) tools. The association of the two fields can be a powerful solution to deal with the defects of IDS. Our experiment results show the effectiveness of our approach in comparison with those fitting in the same trend.

Proceedings ArticleDOI
02 Nov 2012
TL;DR: This work has developed a novel HMGraph OLAP (Heterogeneous and Multi-dimensional Graph OLAP) framework for the purpose of providing more dimensions and operations to mine multi-dimensional heterogeneous information network and proposed entity dimensions, which represent an important dimension for heterogeneous network analysis.
Abstract: As information continues to grow at an explosive rate, more and more heterogeneous network data sources are coming into being. While OLAP (On-Line Analytical Processing) techniques have been proven effective for analyzing and mining structured data, unfortunately, to our best knowledge, there are no OLAP tools available that are able to analyze multi-dimensional heterogeneous networks from different perspectives and with multiple granularities. Therefore, we have developed a novel HMGraph OLAP (Heterogeneous and Multi-dimensional Graph OLAP) framework for the purpose of providing more dimensions and operations to mine multi-dimensional heterogeneous information network. After information dimensions and topological dimensions, we have been the first to propose entity dimensions, which represent an important dimension for heterogeneous network analysis. On the basis of this notion, we designed HMGraph OLAP operations named (Rotate and Stretch for entity dimensions, which are able to mine relationships between different entities. We then proposed the HMGraph Cube, which is an efficient data warehousing model for HMGraph OLAP. In addition, through comparison with common strategies, we have shown that the optimizations we have proposed deliver better performance. Finally, we have implemented a HMGraph OLAP prototype, LiterMiner, which has proven effective for the analysis of multi-dimensional heterogeneous networks.

Journal ArticleDOI
Lili Wu1, Roshan Rajesh Sumbaly1, Chris Riccomini1, Gordon Koo1, Hyung Jin Kim1, Jay Kreps1, Sam Shah1 
01 Aug 2012
TL;DR: To serve LinkedIn's growing 160 million member base, the company built a scalable and fast OLAP serving system called Avatara to solve the many, small cubes problem.
Abstract: Multidimensional data generated by members on websites has seen massive growth in recent years. OLAP is a well-suited solution for mining and analyzing this data. Providing insights derived from this analysis has become crucial for these websites to give members greater value. For example, LinkedIn, the largest professional social network, provides its professional members rich analytics features like "Who's Viewed My Profile?" and "Who's Viewed This Job?" The data behind these features form cubes that must be efficiently served at scale, and can be neatly sharded to do so. To serve our growing 160 million member base, we built a scalable and fast OLAP serving system called Avatara to solve this many, small cubes problem. At LinkedIn, Avatara has been powering several analytics features on the site for the past two years.

Proceedings ArticleDOI
21 May 2012
TL;DR: This work defines the problem of bank and value conflict optimization for data processing operators using the CUDA platform and uses two database operations: foreignkey join and grouped aggregation to analyze the impact of these two factors on operator performance.
Abstract: Implementations of database operators on GPU processors have shown dramatic performance improvement compared to multicore-CPU implementations. GPU threads can cooperate using shared memory, which is organized in interleaved banks and is fast only when threads read and modify addresses belonging to distinct memory banks. Therefore, data processing operators implemented on a GPU, in addition to contention caused by popular values, have to deal with a new performance limiting factor: thread serialization when accessing values belonging to the same bank.Here, we define the problem of bank and value conflict optimization for data processing operators using the CUDA platform. To analyze the impact of these two factors on operator performance we use two database operations: foreignkey join and grouped aggregation. We suggest and evaluate techniques for optimizing the data arrangement offline by creating clones of values to reduce overall memory contention. Results indicate that columns used for writes, as grouping columns, need be optimized to fully exploit the maximum bandwidth of shared memory.

Proceedings ArticleDOI
25 Jun 2012
TL;DR: The cloud hosting of BI has been demonstrated with the help of a simulation on OPNET comprising a cloud model with multiple OLAP application servers applying parallel query loads on an array of servers hosting relational databases, reflecting that true and extensible parallel processing of database servers on the cloud can efficiently process OLAP applications demands on cloud computing.
Abstract: Cloud computing is gradually gaining popularity among businesses due to its distinct advantages over self-hosted IT infrastructures. The software-as-a-service providers are serving as the primary interfacing to the business users community. However, the strategies and methods for hosting mission critical business intelligence (BI) applications on cloud is still being researched. BI is a highly resource intensive system requiring large scale parallel processing and significant storage capacities to host the data warehouses. OLAP (online analytical processing) is the user-end interface of BI that is designed to present multi-dimensional graphical reports to the end users. OLAP employs data cubes formed as a result of multidimensional queries run on an array of data warehouses. In self-hosted environments it was feared that BI will eventually face a resource crunch situation because it won't be feasible for companies to keep on adding resources to host the never ending expansion of data warehouses and the OLAP demands on the underlying networking. Cloud computing has instigated a new hope for future prospects of BI. But how will BI be implemented on cloud and how will the traffic and demand profile look like? This research has attempted to answer these key questions in this paper pertaining to taking BI to the cloud. The cloud hosting of BI has been demonstrated with the help of a simulation on OPNET comprising a cloud model with multiple OLAP application servers applying parallel query loads on an array of servers hosting relational databases. The simulation results have reflected that true and extensible parallel processing of database servers on the cloud can efficiently process OLAP application demands on cloud computing. Hence, the BI designer needs to plan for a highly partitioned database running on massively parallel database servers in which, each server hosts at least one partition of the underlying database serving the OLAP demands.