Author
Xinjian Lu
Bio: Xinjian Lu is an academic researcher from California State University. The author has contributed to research in topics: Data warehouse & Fact table. The author has an hindex of 1, co-authored 1 publications receiving 7 citations.
Topics: Data warehouse, Fact table, Page, Foreign key
Papers
More filters
TL;DR: A cost model is formulated to express the expected time to read the desired data as a function of disk system's parameters (seek time, rotational latency, and reading speed) and the lengths of foreign keys and an algorithm is provided for identifying the most desirable disk page size.
Abstract: This paper examines strategic arrangement of fact data in a data warehouse in order to answer analytical queries efficiently. Usually, the composite of foreign keys from dimension tables are defined as the fact table's primary key. We focus on analytical queries that specify a value for a randomly chosen foreign key. The desired data for answering a query are typically located at different parts of the disk, thus requiting multiple disk I/Os to read them from disk to memory. We formulate a cost model to express the expected time to read the desired data as a function of disk system's parameters (seek time, rotational latency, and reading speed) and the lengths of foreign keys. For a predetermined disk page size, we search for an arrangement of the fact data that minimizes the expected time cost. An algorithm is then provided for identifying the most desirable disk page size. Finally, we present a heuristic for answering complex queries that specify values for multiple foreign keys.
7 citations
Cited by
More filters
TL;DR: In this paper, data fragmentation is formalised as an optimisation problem with constraint and the application of particle swarm optimisation (PSO) technique to design an optimal fragmentation schema is proposed.
Abstract: Data fragmentation is one of the physical database design techniques that improves significantly data management, accessibility and query execution time. Optimal fragmentation schema is designed from workload gathered from data exploitation. However, adapting this technique to data warehouse should consider the specific characteristics of data warehouse such as the complexity of OLAP queries and the dynamicity of data model and workload. In this paper, data fragmentation is formalised as an optimisation problem with constraint and we propose the application of particle swarm optimisation (PSO) technique to design an optimal fragmentation schema.
8 citations
TL;DR: This paper proposes in this paper an approach based on exploitation of recent statistical data access for dynamic data fragmentation in data warehouse that improves significantly data management, accessibility and query execution time.
Abstract: The large size of a data warehouse and the complexity of OLAP queries constitute query performance challenges. Several techniques have been developed to reduce query response time. Data fragmentation improves significantly data management, accessibility and query execution time. Optimal fragmentation schema is designed from workload gathered from data exploitation. So, in context of relational-and object-oriented databases these techniques remain adapted because the workload is almost stable. However, the specific characteristics of data warehouse and more particularly the nature of OLAP queries makes data model and workload very dynamic and consequently an ineffective designed fragmentation schema. To achieve this problem, we propose in this paper an approach based on exploitation of recent statistical data access for dynamic data fragmentation in data warehouse.
7 citations
TL;DR: A model for optimizing the implementation of the shrink operation which considers two possible problem types, and model both problems as set partitioning problems with a side constraint, that is compared with both the original greedy heuristic and a commercial general-purpose MIP solver.
Abstract: Pivot tables are one of the most popular tools for data visualization in both business and research applications. Although they are in general easy to use, their comprehensibility becomes progressively lower when the quantity of cells to be visualized increases (i.e., information flooding problem). Pivot tables are largely adopted in OLAP, the main approach to multidimensional data analysis. To cope with the information flooding problem in OLAP, the shrink operation enables users to balance the size of query results with their approximation, exploiting the presence of multidimensional hierarchies. The only implementation of the shrink operator proposed in the literature is based on a greedy heuristic that, in many cases, is far from reaching a desired level of effectiveness. In this paper we propose a model for optimizing the implementation of the shrink operation which considers two possible problem types. The first type minimizes the loss of precision ensuring that the resulting data do not exceed the maximum allowed size. The second one minimizes the size of the resulting data ensuring that the loss of precision does not exceed a given maximum value. We model both problems as set partitioning problems with a side constraint. To solve the models we propose a dual ascent procedure based on a Lagrangian pricing approach, a Lagrangian heuristic, and an exact method. Experimental results show the effectiveness of the proposed approaches, that is compared with both the original greedy heuristic and a commercial general-purpose MIP solver.
3 citations
23 Oct 2019
TL;DR: A horizontal data partitioning approach tailored to a large data warehouse, interrogated through a high number of queries, that reduces both query response time and fact partitions number; which is the major drawback of existing partitioning techniques.
Abstract: Data partitioning is a well-known technique for decision-support query performance optimization. In this paper, we present a horizontal data partitioning approach tailored to a large data warehouse, interrogated through a high number of queries. The idea behind our approach is to partition horizontally only the large fact table based on partitioning predicates, elected from the set of the selection predicates used by the analytic queries. The partitioning predicates election depends on their numbers of occurrences, their access frequencies, and their selectivities. With the Star Scheme Benchmark under Oracle 12c, we demonstrate that our partitioning technique reduces both query response time and fact partitions number; which is the major drawback of existing partitioning techniques. We also show, that our partitioning algorithm is around 66% faster compared to the primary and derived partitioning techniques based on the genetic algorithm.
2 citations
01 Jan 2015
TL;DR: This article will concentrate on the technique of data fragmentation (also known as partitioning), which is used in data warehousing to improve significantly data manageability, accessibility and query execution time.
Abstract: Data Warehousing (DW) and Online Analytical Processing (OLAP) are becoming critical components of decision support. Analytical queries usually identify business trends rather than individual values; those are much more complex than transactional ones. Processing these queries may take hours and days. To improve performance, several techniques have been developed. We quote materialized views (Chuan & Xin, 2001), indexes (Chaudhuri, 2004), data fragmentation (Bellatreche, 2005; Boukhalfa, 2009), distributed and parallel processing (Furtado, 2004). In this article, we will concentrate on the technique of data fragmentation (also known as partitioning. So, in this article we use both terms fragmentation and partitioning). Partitioning tables, indexes and materialized views in fragments stored and accessed separately improve significantly data manageability, accessibility and query execution time. Thus, traditional fragmentation techniques and more particularly horizontal and vertical fragmentation, developed in relational DBMS, were applied to the data warehouse. These approaches are designed from a statistical analysis of more frequent queries based on both qualitative and quantitative information. So, algorithms used to design an optimal partitioning schema are static algorithms. Their entries are bases on workload gathered from data exploitation. If a change occurs in the inputs of these algorithms, they must be rerun to determine a new optimal fragmentation schema. Moreover, these algorithms are based on the clustering principle which is considered as combinatorial problem and requires for its resolution to use heuristics methods. So, in the case of models evolution and / or changes in workload these algorithms become very complicated, or unworkable. In the context of relational and object oriented databases and in any environment (centralized, parallel, distributed) much of the literature has addressed this issue. Researchers concentrate their work on data redistribution or fragments reallocation in the event of performance degradation. So, it was considered that the solution lies at the physical level by applying load balancing strategies of treatment and data between nodes. The logical aspect, namely the design of the fragmentation schema, itself, remains adapted because the workload is almost stable. Conversely, in data warehousing the evolution of data model and workload are dynamic. This is due more particularly to the specific characteristics of OLAP queries. So, an inappropriate and badly conceived fragmentation schema have a considerable influence on the system’s performance and more particularly during the execution of the expensive operations such as the joint and the multi-joint which characterize the decisional queries. Xinjian (2005) have clearly demonstrated through theorems and lemma that the choice of partition keys and how to arrange the records in the fact table have a great impact on the OLAP queries response time. For efficient use of fragmentation technique in data warehouse, it is not only to analyze the data access frequencies to choose an optimal fragmentation schema, but to make that choice dynamic and adapted to changing workload. Hacène Derrar USTHB, Algeria
2 citations