scispace - formally typeset
Open AccessProceedings ArticleDOI

ROLAP based data warehouse schema to XML schema conversion

TLDR
This research work generates equivalent XML schema from the existing data warehouse schema for an organization which does not has the XML platform to manage the web data.
Abstract
Data Warehouse is one of the powerful tools for analytical processing. XML on the other hand is widely used to handle data in web environment. XML to data warehouse integration is a subject of interest for the business organization to use the semi-structured XML for analytical processing. However, in this research work we approached the problem in reverse direction. Here we generate equivalent XML schema from the existing data warehouse schema for an organization which does not has the XML platform to manage the web data. The proposed reverse engineering framework uses one of the existing methodologies of converting the XML schema to data warehouse schema. However, we have applied it in a reverse approach. Moreover we have established a formalism to prove the soundness and correctness of both the conversion mechanisms.

read more

Content maybe subject to copyright    Report

ROLAP Based Data Warehouse Schema to XML
Schema Conversion
Authors Name/s per 1st Affiliation (Author)
line 1 (of Affiliation): dept. name of organization
line 2-name of organization, acronyms a
Authors Name/s per 2nd Affiliation
Abstract—Data Warehouse is one of the powerful tools for
analytical processing. XML on the other hand is widely used to
handle data in web environment. XML to data warehouse
integration is a subject of interest for the business organization to
use the semi-structured XML for analytical processing. However,
in this research work we approached the problem in reverse
direction. Here we generate equivalent XML schema from the
existing data warehouse schema for an organization which does
not has the XML platform to manage the web data. The proposed
reverse engineering framework uses one of the existing
methodologies of converting the XML schema to data warehouse
schema. However, we have applied it in a reverse approach.
Moreover we have established a formalism to prove the
soundness and correctness of both the conversion mechanisms.
Keywords—Data Warehouse Schema; ROLAP; XML Schema;
Schema Graph; Formalism; Reverse Engineering
I.
I
NTRODUCTION
A data warehouse is a subject-oriented, integrated, non-
volatile, time-variant representation to organize data for
analytical processing. The integrated property of data
warehouse emphasize its capability to work with
heterogeneous data sources. The most widely used data
warehouse is called ROLAP (Relational Online Analytical
Processing), where analytical data is represented in relational
format. Therefore the different types of source data of OLTP
(Online Transactional Processing) are converted to ROLAP to
perform the analytical processing on heterogeneous data in a
unified approach. There are numbers of way to represent
OLTP data. In this research work we focused on XML
(Extended Mark-up Language) as in web environment XML is
the most important language to represent transactional data.
Moreover XML is semi-structured in nature where as
relational model is structured. Hence this conversion has
additional challenge to work with two different types of data
model.
XML data is represented in terms of XML DTD or XML
schema. However XML DTD has several limitations. XML
DTD did not fully support user requirements. Moreover
neither the DTD has built-in data types nor supports user-
derived data types. Along with this DTD allows only limited
control over cardinality. XML Schema has been designed to
provide a robust mechanism to define XML document
structure and limitations. XML Schemas are capable to
represent XML documents. They reference the XML schema
namespace and even have their own DTD process. As XML
schema has several advantages over XML DTD the research
interest is more on XML schema.
XML schema design generally follows 2 types namely
Russian Doll Design [1] and Salami Slice Design [1]. The
Russian Doll design has a single global element that nests
local elements. The Salami design corresponds to having all of
the elements defined within the global namespace and then
referencing the elements. Russian Doll design has more use in
the Industry and therefore in the research work also this
schema structure gets more priority.
In the next section we would discuss on different
approaches of XML to data warehouse conversion method.
However in this research work we view the problem from
reverse side, where data warehouse schema is converted to
XML schema. This seems to us as a potential problem because
numbers of organization are deploying their existing business
in web environment. In some cases they already have data
warehouse for analytical processing which is build based on
other OLTP languages except XML. Thus the reverse
engineering solution that we propose and validate in this work
bears high relevance in the state of the art context. Majority of
the organizations use Relational OLAP (ROLAP) to
implement their data warehouses. Therefore to incorporate
web based environment in the existing implementation it
could be an intelligent decision to use database structure in
web which is equivalent to the ongoing system.
II. R
ELATED
W
ORK
In this section at first we discuss about different existing
approaches of converting data warehouse schemas from the
XML sources. Among this we would choose one of the
efficient methods and then approach with that method in
reverse way. We would also incorporate formalism on these
methods to proof the correctness and soundness and to
establish the reverse engineering.
Data warehouse schema conversion is performed on both
XML DTD and XML schema. At first we discuss few
methods based on XML DTD. In [2], algorithms were
proposed to automatically construct UML diagrams from
XML data and the application of the diagrams on the
conceptual design of (virtual) data warehouses were based on
web data. The UML diagram has been chosen here because
UML is a standardized conceptual data modeling language
Soumya Sen Agostino Cortesi Nabendu Chaki
A. K. Choudhury School of Computer Science Department of Computer
Information Technology Department Science & Engineering
University of Calcutta Ca Foscari University University of Calcutta
iamsoumyasen@gmail.com cortesi@unive.it nabendu@ieee.org
l-))) 

SE
Level -1 Level -2
Level -3 Level -4
CE
HE
and is powerful enough to express a document described by a
DTD. A semi-automatic approach for conceptual designing of
a data mart from XML DTD was described in [3]. It [3]
explained how the semi-structured nature of the source
increases the level of uncertainty on the structure of data, thus
requiring access to the source documents and need to ask the
designer to find out one-to-one or many-to-one relationships.
However the sources were constrained by a DTD using sub-
elements. In the previous section we have discussed about
other limitations of XML DTD. XML schema overcomes the
shortcomings of XML DTD. Now we discuss some
approaches based on XML schema. XML schema conversion
to OLAP cube by identifying fact table and dimension tables
has been showed in [4]. OLAP cube formation using XML
source is an important area of research. Conceptual designing
based on dispersed XML documents has been done to form
both XML warehouse and XML marts [5]. Another research
work on this multi-dimensional model based on XML
database has been carried out specifically for multimedia data
[6]. As the size of multimedia database is usually huge the
work [6] is significant for handling high volume data. A
generic work on XML schema shows how to convert the
contents of the XML schema to multiple schemas of the multi
dimensional model [7]. Further the work has been extended to
design multiple cubes [8] of multidimensional model from
XML schema. A semi-automatic approach [9] was proposed
for XML data warehouse design starting from XML schemas
as data sources. It generates numbers of UML class diagram
from XML schema and then the numbers of classes are
reduced using a set of rules. Finally a multi-dimensional (MD)
element extraction algorithm [9] is used to automatically
identify facts, measures and their corresponding dimensions.
An automatic approach for designing the logical schema for a
data mart starting from the XML schema describing XML
sources using UML and QVT transformation language was
described in [10]. It [10] showed a simplification process and
a set of rules that applies successive transformations to create
the star schema. All of these generated schemas are converted
to star schema only. In order to address the other schemas, a
formalization method to model star and snowflake schema
within XML schema based on attribute tree was proposed and
termed as X-Warehousing [11]. It merges users analysis
objectives represented through XML schema with XML data
sources. A secure data warehouse [12] was proposed on XML
schemas by focusing on the security issues relevant to XML
schemas. In another research work [13] XML schema to data
warehouse schema has been done at first by converting the
XML schema to ER-diagram. In the next phase ER-diagram
has been converted to ROLAP based data warehouse schema.
As ER diagram is generated we could easily convert this to
relational model also. The main significance of the work [13]
is it supports both OLTP (through ER- diagram and relational
model) and OLAP (through data warehouse schema).
However in this research work only star schema and
snowflake schema are identified. This limitation has been
sorted out in [14], where the fact constellation is also
identified. The proposed methodology in this paper is capable
of accepting multiple related XML schemas. The XML
schemas of [14] follow Russian doll design. The given XML
schema(s) is converted to a data structure named as Schema
Graph. In the next phase Schema Graph is converted to data
warehouse schema.
We choose the method of [14] as it can work with multiple
XML schemas and supports star schema, snowflake schema
and fact constellation. In the next section we briefly describe
the proposed framework of [14] and then we propose the
reverse methodology based on [14] to generate XML schema
from the existing data warehouse schema.
III. A
B
RIEF
O
VERVIEW OF THE FRAMEWORK IN
[14]
The proposed framework of [14] accepts more than one
related XML schemas. The proposed algorithm [14] has two
phases. At the first phase XML schemas are converted to a
new data structure named as Schema Graph. Once the Schema
Graph is constructed, then in the next phase data warehouse
schemas are generated and the type of the schema is identified.
Schema Graph is a level wise separable graph. Every
entity of XML schema acts as a vertex in Schema Graph and
the name of the vertex is same as the entity name in the XML
schema. The entities that appear in the Schema Graph are
classified into three types.
A. Holder Element (HE): These elements that have no
predecessor in the Schema Graph. Holder Elements are placed
at the Level-1 of the graph.
B. Contained Element (CE): These elements are directly
connected to the HEs and are called Contained Elements.
Contained Elements are placed at the Level-2 of the graph.
C. Secondary Elements (SE): The elements that are
directly connected to the CEs are called Secondary Elements.
They are placed at the Level-3 of the graph. If elements in the
graph appear as connected to SE, they would be placed in
level-4. The new vertices that would be connected to the
vertices of level-4 would be placed in level-5 and so on.
Subsequently new level could be created if the new entities
appear in the graph connected to the previous level. All the
elements beyond level-3 are termed as Secondary Elements.
A generic structure of Schema Graph is shown in Fig. 1.
Fig. 1: Schema Graph along with HE, CE and SE
Once the Schema Graph is constructed fact table and
dimension tables are identified. If some of the entities do not


have sufficient attributes to form the primary key a key
attribute is added to those entities. This is necessary as the
ROLAP implementation which is based on relational model
requires primary key for each table.
In the proposed methodology of [14] each HE corresponds
to a fact table and makes an entry in the fact table, the key
attribute of the CEs that are connected to the HE are placed in
the corresponding fact table for that HE. CEs appear as the
dimension tables. If SEs are found connected with CE the
primary keys of SEs are placed in CE. If SEs are present even
after level-3, primary keys of the higher level are placed in the
table corresponding to the SE of immediate lower level. After
this the type of data warehouse schema is identified. A data
warehouse schema is identified as star schema if the schema
graph consists of HE and CEs only. Snowflake schema is
identified if the schema graph consists of HE, CEs and SEs. If
there is more than one data warehouse schema the framework
checks whether fact constellation is present or not. If it is
found atleast one dimension table is shared by more than one
fact tables then the overall data warehouse schema is marked
as fact constellation.
IV. P
ROPOSED METHODOLOGY TO GENERATE XML SCHEMA
FROM DATA WAREHOUSE SCHEMA
In this section we introduce the framework to generate
XML schema from the given data warehouse schema based on
the proposed methodology of [14] but in the reverse way.
Once an organization decides to convert the data
warehouse schema to XML schema they need to decide which
dimension table to act as the root element of the XML schema.
They have to select one of the dimension tables from those
which are directly connected to the fact table. Otherwise
system would randomly choose one of the dimension tables
from those which are directly connected to the fact table. This
dimension table is named as First Dimension Table (FDT).
FDT would appear as HE in Schema Graph. Rest of the
dimension tables those are directly connected with fact table
are categorized as Connected Dimension Table (CDT). CDTs
would correspond to CE in the Schema Graph. Other
dimension tables which are not connected to the fact table are
categorized as Secondary Dimension Table (SDT). SDTs
would correspond to SE in the Schema Graph.
In this research work we also establish the correctness of
the method described in [14]. Hence we proof that once an
XML schema has been converted to data warehouse schema
using the reverse methodology we would get back the original
schema. However in some cases newly generated XML
schema may differs from the old one. As primary keys has
been added for those entities which did not have sufficient
attributes to form the primary key. In these cases we can claim
that our proposed methodology helps to re-generate better
XML schema which is more structured than the original. In
this case we have the knowledge of HE, CE and SE. Thus the
dimension tables correspond to HE, CE and SE are
categorized as FDT, CDT and SDT respectively.
A. Methodology to Construct Schema Graph from Data
Warehouse Schema
Here we form Schema Graph. As Schema Graph is a level
wise separable graph HEs are placed at the most left and
labelled as level-1. CEs are placed at right to respective HEs
and labelled as level-2. CEs are also connected to the
respective HE. Now SDTs are placed in Schema Graph level
wise from level-3 onwards. The attributes corresponding to
each entity of the Schema Graph are connected to the
respective entities.
If there is more than one fact table the above process is
repeated for each of them.
Algorithm:
Step 1: N = Numbers of fact table in the system
Step 2: FOR J = 1 to N repeat the following steps
Step 3: Find out the First Dimension Table (FDT) for each J.
FDT is either given by the user or already known if the
data warehouse schema has been constructed from some XML
schema. If the user does not specify it then the system
randomly chose any one of the dimension table among those
which are directly connected to fact table.
Step 4: FDT corresponds to the first level elements of the
Schema Graph. These are the Header Element (HE) of the
Schema Graph and placed at the level-1 of the Schema Graph
Step 5: The attributes corresponding to each FDT are also
connected as attributes to the respective HE of the Schema
Graph.
Step 6: The dimension tables except the FDT which are
connected to the fact table are termed as Connected
Dimension Table (CDT). CDTs appear as the level-2 elements
of the Schema Graph. These are the Contained Element (CE)
of the Schema Graph. CEs are connected with the respective
HE in Schema Graph.
Step 7: The attributes corresponding to each CDT are also
connected as attributes to the respective CE of the Schema
Graph.
Step 8: Other dimension tables which are neither FDT nor
CDT are termed as Secondary Dimension Table (SDT).
Step 9: Dimension tables connected with FDT and CDTs
appear as the level-3 elements of the Schema Graph. These are
the Secondary Element (SE) of the Schema Graph. SEs are
connected with the respective CE in the Schema Graph.
Step10: The attributes corresponding to each dimension table
at this level are also connected as attributes to the respective
SE of the Schema Graph.
Step 11: IF there are further Secondary Dimension Tables
(SDT) in the schema THEN
a) I=3
b) Repeat Steps 12 to 14 until all the dimension tables are
not included in schema graph
Step 12: IF there are SDTs which are connected with the
dimension tables correspond to the I
th
level of schema graph
THEN
Place the SDTs of (I+1)
th
level in the Schema Graph and
connect with the elements at I
th
level. These new elements are
also called Secondary Element (SE) in Schema Graph.


Step 13: The attributes corresponding to each dimension table
at this level are also connected as attributes to the respective
SE of the schema graph.
ENDIF /*Corresponding to IF of Step 12*/
Step 14: I=I+1
End of Repeat /*Corresponding to Step 11 b) */
ENDIF /*Corresponding to IF of Step 11*/
Step 15: ENDFOR /*Corresponding to Step 2*/
B. XML Schema Generation from Schema Graph
After getting the Schema Graph, we head forward to the
last step of generating the XML schema. As we are dealing
with Russian Doll types of XML schema we use the concept
of the nesting of elements.
We denote the HE as the root of the XML schema. CEs are
nested under the root element separately in the XML schema.
SEs corresponding to the level-3 of Schema Graph is nested
under respective CE. If there are further levels of SEs they are
nested under their predecessor level of SE of Schema Graph in
the XML schema.
Algorithm:
Step 1: Repeat FOR every element at level-1 or Header
element (HE)
Step 2: Each HE corresponds to root element of XML schema
Step 3: All the elements at level-2 or Contained Element (CE)
of the schema graph connected with the particular HE is
nested under the root element separately.
Step 4: FOR each element of level-2 of the schema graph find
the elements connected at level-3 or Secondary Element (SE)
and nest them under the element correspond to level-2.
I=3
Step 5: Repeat till all the levels are traversed
a) Select the elements at (I+1)
th
level in Schema Graph which
are connected with the I
th
level elements in Schema Graph.
b) The selected elements of previous step is nested under I
th
level elements in XML schema
c) I=I+1
End of Repeat
ENDFOR /*Corresponds to FOR of Step-4*/
Step 6: ENDFOR
The type of each attribute is obtained from data warehouse
schema definition.
V. I
LLUSTRATION WITH EXAMPLE
In this section, we present an example to describe the
execution of our methodology. We are starting with a given
data warehouse schema as shown in Fig. 2. The given schema
consists of single fact table named Flightorder_fact. The
measure is given as No. of Tickets.
Here we explain the stepwise execution of the algorithm of
sub-section-(IV.A) to construct the Schema Graph.
Step 1: N=1 as the schema has one fact table
Step 2: The following steps are going to be executed only once
Step 3: We take FlightOrder as First Dimension Table (FDT)
as this dimension table has the same name as fact table.
Fig. 2: A Data Warehouse Schema
Step 4: FlightOrder is going to be the HE of the Schema
Graph.
Step 5: The attributes of FlightOrder are connected with the
HE in Schema Graph.
Step 6: All other dimension tables except FDT that are directly
connected with fact table are termed as Connected Dimension
Table (CDT). In this example CDTs are Item, Flight_to and
Flight_from. All these CDTs are placed in level-2 of Schema
Graph and denotes as CE.
Step 7: The attributes of Item, Flight_to and Flight_from are
now added with these CEs in the Schema Graph.
Step 8: Other dimension tables are termed as SDTs. Here
Secondary Dimension Table (SDT) is Supplier.
Step 9: Supplier is placed at the level-3 of the Schema Graph.
Supplier acts as SE in the Schema Graph and also connected
with the CE Item in the previous level.
Step 10: The attributes of Item are connected with it in the
Schema Graph.
Step 11 to 14: These steps are not executed as there is no
further SDT in the Schema Graph.
Step 15: End of Algorithm Execution
The output of the above execution is shown in Fig. 3.
Element FlightOrder is at level-1, elements Item, Flight_to
and Flight_from are at level-2 and finally the element Supplier
is at level-2 are shown in Schema Graph.
Finally the XML schema is build by applying the
algorithm of sub-section-(IV.B) on the data warehouse schema
of Fig. 3.
item_id
Item
supplier_id
name
title
No. of Tickets
flight_to_id
flight_from_id
odr_id
FlightOrder_
Fact
item_id
address
name
flight_from_id
Flight_from
name
address
flight_to_id
Flight_to
odr_id
FlightOrder
odr_person
supplier_id
Supplier
name


The XML schema is given below.
<xsd:elementname="FlightOrder">
<xsd:complexType>
<xsd:sequence>
<xsd:elementname="odr_id" type="xs:string"
use=”required”>
<xsd:elementname="Flight_from"
type="FlightfromType">
<xsd:sequence>
<xsd:elementname="flight_from_id"
type="xsd:string" use="required"/>
<xsd:elementname="name" type="xsd:string"
use="required"/>
<xsd:elementname="addr" type="xsd:string"
use="required"/>
</xsd:sequence>
<xsd:elementname="Flight_to"
type="FlighttoType">
<xsd:sequence>
<xsd:elementname="flight_to_id"
type="xsd:string" use="required"/>
<xsd:elementname="name" type="xsd:string"
use="required"/>
<xsd:elementname="addr" type="xsd:string"
use="required"/>
</xsd:sequence>
<xsd:elementname="Item" type="ItemType">
<xsd:sequence>
<xsd:elementname="title" type="xsd:string"
use="required"/>
<xsd:elementname="name" type="xsd:string"
use="required"/>
<xsd:elementname="Supplier"
type="SupplierType" use="required"/>
</xsd:sequence>
<xsd:complexTypename="SupplierType">
<xsd:sequence>
<xsd:elementname="name" type="xsd:string"
use="required"/>
<xsd:elementname="supplier_id"
type="xsd:string" use="required"/>
</xsd:sequence>
<xsd:attributename="odr_person" type="xsd:string"
use="required"/>
</xsd:sequence>
</xsd:complextype>
</xsd:element>
VI. F
ORMALISM ON
XML
SCHEMA TO AND FROM
D
ATA
W
AREHOUSE
S
CHEMA
C
ONVERSION
In this section we are going to proof that once an XML
schema is converted to the data warehouse schema using the
method of [14] and when we get back the XML schema from
the converted data warehouse schema applying the
methodology of this paper they are equivalent to each other. In
fact, the re-generated XML schema is often better than the
original XML schema in terms of structure. This is because,
the primary keys are added during XML to data warehouse
conversion to those XML elements not having sufficient
attributes to form the primary key. From this point onwards if
we continue the conversion mechanism in both ways the result
would be same. It is depicted in Fig. 4.
Fig. 4: XML schema to/from data warehouse schema
conversion
Flight_to
add
r
Flight_fro
m
add
r
t
itle
Item
odr_id
FlightOrder
name
name
Supplier
supplier_id
name
odr_person
flight_from_id
flight_to_id
item_id
name
Fig. 3: Schema Graph corresponding to the data warehouse schema of Fig. 2
f
g
XML Schema’
DW Schema
g
f
XML Schema
DW Schema


Citations
More filters
Journal ArticleDOI

A Survey on Industrial Information Integration 2016–2019

TL;DR: Industrial information integration engineering is a set of foundational concepts and techniques that facilitate the industrial information integration process and in recent years, many applicat...
Journal ArticleDOI

Skyline computation over multiple points and dimensions

TL;DR: In this paper, the capability of skyline is extended to work with multiple dimensions and to search the multiple interesting points from the given search space. But, they restrict the computational complexity within a fixed upper bound.
References
More filters
Proceedings ArticleDOI

Conceptual design of data warehouses from E/R schemes

TL;DR: A graphical conceptual model for data warehouses, called Dimensional Fact model, is presented and a semi-automated methodology to build it from the pre-existing entity/relationship schemes describing a database is proposed.
Proceedings ArticleDOI

Data warehouse design from XML sources

TL;DR: This paper shows how the design of a data mart can be carried out starting directly from an XML source, and proposes a semi-automatic approach for building the conceptual schema for a dataMart starting from the XML sources.
Journal ArticleDOI

Specifying OLAP cubes on XML data

TL;DR: This paper presents an approach to specification of OLAP DBs based on web data, using Unified Modeling Language (UML) as a basis for so-called UML snowflake diagrams that precisely capture the multidimensional structure of the data.
Book ChapterDOI

X-warehousing: an XML-based approach for warehousing complex data

TL;DR: An XML-based methodology, named X-Warehousing, which designs warehouses at a logical level, and populates them with XML documents at a physical level, which represents the logical model of a data warehouse and populate the physical model of the data warehouse, called the XML cube.

Converting XML Data To UML Diagrams For Conceptual Data Integration.

TL;DR: In this article, the authors present algorithms for automatically constructing UML diagrams from XML DTDs, enabling fast and easy graphical browsing of XML data sources on the web, and how the generated diagrams can be used for the conceptual design of data warehouses based on web data.
Related Papers (5)