scispace - formally typeset
Open AccessJournal ArticleDOI

A Unified Approach to Multisource Data Analyses

Franck Ravat, +1 more
- 01 Jan 2018 - 
- Vol. 162, Iss: 4, pp 311-359
Reads0
Chats0
TLDR
A conceptual modeling solution, named Unified Cube, which blends together multidimensional data from DWs and LOD datasets without materializing them in a stationary repository and an analysis processing process which queries different sources in a transparent way to decision-makers is proposed.
Abstract
Classically, Data Warehouses (DWs) supports business analyses on data coming from the inside of an organization. Nevertheless, Lined Open Data (LOD) might sensibly complete these business analyses by providing complementary perspectives during a decision-making pro-cess. In this paper, we propose a conceptual modeling solution, named Unified Cube, which blends together multidimensional data from DWs and LOD datasets without materializing them in a stationary repository. We complete the conceptual modeling with an implementation frame-work which manages the relations between a Unified Cube and multiple data sources at both schema and instance levels. We also propose an analysis processing process which queries different sources in a transparent way to decision-makers. The practical value of our proposal is illustrated through real-world data and benchmarks.

read more

Content maybe subject to copyright    Report

Any correspondence concerning this service should be sent
to the repository administrator: tech-oatao@listes-diff.inp-toulouse.fr
This is an author’s version published in:
http://oatao.univ-toulouse.fr/22413
T
o cite this version: Ravat, Franck and Song, Jiefu A
Unified Approach to Multisource Data Analyses. (2018)
Fundamenta Informaticae, 162 (4). 311-359. ISSN 1875-8681
Official URL
DOI : https://doi.org/10.3233/FI-2018-1727
Open
Archive
Toulouse
Archive
Ouverte
OATAO is an open access repository that collects the work of Toulouse
researchers and makes it freely available over the web where possible

DOI 10.3233/FI-2018-1700
A Unified
Approach to Multisource Data Analyses
Franck Ravat Jiefu Song
IRIT - Universit´e Toulouse I Capitole
2 Rue du Doyen Gabriel Marty F-31042 Toulouse Cedex 09,
France ravat@irit.fr, song@irit.fr
Abstract. Classically, Data Warehouses (DWs) supports business analyses on data coming from the inside of
an organization. Nevertheless, Lined Open Data (LOD) might sensibly complete these business analyses by
providing complementary perspectives during a decision-making pro-cess. In this paper, we propose a
conceptual modeling solution, named Unified Cube, which blends together multidimensional data from DWs
and LOD datasets without materializing them
in a stationary repository. We complete the conceptual modeling
with an implementation frame-work which manages the relations between a Unified Cube and multiple data
sources at both schema and instance levels. We also propose an analysis processing process which queries dif-
ferent sources in a transparent way to decision-makers. The practical value of our proposal is illustrated
through real-world data and benchmarks.
Keywords: Data Warehouse, Linked Open Data, Conceptual Modeling, Multisource analyses,
Experimental Assessments
1. Introduction
Well-informed and effective decision-making relies on appropriate data for business analyses. Data
are considered appropriate if they include enough information to provide an overall perspective to
decision-makers. To obtain as many appropriate data as possible, decision-makers must have access
to the company’s business data at any time. Since the 1990s, Business Intelligence (BI) has been
Address for correspondence: IRIT - Universit
´
e
Toulouse I Capitole, 2 Rue du Doyen Gabriel Marty F-31042 Toulouse
Cedex 09, France

providing methods, techniques and tools to collect, extract and analyze business data stored in a Data
Warehouse (DW) [9]. However, an overall perspective during decision-making requires not only busi-
ness data from inside a company but also other data from outside a company. In today’s constantly
evolving business context, one promising approach consists of blending web data with warehoused
data [32]. The concept of BI 2.0 is introduced to envision a new generation of BI enhanced by web-
based content [39].
Among various web-based content, Linked Open Data (LOD)
1
provide a set of inter-connected
and machine-readable data to enhance business analyses on a web scale [45]. Since data are produced
and updated at a high speed nowadays, materializing all data (e.g., warehoused data and LOD) related
to analyses in one stationary repository can hardly be synchronized with changes in data sources. It is
necessary to unify data from various sources without integrating all data into a stationary repository.
To support up-to-date decision-making, business dashboards must be created in an on-demand manner.
Such dashboards should include all appropriate data required by decision-makers.
Case Study. In a government organization managing social housings, internal data are periodi-
cally extracted, transformed and loaded in a DW. As shown in figure 1(a), the DW describes number of
applications (i.e. Applications) according to two analysis axes: one about the geographical location of
social housings (i.e. Housing
Ward and Housing District) and the other related to applicant’s profile
(i.e. Applicant
Status). This DW only gives a partial view on the demand for social housings. To
support effective decision-making, additional information should be included in analyses. Therefore,
a decision-maker browses in a second dataset, named LOD1, to obtain complementary views on so-
cial housing allocation. Published by the UK Department for Communities and Local Government
2
,
LOD1 describes the accepted applications for social housing (i.e. acceptance) according to district
Figure 1: An extract of data in a DW and two LOD datasets
1
http://linkeddata.org
2
http://opendatacommunities.org/data/housingmarket/core/tenancies/econstatus

and status (cf. figure 1(b)). LOD1 follows a multidimensional structure expressed in RDF Data Cube
Vocabulary (QB)
3
. The QB format only allows including one granularity in each analysis axis. The
decision-maker needs new analysis possibilities to aggregate data based on multiple granularities. To
discover more geographical granularities, the decision-maker looks into another dataset named LOD2.
This dataset is managed by the Office for National Statistics of the UK
4
; it associates several areas
(including districts) with one corresponding region (cf. figure 1(c)). Both LOD1 and LOD2 are real-
world LOD which can be accessed through querying endpoints
56
.
The above-mentioned warehoused data and LOD share some similar multidimensional features, as
they are organized according to analysis subjects and analysis axes. However, analyzing data scattered
in several sources is difficult without a unified data representation. During analyses, decision-makers
must search for useful information in several sources. The efficiency of such analyses is low, since
different sources may follow different schemas and contain different data instances. Facing these
issues, the decision-maker needs a business-oriented view unifying data from both the DW and the
LOD datasets. She/he makes the following requests regarding the view:
An analysis subject should include all related numeric indicators from different sources, even
though these indicators cannot be aggregated according to the same analytical granularities. To
support real-time analyses, numeric indicators (e.g. Applications from the DW, Acceptances
from the LOD1 dataset) and their descriptive attributes (e.g. Housing
Ward, Housing District
and Applicant
Status from the DW, District and Status from the LOD1 dataset) at different
analytical granularities should be queried on-the-fly from sources;
Analytical granularities related to the same analysis axis should be grouped together. For in-
stance, the Housing
Ward and Housing District granularities from the DW, the District granu-
larity from the LOD1 dataset, the Area and Region granularities from the LOD2 dataset should
be merged into one analysis axis;
Attributes describing the same analytical granularity should be grouped together. The correl-
ative relationships between instances of these attributes should be managed. For instance, the
attribute Housing
District from the DW, the attribute District from the LOD1 dataset and the
attribute Area from the LOD2 dataset should be all included in one analytical granularity related
to districts. Correlative instances Birmingham from the DW, Birmingham E08000025 from
the LOD1 dataset and Birmingham xsd:string from the LOD2 dataset should be associated
together, since they both refer to the same district;
Summarizable analytical granularities should be indicated for each numeric indicator. For in-
stance, only the measure Applications from the DW can be aggregated according the Ward
analytical granularity. The other measure Acceptances from the LOD1 dataset is only summa-
rizable starting from the district analytical granularity on the geographical analysis axis.
3
http://www.w3.org/TR/vocab-data-cube
4
https://www.ons.gov.uk/
5
http://opendatacommunities.org/sparql
6
http://statistics.data.gov.uk/sparql

Contribution. Our aim is to make full use of as much information as possible to support effective
and well-informed decisions. To this end, we propose a unified view of data from both DWs and
LOD datasets. At the schema level, the unified view should include in a single schema all information
about an analysis subject described by all available analysis axes as well as all granularities (coming
from multiple sources). At the instance level, the unified view should not materialize data that can
be directly queried from the source. Nevertheless, it should manage the correlation relations between
related attribute instances referring to the same real-world entity. With the help of the unified view, a
decision-maker can easily obtain an overall perspective of an analysis subject. In the previous example,
a unified view would enable decision-makers to analyze on-the-fly the number of applications and
acceptances according to applicant’s status and district as well as region (cf. figure 1(d)).
In this paper, we describe a generic modeling solution, named Unified Cube, which provides
a business-oriented view unifying both warehoused data and LOD. Section 2 presents different ap-
proaches to unifying data from DWs and LOD datasets. Section 3 describes the conceptual modeling
and graphical notation of Unified Cubes. Section 4 presents an implementation framework for Unified
Cubes. Section 5 shows how analyses are carried out on a Unified Cube in a user-friendly manner.
Section 6 illustrates the feasibility and the efficiency of our proposal through some experimental as-
sessments.
2. Related work
Disparate data silos from different sources make decision-making difficult and tedious [43]. To pro-
vide decision-makers with an overall perspective during business analyses, an effective data integration
strategy is needed. In accordance with our research context, we focus on work related to the integra-
tion of multidimensional data from DWs and LOD datasets. We classify existing researches into three
categories.
The first category is named ETL-based. With the arrival of LOD, the BI community intuitively
treated LOD as external data sources that should be integrated in a DW through ETL processes [15, 29,
36]. The obtained multidimensional DW is used as a centralized repository of LOD [38, 6]. Decision-
makers can use classical DW analysis tools to analyze LOD stored in such DWs. However, the existing
ETL techniques are inclined to populate a DW with LOD rather than updating existing LOD in a
DW. No effective technique is proposed to guarantee the freshness of warehoused LOD presented to
decision-makers during analyses. One promising avenue is to extend on-demand ETL processes [4]
to fit for the integration of LOD in a DW at right time during business analyses. Otherwise, current
ETL-based approaches are not suitable in today’s highly dynamic context where large amounts of data
are constantly published and updated; they collide with the distributed nature and the high volatility
of LOD [24, 17].
The second category is named semantic web modeling. Since multidimensional models have been
proven successful in supporting complex business analyses [35], the LOD community introduces new
modeling vocabularies to semantically describe the multidimensionality of LOD through RDF triples.
Among the proposed modeling vocabularies, the RDF Data Cube Vocabulary
7
(QB) is the current W3C
7
http://www.w3.org/TR/vocab-data-cube

Citations
More filters
Book ChapterDOI

Metadata Management for Data Lakes

TL;DR: A metadata conceptual schema which considers different types (structured, semi-structured and unstructured) of raw or processed data is presented and is implemented in two DBMSs to validate the proposal.
Journal ArticleDOI

Matching Subsequence Music Retrieval in a Software Integration Environment

TL;DR: In this paper, a music retrieval system based on the knowledge of music was proposed, and the feature extraction algorithm was analyzed. But the detailed design of the system was not discussed.
Journal ArticleDOI

A Data Cube Metamodel for Geographic Analysis Involving Heterogeneous Dimensions

TL;DR: In this paper, the authors propose an original data cube metamodel defined in UML, based on concepts like common dimension levels and metadimensions, which can instantiate constellations of heterogeneous data cubes allowing SOLAP to perform multiscale, multi-territory and time analysis.
Journal ArticleDOI

A Novel Unified Data Modeling Method for Equipment Lifecycle Integrated Logistics Support

Xuemiao Cui, +2 more
- 01 Jun 2022 - 
TL;DR: A unified data modeling method is proposed to solve the consistent and comprehensive expression problem of ILS data and different systems in the equipment ILS process can share a set of data models and provide ILS designers with relevant data through different views.
References
More filters

A Comparison of String Metrics for Matching Names and Records

TL;DR: An open-source Java toolkit of methods for matching names and records is described and results obtained from using various string distance metrics on the task of matching entity names are summarized.
Proceedings ArticleDOI

Conceptual design of data warehouses from E/R schemes

TL;DR: A graphical conceptual model for data warehouses, called Dimensional Fact model, is presented and a semi-automated methodology to build it from the pre-existing entity/relationship schemes describing a database is proposed.
Journal ArticleDOI

Entity resolution: theory, practice & open challenges

TL;DR: This tutorial brings together perspectives on ER from a variety of fields, including databases, machine learning, natural language processing and information retrieval, to provide, in one setting, a survey of a large body of work.
Book ChapterDOI

A framework for handling inconsistency in changing ontologies

TL;DR: This paper surveys four different approaches to handling inconsistency in DL-based ontologies: consistent ontology evolution, repairing inconsistencies, reasoning in the presence of inconsistencies and multi-version reasoning.
Book ChapterDOI

String Similarity Metrics for Ontology Alignment

TL;DR: It is shown that if optimal string similarity metrics are chosen, those alone can produce alignments that are competitive with the state of the art in ontology alignment systems.
Related Papers (5)