A Unified Approach to Multisource Data Analyses

doi:10.3233/FI-2018-1727

Any correspondence concerning this service should be sent

to the repository administrator: tech-oatao@listes-diff.inp-toulouse.fr

This is an author’s version published in:

http://oatao.univ-toulouse.fr/22413

T

o cite this version: Ravat, Franck and Song, Jiefu A

Unified Approach to Multisource Data Analyses. (2018)

Fundamenta Informaticae, 162 (4). 311-359. ISSN 1875-8681

Official URL

DOI : https://doi.org/10.3233/FI-2018-1727

Open

Archive

Ouverte

OATAO is an open access repository that collects the work of Toulouse

researchers and makes it freely available over the web where possible

DOI 10.3233/FI-2018-1700

A Uniﬁed

Approach to Multisource Data Analyses

Franck Ravat Jiefu Song

∗

IRIT - Universit´e Toulouse I Capitole

2 Rue du Doyen Gabriel Marty F-31042 Toulouse Cedex 09,

France ravat@irit.fr, song@irit.fr

Abstract. Classically, Data Warehouses (DWs) supports business analyses on data coming from the inside of

an organization. Nevertheless, Lined Open Data (LOD) might sensibly complete these business analyses by

providing complementary perspectives during a decision-making pro-cess. In this paper, we propose a

conceptual modeling solution, named Uniﬁed Cube, which blends together multidimensional data from DWs

and LOD datasets without materializing them

in a stationary repository. We complete the conceptual modeling

with an implementation frame-work which manages the relations between a Uniﬁed Cube and multiple data

sources at both schema and instance levels. We also propose an analysis processing process which queries dif-

ferent sources in a transparent way to decision-makers. The practical value of our proposal is illustrated

through real-world data and benchmarks.

Keywords: Data Warehouse, Linked Open Data, Conceptual Modeling, Multisource analyses,

Experimental Assessments

1. Introduction

Well-informed and effective decision-making relies on appropriate data for business analyses. Data

are considered appropriate if they include enough information to provide an overall perspective to

decision-makers. To obtain as many appropriate data as possible, decision-makers must have access

to the company’s business data at any time. Since the 1990s, Business Intelligence (BI) has been

∗

Address for correspondence: IRIT - Universit

´

e

Toulouse I Capitole, 2 Rue du Doyen Gabriel Marty F-31042 Toulouse

Cedex 09, France

providing methods, techniques and tools to collect, extract and analyze business data stored in a Data

Warehouse (DW) [9]. However, an overall perspective during decision-making requires not only busi-

ness data from inside a company but also other data from outside a company. In today’s constantly

evolving business context, one promising approach consists of blending web data with warehoused

data [32]. The concept of BI 2.0 is introduced to envision a new generation of BI enhanced by web-

based content [39].

Among various web-based content, Linked Open Data (LOD)

1

provide a set of inter-connected

and machine-readable data to enhance business analyses on a web scale [45]. Since data are produced

and updated at a high speed nowadays, materializing all data (e.g., warehoused data and LOD) related

to analyses in one stationary repository can hardly be synchronized with changes in data sources. It is

necessary to unify data from various sources without integrating all data into a stationary repository.

To support up-to-date decision-making, business dashboards must be created in an on-demand manner.

Such dashboards should include all appropriate data required by decision-makers.

Case Study. In a government organization managing social housings, internal data are periodi-

cally extracted, transformed and loaded in a DW. As shown in ﬁgure 1(a), the DW describes number of

applications (i.e. Applications) according to two analysis axes: one about the geographical location of

social housings (i.e. Housing

Ward and Housing District) and the other related to applicant’s proﬁle

(i.e. Applicant

Status). This DW only gives a partial view on the demand for social housings. To

support effective decision-making, additional information should be included in analyses. Therefore,

a decision-maker browses in a second dataset, named LOD1, to obtain complementary views on so-

cial housing allocation. Published by the UK Department for Communities and Local Government

2

,

LOD1 describes the accepted applications for social housing (i.e. acceptance) according to district

Figure 1: An extract of data in a DW and two LOD datasets

1

http://linkeddata.org

2

http://opendatacommunities.org/data/housingmarket/core/tenancies/econstatus

and status (cf. ﬁgure 1(b)). LOD1 follows a multidimensional structure expressed in RDF Data Cube

Vocabulary (QB)

3

. The QB format only allows including one granularity in each analysis axis. The

decision-maker needs new analysis possibilities to aggregate data based on multiple granularities. To

discover more geographical granularities, the decision-maker looks into another dataset named LOD2.

This dataset is managed by the Ofﬁce for National Statistics of the UK

4

; it associates several areas

(including districts) with one corresponding region (cf. ﬁgure 1(c)). Both LOD1 and LOD2 are real-

world LOD which can be accessed through querying endpoints

56

.

The above-mentioned warehoused data and LOD share some similar multidimensional features, as

they are organized according to analysis subjects and analysis axes. However, analyzing data scattered

in several sources is difﬁcult without a uniﬁed data representation. During analyses, decision-makers

must search for useful information in several sources. The efﬁciency of such analyses is low, since

different sources may follow different schemas and contain different data instances. Facing these

issues, the decision-maker needs a business-oriented view unifying data from both the DW and the

LOD datasets. She/he makes the following requests regarding the view:

• An analysis subject should include all related numeric indicators from different sources, even

though these indicators cannot be aggregated according to the same analytical granularities. To

support real-time analyses, numeric indicators (e.g. Applications from the DW, Acceptances

from the LOD1 dataset) and their descriptive attributes (e.g. Housing

Ward, Housing District

and Applicant

Status from the DW, District and Status from the LOD1 dataset) at different

analytical granularities should be queried on-the-ﬂy from sources;

• Analytical granularities related to the same analysis axis should be grouped together. For in-

stance, the Housing

Ward and Housing District granularities from the DW, the District granu-

larity from the LOD1 dataset, the Area and Region granularities from the LOD2 dataset should

be merged into one analysis axis;

• Attributes describing the same analytical granularity should be grouped together. The correl-

ative relationships between instances of these attributes should be managed. For instance, the

attribute Housing

District from the DW, the attribute District from the LOD1 dataset and the

attribute Area from the LOD2 dataset should be all included in one analytical granularity related

to districts. Correlative instances ”Birmingham” from the DW, ”Birmingham E08000025” from

the LOD1 dataset and ”Birmingham xsd:string” from the LOD2 dataset should be associated

together, since they both refer to the same district;

• Summarizable analytical granularities should be indicated for each numeric indicator. For in-

stance, only the measure Applications from the DW can be aggregated according the Ward

analytical granularity. The other measure Acceptances from the LOD1 dataset is only summa-

rizable starting from the district analytical granularity on the geographical analysis axis.

3

http://www.w3.org/TR/vocab-data-cube

4

https://www.ons.gov.uk/

5

http://opendatacommunities.org/sparql

6

http://statistics.data.gov.uk/sparql

Contribution. Our aim is to make full use of as much information as possible to support effective

and well-informed decisions. To this end, we propose a uniﬁed view of data from both DWs and

LOD datasets. At the schema level, the uniﬁed view should include in a single schema all information

about an analysis subject described by all available analysis axes as well as all granularities (coming

from multiple sources). At the instance level, the uniﬁed view should not materialize data that can

be directly queried from the source. Nevertheless, it should manage the correlation relations between

related attribute instances referring to the same real-world entity. With the help of the uniﬁed view, a

decision-maker can easily obtain an overall perspective of an analysis subject. In the previous example,

a uniﬁed view would enable decision-makers to analyze on-the-ﬂy the number of applications and

acceptances according to applicant’s status and district as well as region (cf. ﬁgure 1(d)).

In this paper, we describe a generic modeling solution, named Uniﬁed Cube, which provides

a business-oriented view unifying both warehoused data and LOD. Section 2 presents different ap-

proaches to unifying data from DWs and LOD datasets. Section 3 describes the conceptual modeling

and graphical notation of Uniﬁed Cubes. Section 4 presents an implementation framework for Uniﬁed

Cubes. Section 5 shows how analyses are carried out on a Uniﬁed Cube in a user-friendly manner.

Section 6 illustrates the feasibility and the efﬁciency of our proposal through some experimental as-

sessments.

2. Related work

Disparate data silos from different sources make decision-making difﬁcult and tedious [43]. To pro-

vide decision-makers with an overall perspective during business analyses, an effective data integration

strategy is needed. In accordance with our research context, we focus on work related to the integra-

tion of multidimensional data from DWs and LOD datasets. We classify existing researches into three

categories.

The ﬁrst category is named ETL-based. With the arrival of LOD, the BI community intuitively

treated LOD as external data sources that should be integrated in a DW through ETL processes [15, 29,

36]. The obtained multidimensional DW is used as a centralized repository of LOD [38, 6]. Decision-

makers can use classical DW analysis tools to analyze LOD stored in such DWs. However, the existing

ETL techniques are inclined to populate a DW with LOD rather than updating existing LOD in a

DW. No effective technique is proposed to guarantee the freshness of warehoused LOD presented to

decision-makers during analyses. One promising avenue is to extend on-demand ETL processes [4]

to ﬁt for the integration of LOD in a DW at right time during business analyses. Otherwise, current

ETL-based approaches are not suitable in today’s highly dynamic context where large amounts of data

are constantly published and updated; they collide with the distributed nature and the high volatility

of LOD [24, 17].

The second category is named semantic web modeling. Since multidimensional models have been

proven successful in supporting complex business analyses [35], the LOD community introduces new

modeling vocabularies to semantically describe the multidimensionality of LOD through RDF triples.

Among the proposed modeling vocabularies, the RDF Data Cube Vocabulary

7

(QB) is the current W3C

7

http://www.w3.org/TR/vocab-data-cube

A Unified Approach to Multisource Data Analyses

Figures

Citations

Metadata Management for Data Lakes

Matching Subsequence Music Retrieval in a Software Integration Environment

A Data Cube Metamodel for Geographic Analysis Involving Heterogeneous Dimensions

A Novel Unified Data Modeling Method for Equipment Lifecycle Integrated Logistics Support

References

A Comparison of String Metrics for Matching Names and Records

Conceptual design of data warehouses from E/R schemes

Entity resolution: theory, practice & open challenges

A framework for handling inconsistency in changing ontologies

String Similarity Metrics for Ontology Alignment

Related Papers (5)

Designing multidimensional cubes from warehoused data and linked open data

The dimensional fact model: a conceptual model for data warehouses

A conceptual model for multidimensional data

Deriving initial data warehouse structures from the conceptual data models of the underlying operational information systems

Source integration in data warehousing