A Unified Approach to Multisource Data Analyses

doi:10.3233/FI-2018-1727

Journal Article•DOI•

A Unified Approach to Multisource Data Analyses

01 Jan 2018-Fundamenta Informaticae (IOS Press)-Vol. 162, Iss: 4, pp 311-359

TL;DR: A conceptual modeling solution, named Unified Cube, which blends together multidimensional data from DWs and LOD datasets without materializing them in a stationary repository and an analysis processing process which queries different sources in a transparent way to decision-makers is proposed.

read less

Abstract: Classically, Data Warehouses (DWs) supports business analyses on data coming from the inside of an organization. Nevertheless, Lined Open Data (LOD) might sensibly complete these business analyses by providing complementary perspectives during a decision-making pro-cess. In this paper, we propose a conceptual modeling solution, named Unified Cube, which blends together multidimensional data from DWs and LOD datasets without materializing them in a stationary repository. We complete the conceptual modeling with an implementation frame-work which manages the relations between a Unified Cube and multiple data sources at both schema and instance levels. We also propose an analysis processing process which queries different sources in a transparent way to decision-makers. The practical value of our proposal is illustrated through real-world data and benchmarks.

...read moreread less

Summary (4 min read)

Jump to: [1. Introduction] – [2. Related work] – [3.1. Analysis subject: fact] – [3.2. Analysis axis: dimension] – [3.3. Analytical granularity: level] – [4.1. Schema module] – [4.2. Instance module] – [5.1. Queries generation] – [5.2. Analysis result generation] – [5.3. Multisource analysis framework] – [6. Experimental assessments] – [6.1. Input] – [6.2.1. Protocol] – [6.2.2. Observations and discussions] – [6.3. Guidelines] – [6.4. Validation] and [7. Conclusion]

1. Introduction

Well-informed and effective decision-making relies on appropriate data for business analyses.
An analysis subject should include all related numeric indicators from different sources, even though these indicators cannot be aggregated according to the same analytical granularities.
The authors describe a generic modeling solution, named Unified Cube, which provides a business-oriented view unifying both warehoused data and LOD.
Section 3 describes the conceptual modeling and graphical notation of Unified Cubes.

3.1. Analysis subject: fact

Classically, a fact models an analysis subject.
To support real-time analyses, the Unified Cube modeling extends the concept of measure by allowing on-the-fly extraction of measure values.
Table 1 shows the algebraic form of commonly used SPARQL queries.
The fact named Social Housings contains two measures, namely mAcceptances and mApplications.
The extraction formula of the measure mAcceptances is defined upon SPARQL algebra, such as:.

3.2. Analysis axis: dimension

The concept of dimension in a Unified Cube follows the classical definition.
If several analytical granularities are defined, the authors can find one or several aggregation paths (also known as hierarchies).
Without this constraint, a dimension may start at any level.
Definition 3.3. District named Geography- District with LDGeography\lGeo.District ={lGeo. 11 ∃=1 represents the unique existential quantification meaning ”there exists only one”.

3.3. Analytical granularity: level

Classically, a level indicates a distinct analytical granularity described by a set of attributes from the same data source.
In the context of Unified Cubes, the classical definition of level needs to be extended to group together attributes from different sources.
Status associates the instances of the attribute aStatus with its equivalent instances of the attribute aApplicant Status, such as: C lEconomic.
In order to simplify the notation, the level-measure mapping is drawn between a measure and its lowest summarizable levels within corresponding dimensions.
By including business-oriented concepts and a graphical notation, a Unified Cube can support analyses on multiple data sources in a user-friendly way without requiring specialized knowledge on logical or physical data modeling.

4.1. Schema module

The schema module manages the multidimensional representation of a Unified Cube.
It is worth noticing that extraction formulae of measures and attributes are translated into executable queries (i.e., queryM and queryA).
Binary relations within a dimension are used to instantiate the association between a child level instance and a parent level instance (cf. lines 13 - 15).
The instantiated metamodel contains (i) one instance of Fact (ii) two instances of Measure, (iii) two instances of Algorithm 1: Metamodel Instantiation input : A Unified Cube = {F; D ; LM} output:.
In the snapshot, the Geography dimension includes the level Geo.District.

4.2. Instance module

In a Unified Cube, equivalent attribute instances from different sources are associated together by correlative mappings (cf. section 3.3).
Classically, problems of matching instances from sources of different types are handled in a simplistic way by transforming heterogeneous sources into a common format and then following a matching process designed for homogeneous sources.
Matching based on string similarity measures is often reinforced by some auxiliary techniques.
By referring to the links between LOD datasets, 243 pairs of districts from the LOD1 dataset and the LOD2 dataset are associated together with a perfect confidence score (i.e. score = 1); Housing District’s instances from the DW share similar description with instances of Area from the LOD1 dataset.
The authors first describe how queries are automatically generated for an analysis (cf. section 5.1).

5.1. Queries generation

To facilitate decision-makers tasks, the authors propose a process whose goal is to extract data related to an analysis from multiple sources (cf. algorithm 2).
During the execution, the algorithm picks out attributes linked to a chosen measure (i.e. AM in lines 1 and 23).
The third analysis calculates a measure according to an attribute from a different source.
The second query is generated by referring to the extraction formulae of mAcceptances, aStatus and aDistrict (cf. lines 15 and 18).

5.2. Analysis result generation

After the execution of generated queries, several query results are returned from different sources.
In the following, the authors provide more details about how multiple query results are fused together to form one analysis result at the output of the algorithm.
Note that the abstract notation in algorithm 3 follows the conceptual Unified Cube modeling presented in sections 3, while the operations correspond to those in the metamodel described in section 4.1.
Then, attribute instances from Rtemp are grouped with those from R2 ).
Second, their proposed decision-support process facilitates decision-makers analysis tasks.

5.3. Multisource analysis framework

To enable analyses on multiple sources in a user-friendly manner, the authors develop a multisource analysis framework which present only business-oriented concepts to decision-makers during analyses.
The instance manager deals with correlative attribute instances from different sources.
Both Java programs are included in the analysis processing manager.
The proposed analysis framework supports a user-friendly decision-making process in their analysis framework.
After queries execution, the analysis processing manager receives data extracted from different sources (cf. arrows 5).

6. Experimental assessments

To enable analyses on data from multiple sources, one key step consists of unifying data extracted from different sources together to form one unique analysis result.
This unification is based on the table of correspondences which is populated through an instance matching process .
To study the feasibility and efficiency of their proposed matching process, the authors carry out some experimental assessments.
Second, the authors present the results of their experimental assessments.
Third, based on the experimental results, the authors propose some generic guidelines for efficient use of string similarity measures to match correlative instances in a Unified Cube (cf. section 6.3).

6.1. Input

During their experimental assessments, the authors use two collections of real-world data.
Most CLG datasets contain a geographic dimension composed of one level named District.
Two different implementations of the bibliographic data are managed by the European Network of Excellence ReSIST17 and the L3S Research Center18.
The average string length is about 290 characters including data type descriptors and name spaces.

6.2.1. Protocol

The objective of the experimental assessments is to find out if string similarity measures can be used to match correlative attribute instances.
The matching candidates are formed according to four combinations of matching setups, namely Concatenated&Unprocessed, Concatenated&Optimized, Separated&Unprocessed, and Separated& Optimized (cf. section 4.2).
The efficiency of each string similarity measure is evaluated through the F-measure and the runtime.
The precision is the ratio of the number of true positives to the retrieved mappings, while the recall is the ratio of the number of true positives to the actual mappings expected.
The hardware configuration of the multi-threads execution environment is as follows: CPU of two AMD Opteron 6262HE with 16 cores, RAM of 128 GB and SAS 10K disk.

6.2.2. Observations and discussions

The first tests consist of comparing the F-measure of string similarity measures according to different influence factors.
Conclusion Based on the previous observations, the authors can conclude string similarity measures in the same group produce almost the same F-measure when the same matching setup is used.
To do so, the authors first choose the top five similarity measures producing the highest F-measure according to four matching setups when being applied to seven datasets (cf. table 5).
For bibliographic data, the authors notice even if data volume increases by 225% from DBLP0 to DBLP50, the runtime of all similarity measures augments only from 5.2% to 16.9% with an average of 8% (owing to parallel computation).
Independently of similarity measures, the fastest matching setup is Separated&Optimized, which produces the shortest matching candidates among all matching setups.

6.3. Guidelines

Based on the results of their experimental assessments, the authors describe some generic guidelines to make full use of similarity measures.
First, Soundex, Jaro, Overlap Coefficient, and Euclidean Distance are only suited for specific needs of matching [16, 11, 13], e.g. using Soundex to match homophones in English.
In the context of Unified Cubes, the quality of the correlative mappings based on these similarity measures cannot be guaranteed due to their poor performance in matching instances from some generic sources.
Therefore, they are not considered in the guidelines.
Third, in the case where attribute instances contain relatively short strings (about 200 characters or less), N Grams Distance, Levenshtein, Smith Waterman, Smith Waterman Gotoh, Monge Elkan, Needleman Wunch, and Jaro Winkler are possible choices of similarity measures which can produce a high F-measure.

6.4. Validation

To validate their proposed guidelines, the authors use the benchmark of the instance matching track in the Ontology Alignment Evaluation Initiative campaigns 201620 and 201721.
Due to the large number of highly similar music works, the authors have to firstly write queries to extract corresponding descriptions of music work, for instance title versus title, compositors versus compositors, etc.
It leaves us four choices of similarity measures which are possibly appropriate for the matching in the benchmark: Jaro Winkler, Levenshtein, N Grams Distance, and Smith Waterman.
As no semantic-based technique is used to improve the matching result, the authors expect some low similarity score between instances.
After the execution, the authors obtain some surprisingly good results.

7. Conclusion

To this end, the authors have defined a generic conceptual multidimensional model, named Unified Cube, which blends data from multiple sources together.
The authors have proposed an implementation framework which manages interactions between a Unified Cube and multiple data sources at both the schema and the instance levels.
Based on their proposed implementation framework, the authors have designed an analysis processing process which enables analyses on multisource data in a user-friendly way.
The results of their experimental assessments have been integrated into some generic guidelines allowing identifying the most appropriate string similarity measures according to matching setup, string length, and requirement on runtime.
With regards to the maintenance of the table of correspondences, several update alternatives would be included in the process, such as periodically executing their proposed matching process , triggering an update after each evolution detected in sources [23, 14], or triggering an update in an on-demand manner to support right-time business analyses [41].

Did you find this useful? Give us your feedback

Figures (25)

Table 4: Sixteen string similarity measures according to six groups

Figure 3: Graphical notation of a Unified Cube

Figure 11: Combining query results from the DW and the LOD1 dataset

Table 5: Ranking of string similarity measures based on their occurrence among the top five ones according to matching setups

Figure 12: Combining Rtemp with query result from the LOD2 dataset

Figure 13: Analysis result about number of submitted and accepted applications by applicant’s status and housing district, region

Figure 5: Snapshot of instantiated metamodel

Table 3: Details of the datasets used in our experimental assessments

Figure 20: Average runtime of similarity measures in geographic and bibliographic datasets

Table 2: Analysis needs with corresponding measures and attributes

Figure 1: An extract of data in a DW and two LOD datasets

Figure 16: Top 5 F-measures with the corresponding optimal thresholds according to matching setup

Figure 17: Highest F-measure according to similarity measuresgroups and matching setups

Figure 8: A snapshot of the table of correspondences

Figure 7: Identifying equivalent attribute instances within the level lGeo.District

Figure 15: Graphical interface of the multisource analysis framework

Figure 4: Class diagram of the metamodel for Unified Cubes

Table 1: SPARQL queries and their algebraic representation.

Figure 18: F-measure produced by each similarity measure in different datasets

Figure 9: Mediator-based approach versus analysis processing process for Unified Cube

Content maybe subject to copyright Report

Any correspondence concerning this service should be sent

to the repository administrator: tech-oatao@listes-diff.inp-toulouse.fr

This is an author’s version published in:

http://oatao.univ-toulouse.fr/22413

o cite this version: Ravat, Franck and Song, Jiefu A

Unified Approach to Multisource Data Analyses. (2018)

Fundamenta Informaticae, 162 (4). 311-359. ISSN 1875-8681

Official URL

DOI : https://doi.org/10.3233/FI-2018-1727

Open

Archive

Ouverte

OATAO is an open access repository that collects the work of Toulouse

researchers and makes it freely available over the web where possible

DOI 10.3233/FI-2018-1700

A Uniﬁed

Approach to Multisource Data Analyses

Franck Ravat Jiefu Song

∗

IRIT - Universit´e Toulouse I Capitole

2 Rue du Doyen Gabriel Marty F-31042 Toulouse Cedex 09,

France ravat@irit.fr, song@irit.fr

Abstract. Classically, Data Warehouses (DWs) supports business analyses on data coming from the inside of

an organization. Nevertheless, Lined Open Data (LOD) might sensibly complete these business analyses by

providing complementary perspectives during a decision-making pro-cess. In this paper, we propose a

conceptual modeling solution, named Uniﬁed Cube, which blends together multidimensional data from DWs

and LOD datasets without materializing them

in a stationary repository. We complete the conceptual modeling

with an implementation frame-work which manages the relations between a Uniﬁed Cube and multiple data

sources at both schema and instance levels. We also propose an analysis processing process which queries dif-

ferent sources in a transparent way to decision-makers. The practical value of our proposal is illustrated

through real-world data and benchmarks.

Keywords: Data Warehouse, Linked Open Data, Conceptual Modeling, Multisource analyses,

Experimental Assessments

1. Introduction

Well-informed and effective decision-making relies on appropriate data for business analyses. Data

are considered appropriate if they include enough information to provide an overall perspective to

decision-makers. To obtain as many appropriate data as possible, decision-makers must have access

to the company’s business data at any time. Since the 1990s, Business Intelligence (BI) has been

∗

Address for correspondence: IRIT - Universit

Toulouse I Capitole, 2 Rue du Doyen Gabriel Marty F-31042 Toulouse

Cedex 09, France

providing methods, techniques and tools to collect, extract and analyze business data stored in a Data

Warehouse (DW) [9]. However, an overall perspective during decision-making requires not only busi-

ness data from inside a company but also other data from outside a company. In today’s constantly

evolving business context, one promising approach consists of blending web data with warehoused

data [32]. The concept of BI 2.0 is introduced to envision a new generation of BI enhanced by web-

based content [39].

Among various web-based content, Linked Open Data (LOD)

provide a set of inter-connected

and machine-readable data to enhance business analyses on a web scale [45]. Since data are produced

and updated at a high speed nowadays, materializing all data (e.g., warehoused data and LOD) related

to analyses in one stationary repository can hardly be synchronized with changes in data sources. It is

necessary to unify data from various sources without integrating all data into a stationary repository.

To support up-to-date decision-making, business dashboards must be created in an on-demand manner.

Such dashboards should include all appropriate data required by decision-makers.

Case Study. In a government organization managing social housings, internal data are periodi-

cally extracted, transformed and loaded in a DW. As shown in ﬁgure 1(a), the DW describes number of

applications (i.e. Applications) according to two analysis axes: one about the geographical location of

social housings (i.e. Housing

Ward and Housing District) and the other related to applicant’s proﬁle

(i.e. Applicant

Status). This DW only gives a partial view on the demand for social housings. To

support effective decision-making, additional information should be included in analyses. Therefore,

a decision-maker browses in a second dataset, named LOD1, to obtain complementary views on so-

cial housing allocation. Published by the UK Department for Communities and Local Government

LOD1 describes the accepted applications for social housing (i.e. acceptance) according to district

Figure 1: An extract of data in a DW and two LOD datasets

http://linkeddata.org

http://opendatacommunities.org/data/housingmarket/core/tenancies/econstatus

and status (cf. ﬁgure 1(b)). LOD1 follows a multidimensional structure expressed in RDF Data Cube

Vocabulary (QB)

. The QB format only allows including one granularity in each analysis axis. The

decision-maker needs new analysis possibilities to aggregate data based on multiple granularities. To

discover more geographical granularities, the decision-maker looks into another dataset named LOD2.

This dataset is managed by the Ofﬁce for National Statistics of the UK

; it associates several areas

(including districts) with one corresponding region (cf. ﬁgure 1(c)). Both LOD1 and LOD2 are real-

world LOD which can be accessed through querying endpoints

The above-mentioned warehoused data and LOD share some similar multidimensional features, as

they are organized according to analysis subjects and analysis axes. However, analyzing data scattered

in several sources is difﬁcult without a uniﬁed data representation. During analyses, decision-makers

must search for useful information in several sources. The efﬁciency of such analyses is low, since

different sources may follow different schemas and contain different data instances. Facing these

issues, the decision-maker needs a business-oriented view unifying data from both the DW and the

LOD datasets. She/he makes the following requests regarding the view:

• An analysis subject should include all related numeric indicators from different sources, even

though these indicators cannot be aggregated according to the same analytical granularities. To

support real-time analyses, numeric indicators (e.g. Applications from the DW, Acceptances

from the LOD1 dataset) and their descriptive attributes (e.g. Housing

Ward, Housing District

and Applicant

Status from the DW, District and Status from the LOD1 dataset) at different

analytical granularities should be queried on-the-ﬂy from sources;

• Analytical granularities related to the same analysis axis should be grouped together. For in-

stance, the Housing

Ward and Housing District granularities from the DW, the District granu-

larity from the LOD1 dataset, the Area and Region granularities from the LOD2 dataset should

be merged into one analysis axis;

• Attributes describing the same analytical granularity should be grouped together. The correl-

ative relationships between instances of these attributes should be managed. For instance, the

attribute Housing

District from the DW, the attribute District from the LOD1 dataset and the

attribute Area from the LOD2 dataset should be all included in one analytical granularity related

to districts. Correlative instances ”Birmingham” from the DW, ”Birmingham E08000025” from

the LOD1 dataset and ”Birmingham xsd:string” from the LOD2 dataset should be associated

together, since they both refer to the same district;

• Summarizable analytical granularities should be indicated for each numeric indicator. For in-

stance, only the measure Applications from the DW can be aggregated according the Ward

analytical granularity. The other measure Acceptances from the LOD1 dataset is only summa-

rizable starting from the district analytical granularity on the geographical analysis axis.

http://www.w3.org/TR/vocab-data-cube

https://www.ons.gov.uk/

http://opendatacommunities.org/sparql

http://statistics.data.gov.uk/sparql

Contribution. Our aim is to make full use of as much information as possible to support effective

and well-informed decisions. To this end, we propose a uniﬁed view of data from both DWs and

LOD datasets. At the schema level, the uniﬁed view should include in a single schema all information

about an analysis subject described by all available analysis axes as well as all granularities (coming

from multiple sources). At the instance level, the uniﬁed view should not materialize data that can

be directly queried from the source. Nevertheless, it should manage the correlation relations between

related attribute instances referring to the same real-world entity. With the help of the uniﬁed view, a

decision-maker can easily obtain an overall perspective of an analysis subject. In the previous example,

a uniﬁed view would enable decision-makers to analyze on-the-ﬂy the number of applications and

acceptances according to applicant’s status and district as well as region (cf. ﬁgure 1(d)).

In this paper, we describe a generic modeling solution, named Uniﬁed Cube, which provides

a business-oriented view unifying both warehoused data and LOD. Section 2 presents different ap-

proaches to unifying data from DWs and LOD datasets. Section 3 describes the conceptual modeling

and graphical notation of Uniﬁed Cubes. Section 4 presents an implementation framework for Uniﬁed

Cubes. Section 5 shows how analyses are carried out on a Uniﬁed Cube in a user-friendly manner.

Section 6 illustrates the feasibility and the efﬁciency of our proposal through some experimental as-

sessments.

2. Related work

Disparate data silos from different sources make decision-making difﬁcult and tedious [43]. To pro-

vide decision-makers with an overall perspective during business analyses, an effective data integration

strategy is needed. In accordance with our research context, we focus on work related to the integra-

tion of multidimensional data from DWs and LOD datasets. We classify existing researches into three

categories.

The ﬁrst category is named ETL-based. With the arrival of LOD, the BI community intuitively

treated LOD as external data sources that should be integrated in a DW through ETL processes [15, 29,

36]. The obtained multidimensional DW is used as a centralized repository of LOD [38, 6]. Decision-

makers can use classical DW analysis tools to analyze LOD stored in such DWs. However, the existing

ETL techniques are inclined to populate a DW with LOD rather than updating existing LOD in a

DW. No effective technique is proposed to guarantee the freshness of warehoused LOD presented to

decision-makers during analyses. One promising avenue is to extend on-demand ETL processes [4]

to ﬁt for the integration of LOD in a DW at right time during business analyses. Otherwise, current

ETL-based approaches are not suitable in today’s highly dynamic context where large amounts of data

are constantly published and updated; they collide with the distributed nature and the high volatility

of LOD [24, 17].

The second category is named semantic web modeling. Since multidimensional models have been

proven successful in supporting complex business analyses [35], the LOD community introduces new

modeling vocabularies to semantically describe the multidimensionality of LOD through RDF triples.

Among the proposed modeling vocabularies, the RDF Data Cube Vocabulary

(QB) is the current W3C

http://www.w3.org/TR/vocab-data-cube

HTML Viewer

A Unified Approach to Multisource Data Analyses

Summary (4 min read)

1. Introduction

3.1. Analysis subject: fact

3.2. Analysis axis: dimension

3.3. Analytical granularity: level

4.1. Schema module

4.2. Instance module

5.1. Queries generation

5.2. Analysis result generation

5.3. Multisource analysis framework

6. Experimental assessments

6.1. Input

6.2.1. Protocol

6.2.2. Observations and discussions

6.3. Guidelines

6.4. Validation

7. Conclusion

Figures (25)

Citations

References

"A Unified Approach to Multisource D..." refers background or result in this paper

"A Unified Approach to Multisource D..." refers methods in this paper

Related Papers (5)