Book Chapter•DOI•

Metadata Management for Data Lakes

Franck Ravat¹, Yan Zhao¹•Institutions (1)

08 Sep 2019-pp 37-44

TL;DR: A metadata conceptual schema which considers different types (structured, semi-structured and unstructured) of raw or processed data is presented and is implemented in two DBMSs to validate the proposal.

read less

Abstract: To prevent data lakes from being invisible and inaccessible to users, an efficient metadata management system is necessary. In this paper, we propose a such system based on a generic and extensible classification of metadata. A metadata conceptual schema which considers different types (structured, semi-structured and unstructured) of raw or processed data is presented. This schema is implemented in two DBMSs (relational and graph) to validate our proposal.

...read moreread less

Summary (2 min read)

Jump to: [1 Introduction] – [2 Related Work] – [3.1 Metadata Classification] – [3.2 Metadata Conceptual Schema] – [4 Metadata Implementation] – [4.1 Relational Database] and [4.2 Graph Database]

1 Introduction

To prevent data lakes from turning into data swamps, metadata management is essential [1, 8, 20] .
Nevertheless, many papers are focused on a single zone (especially ingestion zone) or a single data type of data lakes.
In the third section, the authors propose their metadata conceptual schema with a classification.

3.1 Metadata Classification

The authors metadata classification has the advantage of integrating both intra-metadata and inter-metadata for all datasets.
Logical clusters signifies that some datasets are in the same domain.
It consists of the original source of datasets and the processing history.
Security metadata consist of data sensitivity and access level.

3.2 Metadata Conceptual Schema

From the functional architecture point of view [5, 11, 13, 20] , a data lake contains four essential zones.
A raw data zone allows to ingest data without processing and stores raw data in their native format.
And a governance zone is in charge of insuring data quality, security and data life-cycle.
The authors metadata classification is applied on the multi-zones functional architecture of data lakes (see Fig. 1 ). (iv) Information of each single dataset and relationships between different datasets are stored.

4 Metadata Implementation

The University Hospital Center (UHC) of Toulouse is the largest hospital center in the south of France.
All medical, financial and administrative data are stored in the information system of this center.
The UHC of Toulouse plans to launch a data lake through an iterative process.
The objective of this data lake is to combine these different data sources in order to allow data analysts and BI professionals to analyse available data to improve medical treatments.
Regarding metadata management systems, there are metadata stored in key-value [9] , XML documents [15, 17] , relational databases [17] or by ontology [1] .

4.1 Relational Database

After the implantation, the authors collected the needs of data scientists from a metadata point of view.
The first questions were about data trust and data lineage analysis.
To validate their proposal, the authors have written several queries to compare the feasibility and usability of different environments.
In the following paragraphs, you will find two examples.
There is an example to find the original dataset of 'COMEDIMS' (see Fig. 4 (a) ). (ii) Besides finding the origin of one dataset, users may also want to find relevant datasets that come from the same origin of the dataset.

4.2 Graph Database

The second solution of implementation is graph database.
In addition, 2 queries that answer the same questions in the last subsection will be executed.
The authors extended the mapping from UML class diagram to property graphs that proposed by [3] to Neo4j Cypher query language.
To test the implementation, the authors also answered the same questions than the relational database.

Did you find this useful? Give us your feedback

Figures (5)

Fig. 2. Class diagram of metadata conceptual model

Content maybe subject to copyright Report

fficial URL

DOI : https://doi.org/10.1007/978-3-030-30278-8_5

Any correspondence concerning this service should be sent

to the repository administrator: tech-oatao@listes-diff.inp-toulouse.fr

This is an author’s version published in:

http://oatao.univ-toulouse.fr/25044

Open

Archive

Ouverte

OATAO is an open access repository that collects the work of Toulouse

researchers and makes it freely available over the web where possible

To cite this version: Ravat, Franck and Zhao, Yan Metadata

Management for Data Lakes. (2019) In: 23rd East-European

Conference on Advances in Databases and Information

Systems (ADBIS 2019), 8 September 2019 - 11 September

2019 (Bled, Slovenia).

Metadata Management for Data Lakes

Franck Ravat

and Yan Zhao

1,2(

)

Institut de Recherche en Informatique de Toulouse, IRIT-CNRS (UMR 5505),

Universit´e Toulouse 1 Capitole, Toulouse, France

{Franck.Ravat,Yan.Zhao}@irit.fr

Centre Hospitalier Universitaire (CHU) de Toulouse, Toulouse, France

Abstract. To prevent data lakes from being invisible and inaccessible

to users, an eﬃcient metadata management system is necessary. In this

paper, we propose a such system based on a generic and extensible clas-

siﬁcation of metadata. A metadata conceptual schema which considers

diﬀerent types (structured, semi-structured and unstructured) of raw or

processed data is presented. This schema is implemented in two DBMSs

(relational and graph) to validate our proposal.

Keywords: Data lake

· Metadata management ·

Metadata classiﬁcation

1 Introduction

The concept of Data Lake (DL) was created by Dixon [4] and extended by various

authors [5,8,20]. DL allows to ingest raw data from various sources, store data in

their native format, process data upon usage, ensure the availability of data and

provide accesses to data scientists, analysts and BI professionals, govern data to

insure the data quality, security and data life cycle.

DLs facilitate diﬀerent typ es of analysis such as machine learning algorithms,

statistics, data visualisation... (unlike Data Warehouses (DW) [

16]). The main

characteristic of DL is ’schema-on-read’ [

5], data are only processed upon usage.

Compared to DWs, which are structured data repositories dedicated to prede-

termined analyses, DLs have great ﬂexibility and can avoid losing information.

However, a

data lake that contains a great amount of structured, semi-

structured and unstructured data without explicit schema or description can

easily turn into a data swamp which is invisible, inaccessible and unreliable to

users [

18]. To prevent data lakes from turning into data swamps, metadata man-

agement is essential [1,8,20]. Metadata can help users ﬁnd data that correspond

to their needs, accelerate data accesses, verify data origin and processing history

to gain conﬁdence and ﬁnd relevant data to enrich their analyses [1,14].

Nevertheless, many papers are focused on a single zone (especially ingestion

zone) or a single data type of data lakes. Therefore, the goal of this paper is to

propose a metadata management system dedicated to data lakes and applied to

the whole life-cycle (multiple zones) of data. The set of the paper is as follows:

https://doi.org/10.1007/978-3-030-30278-8 5

the second section introduces related work on metadata. In the third section,

we propose our metadata conceptual schema with a classiﬁcation. The fourth

section describes the implementation of a metadata management system.

2 Related Work

DL metadata, inspired by the DW classiﬁcations [

6,7], are classiﬁed into two

ways. A ﬁrst classiﬁcation includes three categories [

12,14]: Technical meta-

data concern data type, format and structure (schema). Operational metadata

concern data processing information and Business metadata concern business

objects and descriptions. A second classiﬁcation includes not only the informa-

tion of each dataset (intra-metadata) but also the relationships between datasets

(inter-metadata). Intra-metadata are classiﬁed into data characteristics, deﬁ-

nitional, navigational, activity, lineage, rating and assessment [2,6,19]. Inter-

etadata describe relationships between datasets, they are classiﬁed into dataset

containment, provenance, logical cluster and content similarity [9].

Compared to the ﬁrst classiﬁcation, the second one is more speciﬁc. Never-

theless, the second classiﬁcation can be improved. Some sub-categories are not

adapted to data lakes. For instance, the rating subcategory that concerns user

preferences [19] needs to be removed. Because in data lakes, datasets can be

processed and analysed by diﬀerent users [

5], a dataset that makes no sense to

BI professionals can be of great value to data scientists. What’s more, this classi-

ﬁcation can be extended with more sub-categories. For instance, data sensitivity

and accessibility also need to be controlled in data lakes.

Concerning metadata management, various solutions for data lakes are pre-

sented with diﬀerent emphases [

1,9,15,17]. Regarding all the solutions of meta-

data management, authors mainly focused on a few points. Firstly, the detection

of relationships between diﬀerent datasets is always presented [

1,9,15]. Relation-

hips between datasets can help users ﬁnd as many relevant datasets as possible

to enrich data analysis. While we want to ﬁnd a metadata model that shows not

only the relationships between datasets but also the information of each single

dataset. Secondly, authors often focused on unstructured data (mostly textual

data) [

15,17] for the diﬃculty of extracting information. However, in a data

lake, there are various types of data (images, pdf ﬁles...). Thirdly, data inges-

tion is the most considered phase to extract metadata [

1,9,17]. Nevertheless,

the information that is produced during process and access phases has value too

[6,17].

Until now, there isn’t a generic metadata management system that works

on both structured and unstructured data for the whole data life-cycle in data

lakes. The objective of this paper is to deﬁne a metadata management system

that addresses these weaknesses.

3 Metadata Model

Considering the diversity of data structural type and diﬀerent pro

cesses that

applied on datasets, our solution is based on intra- and inter-metadata.

Fig. 1. Meta data classiﬁcation

3.1 Metadata Classiﬁcation

Our metadata classiﬁcation has the advantage of integrating both intra-metadata

and inter-metadata for all datasets. Intra-metadata allow users to understand

datasets with their characteristics, meaning, quality and security level [2,19].

Inter-metadata help users ﬁnd relevant datasets that can answer their require-

ments to make their data discovery easier [9,17].

– Inter-metadata. We complete the classiﬁcation of [9] and obtain 5 sub-

categories. Dataset containment signiﬁes that a dataset is contained in other

datasets. Partial overlap signiﬁes that some attributes with corresponding

data in some datasets overlap. For instance, in a hospital, health care and

billing databases contain the same attributes and data about patients, pre-

scriptions and stays. But these two databases also contain their own speciﬁc

data. Provenance signiﬁes that one dataset is the source of another dataset.

Logical clusters signiﬁes that some datasets are in the same domain. For

example, diﬀerent versions, duplication of the same logical dataset. Content

similarity signiﬁes that diﬀerent datasets share the same attributes.

– Intra-metadata. We extend the classiﬁcation of [2,19] to include access, quality

nd security.

• Data characteristics consist of information such as identiﬁcation, name,

size, structural type and creation date of datasets. This information helps

users to have a general idea of a dataset.

• Deﬁnitional metadata speciﬁes datasets’ meanings. In the original taxon-

omy, there are vocabulary and schema subcategories. We classify deﬁni-

tional metadata into semantic and schematic metadata. Structured and

unstructured datasets can be semantically described by a text or by some

keywords (vocabularies). Schematically, a structured dataset can be pre-

sented by a database schema.

• Navigational metadata concerns the location of datasets, for instance, ﬁle

paths and database connection URLs.

• Lineage presents data life-cycle. It consists of the original source of

datasets and the processing history. Information on datasets sources and

process history makes datasets more reliable.

Fig. 2. Class diagram of metadata conceptual model

• Access metadata present access information, for example, name of the

users who accessed datasets and the access tools. This information helps

users to ﬁnd relevant datasets by accessed users and to trust data by

other users’ access histories.

• Quality metadata consist of data consistency and completeness [

10] to

nsure datasets’ reliability.

• Security metadata consist of data sensitivity and access level. Data lakes

store datasets from various sources. Some datasets may contain sensitive

information that can only be accessed by certain users. Security metadata

can support the veriﬁcation of access. This information ensures the safety

of sensitive data.

3.2 Metadata Conceptual Schema

From the functional architecture point of view [

5,11,13,20], a data lake contains

four essential zones. A raw data zone allows to ingest data without processing

and stores raw data in their native format. A process zone allows to process raw

data upon usage and provides intermediate storage areas. The access zone stores

reﬁned data and ensures data availability. And a governance zone is in charge

of insuring data quality, security and data life-cycle.

HTML Viewer

Frequently Asked Questions (14)

Q1. What have the authors contributed in "Metadata management for data lakes" ?

In this paper, the authors propose a such system based on a generic and extensible classification of metadata. A metadata conceptual schema which considers different types ( structured, semi-structured and unstructured ) of raw or processed data is presented.

Q2. What future works have the authors mentioned in the paper "Metadata management for data lakes" ?

For this automatic extraction, the authors plan to adapt to the context of existing works such as automatic detection of relationships between datasets [ 1 ] and automatic extraction of data structure, metadata proprieties and semantic data [ 1 ].

Q3. what is the long term goal of the project?

Their long term goal is to accomplish a metadata management system which integrates automatic extraction of data, effective researches of metadata, automatic generation of dashboards or other analyses.

Q4. What are the main characteristics of DLs?

DLs facilitate different types of analysis such as machine learning algorithms, statistics, data visualisation... (unlike Data Warehouses (DW) [16]).

Q5. What is the purpose of the paper?

DL allows to ingest raw data from various sources, store data in their native format, process data upon usage, ensure the availability of data and provide accesses to data scientists, analysts and BI professionals, govern data to insure the data quality, security and data life cycle.

Q6. What is the main characteristic of a data lake?

a data lake that contains a great amount of structured, semistructured and unstructured data without explicit schema or description can easily turn into a data swamp which is invisible, inaccessible and unreliable to users [18].

Q7. What are the main categories of data lakes?

Intra-metadata are classified into data characteristics, definitional, navigational, activity, lineage, rating and assessment [2,6,19].

Q8. What is the main purpose of the paper?

Metadata can help users find data that correspond to their needs, accelerate data accesses, verify data origin and processing history to gain confidence and find relevant data to enrich their analyses [1,14].

Q9. Why did the authors choose to implement a relational database and a graph database?

The authors have chosen to implement a relational database and a graph database for the fallowing reasons: relational databases have a standardquery language (SQL) and a high security level insured by many RDBMSs (Relational Database Management System); graph databases ensure scalability and flexibility.

Q10. What is the purpose of this paper?

Considering the diversity of data structural type and different processes that applied on datasets, their solution is based on intra- and inter-metadata.

Q11. What is the aim of this project?

The aim of this project is to ingest (i) all the internal relational databases and medical e-documents (including scans of hand written medical reports), (ii) external data coming from other French UHCs and (iii) some public medical data.

Q12. What are the main types of metadata stored in the UHC of Toulouse?

Regarding metadata management systems, there are metadata stored in key-value [9], XML documents [15,17], relational databases [17] or by ontology [1].

Q13. What is the goal of this paper?

the goal of this paper is to propose a metadata management system dedicated to data lakes and applied to the whole life-cycle (multiple zones) of data.

Q14. What is the conceptual schema for data lakes?

What’s more, for validating the conceptual schema, the authors implemented a graph DBMS and a relational DBMS for metadata management system in UHC of Toulouse.

Metadata Management for Data Lakes

Summary (2 min read)

1 Introduction

3.1 Metadata Classification

3.2 Metadata Conceptual Schema

4 Metadata Implementation

4.1 Relational Database

4.2 Graph Database

Figures (5)

Citations

Cites background from "Metadata Management for Data Lakes"

Cites methods from "Metadata Management for Data Lakes"

Cites background or methods from "Metadata Management for Data Lakes"

References