scispace - formally typeset
Search or ask a question
Book ChapterDOI

Metadata Management for Data Lakes

08 Sep 2019-pp 37-44
TL;DR: A metadata conceptual schema which considers different types (structured, semi-structured and unstructured) of raw or processed data is presented and is implemented in two DBMSs to validate the proposal.
Abstract: To prevent data lakes from being invisible and inaccessible to users, an efficient metadata management system is necessary. In this paper, we propose a such system based on a generic and extensible classification of metadata. A metadata conceptual schema which considers different types (structured, semi-structured and unstructured) of raw or processed data is presented. This schema is implemented in two DBMSs (relational and graph) to validate our proposal.

Summary (2 min read)

1 Introduction

  • To prevent data lakes from turning into data swamps, metadata management is essential [1, 8, 20] .
  • Nevertheless, many papers are focused on a single zone (especially ingestion zone) or a single data type of data lakes.
  • In the third section, the authors propose their metadata conceptual schema with a classification.

3.1 Metadata Classification

  • The authors metadata classification has the advantage of integrating both intra-metadata and inter-metadata for all datasets.
  • Logical clusters signifies that some datasets are in the same domain.
  • It consists of the original source of datasets and the processing history.
  • Security metadata consist of data sensitivity and access level.

3.2 Metadata Conceptual Schema

  • From the functional architecture point of view [5, 11, 13, 20] , a data lake contains four essential zones.
  • A raw data zone allows to ingest data without processing and stores raw data in their native format.
  • And a governance zone is in charge of insuring data quality, security and data life-cycle.
  • The authors metadata classification is applied on the multi-zones functional architecture of data lakes (see Fig. 1 ). (iv) Information of each single dataset and relationships between different datasets are stored.

4 Metadata Implementation

  • The University Hospital Center (UHC) of Toulouse is the largest hospital center in the south of France.
  • All medical, financial and administrative data are stored in the information system of this center.
  • The UHC of Toulouse plans to launch a data lake through an iterative process.
  • The objective of this data lake is to combine these different data sources in order to allow data analysts and BI professionals to analyse available data to improve medical treatments.
  • Regarding metadata management systems, there are metadata stored in key-value [9] , XML documents [15, 17] , relational databases [17] or by ontology [1] .

4.1 Relational Database

  • After the implantation, the authors collected the needs of data scientists from a metadata point of view.
  • The first questions were about data trust and data lineage analysis.
  • To validate their proposal, the authors have written several queries to compare the feasibility and usability of different environments.
  • In the following paragraphs, you will find two examples.
  • There is an example to find the original dataset of 'COMEDIMS' (see Fig. 4 (a) ). (ii) Besides finding the origin of one dataset, users may also want to find relevant datasets that come from the same origin of the dataset.

4.2 Graph Database

  • The second solution of implementation is graph database.
  • In addition, 2 queries that answer the same questions in the last subsection will be executed.
  • The authors extended the mapping from UML class diagram to property graphs that proposed by [3] to Neo4j Cypher query language.
  • To test the implementation, the authors also answered the same questions than the relational database.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

O
fficial URL
DOI : https://doi.org/10.1007/978-3-030-30278-8_5
Any correspondence concerning this service should be sent
to the repository administrator: tech-oatao@listes-diff.inp-toulouse.fr
This is an author’s version published in:
http://oatao.univ-toulouse.fr/25044
Open
Archive
Toulouse
Archive
Ouverte
OATAO is an open access repository that collects the work of Toulouse
researchers and makes it freely available over the web where possible
To cite this version: Ravat, Franck and Zhao, Yan Metadata
Management for Data Lakes. (2019) In: 23rd East-European
Conference on Advances in Databases and Information
Systems (ADBIS 2019), 8 September 2019 - 11 September
2019 (Bled, Slovenia).

Metadata Management for Data Lakes
Franck Ravat
1
and Yan Zhao
1,2(
B
)
1
Institut de Recherche en Informatique de Toulouse, IRIT-CNRS (UMR 5505),
Universit´e Toulouse 1 Capitole, Toulouse, France
{Franck.Ravat,Yan.Zhao}@irit.fr
2
Centre Hospitalier Universitaire (CHU) de Toulouse, Toulouse, France
Abstract. To prevent data lakes from being invisible and inaccessible
to users, an efficient metadata management system is necessary. In this
paper, we propose a such system based on a generic and extensible clas-
sification of metadata. A metadata conceptual schema which considers
different types (structured, semi-structured and unstructured) of raw or
processed data is presented. This schema is implemented in two DBMSs
(relational and graph) to validate our proposal.
Keywords: Data lake
· Metadata management ·
Metadata classification
1 Introduction
The concept of Data Lake (DL) was created by Dixon [4] and extended by various
authors [5,8,20]. DL allows to ingest raw data from various sources, store data in
their native format, process data upon usage, ensure the availability of data and
provide accesses to data scientists, analysts and BI professionals, govern data to
insure the data quality, security and data life cycle.
DLs facilitate different typ es of analysis such as machine learning algorithms,
statistics, data visualisation... (unlike Data Warehouses (DW) [
16]). The main
characteristic of DL is ’schema-on-read’ [
5], data are only processed upon usage.
Compared to DWs, which are structured data repositories dedicated to prede-
termined analyses, DLs have great flexibility and can avoid losing information.
However, a
data lake that contains a great amount of structured, semi-
structured and unstructured data without explicit schema or description can
easily turn into a data swamp which is invisible, inaccessible and unreliable to
users [
18]. To prevent data lakes from turning into data swamps, metadata man-
agement is essential [1,8,20]. Metadata can help users find data that correspond
to their needs, accelerate data accesses, verify data origin and processing history
to gain confidence and find relevant data to enrich their analyses [1,14].
Nevertheless, many papers are focused on a single zone (especially ingestion
zone) or a single data type of data lakes. Therefore, the goal of this paper is to
propose a metadata management system dedicated to data lakes and applied to
the whole life-cycle (multiple zones) of data. The set of the paper is as follows:
_
https://doi.org/10.1007/978-3-030-30278-8 5

the second section introduces related work on metadata. In the third section,
we propose our metadata conceptual schema with a classification. The fourth
section describes the implementation of a metadata management system.
2 Related Work
DL metadata, inspired by the DW classifications [
6,7], are classified into two
ways. A first classification includes three categories [
12,14]: Technical meta-
data concern data type, format and structure (schema). Operational metadata
concern data processing information and Business metadata concern business
objects and descriptions. A second classification includes not only the informa-
tion of each dataset (intra-metadata) but also the relationships between datasets
(inter-metadata). Intra-metadata are classified into data characteristics, defi-
nitional, navigational, activity, lineage, rating and assessment [2,6,19]. Inter-
m
etadata describe relationships between datasets, they are classified into dataset
containment, provenance, logical cluster and content similarity [9].
Compared to the first classification, the second one is more specific. Never-
theless, the second classification can be improved. Some sub-categories are not
adapted to data lakes. For instance, the rating subcategory that concerns user
preferences [19] needs to be removed. Because in data lakes, datasets can be
processed and analysed by different users [
5], a dataset that makes no sense to
BI professionals can be of great value to data scientists. What’s more, this classi-
fication can be extended with more sub-categories. For instance, data sensitivity
and accessibility also need to be controlled in data lakes.
Concerning metadata management, various solutions for data lakes are pre-
sented with different emphases [
1,9,15,17]. Regarding all the solutions of meta-
data management, authors mainly focused on a few points. Firstly, the detection
of relationships between different datasets is always presented [
1,9,15]. Relation-
s
hips between datasets can help users find as many relevant datasets as possible
to enrich data analysis. While we want to find a metadata model that shows not
only the relationships between datasets but also the information of each single
dataset. Secondly, authors often focused on unstructured data (mostly textual
data) [
15,17] for the difficulty of extracting information. However, in a data
lake, there are various types of data (images, pdf files...). Thirdly, data inges-
tion is the most considered phase to extract metadata [
1,9,17]. Nevertheless,
the information that is produced during process and access phases has value too
[6,17].
Until now, there isn’t a generic metadata management system that works
on both structured and unstructured data for the whole data life-cycle in data
lakes. The objective of this paper is to define a metadata management system
that addresses these weaknesses.
3 Metadata Model
Considering the diversity of data structural type and different pro
cesses that
applied on datasets, our solution is based on intra- and inter-metadata.

Fig. 1. Meta data classification
3.1 Metadata Classification
Our metadata classification has the advantage of integrating both intra-metadata
and inter-metadata for all datasets. Intra-metadata allow users to understand
datasets with their characteristics, meaning, quality and security level [2,19].
Inter-metadata help users find relevant datasets that can answer their require-
ments to make their data discovery easier [9,17].
Inter-metadata. We complete the classification of [9] and obtain 5 sub-
categories. Dataset containment signifies that a dataset is contained in other
datasets. Partial overlap signifies that some attributes with corresponding
data in some datasets overlap. For instance, in a hospital, health care and
billing databases contain the same attributes and data about patients, pre-
scriptions and stays. But these two databases also contain their own specific
data. Provenance signifies that one dataset is the source of another dataset.
Logical clusters signifies that some datasets are in the same domain. For
example, different versions, duplication of the same logical dataset. Content
similarity signifies that different datasets share the same attributes.
Intra-metadata. We extend the classification of [2,19] to include access, quality
a
nd security.
Data characteristics consist of information such as identification, name,
size, structural type and creation date of datasets. This information helps
users to have a general idea of a dataset.
Definitional metadata specifies datasets’ meanings. In the original taxon-
omy, there are vocabulary and schema subcategories. We classify defini-
tional metadata into semantic and schematic metadata. Structured and
unstructured datasets can be semantically described by a text or by some
keywords (vocabularies). Schematically, a structured dataset can be pre-
sented by a database schema.
Navigational metadata concerns the location of datasets, for instance, file
paths and database connection URLs.
Lineage presents data life-cycle. It consists of the original source of
datasets and the processing history. Information on datasets sources and
process history makes datasets more reliable.

Fig. 2. Class diagram of metadata conceptual model
Access metadata present access information, for example, name of the
users who accessed datasets and the access tools. This information helps
users to find relevant datasets by accessed users and to trust data by
other users’ access histories.
Quality metadata consist of data consistency and completeness [
10] to
e
nsure datasets’ reliability.
Security metadata consist of data sensitivity and access level. Data lakes
store datasets from various sources. Some datasets may contain sensitive
information that can only be accessed by certain users. Security metadata
can support the verification of access. This information ensures the safety
of sensitive data.
3.2 Metadata Conceptual Schema
From the functional architecture point of view [
5,11,13,20], a data lake contains
four essential zones. A raw data zone allows to ingest data without processing
and stores raw data in their native format. A process zone allows to process raw
data upon usage and provides intermediate storage areas. The access zone stores
refined data and ensures data availability. And a governance zone is in charge
of insuring data quality, security and data life-cycle.

Citations
More filters
Journal ArticleDOI
TL;DR: This paper provides a comprehensive state of the art of the different approaches to data lake design, particularly on data lake architectures and metadata management, which are key issues in successful data lakes.
Abstract: Over the past two decades, we have witnessed an exponential increase of data production in the world So-called big data generally come from transactional systems, and even more so from the Internet of Things and social media They are mainly characterized by volume, velocity, variety and veracity issues Big data-related issues strongly challenge traditional data management and analysis systems The concept of data lake was introduced to address them A data lake is a large, raw data repository that stores and manages all company data bearing any format However, the data lake concept remains ambiguous or fuzzy for many researchers and practitioners, who often confuse it with the Hadoop technology Thus, we provide in this paper a comprehensive state of the art of the different approaches to data lake design We particularly focus on data lake architectures and metadata management, which are key issues in successful data lakes We also discuss the pros and cons of data lakes and their design alternatives

68 citations


Cites background from "Metadata Management for Data Lakes"

  • ...5 (Ravat and Zhao 2019a))....

    [...]

  • ...Oram’s metadata classification is the most cited, especially in the industrial literature [9,28,46,47], presumably because it is inspired from metadata categories from data warehouses [45]....

    [...]

  • ...Existing reviews on data lake architectures commonly distinguish pond and zone architectures (Giebler et al. 2019; Ravat and Zhao 2019a)....

    [...]

  • ...Such architectures indeed generally differ in the number and characteristics of zones (Giebler et al. 2019), e.g., some architectures include a transient zone (LaPlante and Sharma 2016; Tharrington 2017; Zikopoulos et al. 2015) while others do not (Hai et al. 2016; Ravat and Zhao 2019a)....

    [...]

  • ...Oram’s metadata classification is the most cited, especially in the industrial literature (Diamantini et al. 2018; LaPlante and Sharma 2016; Ravat and Zhao 2019b; Russom 2017), presumably because it is inspired from metadata categories from data warehouses (Ravat and Zhao 2019a)....

    [...]

Book ChapterDOI
26 Aug 2019
TL;DR: This work studies the existing work and proposes a complete definition and a generic and extensible architecture of data lake and introduces three future research axes related to metadata management that consists of intra- and inter-metadata.
Abstract: As a relatively new concept, data lake has neither a standard definition nor an acknowledged architecture. Thus, we study the existing work and propose a complete definition and a generic and extensible architecture of data lake. What’s more, we introduce three future research axes in connection with our health-care Information Technology (IT) activities. They are related to (i) metadata management that consists of intra- and inter-metadata, (ii) a unified ecosystem for companies’ data warehouses and data lakes and (iii) data lake governance.

58 citations


Cites methods from "Metadata Management for Data Lakes"

  • ...– For inter-metadata [28], we propose to integrate Dataset containment which means a dataset is contained in another dataset....

    [...]

  • ...– For intra-metadata [28], we retain data characteristics, definitional, navigational and lineage metadata proposed in [2] and add the access, quality and security metadata....

    [...]

Journal ArticleDOI
TL;DR: In this paper, the authors provide a comprehensive state of the art of different approaches to data lake design and particularly focus on data lake architectures and metadata management, which are key issues in successful data lakes.
Abstract: Over the past two decades, we have witnessed an exponential increase of data production in the world. So-called big data generally come from transactional systems, and even more so from the Internet of Things and social media. They are mainly characterized by volume, velocity, variety and veracity issues. Big data-related issues strongly challenge traditional data management and analysis systems. The concept of data lake was introduced to address them. A data lake is a large, raw data repository that stores and manages all company data bearing any format. However, the data lake concept remains ambiguous or fuzzy for many researchers and practitioners, who often confuse it with the Hadoop technology. Thus, we provide in this paper a comprehensive state of the art of the different approaches to data lake design. We particularly focus on data lake architectures and metadata management, which are key issues in successful data lakes. We also discuss the pros and cons of data lakes and their design alternatives.

44 citations

Book ChapterDOI
14 Sep 2020
TL;DR: This work presents HANDLE, a generic metadata model for data lakes, which supports the flexible integration of metadata, data lake zones, metadata on various granular levels, and any metadata categorization and enables comprehensive metadata management in data lakes.
Abstract: The substantial increase in generated data induced the development of new concepts such as the data lake. A data lake is a large storage repository designed to enable flexible extraction of the data’s value. A key aspect of exploiting data value in data lakes is the collection and management of metadata. To store and handle the metadata, a generic metadata model is required that can reflect metadata of any potential metadata management use case, e.g., data versioning or data lineage. However, an evaluation of existent metadata models yields that none so far are sufficiently generic. In this work, we present HANDLE, a generic metadata model for data lakes, which supports the flexible integration of metadata, data lake zones, metadata on various granular levels, and any metadata categorization. With these capabilities HANDLE enables comprehensive metadata management in data lakes. We show HANDLE’s feasibility through the application to an exemplary access-use-case and a prototypical implementation. A comparison with existent models yields that HANDLE can reflect the same information and provides additional capabilities needed for metadata management in data lakes.

11 citations


Cites background or methods from "Metadata Management for Data Lakes"

  • ..., [2,16,18], are not sufficiently generic as they cannot support every potential metadata management use case....

    [...]

  • ...Ravat and Zhao’s model is partially checked as they support adding keywords describing their data elements, which does not however, suffice for modeling, e.g., an actor accessing the data....

    [...]

  • ...More general models also created for data lakes include those by Ravat and Zhao [16], Diamantini et al. [2], and lastly, Sawadogo et al.’s model MEtadata for DAta Lakes (MEDAL) [18]....

    [...]

  • ...Set of metadata categorizations, the first five belong to the selected metadata models [18,15,10,16,2]....

    [...]

  • ...Ravat and Zhao mention dataset containment, but it is not clear whether this can be used to implement the granularity topic....

    [...]

Journal ArticleDOI
01 Nov 2021
TL;DR: HANDLE as mentioned in this paper is a generic metadata model for data lakes that supports the acquisition of metadata on varying granular levels, any metadata categorization, including metadata that belongs to a specific data element as well as metadata that applies to a broader range of data.
Abstract: Data contains important knowledge and has the potential to provide new insights. Due to new technological developments such as the Internet of Things, data is generated in increasing volumes. In order to deal with these data volumes and extract the data’s value new concepts such as the data lake were created. The data lake is a data management platform designed to handle data at scale for analytical purposes. To prevent a data lake from becoming inoperable and turning into a data swamp, metadata management is needed. To store and handle metadata, a generic metadata model is required that can reflect metadata of any potential metadata management use case, e.g., data versioning or data lineage. However, an evaluation of existent metadata models yields that none so far are sufficiently generic as their design basis is not suited. In this work, we use a different design approach to build HANDLE, a generic metadata model for data lakes. The new metadata model supports the acquisition of metadata on varying granular levels, any metadata categorization, including the acquisition of both metadata that belongs to a specific data element as well as metadata that applies to a broader range of data. HANDLE supports the flexible integration of metadata and can reflect the same metadata in various ways according to the intended utilization. Furthermore, it is created for data lakes and therefore also supports data lake characteristics like data lake zones. With these capabilities HANDLE enables comprehensive metadata management in data lakes. HANDLE’s feasibility is shown through the application to an exemplary access-use-case and a prototypical implementation. By comparing HANDLE with existing models we demonstrate that it can provide the same information as the other models as well as adding further capabilities needed for metadata management in data lakes.

10 citations

References
More filters
Journal ArticleDOI
TL;DR: A research model is proposed to explain the acquisition intention of big data analytics mainly from the theoretical perspectives of data quality management and data usage experience and empirical investigation reveals that a firm's intention for big data Analytics can be positively affected by its competence in maintaining the quality of corporate data.

550 citations


"Metadata Management for Data Lakes" refers background in this paper

  • ...• Quality metadata consist of data consistency and completeness [10] to ensure datasets’ reliability....

    [...]

Proceedings ArticleDOI
26 Jun 2016
TL;DR: Constance is a Data Lake system with sophisticated metadata management over raw data extracted from heterogeneous data sources that discovers, extracts, and summarizes the structural metadata from the data sources, and annotates data and metadata with semantic information to avoid ambiguities.
Abstract: As the challenge of our time, Big Data still has many research hassles, especially the variety of data. The high diversity of data sources often results in information silos, a collection of non-integrated data management systems with heterogeneous schemas, query languages, and APIs. Data Lake systems have been proposed as a solution to this problem, by providing a schema-less repository for raw data with a common access interface. However, just dumping all data into a data lake without any metadata management, would only lead to a 'data swamp'. To avoid this, we propose Constance, a Data Lake system with sophisticated metadata management over raw data extracted from heterogeneous data sources. Constance discovers, extracts, and summarizes the structural metadata from the data sources, and annotates data and metadata with semantic information to avoid ambiguities. With embedded query rewriting engines supporting structured data and semi-structured data, Constance provides users a unified interface for query processing and data exploration. During the demo, we will walk through each functional component of Constance. Constance will be applied to two real-life use cases in order to show attendees the importance and usefulness of our generic and extensible data lake system.

188 citations


"Metadata Management for Data Lakes" refers background or methods in this paper

  • ...The concept of Data Lake (DL) was created by Dixon [4] and extended by various authors [5,8,20]....

    [...]

  • ...To prevent data lakes from turning into data swamps, metadata management is essential [1,8,20]....

    [...]

Proceedings ArticleDOI
08 Jun 2015
TL;DR: The concept of a data lake is emerging as a popular way to organize and build the next generation of systems to master new big data challenges, but there are lots of concerns and questions for large enterprises to implement data lakes.
Abstract: The concept of a data lake is emerging as a popular way to organize and build the next generation of systems to master new big data challenges, but there are lots of concerns and questions for large enterprises to implement data lakes. The paper discusses the concept of data lakes and shares the author's thoughts and practices of data lakes.

139 citations


"Metadata Management for Data Lakes" refers background or methods in this paper

  • ...The main characteristic of DL is ’schema-on-read’ [5], data are only processed upon usage....

    [...]

  • ...From the functional architecture point of view [5,11,13,20], a data lake contains four essential zones....

    [...]

  • ...The concept of Data Lake (DL) was created by Dixon [4] and extended by various authors [5,8,20]....

    [...]

  • ...Because in data lakes, datasets can be processed and analysed by different users [5], a dataset that makes no sense to BI professionals can be of great value to data scientists....

    [...]

Proceedings ArticleDOI
26 Aug 2015
TL;DR: This paper presents Personal Data Lake, a unified storage facility for storing, analyzing and querying personal data, and allows third-party plugins so that the unstructured data can be analyzed and queried.
Abstract: This paper presents Personal Data Lake, a unified storage facility for storing, analyzing and querying personal data. A data lake stores data regardless of format and thus provides an intuitive way to store personal data fragments of any type. Metadata management is a central part of the lake architecture. For structured/semi-structured data fragments, metadata may contain information about the schema of the data so that the data can be transformed into queryable data objects when required. For unstructured data, enabling gravity pull means allowing third-party plugins so that the unstructured data can be analyzed and queried.

80 citations


"Metadata Management for Data Lakes" refers background or methods in this paper

  • ...From the functional architecture point of view [5,11,13,20], a data lake contains four essential zones....

    [...]

  • ...The concept of Data Lake (DL) was created by Dixon [4] and extended by various authors [5,8,20]....

    [...]

  • ...To prevent data lakes from turning into data swamps, metadata management is essential [1,8,20]....

    [...]

Journal ArticleDOI
TL;DR: Many data warehouses are currently underutilized by managers and knowledge workers, so can high-quality end-user metadata help to increase levels of adoption and use?
Abstract: Many data warehouses are currently underutilized by managers and knowledge workers. Can high-quality end-user metadata help to increase levels of adoption and use?

59 citations

Frequently Asked Questions (14)
Q1. What have the authors contributed in "Metadata management for data lakes" ?

In this paper, the authors propose a such system based on a generic and extensible classification of metadata. A metadata conceptual schema which considers different types ( structured, semi-structured and unstructured ) of raw or processed data is presented. 

For this automatic extraction, the authors plan to adapt to the context of existing works such as automatic detection of relationships between datasets [ 1 ] and automatic extraction of data structure, metadata proprieties and semantic data [ 1 ]. 

Their long term goal is to accomplish a metadata management system which integrates automatic extraction of data, effective researches of metadata, automatic generation of dashboards or other analyses. 

DLs facilitate different types of analysis such as machine learning algorithms, statistics, data visualisation... (unlike Data Warehouses (DW) [16]). 

DL allows to ingest raw data from various sources, store data in their native format, process data upon usage, ensure the availability of data and provide accesses to data scientists, analysts and BI professionals, govern data to insure the data quality, security and data life cycle. 

a data lake that contains a great amount of structured, semistructured and unstructured data without explicit schema or description can easily turn into a data swamp which is invisible, inaccessible and unreliable to users [18]. 

Intra-metadata are classified into data characteristics, definitional, navigational, activity, lineage, rating and assessment [2,6,19]. 

Metadata can help users find data that correspond to their needs, accelerate data accesses, verify data origin and processing history to gain confidence and find relevant data to enrich their analyses [1,14]. 

The authors have chosen to implement a relational database and a graph database for the fallowing reasons: relational databases have a standardquery language (SQL) and a high security level insured by many RDBMSs (Relational Database Management System); graph databases ensure scalability and flexibility. 

Considering the diversity of data structural type and different processes that applied on datasets, their solution is based on intra- and inter-metadata. 

The aim of this project is to ingest (i) all the internal relational databases and medical e-documents (including scans of hand written medical reports), (ii) external data coming from other French UHCs and (iii) some public medical data. 

Regarding metadata management systems, there are metadata stored in key-value [9], XML documents [15,17], relational databases [17] or by ontology [1]. 

the goal of this paper is to propose a metadata management system dedicated to data lakes and applied to the whole life-cycle (multiple zones) of data. 

What’s more, for validating the conceptual schema, the authors implemented a graph DBMS and a relational DBMS for metadata management system in UHC of Toulouse.