Metadata Management for Data Lakes
Summary (2 min read)
1 Introduction
- To prevent data lakes from turning into data swamps, metadata management is essential [1, 8, 20] .
- Nevertheless, many papers are focused on a single zone (especially ingestion zone) or a single data type of data lakes.
- In the third section, the authors propose their metadata conceptual schema with a classification.
3.1 Metadata Classification
- The authors metadata classification has the advantage of integrating both intra-metadata and inter-metadata for all datasets.
- Logical clusters signifies that some datasets are in the same domain.
- It consists of the original source of datasets and the processing history.
- Security metadata consist of data sensitivity and access level.
3.2 Metadata Conceptual Schema
- From the functional architecture point of view [5, 11, 13, 20] , a data lake contains four essential zones.
- A raw data zone allows to ingest data without processing and stores raw data in their native format.
- And a governance zone is in charge of insuring data quality, security and data life-cycle.
- The authors metadata classification is applied on the multi-zones functional architecture of data lakes (see Fig. 1 ). (iv) Information of each single dataset and relationships between different datasets are stored.
4 Metadata Implementation
- The University Hospital Center (UHC) of Toulouse is the largest hospital center in the south of France.
- All medical, financial and administrative data are stored in the information system of this center.
- The UHC of Toulouse plans to launch a data lake through an iterative process.
- The objective of this data lake is to combine these different data sources in order to allow data analysts and BI professionals to analyse available data to improve medical treatments.
- Regarding metadata management systems, there are metadata stored in key-value [9] , XML documents [15, 17] , relational databases [17] or by ontology [1] .
4.1 Relational Database
- After the implantation, the authors collected the needs of data scientists from a metadata point of view.
- The first questions were about data trust and data lineage analysis.
- To validate their proposal, the authors have written several queries to compare the feasibility and usability of different environments.
- In the following paragraphs, you will find two examples.
- There is an example to find the original dataset of 'COMEDIMS' (see Fig. 4 (a) ). (ii) Besides finding the origin of one dataset, users may also want to find relevant datasets that come from the same origin of the dataset.
4.2 Graph Database
- The second solution of implementation is graph database.
- In addition, 2 queries that answer the same questions in the last subsection will be executed.
- The authors extended the mapping from UML class diagram to property graphs that proposed by [3] to Neo4j Cypher query language.
- To test the implementation, the authors also answered the same questions than the relational database.
Did you find this useful? Give us your feedback
Citations
68 citations
Cites background from "Metadata Management for Data Lakes"
...5 (Ravat and Zhao 2019a))....
[...]
...Oram’s metadata classification is the most cited, especially in the industrial literature [9,28,46,47], presumably because it is inspired from metadata categories from data warehouses [45]....
[...]
...Existing reviews on data lake architectures commonly distinguish pond and zone architectures (Giebler et al. 2019; Ravat and Zhao 2019a)....
[...]
...Such architectures indeed generally differ in the number and characteristics of zones (Giebler et al. 2019), e.g., some architectures include a transient zone (LaPlante and Sharma 2016; Tharrington 2017; Zikopoulos et al. 2015) while others do not (Hai et al. 2016; Ravat and Zhao 2019a)....
[...]
...Oram’s metadata classification is the most cited, especially in the industrial literature (Diamantini et al. 2018; LaPlante and Sharma 2016; Ravat and Zhao 2019b; Russom 2017), presumably because it is inspired from metadata categories from data warehouses (Ravat and Zhao 2019a)....
[...]
58 citations
Cites methods from "Metadata Management for Data Lakes"
...– For inter-metadata [28], we propose to integrate Dataset containment which means a dataset is contained in another dataset....
[...]
...– For intra-metadata [28], we retain data characteristics, definitional, navigational and lineage metadata proposed in [2] and add the access, quality and security metadata....
[...]
44 citations
11 citations
Cites background or methods from "Metadata Management for Data Lakes"
..., [2,16,18], are not sufficiently generic as they cannot support every potential metadata management use case....
[...]
...Ravat and Zhao’s model is partially checked as they support adding keywords describing their data elements, which does not however, suffice for modeling, e.g., an actor accessing the data....
[...]
...More general models also created for data lakes include those by Ravat and Zhao [16], Diamantini et al. [2], and lastly, Sawadogo et al.’s model MEtadata for DAta Lakes (MEDAL) [18]....
[...]
...Set of metadata categorizations, the first five belong to the selected metadata models [18,15,10,16,2]....
[...]
...Ravat and Zhao mention dataset containment, but it is not clear whether this can be used to implement the granularity topic....
[...]
10 citations
References
550 citations
"Metadata Management for Data Lakes" refers background in this paper
...• Quality metadata consist of data consistency and completeness [10] to ensure datasets’ reliability....
[...]
188 citations
"Metadata Management for Data Lakes" refers background or methods in this paper
...The concept of Data Lake (DL) was created by Dixon [4] and extended by various authors [5,8,20]....
[...]
...To prevent data lakes from turning into data swamps, metadata management is essential [1,8,20]....
[...]
139 citations
"Metadata Management for Data Lakes" refers background or methods in this paper
...The main characteristic of DL is ’schema-on-read’ [5], data are only processed upon usage....
[...]
...From the functional architecture point of view [5,11,13,20], a data lake contains four essential zones....
[...]
...The concept of Data Lake (DL) was created by Dixon [4] and extended by various authors [5,8,20]....
[...]
...Because in data lakes, datasets can be processed and analysed by different users [5], a dataset that makes no sense to BI professionals can be of great value to data scientists....
[...]
80 citations
"Metadata Management for Data Lakes" refers background or methods in this paper
...From the functional architecture point of view [5,11,13,20], a data lake contains four essential zones....
[...]
...The concept of Data Lake (DL) was created by Dixon [4] and extended by various authors [5,8,20]....
[...]
...To prevent data lakes from turning into data swamps, metadata management is essential [1,8,20]....
[...]
59 citations
Related Papers (5)
Frequently Asked Questions (14)
Q2. What future works have the authors mentioned in the paper "Metadata management for data lakes" ?
For this automatic extraction, the authors plan to adapt to the context of existing works such as automatic detection of relationships between datasets [ 1 ] and automatic extraction of data structure, metadata proprieties and semantic data [ 1 ].
Q3. what is the long term goal of the project?
Their long term goal is to accomplish a metadata management system which integrates automatic extraction of data, effective researches of metadata, automatic generation of dashboards or other analyses.
Q4. What are the main characteristics of DLs?
DLs facilitate different types of analysis such as machine learning algorithms, statistics, data visualisation... (unlike Data Warehouses (DW) [16]).
Q5. What is the purpose of the paper?
DL allows to ingest raw data from various sources, store data in their native format, process data upon usage, ensure the availability of data and provide accesses to data scientists, analysts and BI professionals, govern data to insure the data quality, security and data life cycle.
Q6. What is the main characteristic of a data lake?
a data lake that contains a great amount of structured, semistructured and unstructured data without explicit schema or description can easily turn into a data swamp which is invisible, inaccessible and unreliable to users [18].
Q7. What are the main categories of data lakes?
Intra-metadata are classified into data characteristics, definitional, navigational, activity, lineage, rating and assessment [2,6,19].
Q8. What is the main purpose of the paper?
Metadata can help users find data that correspond to their needs, accelerate data accesses, verify data origin and processing history to gain confidence and find relevant data to enrich their analyses [1,14].
Q9. Why did the authors choose to implement a relational database and a graph database?
The authors have chosen to implement a relational database and a graph database for the fallowing reasons: relational databases have a standardquery language (SQL) and a high security level insured by many RDBMSs (Relational Database Management System); graph databases ensure scalability and flexibility.
Q10. What is the purpose of this paper?
Considering the diversity of data structural type and different processes that applied on datasets, their solution is based on intra- and inter-metadata.
Q11. What is the aim of this project?
The aim of this project is to ingest (i) all the internal relational databases and medical e-documents (including scans of hand written medical reports), (ii) external data coming from other French UHCs and (iii) some public medical data.
Q12. What are the main types of metadata stored in the UHC of Toulouse?
Regarding metadata management systems, there are metadata stored in key-value [9], XML documents [15,17], relational databases [17] or by ontology [1].
Q13. What is the goal of this paper?
the goal of this paper is to propose a metadata management system dedicated to data lakes and applied to the whole life-cycle (multiple zones) of data.
Q14. What is the conceptual schema for data lakes?
What’s more, for validating the conceptual schema, the authors implemented a graph DBMS and a relational DBMS for metadata management system in UHC of Toulouse.