scispace - formally typeset
Open AccessJournal ArticleDOI

A comparative analysis of methodologies for database schema integration

TLDR
The aim of the paper is to provide first a unifying framework for the problem of schema integration, then a comparative review of the work done thus far in this area, providing a basis for identifying strengths and weaknesses of individual methodologies, as well as general guidelines for future improvements and extensions.
Abstract
One of the fundamental principles of the database approach is that a database allows a nonredundant, unified representation of all data managed in an organization. This is achieved only when methodologies are available to support integration across organizational and application boundaries.Methodologies for database design usually perform the design activity by separately producing several schemas, representing parts of the application, which are subsequently merged. Database schema integration is the activity of integrating the schemas of existing or proposed databases into a global, unified schema.The aim of the paper is to provide first a unifying framework for the problem of schema integration, then a comparative review of the work done thus far in this area. Such a framework, with the associated analysis of the existing approaches, provides a basis for identifying strengths and weaknesses of individual methodologies, as well as general guidelines for future improvements and extensions.

read more

Content maybe subject to copyright    Report

A Comparative Analysis of Methodologies
for Database Schema Integration
C. BATINI and M. LENZERINI
Dipartimento di Informutica e Sistemistica, University of Rome, Rome, Italy
S. B. NAVATHE
Database Systems Research and Development Center, Computer and Information Sciences Department,
University of Florida, Gainesville, Florida 32601
One of the fundamental principles of the database approach is that a database allows a
nonredundant, unified representation of all data managed in an organization. This is
achieved only when methodologies are available to support integration across
organizational and application boundaries.
Methodologies for database design usually perform the design activity by separately
producing several schemas, representing parts of the application, which are subsequently
merged. Database schema integration is the activity of integrating the schemas of existing
or proposed databases into a global, unified schema.
The aim of the paper is to provide first a unifying framework for the problem of
schema integration, then a comparative review of the work done thus far in this area.
Such a framework, with the associated analysis of the existing approaches, provides a
basis for identifying strengths and weaknesses of individual methodologies, as well as
general guidelines for future improvements and extensions.
Categories and Subject Descriptors: H.0 [Information Systems]: General; H.2.1
[Database Management]: Data Models; Schema and Subschema; H.2.5: [Database
Management]: Heterogeneous Databases; D.2.1: [Requirements/Specifications]:
Methodologies
General Terms: Management
Additional Key Words and Phrases: Conceptual database design, database integration,
database schema integration, information systems design, models, view integration
INTRODUCTION
1.1 Schema Integration
Database management systems (DBMSs)
have been developed in the past two dec-
ades using various data models and archi-
tectures. The primary data models used for
implementation are the hierarchical, net-
work, and relational data models. More re-
cently, several so-called semantic data
models, significantly more powerful than
primary data models in representing the
application of interest, have been proposed
(e.g., see Smith’s abstraction hierarchy
model [Smith and Smith 19771, the Seman-
tic Data Model [Hammer and McLeod
19811, the TAXIS data model [Mylopoulos
et al. 19801, DAPLEX [Shipman 19801, and
recently, the Galileo data model [Albano et
al. 19851).
Authors in alphabetical order.
Permission to copy without fee all or part of this material is granted provided that the copies are not made or
distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its
date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To
copy otherwise, or to republish, requires a fee and/or specific permission.
0 1987 ACM 0360-0300/86/1200-0323 $1.50
ACM Computing Surveys, Vol. 18, No. 4, December 1986

324
l
C. Batini, M. Lenzerini, and S. B. Navathe
CONTENTS
traditional
“data-processing-using-files”
approach is that a database management
system makes it possible to define an inte-
grated view of relevant data for all appli-
cations. This eliminates duplication, avoids
problems of multiple updates, and mini-
mizes inconsistencies across applications.
Whereas the above claims of the database
approach are highly touted, database text-
books and survey literature to date have
paid scant attention to this topic. At the
same time, research on the problem of
integration has proceeded, and most
of the researchers have suggested perform-
ing the integration activity as a part of
the conceptual design step. In this paper
we refer to the integration activity by a
generic term, schema integration, which is
defined as the activity of integrating the
schemas of existing or proposed databases
into a global, unified schema. Schema
integration,
as defined here, occurs in
two contexts:
INTRODUCTION
I.1 Schema Integration
1.2 View Integration in Database Design
1.3 Database Integration
I.4 Organisational Context for Integration
1.5 Structure of the Paper
1. METHODOLOGIES FOR SCHEMA
INTEGRATION
1.1 An example
1.2 Causes for Schema Diversity
1.3 Steps and Goals
of the Integration Process
1.4 Influence of the Conceptual Model
on the Integration Process
2. A COMPARISON OF METHODOLOGIES
2.1 Introduction
2.2 Applicability of Integration Methodologies
2.3 Methodologies Viewed as Black Boxes
2.4 Gross Architecture of Methodologies
2.5 Preintegration
2.6 Comparison of Schema8
2.7 Conforming of Schemas
2.8 Merging and Restructuring
3. CONCLUSIONS AND FUTURE WORK
3.1 General Remarks
3.2 Missing Aspects
3.3 Future Research Directions
APPENDIX 1. A SUMMARY DESCRIPTION
OF METHODOLOGIES
APPENDIX 2. THE ENTITY-RELATIONSHIP
MODEL
ACKNOWLEDGMENTS
REFERENCES
Since semantic models allow data to be
described in a very abstract and under-
standable manner, they are currently used
in designing the conceptual structure of
databases. This conceptual activity is called
conceptual database design. Its goal is to
produce an abstract, global view of the data
of the application,
called conceptual
schema.
The introduction of a conceptual step in
design methodologies is a fairly recent de-
velopment. It allows designers and users to
cooperate in collecting requirements and
provides a high-level specification of the
data involved in the application. Further-
more, it simplifies the integration of differ-
ing perspectives and expectations that
various users have of the application.
One of the basic motivations for using
the database approach instead of the
(1) View integration (in database design)
produces a global conceptual descrip-
tion of a proposed database.
(2) Database integration
(in distributed
database management) produces the
global schema of a collection of data-
bases. This global schema is a virtual
view of all databases taken together in
a distributed database environment.
The database technology has progressed
to a level where thousands of organizations
are using databases for their day-to-day,
tactical, and strategic management appli-
cations. The distributed database manage-
ment area is also becoming sufficiently well
understood, and we expect to see a large
number of organizations changing to dis-
tributed databases by integrating their
current diverse databases.
The contributions to the state of the art
of database design methodologies, and in
particular schema integration, have been
particularly significant in the last ten years.
Our goal is to provide first a framework by
which the problem of schema integration
can be better understood, and second a
comparative review of the work done thus
far on this problem. Such a framework with
an associated analysis of the prevalent
ACM Computing Surveys, Vol. 18, No. 4, December 1986

Comparison of Methodologies for Database Schema Integration
l
325
approaches provides
(1) a conceptual foundation to the problem
of schema integration;
(2) a basis upon which to identify strengths
and weaknesses and the missing fea-
tures about individual methodologies;
(3) general guidelines for future improve-
ments and extensions to the present
approaches.
In the next section we explain the view
integration activity; Section I.3 is devoted
to database integration. In Section I.4 we
elaborate on the motivation for investi-
gating integration. Finally, in Section I.5
we describe the general structure of the
remainder of the paper.
1.2 View Integration in Database Design
The problem of database design is one of
designing the structure of a database in a
given environment of users and applica-
tions such that all users’ data requirements
and all applications’ process requirements
are “best satisfied.” This problem has ex-
isted ever since DBMSs came into being.
The DBMSs that store and manipulate
a database must have a definition of the
database in the form of a schema. This is
termed the intension of the database. The
actual values of data in a database are
called instances or occurrences of data.
Sometimes they are termed the extension
of a database, or just “the database.”
Whereas the extension of a database keeps
changing with time, the intension of the
database is supposed to be time invariant.
The database design problem aims at de-
signing the intension schema of the data-
base, which includes logical specifications
such as groupings of attributes and rela-
tionships among these groupings (logical
schema), as well as physical specifications
such as the type of access to records, in-
dexes, ordering, and physical placement
(physical schema). On the basis of this dis-
tinction, the corresponding database design
activities are termed logical schema design
andphysical schema design. Logical schema
design involves the problem of designing
the conceptual schema and mapping such
a schema into the schema definition lan-
guage (or data definition language) of a
specific DBMS. Figure 1 shows the phases
of database design and the intermediate
schema representations. The phases of
database design are
(1)
(2)
(3)
(4)
Requirements Specification and Analy-
sis. An analysis of the information re-
quirements of various areas within an
organization resulting in a preliminary
specification of the information needs
of various user groups.
Conceptual Design. Modeling and rep-
resentation of users’ and applications’
views of information and possibly a
specification of the processing or use of
the information. The final result of this
activity is a conceptual schema that
represents a global, high-level descrip-
tion of the requirements.
Implementation Design. Transforming
a conceptual schema into the logical
schema of a DBMS. The second and
third phases taken together are called
logical database design.
Physical Schema Design and Optimi-
zation. Mapping the logical schema of
a database into an appropriate stored
representation in a DBMS, including
new physical parameters to optimize
the database performance against a set
of transactions.
Typically, the application design activity
proceeds in parallel with database design.
Hence, Figure 1 also shows specifications
related to applications as the outputs of the
last two phases.
As shown in Figure 1, the activity of view
integration can be performed at several
points of the database design process. It
usually is performed during conceptual de-
sign. In that case, its goal is to produce an
integrated schema starting from several ap-
plication views that have been produced
independent1y.l
‘There is a body of work that regards conceptual
design as an activity that considers the application as
a whole, thus producing a single schema. This includes
Batini et al. 119841, Biller and Neuhold 119821, Brodie
[1981], Brodie and Zilles [1981], Ceri [i983];Ceri et
al. [1981], Chen [1983], Lum et al. [1970], Olle et al.
[1982], Rolland and Richards [1982], and Sakai
[1981].
ACM Computing Surveys,
Vol. 18, No. 4, December 1986

326 .
C. Batini, M. Lenzerini, and 5’. B. Navathe
User Requirements in
an Application Domain
1
Logical
Design
1 Requirement Analysis 1
1
Specification and Representation
of Information Needs
*----v---y
of Organization and User Groups
\
1
I
c
I
Global Conceptual Schema
+,
1 1
i
*--------~
J
ktgical Schema
Functional Design
of Applications
Implemented
Application
Database Schema Programs
Figure 1.
Phases of database design. (Adapted from Navathe and Schkolnick [1978].)
(1) The structure of the database for large
annlications (orzanizations) is too com-
The reason for integration is twofold:
their own reauirements and exnecta-
tions of data, -which may conflict with
other user groups.
plex to be modeTed by a single designer
Another possibility (Figure 1) is to per-
in a single view.
form integration even before the “concep-
(2) User groups typically operate inde-
tual design” step is undertaken. In this
pendently in organizations and have
case, view integration still occurs; however,
ACM Computing Surveys, Vol. 18, No. 4, December 1986

Comparison
of
Methodologies
for
Database Schema Integration
l
327
views are less formal and are mostly in the
form of narrative descriptions of require-
ments. The last possibility shown in the
figure is to perform integration after the
implementation design step, that is, start
from schemas expressed as implementable
logical schemas. This is the approach fol-
lowed in methodologies based strictly on
the relational model (see Al-Fedaghi and
Scheuermann [ 19811 and Casanova and Vi-
da1 [1983]) that do not advocate a concep-
tual step and model requirements directly
in terms of the relational model.
1.3 Database Integration
Database integration
is a relatively recent
problem that has appeared in the context
of distributed databases. A distributed da-
tabase is a collection of data that logically
belong to the same system but are spread
over the sites of a computer network [Ceri
and Pelagatti 19841. Distributed databases
and distributed database management sys-
tems can be classified into two major cate-
gories:
homogeneous,
dealing with local
databases having the same data model
and identical DBMSs, and
heterogeneous,
having a diversity in data models and
DBMSs. The term
Federated Database
is
used (e.g., in McLeod and Heimbigner
[1980]) to refer to a collection of databases
in which the sharing is made more explicit
by allowing
export schemas,
which define
the sharable part of each local database.
Each application is able to design its own
global schema by integrating the export
schemas.
The above contexts require that an in-
tegrated global schema be designed from
the local schemas, which refer to existing
databases. This too can be considered a
database design activity. Existing work on
database integration included in our survey
implicitly addresses this problem. The au-
thors of these works [Dayal and Hwang
1984; ElMasri et al. 1987; Mannino and
Effelsberg 1984a; Motro and Buneman
19811 use a semantic data model as an
intermediate model to facilitate the inte-
gration. Another implicit assumption they
make is that the heterogeneous database
management system is able to map the
requests of users-retrievals as well as up-
dates-from such a semantic data model
into the actual databases.
The database integration activity is de-
scribed in a general way in Figure 2. It
shows that this activity has as input the
local schemas and the local queries and
transactions. Most existing work, however,
does not explicitly take into account the
latter process-oriented information in de-
veloping the integrated schema. It is
strictly used in mapping the queries (query
mapping) between the global and the local
levels. Hence, we show the global schema
as well as the data and query-mapping spec-
ifications to be the outputs of the database
integration activity.
1.4 Organizational Context for Integration
Thus far we have pointed out how schema
integration arises in database design. As we
survey the work on schema integration, it
is worthwhile to point out an organizational
context for this important area.
There is a growing trend to regard data
as an autonomous resource of the organi-
zation, independent of the functions cur-
rently in use in the organization [National
Bureau of Standards 19821. There is a need
to capture the meaning of data for the
whole organization in order to manage it
effectively. Because of this awareness, in-
tegration of data has become an area of
growing interest in recent years.
One of the fundamental principles of the
database approach is that a database allows
a nonredundant, unified representation of
all data managed in an organization. This
is true only when methodologies are avail-
able to support integration across organi-
zational and application boundaries. More
and more organizations are becoming aware
of the potential of database systems and
wish to use them for integrated applications
and not just as software for fast retrieval
and updating of data.
Even when applications and user groups
are structurally disconnected, as in most
governmental and large administrative set-
ups, there is something to be gained by
having an enterprise-wide view of the data
resource. This potentially affords individ-
ACM Computing Surveys, Vol. 18, No. 4, December 1986

Citations
More filters
Journal ArticleDOI

A survey of approaches to automatic schema matching

TL;DR: A taxonomy is presented that distinguishes between schema-level and instance-level, element- level and structure- level, and language-based and constraint-based matchers and is intended to be useful when comparing different approaches to schema matching, when developing a new match algorithm, and when implementing a schema matching component.
Journal ArticleDOI

Federated database systems for managing distributed, heterogeneous, and autonomous databases

TL;DR: In this paper, the authors define a reference architecture for distributed database management systems from system and schema viewpoints and show how various FDBS architectures can be developed, and define a methodology for developing one of the popular architectures of an FDBS.
Journal ArticleDOI

Data fusion

TL;DR: This article places data fusion into the greater context of data integration, precisely defines the goals of data fusion, namely, complete, concise, and consistent data, and highlights the challenges of data Fusion.
Journal Article

Data Cleaning: Problems and Current Approaches.

TL;DR: This work classifies data quality problems that are addressed by data cleaning and provides an overview of the main solution approaches and discusses current tool support for data cleaning.
Journal ArticleDOI

Query evaluation techniques for large databases

TL;DR: This survey describes a wide array of practical query evaluation techniques for both relational and postrelational database systems, including iterative execution of complex query evaluation plans, the duality of sort- and hash-based set-matching algorithms, types of parallel query execution and their implementation, and special operators for emerging database application domains.
References
More filters
Book

The entity-relationship model: toward a unified view of data

TL;DR: A data model, called the entity-relationship model, is proposed that incorporates some of the important semantic information about the real world and can be used as a basis for unification of different views of data: the network model, the relational model, and the entity set model.

Principles of Database Systems

TL;DR: A large part of as mentioned in this paper is a description of relations, their algebra and calculus, and query languages that have been designed using these concepts, and explanations of how the theory can be used to design good systems.
Book

The theory of relational databases

David Maier
TL;DR: In this article, the graphite moderator is replaced by a gaseous mixture of carbon dioxide and helium, in which the helium comprises 40-60 volume percent of the mixture.
Journal ArticleDOI

Database abstractions: aggregation and generalization

TL;DR: A triggering mechanism for automatically maintaining these invariants during update operations is proposed, and a simple mapping of aggregation/generalization hierarchies onto owner-coupled set structures is given.
Journal ArticleDOI

The functional data model and the data languages DAPLEX

TL;DR: This paper presents and motivates the DAPLEX language and the underlying data model on which it is based.
Related Papers (5)
Frequently Asked Questions (1)
Q1. What are the contributions mentioned in the paper "A comparative analysis of methodologies for database schema integration" ?

The aim of the paper is to provide first a unifying framework for the problem of schema integration, then a comparative review of the work done thus far in this area. Such a framework, with the associated analysis of the existing approaches, provides a basis for identifying strengths and weaknesses of individual methodologies, as well as general guidelines for future improvements and extensions.