What are the contributions mentioned in the paper "A comparative analysis of methodologies for database schema integration" ?

(Open Access) A comparative analysis of methodologies for database schema integration (1986) | Carlo Batini

A Comparative Analysis of Methodologies

for Database Schema Integration

C. BATINI and M. LENZERINI

Dipartimento di Informutica e Sistemistica, University of Rome, Rome, Italy

S. B. NAVATHE

Database Systems Research and Development Center, Computer and Information Sciences Department,

University of Florida, Gainesville, Florida 32601

One of the fundamental principles of the database approach is that a database allows a

nonredundant, unified representation of all data managed in an organization. This is

achieved only when methodologies are available to support integration across

organizational and application boundaries.

Methodologies for database design usually perform the design activity by separately

producing several schemas, representing parts of the application, which are subsequently

merged. Database schema integration is the activity of integrating the schemas of existing

or proposed databases into a global, unified schema.

The aim of the paper is to provide first a unifying framework for the problem of

schema integration, then a comparative review of the work done thus far in this area.

Such a framework, with the associated analysis of the existing approaches, provides a

basis for identifying strengths and weaknesses of individual methodologies, as well as

general guidelines for future improvements and extensions.

Categories and Subject Descriptors: H.0 [Information Systems]: General; H.2.1

[Database Management]: Data Models; Schema and Subschema; H.2.5: [Database

Management]: Heterogeneous Databases; D.2.1: [Requirements/Specifications]:

Methodologies

General Terms: Management

Additional Key Words and Phrases: Conceptual database design, database integration,

database schema integration, information systems design, models, view integration

INTRODUCTION

1.1 Schema Integration

Database management systems (DBMSs)

have been developed in the past two dec-

ades using various data models and archi-

tectures. The primary data models used for

implementation are the hierarchical, net-

work, and relational data models. More re-

cently, several so-called semantic data

models, significantly more powerful than

primary data models in representing the

application of interest, have been proposed

(e.g., see Smith’s abstraction hierarchy

model [Smith and Smith 19771, the Seman-

tic Data Model [Hammer and McLeod

19811, the TAXIS data model [Mylopoulos

et al. 19801, DAPLEX [Shipman 19801, and

recently, the Galileo data model [Albano et

al. 19851).

Authors in alphabetical order.

Permission to copy without fee all or part of this material is granted provided that the copies are not made or

distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its

date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To

copy otherwise, or to republish, requires a fee and/or specific permission.

0 1987 ACM 0360-0300/86/1200-0323 $1.50

ACM Computing Surveys, Vol. 18, No. 4, December 1986

324

C. Batini, M. Lenzerini, and S. B. Navathe

CONTENTS

traditional

“data-processing-using-files”

approach is that a database management

system makes it possible to define an inte-

grated view of relevant data for all appli-

cations. This eliminates duplication, avoids

problems of multiple updates, and mini-

mizes inconsistencies across applications.

Whereas the above claims of the database

approach are highly touted, database text-

books and survey literature to date have

paid scant attention to this topic. At the

same time, research on the problem of

integration has proceeded, and most

of the researchers have suggested perform-

ing the integration activity as a part of

the conceptual design step. In this paper

we refer to the integration activity by a

generic term, schema integration, which is

defined as the activity of integrating the

schemas of existing or proposed databases

into a global, unified schema. Schema

integration,

as defined here, occurs in

two contexts:

INTRODUCTION

I.1 Schema Integration

1.2 View Integration in Database Design

1.3 Database Integration

I.4 Organisational Context for Integration

1.5 Structure of the Paper

1. METHODOLOGIES FOR SCHEMA

INTEGRATION

1.1 An example

1.2 Causes for Schema Diversity

1.3 Steps and Goals

of the Integration Process

1.4 Influence of the Conceptual Model

on the Integration Process

2. A COMPARISON OF METHODOLOGIES

2.1 Introduction

2.2 Applicability of Integration Methodologies

2.3 Methodologies Viewed as Black Boxes

2.4 Gross Architecture of Methodologies

2.5 Preintegration

2.6 Comparison of Schema8

2.7 Conforming of Schemas

2.8 Merging and Restructuring

3. CONCLUSIONS AND FUTURE WORK

3.1 General Remarks

3.2 Missing Aspects

3.3 Future Research Directions

APPENDIX 1. A SUMMARY DESCRIPTION

OF METHODOLOGIES

APPENDIX 2. THE ENTITY-RELATIONSHIP

MODEL

ACKNOWLEDGMENTS

REFERENCES

Since semantic models allow data to be

described in a very abstract and under-

standable manner, they are currently used

in designing the conceptual structure of

databases. This conceptual activity is called

conceptual database design. Its goal is to

produce an abstract, global view of the data

of the application,

called conceptual

schema.

The introduction of a conceptual step in

design methodologies is a fairly recent de-

velopment. It allows designers and users to

cooperate in collecting requirements and

provides a high-level specification of the

data involved in the application. Further-

more, it simplifies the integration of differ-

ing perspectives and expectations that

various users have of the application.

One of the basic motivations for using

the database approach instead of the

(1) View integration (in database design)

produces a global conceptual descrip-

tion of a proposed database.

(2) Database integration

(in distributed

database management) produces the

global schema of a collection of data-

bases. This global schema is a virtual

view of all databases taken together in

a distributed database environment.

The database technology has progressed

to a level where thousands of organizations

are using databases for their day-to-day,

tactical, and strategic management appli-

cations. The distributed database manage-

ment area is also becoming sufficiently well

understood, and we expect to see a large

number of organizations changing to dis-

tributed databases by integrating their

current diverse databases.

The contributions to the state of the art

of database design methodologies, and in

particular schema integration, have been

particularly significant in the last ten years.

Our goal is to provide first a framework by

which the problem of schema integration

can be better understood, and second a

comparative review of the work done thus

far on this problem. Such a framework with

an associated analysis of the prevalent

ACM Computing Surveys, Vol. 18, No. 4, December 1986

Comparison of Methodologies for Database Schema Integration

325

approaches provides

(1) a conceptual foundation to the problem

of schema integration;

(2) a basis upon which to identify strengths

and weaknesses and the missing fea-

tures about individual methodologies;

(3) general guidelines for future improve-

ments and extensions to the present

approaches.

In the next section we explain the view

integration activity; Section I.3 is devoted

to database integration. In Section I.4 we

elaborate on the motivation for investi-

gating integration. Finally, in Section I.5

we describe the general structure of the

remainder of the paper.

1.2 View Integration in Database Design

The problem of database design is one of

designing the structure of a database in a

given environment of users and applica-

tions such that all users’ data requirements

and all applications’ process requirements

are “best satisfied.” This problem has ex-

isted ever since DBMSs came into being.

The DBMSs that store and manipulate

a database must have a definition of the

database in the form of a schema. This is

termed the intension of the database. The

actual values of data in a database are

called instances or occurrences of data.

Sometimes they are termed the extension

of a database, or just “the database.”

Whereas the extension of a database keeps

changing with time, the intension of the

database is supposed to be time invariant.

The database design problem aims at de-

signing the intension schema of the data-

base, which includes logical specifications

such as groupings of attributes and rela-

tionships among these groupings (logical

schema), as well as physical specifications

such as the type of access to records, in-

dexes, ordering, and physical placement

(physical schema). On the basis of this dis-

tinction, the corresponding database design

activities are termed logical schema design

andphysical schema design. Logical schema

design involves the problem of designing

the conceptual schema and mapping such

a schema into the schema definition lan-

guage (or data definition language) of a

specific DBMS. Figure 1 shows the phases

of database design and the intermediate

schema representations. The phases of

database design are

(1)

(2)

(3)

(4)

Requirements Specification and Analy-

sis. An analysis of the information re-

quirements of various areas within an

organization resulting in a preliminary

specification of the information needs

of various user groups.

Conceptual Design. Modeling and rep-

resentation of users’ and applications’

views of information and possibly a

specification of the processing or use of

the information. The final result of this

activity is a conceptual schema that

represents a global, high-level descrip-

tion of the requirements.

Implementation Design. Transforming

a conceptual schema into the logical

schema of a DBMS. The second and

third phases taken together are called

logical database design.

Physical Schema Design and Optimi-

zation. Mapping the logical schema of

a database into an appropriate stored

representation in a DBMS, including

new physical parameters to optimize

the database performance against a set

of transactions.

Typically, the application design activity

proceeds in parallel with database design.

Hence, Figure 1 also shows specifications

related to applications as the outputs of the

last two phases.

As shown in Figure 1, the activity of view

integration can be performed at several

points of the database design process. It

usually is performed during conceptual de-

sign. In that case, its goal is to produce an

integrated schema starting from several ap-

plication views that have been produced

independent1y.l

‘There is a body of work that regards conceptual

design as an activity that considers the application as

a whole, thus producing a single schema. This includes

Batini et al. 119841, Biller and Neuhold 119821, Brodie

[1981], Brodie and Zilles [1981], Ceri [i983];Ceri et

al. [1981], Chen [1983], Lum et al. [1970], Olle et al.

[1982], Rolland and Richards [1982], and Sakai

[1981].

ACM Computing Surveys,

Vol. 18, No. 4, December 1986

326 .

C. Batini, M. Lenzerini, and 5’. B. Navathe

User Requirements in

an Application Domain

Logical

Design

1 Requirement Analysis 1

Specification and Representation

of Information Needs

*----v---y

of Organization and User Groups

Global Conceptual Schema

1 1

*--------~

ktgical Schema

Functional Design

of Applications

Implemented

Application

Database Schema Programs

Figure 1.

Phases of database design. (Adapted from Navathe and Schkolnick [1978].)

(1) The structure of the database for large

annlications (orzanizations) is too com-

The reason for integration is twofold:

their own reauirements and exnecta-

tions of data, -which may conflict with

other user groups.

plex to be modeTed by a single designer

Another possibility (Figure 1) is to per-

in a single view.

form integration even before the “concep-

(2) User groups typically operate inde-

tual design” step is undertaken. In this

pendently in organizations and have

case, view integration still occurs; however,

ACM Computing Surveys, Vol. 18, No. 4, December 1986

Comparison

Methodologies

for

Database Schema Integration

327

views are less formal and are mostly in the

form of narrative descriptions of require-

ments. The last possibility shown in the

figure is to perform integration after the

implementation design step, that is, start

from schemas expressed as implementable

logical schemas. This is the approach fol-

lowed in methodologies based strictly on

the relational model (see Al-Fedaghi and

Scheuermann [ 19811 and Casanova and Vi-

da1 [1983]) that do not advocate a concep-

tual step and model requirements directly

in terms of the relational model.

1.3 Database Integration

Database integration

is a relatively recent

problem that has appeared in the context

of distributed databases. A distributed da-

tabase is a collection of data that logically

belong to the same system but are spread

over the sites of a computer network [Ceri

and Pelagatti 19841. Distributed databases

and distributed database management sys-

tems can be classified into two major cate-

gories:

homogeneous,

dealing with local

databases having the same data model

and identical DBMSs, and

heterogeneous,

having a diversity in data models and

DBMSs. The term

Federated Database

used (e.g., in McLeod and Heimbigner

[1980]) to refer to a collection of databases

in which the sharing is made more explicit

by allowing

export schemas,

which define

the sharable part of each local database.

Each application is able to design its own

global schema by integrating the export

schemas.

The above contexts require that an in-

tegrated global schema be designed from

the local schemas, which refer to existing

databases. This too can be considered a

database design activity. Existing work on

database integration included in our survey

implicitly addresses this problem. The au-

thors of these works [Dayal and Hwang

1984; ElMasri et al. 1987; Mannino and

Effelsberg 1984a; Motro and Buneman

19811 use a semantic data model as an

intermediate model to facilitate the inte-

gration. Another implicit assumption they

make is that the heterogeneous database

management system is able to map the

requests of users-retrievals as well as up-

dates-from such a semantic data model

into the actual databases.

The database integration activity is de-

scribed in a general way in Figure 2. It

shows that this activity has as input the

local schemas and the local queries and

transactions. Most existing work, however,

does not explicitly take into account the

latter process-oriented information in de-

veloping the integrated schema. It is

strictly used in mapping the queries (query

mapping) between the global and the local

levels. Hence, we show the global schema

as well as the data and query-mapping spec-

ifications to be the outputs of the database

integration activity.

1.4 Organizational Context for Integration

Thus far we have pointed out how schema

integration arises in database design. As we

survey the work on schema integration, it

is worthwhile to point out an organizational

context for this important area.

There is a growing trend to regard data

as an autonomous resource of the organi-

zation, independent of the functions cur-

rently in use in the organization [National

Bureau of Standards 19821. There is a need

to capture the meaning of data for the

whole organization in order to manage it

effectively. Because of this awareness, in-

tegration of data has become an area of

growing interest in recent years.

One of the fundamental principles of the

database approach is that a database allows

a nonredundant, unified representation of

all data managed in an organization. This

is true only when methodologies are avail-

able to support integration across organi-

zational and application boundaries. More

and more organizations are becoming aware

of the potential of database systems and

wish to use them for integrated applications

and not just as software for fast retrieval

and updating of data.

Even when applications and user groups

are structurally disconnected, as in most

governmental and large administrative set-

ups, there is something to be gained by

having an enterprise-wide view of the data

resource. This potentially affords individ-

ACM Computing Surveys, Vol. 18, No. 4, December 1986

A comparative analysis of methodologies for database schema integration

Figures

Citations

A survey of approaches to automatic schema matching

Federated database systems for managing distributed, heterogeneous, and autonomous databases

Data fusion

Data Cleaning: Problems and Current Approaches.

Query evaluation techniques for large databases

References

The entity-relationship model: toward a unified view of data

Principles of Database Systems

The theory of relational databases

Database abstractions: aggregation and generalization

The functional data model and the data languages DAPLEX

Related Papers (5)

Federated database systems for managing distributed, heterogeneous, and autonomous databases

A survey of approaches to automatic schema matching

The entity-relationship model: toward a unified view of data

Mediators in the architecture of future information systems

A federated architecture for information management

Frequently Asked Questions (1)

Q1. What are the contributions mentioned in the paper "A comparative analysis of methodologies for database schema integration" ?