Journal Article•DOI•

A survey of approaches to automatic schema matching

Erhard Rahm¹, Philip A. Bernstein²•Institutions (2)

01 Dec 2001-Vol. 10, Iss: 4, pp 334-350

TL;DR: A taxonomy is presented that distinguishes between schema-level and instance-level, element- level and structure- level, and language-based and constraint-based matchers and is intended to be useful when comparing different approaches to schema matching, when developing a new match algorithm, and when implementing a schema matching component.

read less

Abstract: Schema matching is a basic problem in many database application domains, such as data integration, E-business, data warehousing, and semantic query processing. In current implementations, schema matching is typically performed manually, which has significant limitations. On the other hand, previous research papers have proposed many techniques to achieve a partial automation of the match operation for specific application domains. We present a taxonomy that covers many of these existing approaches, and we describe the approaches in some detail. In particular, we distinguish between schema-level and instance-level, element-level and structure-level, and language-based and constraint-based matchers. Based on our classification we review some previous match implementations thereby indicating which part of the solution space they cover. We intend our taxonomy and review of past work to be useful when comparing different approaches to schema matching, when developing a new match algorithm, and when implementing a schema matching component.

...read moreread less

Summary (5 min read)

Jump to: [1. Introduction] – [2.1. Schema integration] – [2.2. Data warehouses] – [2.3. E-commerce] – [2.4. Semantic query processing] – [3. The match operator] – [4. Architecture for generic match] – [5. Classification of schema matching approaches] – [6. Schema-level matchers] – [6.1. Granularity of match (element-level vs structure-level)] – [6.2. Match cardinality] – [6.3. Linguistic approaches] – [D (name1, name2,] – [6.4. Constraint-based approaches] – [6.5. Reusing schema and mapping information] – [7. Instance-level approaches] – [8. Combining different matchers] – [9.1. Prototype schema matchers] – [9.2. Related prototypes] and [10. Conclusion]

1. Introduction

A fundamental operation in the manipulation of schema information isMatch, which takes two schemas as input and produces a mapping between elements of the two schemas that correspond semantically to each other [LC94, MIR94, MZ98, PSU98, MWJ99, DDL00].
Match plays a central role in numerous applications, such as web-oriented data integration, electronic commerce, schema integration, schema evolution and migration, application evolution, data warehousing, database design, web site creation and management, and component-based development.
Currently, schema matching is typically performed manually, perhaps supported by a graphical user interface.
This section illustrates both the complexity of the problem and (at least part of) the solution space.
Section 9 is a literature review, which describes some integrated solutions and how they fit in their classification.

2.1. Schema integration

Most work on schema match has been motivated by schema integration, a problem that has been investigated since the early 1980s: Given a set of independently developed schemas, construct a global view [BLN86, EP90, SL90, PS98].
In an artificial intelligence setting, this is the problem of integrating independently developed ontologies into a single ontology.
It also occurs even if they model the same real world domain, just because they were developed by different people in different real-world contexts.
Thus, a first step in integrating the schemas is to identify and characterize these interschema relationships.
Again, this requires reconciling the structure and terminology of the two schemas, which involves schema matching.

2.2. Data warehouses

A variation of the schema integration problem that became popular in the 1990s is that of integrating data sources into a data warehouse.
The extraction process requires transforming data from the source format into the warehouse format.
As shown in [BR00], the match operation is useful for designing transformations.
After an initial mapping is created, the data warehouse designer needs to examine the detailed semantics of each source element and create transformations that reconcile those semantics with those of the target.
First, the common elements ofS′ andS are found (a match operation) and thenS⇒W is reused for those common elements.

2.3. E-commerce

In the current decade, E-commerce has led to a new motivation for schema matching: message translation.
Trading partners frequently exchange messages that describe business transactions.
Fields are grouped into structures that also may differ between the two formats.
Translating between different message schemas is, in part, a schema matching problem.
Today, application designers need to specify manually how message formats are related.

2.4. Semantic query processing

Schema integration, data warehousing, and E-commerce are all similar in that they involve the design-time analysis of schemas to produce mappings and, possibly an integrated schema.
A somewhat different scenario is semantic query processing – a run-time scenario where a user specifies the output of a query (e.g., the SELECT clause in SQL), and the system figures out how to produce that output (e.g., by determining the FROM and WHERE clauses in SQL).
The user’s specification is stated in terms of concepts familiar to her, which may not be the same as the names of elements specified in the database schema.
Therefore, in the first phase of processing the query, the system must map the user-specified concepts in the query output to schema elements.
Techniques for deriving this qualification have been developed over the past 20 years [MRSS82, KKFG84, WS90, RYAC00].

3. The match operator

To define the match operator, Match, the authors need to choose a representation for its input schemas and output mapping.
Each mapping element of the match result specifies that certain elements of schemaS1 logically correspond to, i.e., match, certain elements ofS2, where the semantics of this correspondence is expressed by the mapping element’s mapping expression.
A complete specification of the result of the invocation of Match would also include the mapping expression of each element, that is “Cust.C# = Customer.CustID”, “Cust.
The similarity of Match and Join extends to OuterMatch operations, which are useful counterparts to Match in much the same way that OuterJoin is a counterpart to Join.
A right (or left) OuterMatch ensures that every element ofS2 (or S1) is referenced by the mapping.

4. Architecture for generic match

XML schema editors, portal development kits, database modeling tools and the like may access libraries to select existing schemas, shown in the lower left of Fig.1.
This uniform representation significantly reduces the complexity of Match by not having to deal with the large number of different representations of schemas.
Tools that are tightly integrated with the framework can work directly on the internal representation.
The implementation of Match should therefore only determinematch candidates, which the user can accept, reject or change.

5. Classification of schema matching approaches

In this section the authors classify the major approaches to schema matching.
Fig.2 shows part of their classification scheme together with some sample approaches.
An implementation of Match may use multiple match algorithms ormatchers.
For individual matchers, the authors consider the following largely-orthogonal classification criteria: Instance vs schema:matching approaches can consider instance data (i.e., data contents) or only schema-level information.
In addition, each mapping element may interrelate one or more elements of the two schemas.

6. Schema-level matchers

Schema-level matchers only consider schema information, not instance data.
The available information includes the usual properties of schema elements, such as name, description, data type, relationship types (part-of, is-a, etc.), constraints, and schema structure.
In general, a matcher will find multiple match candidates.
For each candidate, it is customary to estimate the degree of similarity by a normalized numeric value in the range 0–1, in order to identify the best match candidates (as in [PSU98, BCV99, DDL00, CDD01]).
Then the authors cover linguistic and constraintbased matchers.

6.1. Granularity of match (element-level vs structure-level)

The authors distinguish two main alternatives for the granularity of Match, element-level and structure-level matching.
For each element of the first schema,element-level matchingdetermines the matching elements in the second input schema.
Structure-level matching,on the other hand, refers to matching combinations of elements that appear together in a structure.
The fact that the elements “Address” and “CustomerAddress” in Table 2 are likely to match can be derived by a name-based element-level matching without considering their underlying components.
Element-level matching can be implemented by algorithms similar to relational join processing.

6.2. Match cardinality

An S1 (or S2) element can participate in zero, one or many mapping elements of the match result between the two input schemasS1 and S2.
Thus, the authors have the usual relationship cardinalities, namely 1:1 and the set-oriented cases 1:n, n:1, and n:m, between matching elements both with respect to different mapping elements (global cardinality) and with respect to an individual mapping element (local cardinality).
M mapping elements usually requires considering the structural embedding of the schema elements and thus requires structure-level matching, also known as Obtaining n.
Row 3 explains how FirstName and LastName are extracted from Name.
For the first three examples in Table 3, oneS1 instance is matched with oneS2 instance (1:1 instance-level match).

6.3. Linguistic approaches

Language-based or linguistic matchers use names and text (i.e., words or sentences) to find semantically similar schema elements.
Name matching Name-based matching matches schema elements with equal or similar names.
Similarity of names can be defined and measured in various ways, including: Equality of names.
Homonyms are equal or similar names that refer to different elements.
Name-based matching is not limited to finding 1:1 matches.

D (name1, name2,

This assumes that D contains all relevant pairs of the transitive closure over similar names.
Intuitively, the authors would expect the similarity valueσ to be .9× .8 = .72, but this depends on the type of similarity, the use of homonyms, and perhaps other factors.
These comments can also be evaluated linguistically to determine the similarity between schema elements.

6.4. Constraint-based approaches

Schemas often contain constraints to define data types and value ranges, uniqueness, optionality, relationship types and cardinalities, etc. Certain structural information can be interpreted as constraints, such as intra-schema references (e.g., foreign keys) and adjacency-related information (e.g., part-of relationships).
When performing a match based on hierarchical structures, an algorithm can traverse the structure either top-down or bottom-up.
This allows us to determine the correct n:m SQL-like match mapping S2.

6.5. Reusing schema and mapping information

The authors have already discussed the use of auxiliary information in addition to the input schemas, such as dictionaries, thesauri, and user-provided match or mismatch information.
Another way to use auxiliary information to improve the effectiveness of Match is to support and exploit the reuse of common schema components and previously determined mappings.
The authors also want to reuse entire structures, which is useful when matching different but similar schemas to the same destination schema, as may occur when integrating new sources into a data warehouse or digital library.
The authors already have the match result betweenS andS2, illustrated by the arrows.
Salary and Income may be considered identical in a payroll application but not in a tax reporting application.

7. Instance-level approaches

Instance-level data can give important insight into the contents and meaning of schema elements.
It can help disambiguate between equally plausible schema-level matches by choosing to match the elements whose instances are more similar.
The main benefit of evaluating instances is a precise characterization of the actual contents of schema elements.
Then, theS2 instances are matched one-by-one against the characterizations ofS1 elements.
Instance-level matching can also be performed by utilizing auxiliary information, e.g., previous mappings obtained from matching different schemas.

8. Combining different matchers

The authors have reviewed several types of matchers and many different variations.
Structure-level matching also benefits from being combined with other approaches such as name matching.
On the other hand, one can use acomposite matcherthat combines the results of several independently executed matchers, including hybrid matchers.
Selection of matchers, and determining their execution order and the combination of independently determined match esults can be done either automatically by the implementation of Match itself or its clients (e.g., tools), or manually by a human user.
An automatic approach can reduce the number of user interactions, but it is difficult to achieve a generic solution that is adaptable to different application domains (although the approach could be controlled by tuning parameters).

9.1. Prototype schema matchers

In Table 5 the authors show how seven published prototype implementations fit the classification criteria introduced in Sect.5.
The table thus indicates which part of the solution space is covered by which implementations, thereby supporting a comparison of the approaches.
The table shows that all systems support multiple matching criteria, six in the form of a hybrid matcher and only one, LSD, by a composite match approach.
A global matcher that uses the same machine-learning technology is used to merge the lists into a combined list of match candidates for each schema element.
It computes matches by a weighted sum of name and data type affinity and structural affinity.

10. Conclusion

Schema matching is a basic problem in many database application domains, such as heterogeneous database integration, E-commerce, data warehousing, and semantic query processing.
The authors hope that the taxonomy will be useful to programmers who need to implement a match algorithm and to researchers looking to develop more effective and comprehensive schema matching algorithms.
More attention should be given to the utilization of instance-level information and reuse opportunities to perform Match.
The authors are grateful for many helpful suggestions from Sonia Bergamaschi, Silvana Castano, Chris Clifton, Hai Hong Do, An Hai Doan, Alon Halevy, Jayant Madhavan, Sergey Melnik, Renée Miller, Rachel Pottinger,Arnie Rosenthal, Dennis Shasha, and the anonymous referees.

Did you find this useful? Give us your feedback

Figures (7)

Table 5.Characteristics of proposed schema match approaches

Table 4.Constraint-based matching (example)

Table 2.Full vs partial structural match (example)

Fig. 2.Classification of schema matching approaches

Content maybe subject to copyright Report

The VLDB Journal 10: 334–350 (2001) / Digital Object Identiﬁer (DOI) 10.1007/s007780100057

A survey of approaches to automatic schema matching

Erhard Rahm

, Philip A. Bernstein

Universit¨at Leipzig, Institut f¨ur Informatik, 04109 Leipzig, Germany; (e-mail: rahm@informatik.uni-leipzig.de)

Microsoft Research, Redmond, WA 98052-6399, USA; (e-mail: philbe@microsoft.com)

Edited by P. Scheuermann. Received: 5 February 2001 / Accepted: 6 September 2001

Published online: 21 November 2001 –

 Springer-Verlag 2001

Abstract. Schema matching is a basic problem in many

database application domains, such as data integration, E-

business, data warehousing, and semantic query processing.

In current implementations, schema matching is typically per-

formed manually, which has signiﬁcant limitations. On the

other hand, previous research papers have proposed many

techniques to achieve a partial automation of the match op-

eration for speciﬁc application domains. We present a taxon-

omy that covers many of these existing approaches, and we

describetheapproachesinsomedetail.Inparticular,wedistin-

guish between schema-level and instance-level, element-level

and structure-level, and language-based and constraint-based

matchers. Based on our classiﬁcation we review some pre-

vious match implementations thereby indicating which part

of the solution space they cover. We intend our taxonomy and

review of past work to be useful when comparing different ap-

proaches to schema matching, when developing a new match

algorithm, and when implementing a schema matching com-

ponent.

Keywords: Schema matching – Schema integration – Graph

matching – Model management – Machine learning

1. Introduction

A fundamental operation in the manipulation of schema in-

formation is Match, which takes two schemas as input and

produces a mapping between elements of the two schemas

that correspond semantically to each other [LC94, MIR94,

MZ98, PSU98, MWJ99, DDL00]. Match plays a central role

in numerous applications, such as web-oriented data integra-

tion, electronic commerce, schema integration, schema evo-

lution and migration, application evolution, data warehous-

ing, database design, web site creation and management, and

component-based development.

Currently, schema matching is typically performed man-

ually, perhaps supported by a graphical user interface. Obvi-

ously, manually specifying schema matches is a tedious, time-

consuming,error-prone,andthereforeexpensiveprocess.This

is a growing problem given the rapidly increasing number of

web data sources and E-businesses to integrate. Moreover, as

systems become able to handle more complex databases and

applications, their schemas become larger, further increasing

the number of matches to be performed. The level of effort

is at least linear in the number of matches to be performed,

maybe worse than linear if one needs to evaluate each match in

the context of other possible matches of the same elements.A

faster and less labor-intensive integration approach is needed.

This requires automated support for schema matching.

To provide this automated support, we would like to see

a generic, customizable implementation of Match that is us-

able across application areas. This would make it easier to

build application-speciﬁc tools that include automatic schema

match. Such a generic implementation can also be a key com-

ponent within a more comprehensive model management ap-

proach, such as the one proposed in [BHP00, Be00, BR00],

where the mapping returned by a match operation may be

used as input to operations to merge schemas and compose

mappings.

Fortunately, there is a lot of previous work on schema

matching developed in the context of schema translation and

integration, knowledge representation, machine learning, and

information retrieval. The main goals of this paper are to sur-

vey these past approaches and to present a taxonomy that ex-

plains their common features.We expectthe surveyto be help-

ful both to designers of new approaches and to users who need

to select from a library of approaches.

In the next section, we summarize some example applica-

tions of schema matching. Section 3 deﬁnes the match oper-

ator, and Section 4 describes a high-level architecture for im-

plementing it. Section 5 provides a classiﬁcation of different

ways to perform Match automatically. This section illustrates

both the complexity of the problem and (at least part of) the

solution space. We use the classiﬁcation in Sects. 6 through

8 to organize our presentation of previously proposed tech-

niques and to explain how they may be applied in the overall

architecture. Section 9 is a literature review, which describes

someintegratedsolutionsandhowthey ﬁt inourclassiﬁcation.

Section 10 is the conclusion.

2. Application domains

To motivate the importance of schema matching, we summa-

rize its use in several database application domains.

E. Rahm, P.A. Bernstein: A survey of approaches to automatic schema matching 335

2.1. Schema integration

Most work on schema match has been motivated by schema

integration, a problem that has been investigated since the

early 1980s: Given a set of independently developed schemas,

construct a global view [BLN86, EP90, SL90, PS98]. In an

artiﬁcial intelligence setting, this is the problem of integrating

independently developed ontologies into a single ontology.

Sincetheschemasareindependentlydeveloped,theyoften

have different structure and terminology. This can obviously

occur when the schemas are from different domains, such as

a real estate schema and property tax schema. However, it

also occurs even if they model the same real world domain,

just because they were developed by different people in dif-

ferent real-world contexts. Thus, a ﬁrst step in integrating the

schemas is to identify and characterize these interschema re-

lationships. This is a process of schema matching. Once they

are identiﬁed, matching elements can be uniﬁed under a co-

herent, integrated schema or view. During this integration, or

sometimes as a separate step, programs or queries are created

that permit translation of data from the original schemas into

the integrated representation.

A variation of the schema integration problem is to inte-

grate an independently developed schema with a given con-

ceptual schema. Again, this requires reconciling the structure

and terminology of the two schemas, which involves schema

matching.

2.2. Data warehouses

A variation of the schema integration problem that became

popular in the 1990s is that of integrating data sources into

a data warehouse. A data warehouse is a decision support

database that is extracted from a set of data sources. The ex-

traction process requires transforming data from the source

format into the warehouse format. As shown in [BR00], the

matchoperationisusefulfordesigningtransformations.Given

a data source, one approach to creating appropriate transfor-

mations is to start by ﬁnding those elements of the source that

are also present in the warehouse. This is a match operation.

After an initial mapping is created, the data warehouse de-

signer needs to examine the detailed semantics of each source

element and create transformations that reconcile those se-

mantics with those of the target.

Another approach to integrating a new data source S



is to

reuse an existing source-to-warehouse transformation S⇒W.

First, the common elements of S



and S are found (a match

operation) and then S⇒W is reused for those common ele-

ments.

2.3. E-commerce

Inthe currentdecade, E-commercehas ledto anewmotivation

for schema matching: message translation. Trading partners

frequently exchange messages that describe business trans-

actions. Usually, each trading partner uses its own message

format. Message formats may differ in their syntax, such as

EDI (electronic data interchange) structures, XML, or custom

data structures.They may also use differentmessage schemas.

To enable systems to exchange messages, application devel-

opers need to convert messages between the formats required

by different trading partners.

Part of the message translation problem is translating be-

tween different message schemas. Message schemas may use

different names, somewhat different data types, and different

ranges of allowable values. Fields are grouped into structures

that also may differ between the two formats. For example,

one may be a ﬂat structure that simply lists ﬁelds while an-

othermay group related ﬁelds. Orboth formats may use nested

structures but may group ﬁelds in different combinations.

Translating between different message schemas is, in part,

a schema matching problem. Today, application designers

need to specify manually how message formats are related.A

match operation would reduce the amount of manual work by

generatingadraft mapping between the twomessageschemas,

which an application designer can subsequently validate and

modify as needed.

Schema match may also be helpful to applications being

considered for the semantic web [BHL01], such as mapping

messagesbetweenautonomousagentsormatching declarative

mediator deﬁnitions.

2.4. Semantic query processing

Schema integration, data warehousing, and E-commerce are

all similar in that they involve the design-time analysis of

schemas to produce mappings and, possibly an integrated

schema.A somewhat different scenario is semantic query pro-

cessing – a run-time scenario where a user speciﬁes the output

of a query (e.g., the SELECT clause in SQL), and the system

ﬁgures out how to produce that output (e.g., by determining

the FROM and WHERE clauses in SQL). The user’s speci-

ﬁcation is stated in terms of concepts familiar to her, which

may not be the same as the names of elements speciﬁed in the

database schema. Therefore, in the ﬁrst phase of processing

the query, the system must map the user-speciﬁed concepts

in the query output to schema elements. This too is a natural

application of the match operation.

After mapping the query output to the schema elements,

the system must derive a qualiﬁcation (e.g., a WHERE clause)

that gives the semantics of the mapping. Techniques for de-

riving this qualiﬁcation have been developed over the past 20

years [MRSS82, KKFG84, WS90, RYAC00]. We expect that

these techniques can be generalized to specify the semantics

of a mapping produced by the match operation. However, an

investigation of this hypothesis is beyond the scope of this

paper.

3. The match operator

To deﬁne the match operator, Match, we need to choose a

representation for its input schemas and output mapping. We

want to explore many approaches to Match.These approaches

depend a lot on the kinds of schema information they use and

how they interpret it. However, they depend hardly at all on

that information’s internal representation, except to the extent

that it is expressive enough to represent the information of

interest. Therefore, for the purposes of this paper, we deﬁne

336 E. Rahm, P.A. Bernstein: A survey of approaches to automatic schema matching

Table 1. Sample input schemas

S1 elements S2 elements

Cust

CName

FirstName

LastName

Customer

CustID

Company

Contact

Phone

a schema to be simply a set of elements connected by some

structure.

In practice, a particular representation must be chosen,

such as an entity-relationship (ER) model, an object-oriented

(OO) model, XML, or directed graphs. In each case, there is

a natural correspondence between the building blocks of the

representation and the notions of elements and structure: enti-

ties and relationships in ER models; objects and relationships

in OO models; elements, subelements, and IDREFs in XML;

and nodes and edges in graphs.

We deﬁne a mapping to be a set of mapping elements,

each of which indicates that certain elements of schema S1

are mapped to certain elements in S2. Furthermore, each map-

ping element can have a mapping expression which speciﬁes

how the S1 and S2 elements are related. The mapping ex-

pression may be directional, for example, a certain function

from the S1 elements referenced by the mapping element to

the S2 elements referenced by the mapping element, or it may

be non-directional, that is, a relation between a combination

of elements of S1 and S2. It may use simple relations over

scalars (e.g., =, ≤), functions (e.g., addition or concatena-

tion), ER-style relationships (e.g., is-a, part-of), set-oriented

relationships (e.g., overlaps, contains [LNE89]), or any other

terms that are deﬁned in the expression language being used.

For example, Table 1 shows two schemas S1 and S2

representing customer information. A mapping between S1

and S2 could contain a mapping element relating Cust.C#

to Customer.CustID with the mapping expression “Cust.C#

= Customer.CustID”. A mapping element with the expres-

sion “Concatenate(Cust.FirstName, Cust.LastName) = Cus-

tomer.Contact”describesamappingbetweentwo S1elements

and one S2 element.

We deﬁne the match operation to be a function that takes

two schemas S1 and S2 as input and returns a mapping be-

tween those two schemas as output, called the match result.

Eachmappingelementofthematchresultspeciﬁes thatcertain

elements of schema S1 logically correspond to, i.e., match,

certain elements of S2, where the semantics of this corre-

spondence is expressed by the mapping element’s mapping

expression.

Unfortunately, the criteria used to match elements of S1

and S2 are based on heuristics that are not easily captured in a

precise mathematical way that can guide us in the implemen-

tation of Match. Thus, we are left with the practical, though

mathematicallyunsatisfying, goalofproducinga mappingthat

is consistent with heuristics that approximate our understand-

ing of what users consider to be a good match.

Similar to previous work we focus mostly on match algo-

rithms that return a mapping that does not include mapping

expressions.We therefore often represent a mapping as a simi-

larityrelation,

∼

,overthe powersetsof S1and S2, where each

pair in

∼

represents one mapping element of the mapping. For

example, the result of calling Match on the schemas of Table

1 could be “Cust.C#

∼

Customer.CustID”, “Cust.CName

∼

Customer.Company”,and“{Cust.FirstName,Cust.LastName}

∼

Customer.Contact”. A complete speciﬁcation of the result

of the invocation of Match wouldalso include the mapping ex-

pressionofeachelement,thatis“Cust.C#=Customer.CustID”,

“Cust.CName = Customer. Company”, and “Concatenate

(Cust.FirstName, Cust.LastName) = Customer.Contact”. In

what follows, when mapping expressions are involved, we

will explicitly mention them. Otherwise, we will simply use

∼

Aswewillsee, some implementations ofMatcharesimilar

to join processing in relational databases, in that both Match

and Join are binary operations that determine pairs of corre-

sponding elements from their input operands. There are many

differences, of course. Match operates on metadata (schema

elements) and Join on data (rows of tables). Moreover, Match

is more complex than Join. Each element in the Join result

combines only one element of the ﬁrst with one matching el-

ement of the second input, while an element in a match result

can relate multiple elements from both inputs. Furthermore,

Join semantics is speciﬁed by a single comparison expression

(e.g., an equality condition for natural join) that must hold

for all matching input elements. By contrast, each element

in a match result may have a different mapping expression.

Hence, the semantics of Match is less restricted than that of

Join and is more difﬁcult to capture in a consistent way.

The similarity of Match and Join extends to OuterMatch

operations, which are useful counterparts to Match in much

the same way that OuterJoin is a counterpart to Join. A right

(or left) OuterMatch ensures that every element of S2 (or S1)

isreferenced by themapping.AfullOuterMatch ensures every

element of both S1 and S2 are referenced by the mapping. By

ensuring that every element of a schema S is referenced in the

mapping returned by Match, the mapping can be more easily

composed with other mappings that refer to S. Examples of

such compositions appear in [BR00], which introduced the

OuterMatch operation.Although the usage of OuterMatch in-

volves some subtlety, its implementation is a straightforward

extension of Match: given an algorithm for the match opera-

tion, OuterMatch can easily be computed by adding elements

to the match result that reference the otherwise non-referenced

elements of S1 or S2. We therefore do not consider Outer-

Match further in this paper.

4. Architecture for generic match

When reviewing and comparing approaches to Match, it helps

to have an implementation architecture in mind. We therefore

describe a high-level architecture for a generic, customizable

implementation of Match.

Figure 1 illustrates the overall architecture. The clients are

schema-related applications and tools from different domains,

such as E-business, portals, and data warehousing. Each client

uses the generic implementation of Match to automatically

determine matches between two input schemas. XML schema

editors, portal development kits, database modeling tools and

the like may access libraries to select existing schemas, shown

in the lower left of Fig.1. The implementation of Match may

E. Rahm, P.A. Bernstein: A survey of approaches to automatic schema matching 337

Global libraries

(dictionaries, schemas

…)

Generic Match

implementation

Tool 1

(Portal schemas)

Tool 2

(E-business schemas)

Tool 3

(Data

warehousing schemas)

Schema import/ export

Tool 4

(Database

design schemas)

Internal schema

representation

Global libraries

(dictionaries, schemas

…)

Generic Match

implementation

Tool 1

(Portal schemas)

Tool 1

(Portal schemas)

Tool 2

(E-business schemas)

Tool 2

(E-business schemas)

Tool 3

(Data

warehousing schemas)

Tool 3

(Data

warehousing schemas)

Schema import/ export

Tool 4

(Database

design schemas)

Tool 4

(Database

design schemas)

Internal schema

representation

Fig. 1. High-level architecture of generic Match

Table 2. Full vs partial structural match (example)

S1 elements S2 elements

Address

Street

City

State

ZIP

CustomerAddress

Street

City

USState

PostalCode

full structural match of

Address and CustomerAddress

AccountOwner

Name

Address

Birthdate

TaxExempt

Customer

Cname

CAddress

CPhone

partialstructural matchofAccountOwnerand

Customer

also use the libraries and other auxiliary information, such as

dictionaries and thesauri, to help ﬁnd matches.

We assume that the generic implementation of Match rep-

resents the schemas to be matched in a uniform internal rep-

resentation. This uniform representation signiﬁcantly reduces

the complexity of Match by not having to deal with the large

number of different (heterogeneous) representations of

schemas. Tools that are tightly integrated with the framework

can work directly on the internal representation. Other tools

need import/export programs to translate between their na-

tive schema representation (such as XML, SQL, or UML) and

the internal representation. A semantics-preserving importer

translates input schemas from their native representation into

the internal representation. Similarly, an exporter translates

mappings produced by the generic implementation of Match

from the internal representation into the representation re-

quired by each tool. This allows the generic implementation

of Match to operate exclusively on the internal representation.

In general, it is not possible to determine fully automat-

ically all matches between two schemas, primarily because

most schemas have some semantics that affects the match-

ing criteria but is not formally expressed or often even docu-

mented. The implementation of Match should therefore only

determine match candidates, which the user can accept, reject

or change. Furthermore, the user should be able to specify

matches for elements for which the system was unable to ﬁnd

satisfactory match candidates.

5. Classiﬁcation of schema matching approaches

In this section we classify the major approaches to schema

matching. Fig.2 shows part of our classiﬁcation scheme to-

gether with some sample approaches.

An implementation of Match may use multiple match al-

gorithms or matchers. This allows us to select the matchers

depending on the application domain and schema types. Given

that we want to use multiple matchers we distinguish two sub-

problems. First, there is the realization of individual matchers,

each of which computes a mapping based on a single match-

ing criterion. Second, there is the combination of individ-

ual matchers, either by using multiple matching criteria (e.g.,

name and type equality) within an integrated hybrid matcher

or by combining multiple match results produced by different

match algorithms within a composite matcher. For individual

matchers, we consider the following largely-orthogonal clas-

siﬁcation criteria:

• Instance vs schema: matching approaches can consider

instance data (i.e., data contents) or only schema-level in-

formation.

• Element vs structure matching: match can be performed

for individual schema elements, such as attributes, or for

combinations of elements, such as complex schema struc-

tures.

• Language vs constraint: a matcher can use a linguistic-

based approach (e.g., based on names and textual descrip-

tions of schema elements) or a constraint-based approach

(e.g., based on keys and relationships).

• Matching cardinality: the overall match result may relate

one or more elements of one schema to one or more ele-

ments of the other, yielding four cases: 1:1, 1:n, n:1, n:m.

In addition, each mapping element may interrelate one

or more elements of the two schemas. Furthermore, there

may be different match cardinalities at the instance level.

• Auxiliary information: most matchers rely not only on the

input schemas S1 and S2 but also on auxiliary informa-

tion,such asdictionaries, globalschemas, previous match-

ing decisions, and user input.

338 E. Rahm, P.A. Bernstein: A survey of approaches to automatic schema matching

Automatic

composition

Composite matchers

Schema Matching Approaches

Individual matcher approaches Combining matchers

Manual

composition

Schema-only based Instance/contents-based

• Graph

matching

Further criteria:

- Match cardinality

- Auxiliary information used …

Linguistic

Constraint-

based

Structure-levelElement-level

• Type similarity

• Key properties

• Value pattern and

ranges

Constraint-

based

Linguistic

• IR techniques

(word frequencies,

key terms)

Sample approaches

…… … … …

Element-level

Hybrid matchers

Constraint-

based

• Name similarity

• Description

similarity

• Global

namespaces

Automatic

composition

Composite matchers

Schema Matching Approaches

Individual matcher approaches Combining matchers

Manual

composition

Schema-only based Instance/contents-based

• Graph

matching

Further criteria:

- Match cardinality

- Auxiliary information used …

Linguistic

Constraint-

based

Structure-levelElement-level

• Type similarity

• Key properties

• Value pattern and

ranges

Constraint-

based

Linguistic

• IR techniques

(word frequencies,

key terms)

Sample approaches

…… … … …

Element-level

Hybrid matchers

Constraint-

based

• Name similarity

• Description

similarity

• Global

namespaces

Fig. 2. Classiﬁcation of schema matching approaches

Note that our classiﬁcation does not distinguish between dif-

ferent types of schemas (relational, XML, object-oriented,

etc.) and their internal representation, because algorithms de-

pend mostly on the kind of information they exploit, not on

its representation.

In the following three sections, we discuss the main alter-

natives according to the above classiﬁcation criteria. We dis-

cussschema-levelmatchinginSect.6,instance-levelmatching

in Sect.7, and combinations of multiple matchers in Sect.8.

6. Schema-level matchers

Schema-level matchers onlyconsider schema information, not

instance data. The available information includes the usual

properties of schema elements, such as name, description,

data type, relationship types (part-of, is-a, etc.), constraints,

and schema structure. In general, a matcher will ﬁnd multiple

match candidates. For each candidate, it is customary to esti-

mate the degree of similarity by a normalized numeric value

in the range 0–1, in order to identify the best match candidates

(as in [PSU98, BCV99, DDL00, CDD01]).

We ﬁrst discuss the main alternativesfor match granularity

andmatch cardinality.Then wecoverlinguisticandconstraint-

based matchers. Finally, we outline approaches based on the

reuse of auxiliary data, such as previously deﬁned schemas

and previous match results.

6.1. Granularity of match (element-level vs structure-level)

We distinguish two main alternatives for the granularity of

Match, element-level and structure-level matching. For each

element of the ﬁrst schema, element-level matching deter-

mines the matching elements in the second input schema. In

the simplest case, only elements at the ﬁnest level of granular-

ity are considered, which we call the atomic level, such as at-

tributes in an XML schema or columns in a relational schema.

For the schema fragments shown in Table 2, a sample atomic-

level match is “Address.ZIP

∼

CustomerAddress.PostalCode”

(recall that “

∼

” means “matches”).

Structure-level matching, on the other hand, refers to

matching combinations of elements that appear together in a

structure.Arange of cases is possible, depending on howcom-

plete and precise a match of the structure is required. In the

ideal case, all components of the structures in the two schemas

fully match. Alternatively, only some of the components may

be required to match (i.e., a partial structural match). Exam-

ples of the two cases are shown in Table 2. The need for partial

matchessometimes arisesbecausesubschemasof differentdo-

mains are being compared. For example, in the second row of

Table 2, AccountOwner may come from a ﬁnance database

while Customer comes from a sales database.

For more complex cases, the effectiveness of structure

matching can be enhanced by considering known equivalence

patterns, which may be kept in a library. One simple pattern

is shown in Fig.3 relating two structures in an is-a hierarchy

to a single structure. The subclass of the ﬁrst schema is repre-

sented by a Boolean attribute in the second schema. Another

well-known pattern consists of two structures interconnected

by a referential relationship being equivalent to a single struc-

ture (essentially, the join of the two). We will see an example

of this in Sect.6.4.

Element-levelmatchingisnotrestrictedtotheatomiclevel,

butmayalsobeappliedtocoarsergrained,higher (non-atomic)

HTML Viewer

Frequently Asked Questions (11)

Q1. What have the authors contributed in "A survey of approaches to automatic schema matching" ?

The authors present a taxonomy that covers many of these existing approaches, and they describe the approaches in some detail. Based on their classification the authors review some previous match implementations thereby indicating which part of the solution space they cover. The authors intend their taxonomy and review of past work to be useful when comparing different approaches to schema matching, when developing a new match algorithm, and when implementing a schema matching component.

Q2. What have the authors stated for future works in "A survey of approaches to automatic schema matching" ?

The authors hope that the taxonomy will be useful to programmers who need to implement a match algorithm and to researchers looking to develop more effective and comprehensive schema matching algorithms. In the future, the authors would like to see quantitative work on the relative performance and accuracy of different approaches. Such results could tell us which of the existing approaches dominate the others and could help identify weaknesses in the existing approaches that suggest opportunities for future research. Since the problem is so fundamental, the authors believe the field would benefit from treating it as an independent problem, as they have begun doing here.

Q3. What is the role of match in various applications?

Match plays a central role in numerous applications, such as web-oriented data integration, electronic commerce, schema integration, schema evolution and migration, application evolution, data warehousing, database design, web site creation and management, and component-based development.

Q4. What is the way to use auxiliary information to improve the effectiveness of Match?

Another way to use auxiliary information to improve the effectiveness of Match is to support and exploit the reuse of common schema components and previously determined mappings.

Q5. What is the level of effort required to perform a match?

The level of effort is at least linear in the number of matches to be performed, maybe worse than linear if one needs to evaluate each match in the context of other possible matches of the same elements.

Q6. Why do the authors think the field would benefit from treating it as an independent problem?

Since the problem is so fundamental, the authors believe the field would benefit from treating it as an independent problem, as the authors have begun doing here.

Q7. What are the main classification criteria for a matcher?

For individual matchers, the authors consider the following largely-orthogonal classification criteria: • Instance vs schema: matching approaches can considerinstance data (i.e., data contents) or only schema-level information.

Q8. What is the way to combine structure- with element-level matching?

One way to combine structure- with element-level matching is to use one algorithm to generate a partial mapping and the other to complete the mapping.

Q9. What is the way to simplify the automatic generation of match candidates?

If S1 is more similar to S than to S2, this can simplify the automatic generation of match candidates by reusing matches from the existing result of Match(S, S2), although some care is needed since matches are sometimes not transitive.

Q10. What is the process of generating a list of match candidates in S2?

The per-instance match results need to be merged and abstracted to the schema level, to generate a ranked list of match candidates in S1 for each (schema-level) element inS2.

Q11. What is the way to deal with input schemas?

General natural language dictionaries may be useful, perhaps even multi-language dictionaries (e.g., English-German) to deal with input schemas of different languages.

A survey of approaches to automatic schema matching

Summary (5 min read)

1. Introduction

2.1. Schema integration

2.2. Data warehouses

2.3. E-commerce

2.4. Semantic query processing

3. The match operator

4. Architecture for generic match

5. Classification of schema matching approaches

6. Schema-level matchers

6.1. Granularity of match (element-level vs structure-level)

6.2. Match cardinality

6.3. Linguistic approaches

D (name1, name2,

6.4. Constraint-based approaches

6.5. Reusing schema and mapping information

7. Instance-level approaches

8. Combining different matchers

9.1. Prototype schema matchers

9.2. Related prototypes

10. Conclusion

Figures (7)

Citations

Cites background from "A survey of approaches to automatic..."

Cites background or methods from "A survey of approaches to automatic..."

Cites background from "A survey of approaches to automatic..."

References

"A survey of approaches to automatic..." refers background in this paper

Additional excerpts

"A survey of approaches to automatic..." refers background in this paper

Related Papers (5)

Frequently Asked Questions (11)

Q1. What have the authors contributed in "A survey of approaches to automatic schema matching" ?

Q2. What have the authors stated for future works in "A survey of approaches to automatic schema matching" ?

Q3. What is the role of match in various applications?

Q4. What is the way to use auxiliary information to improve the effectiveness of Match?

Q5. What is the level of effort required to perform a match?

Q6. Why do the authors think the field would benefit from treating it as an independent problem?

Q7. What are the main classification criteria for a matcher?

Q8. What is the way to combine structure- with element-level matching?

Q9. What is the way to simplify the automatic generation of match candidates?

Q10. What is the process of generating a list of match candidates in S2?

Q11. What is the way to deal with input schemas?