scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A survey of approaches to automatic schema matching

01 Dec 2001-Vol. 10, Iss: 4, pp 334-350
TL;DR: A taxonomy is presented that distinguishes between schema-level and instance-level, element- level and structure- level, and language-based and constraint-based matchers and is intended to be useful when comparing different approaches to schema matching, when developing a new match algorithm, and when implementing a schema matching component.
Abstract: Schema matching is a basic problem in many database application domains, such as data integration, E-business, data warehousing, and semantic query processing. In current implementations, schema matching is typically performed manually, which has significant limitations. On the other hand, previous research papers have proposed many techniques to achieve a partial automation of the match operation for specific application domains. We present a taxonomy that covers many of these existing approaches, and we describe the approaches in some detail. In particular, we distinguish between schema-level and instance-level, element-level and structure-level, and language-based and constraint-based matchers. Based on our classification we review some previous match implementations thereby indicating which part of the solution space they cover. We intend our taxonomy and review of past work to be useful when comparing different approaches to schema matching, when developing a new match algorithm, and when implementing a schema matching component.

Summary (5 min read)

1. Introduction

  • A fundamental operation in the manipulation of schema information isMatch, which takes two schemas as input and produces a mapping between elements of the two schemas that correspond semantically to each other [LC94, MIR94, MZ98, PSU98, MWJ99, DDL00].
  • Match plays a central role in numerous applications, such as web-oriented data integration, electronic commerce, schema integration, schema evolution and migration, application evolution, data warehousing, database design, web site creation and management, and component-based development.
  • Currently, schema matching is typically performed manually, perhaps supported by a graphical user interface.
  • This section illustrates both the complexity of the problem and (at least part of) the solution space.
  • Section 9 is a literature review, which describes some integrated solutions and how they fit in their classification.

2.1. Schema integration

  • Most work on schema match has been motivated by schema integration, a problem that has been investigated since the early 1980s: Given a set of independently developed schemas, construct a global view [BLN86, EP90, SL90, PS98].
  • In an artificial intelligence setting, this is the problem of integrating independently developed ontologies into a single ontology.
  • It also occurs even if they model the same real world domain, just because they were developed by different people in different real-world contexts.
  • Thus, a first step in integrating the schemas is to identify and characterize these interschema relationships.
  • Again, this requires reconciling the structure and terminology of the two schemas, which involves schema matching.

2.2. Data warehouses

  • A variation of the schema integration problem that became popular in the 1990s is that of integrating data sources into a data warehouse.
  • The extraction process requires transforming data from the source format into the warehouse format.
  • As shown in [BR00], the match operation is useful for designing transformations.
  • After an initial mapping is created, the data warehouse designer needs to examine the detailed semantics of each source element and create transformations that reconcile those semantics with those of the target.
  • First, the common elements ofS′ andS are found (a match operation) and thenS⇒W is reused for those common elements.

2.3. E-commerce

  • In the current decade, E-commerce has led to a new motivation for schema matching: message translation.
  • Trading partners frequently exchange messages that describe business transactions.
  • Fields are grouped into structures that also may differ between the two formats.
  • Translating between different message schemas is, in part, a schema matching problem.
  • Today, application designers need to specify manually how message formats are related.

2.4. Semantic query processing

  • Schema integration, data warehousing, and E-commerce are all similar in that they involve the design-time analysis of schemas to produce mappings and, possibly an integrated schema.
  • A somewhat different scenario is semantic query processing – a run-time scenario where a user specifies the output of a query (e.g., the SELECT clause in SQL), and the system figures out how to produce that output (e.g., by determining the FROM and WHERE clauses in SQL).
  • The user’s specification is stated in terms of concepts familiar to her, which may not be the same as the names of elements specified in the database schema.
  • Therefore, in the first phase of processing the query, the system must map the user-specified concepts in the query output to schema elements.
  • Techniques for deriving this qualification have been developed over the past 20 years [MRSS82, KKFG84, WS90, RYAC00].

3. The match operator

  • To define the match operator, Match, the authors need to choose a representation for its input schemas and output mapping.
  • Each mapping element of the match result specifies that certain elements of schemaS1 logically correspond to, i.e., match, certain elements ofS2, where the semantics of this correspondence is expressed by the mapping element’s mapping expression.
  • A complete specification of the result of the invocation of Match would also include the mapping expression of each element, that is “Cust.C# = Customer.CustID”, “Cust.
  • The similarity of Match and Join extends to OuterMatch operations, which are useful counterparts to Match in much the same way that OuterJoin is a counterpart to Join.
  • A right (or left) OuterMatch ensures that every element ofS2 (or S1) is referenced by the mapping.

4. Architecture for generic match

  • XML schema editors, portal development kits, database modeling tools and the like may access libraries to select existing schemas, shown in the lower left of Fig.1.
  • This uniform representation significantly reduces the complexity of Match by not having to deal with the large number of different representations of schemas.
  • Tools that are tightly integrated with the framework can work directly on the internal representation.
  • The implementation of Match should therefore only determinematch candidates, which the user can accept, reject or change.

5. Classification of schema matching approaches

  • In this section the authors classify the major approaches to schema matching.
  • Fig.2 shows part of their classification scheme together with some sample approaches.
  • An implementation of Match may use multiple match algorithms ormatchers.
  • For individual matchers, the authors consider the following largely-orthogonal classification criteria: Instance vs schema:matching approaches can consider instance data (i.e., data contents) or only schema-level information.
  • In addition, each mapping element may interrelate one or more elements of the two schemas.

6. Schema-level matchers

  • Schema-level matchers only consider schema information, not instance data.
  • The available information includes the usual properties of schema elements, such as name, description, data type, relationship types (part-of, is-a, etc.), constraints, and schema structure.
  • In general, a matcher will find multiple match candidates.
  • For each candidate, it is customary to estimate the degree of similarity by a normalized numeric value in the range 0–1, in order to identify the best match candidates (as in [PSU98, BCV99, DDL00, CDD01]).
  • Then the authors cover linguistic and constraintbased matchers.

6.1. Granularity of match (element-level vs structure-level)

  • The authors distinguish two main alternatives for the granularity of Match, element-level and structure-level matching.
  • For each element of the first schema,element-level matchingdetermines the matching elements in the second input schema.
  • Structure-level matching,on the other hand, refers to matching combinations of elements that appear together in a structure.
  • The fact that the elements “Address” and “CustomerAddress” in Table 2 are likely to match can be derived by a name-based element-level matching without considering their underlying components.
  • Element-level matching can be implemented by algorithms similar to relational join processing.

6.2. Match cardinality

  • An S1 (or S2) element can participate in zero, one or many mapping elements of the match result between the two input schemasS1 and S2.
  • Thus, the authors have the usual relationship cardinalities, namely 1:1 and the set-oriented cases 1:n, n:1, and n:m, between matching elements both with respect to different mapping elements (global cardinality) and with respect to an individual mapping element (local cardinality).
  • M mapping elements usually requires considering the structural embedding of the schema elements and thus requires structure-level matching, also known as Obtaining n.
  • Row 3 explains how FirstName and LastName are extracted from Name.
  • For the first three examples in Table 3, oneS1 instance is matched with oneS2 instance (1:1 instance-level match).

6.3. Linguistic approaches

  • Language-based or linguistic matchers use names and text (i.e., words or sentences) to find semantically similar schema elements.
  • Name matching Name-based matching matches schema elements with equal or similar names.
  • Similarity of names can be defined and measured in various ways, including: Equality of names.
  • Homonyms are equal or similar names that refer to different elements.
  • Name-based matching is not limited to finding 1:1 matches.

D (name1, name2,

  • This assumes that D contains all relevant pairs of the transitive closure over similar names.
  • Intuitively, the authors would expect the similarity valueσ to be .9× .8 = .72, but this depends on the type of similarity, the use of homonyms, and perhaps other factors.
  • These comments can also be evaluated linguistically to determine the similarity between schema elements.

6.4. Constraint-based approaches

  • Schemas often contain constraints to define data types and value ranges, uniqueness, optionality, relationship types and cardinalities, etc. Certain structural information can be interpreted as constraints, such as intra-schema references (e.g., foreign keys) and adjacency-related information (e.g., part-of relationships).
  • When performing a match based on hierarchical structures, an algorithm can traverse the structure either top-down or bottom-up.
  • This allows us to determine the correct n:m SQL-like match mapping S2.

6.5. Reusing schema and mapping information

  • The authors have already discussed the use of auxiliary information in addition to the input schemas, such as dictionaries, thesauri, and user-provided match or mismatch information.
  • Another way to use auxiliary information to improve the effectiveness of Match is to support and exploit the reuse of common schema components and previously determined mappings.
  • The authors also want to reuse entire structures, which is useful when matching different but similar schemas to the same destination schema, as may occur when integrating new sources into a data warehouse or digital library.
  • The authors already have the match result betweenS andS2, illustrated by the arrows.
  • Salary and Income may be considered identical in a payroll application but not in a tax reporting application.

7. Instance-level approaches

  • Instance-level data can give important insight into the contents and meaning of schema elements.
  • It can help disambiguate between equally plausible schema-level matches by choosing to match the elements whose instances are more similar.
  • The main benefit of evaluating instances is a precise characterization of the actual contents of schema elements.
  • Then, theS2 instances are matched one-by-one against the characterizations ofS1 elements.
  • Instance-level matching can also be performed by utilizing auxiliary information, e.g., previous mappings obtained from matching different schemas.

8. Combining different matchers

  • The authors have reviewed several types of matchers and many different variations.
  • Structure-level matching also benefits from being combined with other approaches such as name matching.
  • On the other hand, one can use acomposite matcherthat combines the results of several independently executed matchers, including hybrid matchers.
  • Selection of matchers, and determining their execution order and the combination of independently determined match esults can be done either automatically by the implementation of Match itself or its clients (e.g., tools), or manually by a human user.
  • An automatic approach can reduce the number of user interactions, but it is difficult to achieve a generic solution that is adaptable to different application domains (although the approach could be controlled by tuning parameters).

9.1. Prototype schema matchers

  • In Table 5 the authors show how seven published prototype implementations fit the classification criteria introduced in Sect.5.
  • The table thus indicates which part of the solution space is covered by which implementations, thereby supporting a comparison of the approaches.
  • The table shows that all systems support multiple matching criteria, six in the form of a hybrid matcher and only one, LSD, by a composite match approach.
  • A global matcher that uses the same machine-learning technology is used to merge the lists into a combined list of match candidates for each schema element.
  • It computes matches by a weighted sum of name and data type affinity and structural affinity.

10. Conclusion

  • Schema matching is a basic problem in many database application domains, such as heterogeneous database integration, E-commerce, data warehousing, and semantic query processing.
  • The authors hope that the taxonomy will be useful to programmers who need to implement a match algorithm and to researchers looking to develop more effective and comprehensive schema matching algorithms.
  • More attention should be given to the utilization of instance-level information and reuse opportunities to perform Match.
  • The authors are grateful for many helpful suggestions from Sonia Bergamaschi, Silvana Castano, Chris Clifton, Hai Hong Do, An Hai Doan, Alon Halevy, Jayant Madhavan, Sergey Melnik, Renée Miller, Rachel Pottinger,Arnie Rosenthal, Dennis Shasha, and the anonymous referees.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

The VLDB Journal 10: 334–350 (2001) / Digital Object Identifier (DOI) 10.1007/s007780100057
A survey of approaches to automatic schema matching
Erhard Rahm
1
, Philip A. Bernstein
2
1
Universit¨at Leipzig, Institut f¨ur Informatik, 04109 Leipzig, Germany; (e-mail: rahm@informatik.uni-leipzig.de)
2
Microsoft Research, Redmond, WA 98052-6399, USA; (e-mail: philbe@microsoft.com)
Edited by P. Scheuermann. Received: 5 February 2001 / Accepted: 6 September 2001
Published online: 21 November 2001
c
Springer-Verlag 2001
Abstract. Schema matching is a basic problem in many
database application domains, such as data integration, E-
business, data warehousing, and semantic query processing.
In current implementations, schema matching is typically per-
formed manually, which has significant limitations. On the
other hand, previous research papers have proposed many
techniques to achieve a partial automation of the match op-
eration for specific application domains. We present a taxon-
omy that covers many of these existing approaches, and we
describetheapproachesinsomedetail.Inparticular,wedistin-
guish between schema-level and instance-level, element-level
and structure-level, and language-based and constraint-based
matchers. Based on our classification we review some pre-
vious match implementations thereby indicating which part
of the solution space they cover. We intend our taxonomy and
review of past work to be useful when comparing different ap-
proaches to schema matching, when developing a new match
algorithm, and when implementing a schema matching com-
ponent.
Keywords: Schema matching Schema integration Graph
matching Model management Machine learning
1. Introduction
A fundamental operation in the manipulation of schema in-
formation is Match, which takes two schemas as input and
produces a mapping between elements of the two schemas
that correspond semantically to each other [LC94, MIR94,
MZ98, PSU98, MWJ99, DDL00]. Match plays a central role
in numerous applications, such as web-oriented data integra-
tion, electronic commerce, schema integration, schema evo-
lution and migration, application evolution, data warehous-
ing, database design, web site creation and management, and
component-based development.
Currently, schema matching is typically performed man-
ually, perhaps supported by a graphical user interface. Obvi-
ously, manually specifying schema matches is a tedious, time-
consuming,error-prone,andthereforeexpensiveprocess.This
is a growing problem given the rapidly increasing number of
web data sources and E-businesses to integrate. Moreover, as
systems become able to handle more complex databases and
applications, their schemas become larger, further increasing
the number of matches to be performed. The level of effort
is at least linear in the number of matches to be performed,
maybe worse than linear if one needs to evaluate each match in
the context of other possible matches of the same elements.A
faster and less labor-intensive integration approach is needed.
This requires automated support for schema matching.
To provide this automated support, we would like to see
a generic, customizable implementation of Match that is us-
able across application areas. This would make it easier to
build application-specific tools that include automatic schema
match. Such a generic implementation can also be a key com-
ponent within a more comprehensive model management ap-
proach, such as the one proposed in [BHP00, Be00, BR00],
where the mapping returned by a match operation may be
used as input to operations to merge schemas and compose
mappings.
Fortunately, there is a lot of previous work on schema
matching developed in the context of schema translation and
integration, knowledge representation, machine learning, and
information retrieval. The main goals of this paper are to sur-
vey these past approaches and to present a taxonomy that ex-
plains their common features.We expectthe surveyto be help-
ful both to designers of new approaches and to users who need
to select from a library of approaches.
In the next section, we summarize some example applica-
tions of schema matching. Section 3 defines the match oper-
ator, and Section 4 describes a high-level architecture for im-
plementing it. Section 5 provides a classification of different
ways to perform Match automatically. This section illustrates
both the complexity of the problem and (at least part of) the
solution space. We use the classification in Sects. 6 through
8 to organize our presentation of previously proposed tech-
niques and to explain how they may be applied in the overall
architecture. Section 9 is a literature review, which describes
someintegratedsolutionsandhowthey fit inourclassification.
Section 10 is the conclusion.
2. Application domains
To motivate the importance of schema matching, we summa-
rize its use in several database application domains.

E. Rahm, P.A. Bernstein: A survey of approaches to automatic schema matching 335
2.1. Schema integration
Most work on schema match has been motivated by schema
integration, a problem that has been investigated since the
early 1980s: Given a set of independently developed schemas,
construct a global view [BLN86, EP90, SL90, PS98]. In an
artificial intelligence setting, this is the problem of integrating
independently developed ontologies into a single ontology.
Sincetheschemasareindependentlydeveloped,theyoften
have different structure and terminology. This can obviously
occur when the schemas are from different domains, such as
a real estate schema and property tax schema. However, it
also occurs even if they model the same real world domain,
just because they were developed by different people in dif-
ferent real-world contexts. Thus, a first step in integrating the
schemas is to identify and characterize these interschema re-
lationships. This is a process of schema matching. Once they
are identified, matching elements can be unified under a co-
herent, integrated schema or view. During this integration, or
sometimes as a separate step, programs or queries are created
that permit translation of data from the original schemas into
the integrated representation.
A variation of the schema integration problem is to inte-
grate an independently developed schema with a given con-
ceptual schema. Again, this requires reconciling the structure
and terminology of the two schemas, which involves schema
matching.
2.2. Data warehouses
A variation of the schema integration problem that became
popular in the 1990s is that of integrating data sources into
a data warehouse. A data warehouse is a decision support
database that is extracted from a set of data sources. The ex-
traction process requires transforming data from the source
format into the warehouse format. As shown in [BR00], the
matchoperationisusefulfordesigningtransformations.Given
a data source, one approach to creating appropriate transfor-
mations is to start by finding those elements of the source that
are also present in the warehouse. This is a match operation.
After an initial mapping is created, the data warehouse de-
signer needs to examine the detailed semantics of each source
element and create transformations that reconcile those se-
mantics with those of the target.
Another approach to integrating a new data source S
is to
reuse an existing source-to-warehouse transformation SW.
First, the common elements of S
and S are found (a match
operation) and then SW is reused for those common ele-
ments.
2.3. E-commerce
Inthe currentdecade, E-commercehas ledto anewmotivation
for schema matching: message translation. Trading partners
frequently exchange messages that describe business trans-
actions. Usually, each trading partner uses its own message
format. Message formats may differ in their syntax, such as
EDI (electronic data interchange) structures, XML, or custom
data structures.They may also use differentmessage schemas.
To enable systems to exchange messages, application devel-
opers need to convert messages between the formats required
by different trading partners.
Part of the message translation problem is translating be-
tween different message schemas. Message schemas may use
different names, somewhat different data types, and different
ranges of allowable values. Fields are grouped into structures
that also may differ between the two formats. For example,
one may be a flat structure that simply lists fields while an-
othermay group related fields. Orboth formats may use nested
structures but may group fields in different combinations.
Translating between different message schemas is, in part,
a schema matching problem. Today, application designers
need to specify manually how message formats are related.A
match operation would reduce the amount of manual work by
generatingadraft mapping between the twomessageschemas,
which an application designer can subsequently validate and
modify as needed.
Schema match may also be helpful to applications being
considered for the semantic web [BHL01], such as mapping
messagesbetweenautonomousagentsormatching declarative
mediator definitions.
2.4. Semantic query processing
Schema integration, data warehousing, and E-commerce are
all similar in that they involve the design-time analysis of
schemas to produce mappings and, possibly an integrated
schema.A somewhat different scenario is semantic query pro-
cessing a run-time scenario where a user specifies the output
of a query (e.g., the SELECT clause in SQL), and the system
figures out how to produce that output (e.g., by determining
the FROM and WHERE clauses in SQL). The user’s speci-
fication is stated in terms of concepts familiar to her, which
may not be the same as the names of elements specified in the
database schema. Therefore, in the first phase of processing
the query, the system must map the user-specified concepts
in the query output to schema elements. This too is a natural
application of the match operation.
After mapping the query output to the schema elements,
the system must derive a qualification (e.g., a WHERE clause)
that gives the semantics of the mapping. Techniques for de-
riving this qualification have been developed over the past 20
years [MRSS82, KKFG84, WS90, RYAC00]. We expect that
these techniques can be generalized to specify the semantics
of a mapping produced by the match operation. However, an
investigation of this hypothesis is beyond the scope of this
paper.
3. The match operator
To define the match operator, Match, we need to choose a
representation for its input schemas and output mapping. We
want to explore many approaches to Match.These approaches
depend a lot on the kinds of schema information they use and
how they interpret it. However, they depend hardly at all on
that information’s internal representation, except to the extent
that it is expressive enough to represent the information of
interest. Therefore, for the purposes of this paper, we define

336 E. Rahm, P.A. Bernstein: A survey of approaches to automatic schema matching
Table 1. Sample input schemas
S1 elements S2 elements
Cust
C#
CName
FirstName
LastName
Customer
CustID
Company
Contact
Phone
a schema to be simply a set of elements connected by some
structure.
In practice, a particular representation must be chosen,
such as an entity-relationship (ER) model, an object-oriented
(OO) model, XML, or directed graphs. In each case, there is
a natural correspondence between the building blocks of the
representation and the notions of elements and structure: enti-
ties and relationships in ER models; objects and relationships
in OO models; elements, subelements, and IDREFs in XML;
and nodes and edges in graphs.
We define a mapping to be a set of mapping elements,
each of which indicates that certain elements of schema S1
are mapped to certain elements in S2. Furthermore, each map-
ping element can have a mapping expression which specifies
how the S1 and S2 elements are related. The mapping ex-
pression may be directional, for example, a certain function
from the S1 elements referenced by the mapping element to
the S2 elements referenced by the mapping element, or it may
be non-directional, that is, a relation between a combination
of elements of S1 and S2. It may use simple relations over
scalars (e.g., =, ), functions (e.g., addition or concatena-
tion), ER-style relationships (e.g., is-a, part-of), set-oriented
relationships (e.g., overlaps, contains [LNE89]), or any other
terms that are defined in the expression language being used.
For example, Table 1 shows two schemas S1 and S2
representing customer information. A mapping between S1
and S2 could contain a mapping element relating Cust.C#
to Customer.CustID with the mapping expression “Cust.C#
= Customer.CustID”. A mapping element with the expres-
sion “Concatenate(Cust.FirstName, Cust.LastName) = Cus-
tomer.Contact”describesamappingbetweentwo S1elements
and one S2 element.
We define the match operation to be a function that takes
two schemas S1 and S2 as input and returns a mapping be-
tween those two schemas as output, called the match result.
Eachmappingelementofthematchresultspecifies thatcertain
elements of schema S1 logically correspond to, i.e., match,
certain elements of S2, where the semantics of this corre-
spondence is expressed by the mapping element’s mapping
expression.
Unfortunately, the criteria used to match elements of S1
and S2 are based on heuristics that are not easily captured in a
precise mathematical way that can guide us in the implemen-
tation of Match. Thus, we are left with the practical, though
mathematicallyunsatisfying, goalofproducinga mappingthat
is consistent with heuristics that approximate our understand-
ing of what users consider to be a good match.
Similar to previous work we focus mostly on match algo-
rithms that return a mapping that does not include mapping
expressions.We therefore often represent a mapping as a simi-
larityrelation,
=
,overthe powersetsof S1and S2, where each
pair in
=
represents one mapping element of the mapping. For
example, the result of calling Match on the schemas of Table
1 could be “Cust.C#
=
Customer.CustID”, “Cust.CName
=
Customer.Company”,and{Cust.FirstName,Cust.LastName}
=
Customer.Contact”. A complete specification of the result
of the invocation of Match wouldalso include the mapping ex-
pressionofeachelement,thatis“Cust.C#=Customer.CustID”,
“Cust.CName = Customer. Company”, and “Concatenate
(Cust.FirstName, Cust.LastName) = Customer.Contact”. In
what follows, when mapping expressions are involved, we
will explicitly mention them. Otherwise, we will simply use
=
.
Aswewillsee, some implementations ofMatcharesimilar
to join processing in relational databases, in that both Match
and Join are binary operations that determine pairs of corre-
sponding elements from their input operands. There are many
differences, of course. Match operates on metadata (schema
elements) and Join on data (rows of tables). Moreover, Match
is more complex than Join. Each element in the Join result
combines only one element of the first with one matching el-
ement of the second input, while an element in a match result
can relate multiple elements from both inputs. Furthermore,
Join semantics is specified by a single comparison expression
(e.g., an equality condition for natural join) that must hold
for all matching input elements. By contrast, each element
in a match result may have a different mapping expression.
Hence, the semantics of Match is less restricted than that of
Join and is more difficult to capture in a consistent way.
The similarity of Match and Join extends to OuterMatch
operations, which are useful counterparts to Match in much
the same way that OuterJoin is a counterpart to Join. A right
(or left) OuterMatch ensures that every element of S2 (or S1)
isreferenced by themapping.AfullOuterMatch ensures every
element of both S1 and S2 are referenced by the mapping. By
ensuring that every element of a schema S is referenced in the
mapping returned by Match, the mapping can be more easily
composed with other mappings that refer to S. Examples of
such compositions appear in [BR00], which introduced the
OuterMatch operation.Although the usage of OuterMatch in-
volves some subtlety, its implementation is a straightforward
extension of Match: given an algorithm for the match opera-
tion, OuterMatch can easily be computed by adding elements
to the match result that reference the otherwise non-referenced
elements of S1 or S2. We therefore do not consider Outer-
Match further in this paper.
4. Architecture for generic match
When reviewing and comparing approaches to Match, it helps
to have an implementation architecture in mind. We therefore
describe a high-level architecture for a generic, customizable
implementation of Match.
Figure 1 illustrates the overall architecture. The clients are
schema-related applications and tools from different domains,
such as E-business, portals, and data warehousing. Each client
uses the generic implementation of Match to automatically
determine matches between two input schemas. XML schema
editors, portal development kits, database modeling tools and
the like may access libraries to select existing schemas, shown
in the lower left of Fig.1. The implementation of Match may

E. Rahm, P.A. Bernstein: A survey of approaches to automatic schema matching 337
Global libraries
(dictionaries, schemas
…)
Generic Match
implementation
Tool 1
(Portal schemas)
Tool 2
(E-business schemas)
Tool 3
(Data
warehousing schemas)
Schema import/ export
Tool 4
(Database
design schemas)
Internal schema
representation
Global libraries
(dictionaries, schemas
…)
Generic Match
implementation
Tool 1
(Portal schemas)
Tool 1
(Portal schemas)
Tool 2
(E-business schemas)
Tool 2
(E-business schemas)
Tool 3
(Data
warehousing schemas)
Tool 3
(Data
warehousing schemas)
Schema import/ export
Tool 4
(Database
design schemas)
Tool 4
(Database
design schemas)
Internal schema
representation
Fig. 1. High-level architecture of generic Match
Table 2. Full vs partial structural match (example)
S1 elements S2 elements
Address
Street
City
State
ZIP
CustomerAddress
Street
City
USState
PostalCode
full structural match of
Address and CustomerAddress
AccountOwner
Name
Address
Birthdate
TaxExempt
Customer
Cname
CAddress
CPhone
partialstructural matchofAccountOwnerand
Customer
also use the libraries and other auxiliary information, such as
dictionaries and thesauri, to help find matches.
We assume that the generic implementation of Match rep-
resents the schemas to be matched in a uniform internal rep-
resentation. This uniform representation significantly reduces
the complexity of Match by not having to deal with the large
number of different (heterogeneous) representations of
schemas. Tools that are tightly integrated with the framework
can work directly on the internal representation. Other tools
need import/export programs to translate between their na-
tive schema representation (such as XML, SQL, or UML) and
the internal representation. A semantics-preserving importer
translates input schemas from their native representation into
the internal representation. Similarly, an exporter translates
mappings produced by the generic implementation of Match
from the internal representation into the representation re-
quired by each tool. This allows the generic implementation
of Match to operate exclusively on the internal representation.
In general, it is not possible to determine fully automat-
ically all matches between two schemas, primarily because
most schemas have some semantics that affects the match-
ing criteria but is not formally expressed or often even docu-
mented. The implementation of Match should therefore only
determine match candidates, which the user can accept, reject
or change. Furthermore, the user should be able to specify
matches for elements for which the system was unable to find
satisfactory match candidates.
5. Classification of schema matching approaches
In this section we classify the major approaches to schema
matching. Fig.2 shows part of our classification scheme to-
gether with some sample approaches.
An implementation of Match may use multiple match al-
gorithms or matchers. This allows us to select the matchers
depending on the application domain and schema types. Given
that we want to use multiple matchers we distinguish two sub-
problems. First, there is the realization of individual matchers,
each of which computes a mapping based on a single match-
ing criterion. Second, there is the combination of individ-
ual matchers, either by using multiple matching criteria (e.g.,
name and type equality) within an integrated hybrid matcher
or by combining multiple match results produced by different
match algorithms within a composite matcher. For individual
matchers, we consider the following largely-orthogonal clas-
sification criteria:
Instance vs schema: matching approaches can consider
instance data (i.e., data contents) or only schema-level in-
formation.
Element vs structure matching: match can be performed
for individual schema elements, such as attributes, or for
combinations of elements, such as complex schema struc-
tures.
Language vs constraint: a matcher can use a linguistic-
based approach (e.g., based on names and textual descrip-
tions of schema elements) or a constraint-based approach
(e.g., based on keys and relationships).
Matching cardinality: the overall match result may relate
one or more elements of one schema to one or more ele-
ments of the other, yielding four cases: 1:1, 1:n, n:1, n:m.
In addition, each mapping element may interrelate one
or more elements of the two schemas. Furthermore, there
may be different match cardinalities at the instance level.
Auxiliary information: most matchers rely not only on the
input schemas S1 and S2 but also on auxiliary informa-
tion,such asdictionaries, globalschemas, previous match-
ing decisions, and user input.

338 E. Rahm, P.A. Bernstein: A survey of approaches to automatic schema matching
Automatic
composition
Composite matchers
Schema Matching Approaches
Individual matcher approaches Combining matchers
Manual
composition
Schema-only based Instance/contents-based
Graph
matching
Further criteria:
- Match cardinality
- Auxiliary information used
Linguistic
Constraint-
based
Structure-levelElement-level
Type similarity
Key properties
Value pattern and
ranges
Constraint-
based
Linguistic
IR techniques
(word frequencies,
key terms)
Sample approaches
……
Element-level
Hybrid matchers
Constraint-
based
Name similarity
Description
similarity
Global
namespaces
Automatic
composition
Composite matchers
Schema Matching Approaches
Individual matcher approaches Combining matchers
Manual
composition
Schema-only based Instance/contents-based
Graph
matching
Further criteria:
- Match cardinality
- Auxiliary information used
Linguistic
Constraint-
based
Structure-levelElement-level
Type similarity
Key properties
Value pattern and
ranges
Constraint-
based
Linguistic
IR techniques
(word frequencies,
key terms)
Sample approaches
……
Element-level
Hybrid matchers
Constraint-
based
Name similarity
Description
similarity
Global
namespaces
Fig. 2. Classification of schema matching approaches
Note that our classification does not distinguish between dif-
ferent types of schemas (relational, XML, object-oriented,
etc.) and their internal representation, because algorithms de-
pend mostly on the kind of information they exploit, not on
its representation.
In the following three sections, we discuss the main alter-
natives according to the above classification criteria. We dis-
cussschema-levelmatchinginSect.6,instance-levelmatching
in Sect.7, and combinations of multiple matchers in Sect.8.
6. Schema-level matchers
Schema-level matchers onlyconsider schema information, not
instance data. The available information includes the usual
properties of schema elements, such as name, description,
data type, relationship types (part-of, is-a, etc.), constraints,
and schema structure. In general, a matcher will find multiple
match candidates. For each candidate, it is customary to esti-
mate the degree of similarity by a normalized numeric value
in the range 0–1, in order to identify the best match candidates
(as in [PSU98, BCV99, DDL00, CDD01]).
We first discuss the main alternativesfor match granularity
andmatch cardinality.Then wecoverlinguisticandconstraint-
based matchers. Finally, we outline approaches based on the
reuse of auxiliary data, such as previously defined schemas
and previous match results.
6.1. Granularity of match (element-level vs structure-level)
We distinguish two main alternatives for the granularity of
Match, element-level and structure-level matching. For each
element of the first schema, element-level matching deter-
mines the matching elements in the second input schema. In
the simplest case, only elements at the finest level of granular-
ity are considered, which we call the atomic level, such as at-
tributes in an XML schema or columns in a relational schema.
For the schema fragments shown in Table 2, a sample atomic-
level match is Address.ZIP
=
CustomerAddress.PostalCode”
(recall that
=
means “matches”).
Structure-level matching, on the other hand, refers to
matching combinations of elements that appear together in a
structure.Arange of cases is possible, depending on howcom-
plete and precise a match of the structure is required. In the
ideal case, all components of the structures in the two schemas
fully match. Alternatively, only some of the components may
be required to match (i.e., a partial structural match). Exam-
ples of the two cases are shown in Table 2. The need for partial
matchessometimes arisesbecausesubschemasof differentdo-
mains are being compared. For example, in the second row of
Table 2, AccountOwner may come from a finance database
while Customer comes from a sales database.
For more complex cases, the effectiveness of structure
matching can be enhanced by considering known equivalence
patterns, which may be kept in a library. One simple pattern
is shown in Fig.3 relating two structures in an is-a hierarchy
to a single structure. The subclass of the first schema is repre-
sented by a Boolean attribute in the second schema. Another
well-known pattern consists of two structures interconnected
by a referential relationship being equivalent to a single struc-
ture (essentially, the join of the two). We will see an example
of this in Sect.6.4.
Element-levelmatchingisnotrestrictedtotheatomiclevel,
butmayalsobeappliedtocoarsergrained,higher (non-atomic)

Citations
More filters
Proceedings ArticleDOI
03 Jun 2002
TL;DR: The tutorial is focused on some of the theoretical issues that are relevant for data integration: modeling a data integration application, processing queries in data integration, dealing with inconsistent data sources, and reasoning on queries.
Abstract: Data integration is the problem of combining data residing at different sources, and providing the user with a unified view of these data. The problem of designing data integration systems is important in current real world applications, and is characterized by a number of issues that are interesting from a theoretical point of view. This document presents on overview of the material to be presented in a tutorial on data integration. The tutorial is focused on some of the theoretical issues that are relevant for data integration. Special attention will be devoted to the following aspects: modeling a data integration application, processing queries in data integration, dealing with inconsistent data sources, and reasoning on queries.

2,716 citations


Cites background from "A survey of approaches to automatic..."

  • ...• How to build an appropriate global schema, and how to discover inter-schema [31] and mapping assertions (LAV or GAV) in the design of a data integration system (see, for instance, [83])....

    [...]

Book
05 Jun 2007
TL;DR: The second edition of Ontology Matching has been thoroughly revised and updated to reflect the most recent advances in this quickly developing area, which resulted in more than 150 pages of new content.
Abstract: Ontologies tend to be found everywhere. They are viewed as the silver bullet for many applications, such as database integration, peer-to-peer systems, e-commerce, semantic web services, or social networks. However, in open or evolving systems, such as the semantic web, different parties would, in general, adopt different ontologies. Thus, merely using ontologies, like using XML, does not reduce heterogeneity: it just raises heterogeneity problems to a higher level. Euzenat and Shvaikos book is devoted to ontology matching as a solution to the semantic heterogeneity problem faced by computer systems. Ontology matching aims at finding correspondences between semantically related entities of different ontologies. These correspondences may stand for equivalence as well as other relations, such as consequence, subsumption, or disjointness, between ontology entities. Many different matching solutions have been proposed so far from various viewpoints, e.g., databases, information systems, and artificial intelligence. The second edition of Ontology Matching has been thoroughly revised and updated to reflect the most recent advances in this quickly developing area, which resulted in more than 150 pages of new content. In particular, the book includes a new chapter dedicated to the methodology for performing ontology matching. It also covers emerging topics, such as data interlinking, ontology partitioning and pruning, context-based matching, matcher tuning, alignment debugging, and user involvement in matching, to mention a few. More than 100 state-of-the-art matching systems and frameworks were reviewed. With Ontology Matching, researchers and practitioners will find a reference book that presents currently available work in a uniform framework. In particular, the work and the techniques presented in this book can be equally applied to database schema matching, catalog integration, XML schema matching and other related problems. The objectives of the book include presenting (i) the state of the art and (ii) the latest research results in ontology matching by providing a systematic and detailed account of matching techniques and matching systems from theoretical, practical and application perspectives.

2,579 citations


Cites background or methods from "A survey of approaches to automatic..."

  • ...We build on the previous work on classifying automated schema matching approaches of (Rahm and Bernstein 2001) which distinguishes between elementary (individual) matchers and composition of matchers....

    [...]

  • ...Internal structure-based methods are sometimes referred to as constraint-based approaches in the literature (Rahm and Bernstein 2001)....

    [...]

  • ...For example, (Do 2005) extends the work of (Rahm and Bernstein 2001) by adding a reuse-oriented category of techniques on top of schema-based vs....

    [...]

  • ...There have already been some comparisons of matching systems, in particular in (Parent and Spaccapietra 2000; Rahm and Bernstein 2001; Do et al. 2002; Kalfoglou and Schorlemmer 2003b; Noy 2004a; Doan and Halevy 2005; Shvaiko and Euzenat 2005; Choi et al. 2006; Bellahsene et al. 2011)....

    [...]

  • ...For instance, in schema matching, some authors (Sheth and Larson 1990; Rahm and Bernstein 2001) tend to consider that a correspondence like...

    [...]

Proceedings ArticleDOI
26 Feb 2002
TL;DR: This paper presents a matching algorithm based on a fixpoint computation that is usable across different scenarios and conducts a user study, in which the accuracy metric was used to estimate the labor savings that the users could obtain by utilizing the algorithm to obtain an initial matching.
Abstract: Matching elements of two data schemas or two data instances plays a key role in data warehousing, e-business, or even biochemical applications. In this paper we present a matching algorithm based on a fixpoint computation that is usable across different scenarios. The algorithm takes two graphs (schemas, catalogs, or other data structures) as input, and produces as output a mapping between corresponding nodes of the graphs. Depending on the matching goal, a subset of the mapping is chosen using filters. After our algorithm runs, we expect a human to check and if necessary adjust the results. As a matter of fact, we evaluate the 'accuracy' of the algorithm by counting the number of needed adjustments. We conducted a user study, in which our accuracy metric was used to estimate the labor savings that the users could obtain by utilizing our algorithm to obtain an initial matching. Finally, we illustrate how our matching algorithm is deployed as one of several high-level operators in an implemented testbed for managing information models and mappings.

1,613 citations

Journal ArticleDOI
01 Jan 2016
TL;DR: This paper provides a review of how statistical models can be “trained” on large knowledge graphs, and then used to predict new facts about the world (which is equivalent to predicting new edges in the graph) and how such statistical models of graphs can be combined with text-based information extraction methods for automatically constructing knowledge graphs from the Web.
Abstract: Relational machine learning studies methods for the statistical analysis of relational, or graph-structured, data. In this paper, we provide a review of how such statistical models can be “trained” on large knowledge graphs, and then used to predict new facts about the world (which is equivalent to predicting new edges in the graph). In particular, we discuss two fundamentally different kinds of statistical relational models, both of which can scale to massive data sets. The first is based on latent feature models such as tensor factorization and multiway neural networks. The second is based on mining observable patterns in the graph. We also show how to combine these latent and observable models to get improved modeling power at decreased computational cost. Finally, we discuss how such statistical models of graphs can be combined with text-based information extraction methods for automatically constructing knowledge graphs from the Web. To this end, we also discuss Google's knowledge vault project as an example of such combination.

1,452 citations


Cites background from "A survey of approaches to automatic..."

  • ...Entity resolution (also known as record linkage [47], object identification [48], instance matching [49], and deduplication [50]) is the problem of identifying which objects in relational data refer to the same underlying entities....

    [...]

Journal ArticleDOI
TL;DR: Ontology mapping is seen as a solution provider in today's landscape of ontology research as mentioned in this paper and provides a common layer from which several ontologies could be accessed and hence could exchange information in semantically sound manners.
Abstract: Ontology mapping is seen as a solution provider in today's landscape of ontology research. As the number of ontologies that are made publicly available and accessible on the Web increases steadily, so does the need for applications to use them. A single ontology is no longer enough to support the tasks envisaged by a distributed environment like the Semantic Web. Multiple ontologies need to be accessed from several applications. Mapping could provide a common layer from which several ontologies could be accessed and hence could exchange information in semantically sound manners. Developing such mappings has been the focus of a variety of works originating from diverse communities over a number of years. In this article we comprehensively review and present these works. We also provide insights on the pragmatics of ontology mapping and elaborate on a theoretical approach for defining ontology mapping.

1,384 citations

References
More filters
Journal ArticleDOI
Amit P. Sheth, James A. Larson1
TL;DR: In this paper, the authors define a reference architecture for distributed database management systems from system and schema viewpoints and show how various FDBS architectures can be developed, and define a methodology for developing one of the popular architectures of an FDBS.
Abstract: A federated database system (FDBS) is a collection of cooperating database systems that are autonomous and possibly heterogeneous. In this paper, we define a reference architecture for distributed database management systems from system and schema viewpoints and show how various FDBS architectures can be developed. We then define a methodology for developing one of the popular architectures of an FDBS. Finally, we discuss critical issues related to developing and operating an FDBS.

2,376 citations

Journal ArticleDOI
TL;DR: The aim of the paper is to provide first a unifying framework for the problem of schema integration, then a comparative review of the work done thus far in this area, providing a basis for identifying strengths and weaknesses of individual methodologies, as well as general guidelines for future improvements and extensions.
Abstract: One of the fundamental principles of the database approach is that a database allows a nonredundant, unified representation of all data managed in an organization. This is achieved only when methodologies are available to support integration across organizational and application boundaries.Methodologies for database design usually perform the design activity by separately producing several schemas, representing parts of the application, which are subsequently merged. Database schema integration is the activity of integrating the schemas of existing or proposed databases into a global, unified schema.The aim of the paper is to provide first a unifying framework for the problem of schema integration, then a comparative review of the work done thus far in this area. Such a framework, with the associated analysis of the existing approaches, provides a basis for identifying strengths and weaknesses of individual methodologies, as well as general guidelines for future improvements and extensions.

1,648 citations


"A survey of approaches to automatic..." refers background in this paper

  • ...Most work on schema match has been motivated by schema integration, a problem that has been investigated since the early 1980s: Given a set of independently developed schemas, construct a global view [ BLN86 , EP90, SL90, PS98]....

    [...]

Proceedings ArticleDOI
26 Feb 2002
TL;DR: This paper presents a matching algorithm based on a fixpoint computation that is usable across different scenarios and conducts a user study, in which the accuracy metric was used to estimate the labor savings that the users could obtain by utilizing the algorithm to obtain an initial matching.
Abstract: Matching elements of two data schemas or two data instances plays a key role in data warehousing, e-business, or even biochemical applications. In this paper we present a matching algorithm based on a fixpoint computation that is usable across different scenarios. The algorithm takes two graphs (schemas, catalogs, or other data structures) as input, and produces as output a mapping between corresponding nodes of the graphs. Depending on the matching goal, a subset of the mapping is chosen using filters. After our algorithm runs, we expect a human to check and if necessary adjust the results. As a matter of fact, we evaluate the 'accuracy' of the algorithm by counting the number of needed adjustments. We conducted a user study, in which our accuracy metric was used to estimate the labor savings that the users could obtain by utilizing our algorithm to obtain an initial matching. Finally, we illustrate how our matching algorithm is deployed as one of several high-level operators in an implemented testbed for managing information models and mappings.

1,613 citations

Proceedings Article
11 Sep 2001
TL;DR: This paper proposes a new algorithm, Cupid, that discovers mappings between schema elements based on their names, data types, constraints, and schema structure, using a broader set of techniques than past approaches.
Abstract: Schema matching is a critical step in many applications, such as XML message mapping, data warehouse loading, and schema integration. In this paper, we investigate algorithms for generic schema matching, outside of any particular data model or application. We first present a taxonomy for past solutions, showing that a rich range of techniques is available. We then propose a new algorithm, Cupid, that discovers mappings between schema elements based on their names, data types, constraints, and schema structure, using a broader set of techniques than past approaches. Some of our innovations are the integrated use of linguistic and structural matching, context-dependent matching of shared types, and a bias toward leaf structure where much of the schema content resides. After describing our algorithm, we present experimental results that compare Cupid to two other schema matching systems.

1,533 citations


Additional excerpts

  • ...Cupid is a hybrid matcher based on both element- and structure-level matching [ MBR01 ]....

    [...]

Journal ArticleDOI
TL;DR: Algorithms are designed to answer the following kinds of questions about trees: what is the distance between two trees, and the analogous question for prunings as for subtrees.
Abstract: Ordered labeled trees are trees in which the left-to-right order among siblings is significant. The distance between two ordered trees is considered to be the weighted number of edit operations (in...

1,367 citations


"A survey of approaches to automatic..." refers background in this paper

  • ...Zhang and Shasha developed an algorithm to find a mapping between two labeled trees [ ZS89 , ZSW92, ZS97], which they later implemented in a system for approximate tree matching [WZJS94]....

    [...]

Frequently Asked Questions (11)
Q1. What have the authors contributed in "A survey of approaches to automatic schema matching" ?

The authors present a taxonomy that covers many of these existing approaches, and they describe the approaches in some detail. Based on their classification the authors review some previous match implementations thereby indicating which part of the solution space they cover. The authors intend their taxonomy and review of past work to be useful when comparing different approaches to schema matching, when developing a new match algorithm, and when implementing a schema matching component. 

The authors hope that the taxonomy will be useful to programmers who need to implement a match algorithm and to researchers looking to develop more effective and comprehensive schema matching algorithms. In the future, the authors would like to see quantitative work on the relative performance and accuracy of different approaches. Such results could tell us which of the existing approaches dominate the others and could help identify weaknesses in the existing approaches that suggest opportunities for future research. Since the problem is so fundamental, the authors believe the field would benefit from treating it as an independent problem, as they have begun doing here. 

Match plays a central role in numerous applications, such as web-oriented data integration, electronic commerce, schema integration, schema evolution and migration, application evolution, data warehousing, database design, web site creation and management, and component-based development. 

Another way to use auxiliary information to improve the effectiveness of Match is to support and exploit the reuse of common schema components and previously determined mappings. 

The level of effort is at least linear in the number of matches to be performed, maybe worse than linear if one needs to evaluate each match in the context of other possible matches of the same elements. 

Since the problem is so fundamental, the authors believe the field would benefit from treating it as an independent problem, as the authors have begun doing here. 

For individual matchers, the authors consider the following largely-orthogonal classification criteria: • Instance vs schema: matching approaches can considerinstance data (i.e., data contents) or only schema-level information. 

One way to combine structure- with element-level matching is to use one algorithm to generate a partial mapping and the other to complete the mapping. 

If S1 is more similar to S than to S2, this can simplify the automatic generation of match candidates by reusing matches from the existing result of Match(S, S2), although some care is needed since matches are sometimes not transitive. 

The per-instance match results need to be merged and abstracted to the schema level, to generate a ranked list of match candidates in S1 for each (schema-level) element inS2. 

General natural language dictionaries may be useful, perhaps even multi-language dictionaries (e.g., English-German) to deal with input schemas of different languages.