Book Chapter•DOI•

COMA: a system for flexible combination of schema matching approaches

Hong-Hai Do¹, Erhard Rahm¹•Institutions (1)

20 Aug 2002-pp 610-621

TL;DR: This work develops the COMA schema matching system as a platform to combine multiple matchers in a flexible way and uses COMA as a framework to comprehensively evaluate the effectiveness of different matchers and their combinations for real-world schemas.

read less

Abstract: Schema matching is the task of finding semantic correspondences between elements of two schemas. It is needed in many database applications, such as integration of web data sources, data warehouse loading and XML message mapping. To reduce the amount of user effort as much as possible, automatic approaches combining several match techniques are required. While such match approaches have found considerable interest recently, the problem of how to best combine different match algorithms still requires further work. We have thus developed the COMA schema matching system as a platform to combine multiple matchers in a flexible way. We provide a large spectrum of individual matchers, in particular a novel approach aiming at reusing results from previous match operations, and several mechanisms to combine the results of matcher executions. We use COMA as a framework to comprehensively evaluate the effectiveness of different matchers and their combinations for real-world schemas. The results obtained so far show the superiority of combined match approaches and indicate the high value of reuse-oriented strategies.

...read moreread less

Summary (1 min read)

Jump to: [Introduction] – [Recommendations for Using a Komondor] and [Literature Cited]

Introduction

Nine Komondor dogs were observed guarding lambs in two 65ha enclosures for 21 days each.
The dogs are subjects of a 3-year study of the efficacy of using Komondor dogs to protect sheep from coyote predation.
The enclosures in which the trials were conducted are part of a full section (260 ha) set aside for predator research at the USSES.
In 79 of the 153 coyote-sheep interactions which the authors observed, the sheep either stayed with or ran to the dog, and in 75 of the 79 the dogs stood between the sheep and the coyote or chased the coyote away.
In addition, the behavior of the sheep changed, generally improving the dogs’ effectiveness in guarding.

Recommendations for Using a Komondor

The Komondor was developed by the early Hungarians as a flock guardian.
Training and human influence are required in at least three areas: early socialization, obedience, and flock management.’.
Training and rearing procedure should capitalize on two basic behaviors of the breed: I. Komondorok are very conservative in nature.
It should include a sheltered place where the dog can retire from the sheep.
Work with the dog on a regular basis in the pasture with the sheep so that training becomes associated with the pleasure of the owner’s company and with sheep.

Literature Cited

Middle Atlantic States Komondor Club, Inc., Princton, N.J. 1 I p. (mimeo.) Anonymous.
Seasonal development and yield of native plants of the Upper Snake River Plains and their relation to certain climatic factors.
Komondor guard dogs reduce sheep losses to coyotes: a preliminary evaluation.
Non-lethal methods-boon for some, bust for others.

Did you find this useful? Give us your feedback

Figures (15)

Figure 8. Problem size in schema matching tasks

Figure 5. Schema-level reuse in the Schema matcher

Figure 12. Quality of best matcher combinations Among the no-reuse combinations, All performs best because many aspects are examined at the same time to

Figure 13. Impact of schema characteristics on match quality

Table 4. Construction of hybrid matchers Name computes element similarities by combining the similarity values for the names’ token sets. Token similarities are determined using multiple simple matchers, such as Trigram and Synonym. In step 1, we use Max for

Figure 7. Examples for computing combined similarity

Figure 10. Distribution of series with respect to combination strategies

Figure 1. External and internal schema representation

Table 3. Implemented matchers in the matcher library

Table 6. Tested matchers and combination strategies

Figure 9. Distribution of series in different Overall ranges

Content maybe subject to copyright Report

COMA - A system for flexible combination of

schema matching approaches

Hong-Hai Do Erhard Rahm

University of Leipzig University of Leipzig

hong@informatik.uni-leipzig.de

rahm@informatik.uni-leipzig.de

Abstract

Schema matching is the task of finding semantic cor-

respondences between elements of two schemas. It is

needed in many database applications, such as integra-

tion of web data sources, data warehouse loading and

XML message mapping. To reduce the amount of user

effort as much as possible, automatic approaches com-

bining several match techniques are required. While

such match approaches have found considerable inter-

est recently, the problem of how to best combine dif-

ferent match algorithms still requires further work. We

have thus developed the COMA schema matching sys-

tem as a platform to combine multiple matchers in a

flexible way. We provide a large spectrum of individ-

ual matchers, in particular a novel approach aiming at

reusing results from previous match operations, and

several mechanisms to combine the results of matcher

executions. We use COMA as a framework to com-

prehensively evaluate the effectiveness of different

matchers and their combinations for real-world sche-

mas. The results obtained so far show the superiority

of combined match approaches and indicate the high

value of reuse-oriented strategies.

1 Introduction

Schema matching is the task of finding semantic

correspondences between elements of two schemas [ 11,

12, 15]. It is a critical operation in many schema and data

translation and integration applications, such as integra-

tion of web data sources, data warehouse loading, XML

message mapping and XML-relational data mapping. Cur-

rently, schema matching is largely performed manually by

domain experts, and therefore a time-consuming and tedi-

ous process. In web-based applications and services, such

a manual approach is a major limitation due to the rapidly

increasing number of data sources, XML message and

document schemas, and web service interfaces to be dealt

with. Hence, approaches for automating the schema

matching tasks as much as possible are badly needed to

simplify and speed up the development, maintenance and

use of such applications.

Numerous researchers have addressed the schema

matching problem either for specific applications [ 1, 4, 5,

7, 8, 9, 11, 15, 16] or in a more generic way for different

applications and schema languages [ 12, 13, 14]. The pro-

posed techniques for automating schema matching exploit

various types of schema information, e.g. element names,

data types and structural properties [ 2, 12, 15, 16, 9] as

well as characteristics of data instances [ 7, 8, 14, 11, 9].

Some approaches utilize auxiliary sources, such as tax-

onomies, dictionaries and thesauri [ 2, 9]. To achieve high

match accuracy for a large variety of schemas, a single

technique (e.g., name matching) is unlikely to be success-

ful. Hence, it is necessary to combine different ap-

proaches in an effective way. For this purpose, previous

prototypes have followed either a so-called hybrid or

composite combination of match techniques [ 18]. So far

the hybrid approach is most common where different

match criteria or properties (e.g., name and data type) are

used within a single algorithm. Typically these criteria are

fixed and used in a specific way. By contrast, a composite

match approach combines the results of several independ-

ently executed match algorithms, which can be simple

(based on a single match criterion) or hybrid. This allows

for a high flexibility, as there is the potential for selecting

the match algorithms to be executed based on the match

task at hand. Moreover, there are different possibilities for

combining the individual match results. We know of only

three recent systems following such a composite approach

[ 7, 8, 9]. They are all limited to match techniques based

on machine learning and do not fully utilize the flexibility

offered by the composite approach (see Section 2).

To investigate the effectiveness of composite match

approaches more comprehensively we have developed the

COMA system for co

mbining match algorithms in a

flexible way. COMA represents a generic match system

supporting different applications and multiple schema

types such as XML and relational schemas. It provides an

extensible library of match algorithms and supports dif-

ferent ways for combining match results. New match al-

gorithms can be included in the library and used in com-

bination with other matchers. COMA thus allows us to

tailor match strategies by selecting the match algorithms

and their combination for a given match problem. More-

over, we use COMA as an evaluation platform to system-

ermission to copy without fee all or part of this material is grant

rovided that the copies are not made or distributed for direct comm

cial advantage, the VLDB copyright notice and the title of the publica-

tion and its date appear, and notice is given that copying is by permis-

ion of the Very Large Data Base Endowment. To copy otherwise, or

republish, requires a fee and/or special permission from the Endowment

Proceedings of the 28

VLDB Conference,

Hong Kong, China, 2002

atically examine and compare the effectiveness of differ-

ent matchers and combination strategies. In the design of

COMA we observed that in general fully automatic solu-

tions to the match problem are not possible due to the

potentially high degrees of semantic heterogeneity be-

tween schemas. We thus allow an interactive and iterative

match process during which the user can provide feed-

back, e.g. to manually provide match correspondences or

to confirm or reject proposed matches.

As another contribution we propose a new match ap-

proach that aims at reusing previously obtained match

results, motivated by the observation that many schemas

to be matched are very similar to previously matched

schemas. Reusing the previous match results may thus

result in significant savings of manual effort. A simple

form of such an approach is the use of synonym tables

indicating match correspondences at the level of single

schema elements. Our new approach tries to reuse match

results at the level of entire schemas or schema fragments.

The flexibility of COMA is made possible by the use of a

DBMS-based repository for storing schemas, intermediate

similarity results of individual matchers, and complete

(possibly user-confirmed) match results for later reuse.

The paper is organized as follows. In Section 2 we

discuss some related work. Section 3 provides an over-

view of COMA. In Sections 4 and 5 we present the sup-

ported matchers including the reuse-oriented approach.

Section 6 outlines the strategies for matcher combination.

Section 7 presents the results of using COMA for evaluat-

ing different strategies for matching real-world schemas.

Finally, we conclude and discuss some future work.

2 Related work

A recent survey on automatic schema matching proposed

a solution taxonomy differentiating between schema- and

instance-level, element- and structure-level, and language-

and constraint-based matching approaches [ 18, 12]. Fur-

thermore, the distinction between hybrid and composite

combination of matchers is introduced and previous

match prototypes such as Cupid [ 12], SemInt [ 11], LSD

[ 7], Dike [ 16], SF [ 13], TranScm [ 15], and Momis [ 2] are

reviewed.

Cupid [ 12] represents a sophisticated hybrid match

approach combining a name matcher with a structural

match algorithm, which derives the similarity of elements

based on the similarity of their components hereby em-

phasizing the name and data type similarities present at

the finest level of granularity (leaf level). In a compara-

tive evaluation Cupid was generally more effective than

two earlier match prototypes (Dike and Momis).

LSD (Learning Source Description) [ 7] and its exten-

sion GLUE [ 8] represent powerful composite approaches

to combining different matchers. Both use machine-

learning techniques for individual matchers and an auto-

matic combination of match results. Machine learning is a

promising technique especially for evaluating data in-

stances to predict element similarity. On the other hand,

the accuracy of the predictions depends on a suitable

training which can incur a substantial manual effort. The

predictions of individual matchers are combined by a so-

called meta-learner, which weights the predictions from a

matcher according to its accuracy shown during the train-

ing phase. In various experiments LSD and GLUE

showed promising results, albeit based on a not well-

defined accuracy metric apparently not taking into ac-

count wrongly proposed match correspondences.

In [ 9], Embley et al. describe another composite ap-

proach based on machine learning. In addition to instance-

level matchers a name matcher is supported requiring an

external dictionary (WordNet). The predictions of the

individual matchers are combined using an average func-

tion. Like LSD and GLUE, a training phase is needed.

The evaluation of the structural match algorithm SF

(Similarity Flooding) in [ 13] used a more realistic metric

for measuring the match accuracy than previous studies. It

takes into account both the share of correctly proposed

match candidates and wrongly suggested match candi-

dates. In our evaluation we will also use this refined met-

ric (Section 7).

To sum up, the composite approach has so far only

been studied in the context of machine learning ap-

proaches focusing on instance-level matchers and using a

specific combination of match results. By contrast we

want to support and evaluate a spectrum of matchers not

confined to machine learning as well as the customizable

combination of their results. A systematic comparative

evaluation of different match algorithms and their combi-

nations based on well-defined accuracy metrics does not

exist so far. To our knowledge, beyond the use of simple

synonym tables the reuse of previous match results has

not yet been studied.

3 Overview of COMA

A schema consists of a set of elements, such as relational

tables and columns or XML elements and attributes. In

COMA we represent schemas by rooted directed acyclic

graphs. Schema elements are represented by graph nodes

connected by directed links of different types, e.g. for

containment and referential relationships. Schemas are

imported from external sources, e.g. relational databases

or XML files, into the internal format on which all match

algorithms operate. Figure 1 shows our running examples,

a relational and an XML schema for purchase orders

(PO), and their internal graph representation.

The match operation takes as input two schemas and

determines a mapping indicating which elements of the

input schemas logically correspond to each other, i.e.

match. The match result is a set of mapping elements

specifying the matching schema elements together with a

similarity value between 0 (strong dissimilarity) and 1

(strong similarity) indicating the plausibility of their cor-

respondence. Similar to previous work, we focus on one-

to-one (1:1) match relationships. However, match algo-

rithms may determine multiple match candidates with

different similarities for a schema element and finally

select one of them or leave the final choice to the user.

Figure 2 illustrates match processing in COMA on

two input schemas S1 and S2. Match processing either

takes place in one or multiple iterations depending on

whether an automatic or interactive determination of

match candidates is to be performed. Each match iteration

consists of three phases: an optional user feedback phase,

the execution of different matchers and the combination

of the individual match results. In interactive mode, the

user can interact with COMA for each iteration to specify

the match strategy (selection of matchers, of strategies to

combine individual match results), define match or mis-

match relationships, and accept or reject match candidates

proposed in the previous iteration. The interactive ap-

proach is useful to test and compare different match

strategies for specific schemas and to continuously refine

and improve the match result. In automatic mode, the

match process consists of a single match iteration for

which a default strategy is applied or strategy specified by

input parameters. This mode is especially useful for appli-

cations already knowing their most suitable match strat-

egy or implementing their own user interaction interface.

We now describe the steps of the match process in

more detail. After being converted to the internal graph

format introduced above, the schemas are traversed to

determine all schema elements for which the match algo-

rithms calculate the similarity values. We represent

schema elements by their paths, i.e. sequences of nodes

following the containment links from the root to the cor-

responding nodes. Shared schema fragments or elements,

such as Address in PO2, will result in multiple paths for

which we can independently determine match candidates.

COMA supports user interaction by a so-called User-

Feedback matcher to capture match and mismatch infor-

mation provided by the user including corrected match

results from the previous match iteration. This matcher

ensures that approved matches (and mismatches) are as-

signed the maximal (and minimal) similarity and that

these values remain unaffected by the other matchers dur-

ing the matcher execution step. The user-provided simi-

larity values influence the similarity computations for the

neighbourhood of the respective elements and can thus

improve the match accuracy of structural matchers.

A main step during a match iteration is the execution

of multiple independent matchers chosen from the

matcher library. The matchers currently supported fall

into three classes: simple, hybrid and reuse-oriented

matchers. They exploit different kinds of schema infor-

mation, such as names, data types, and structural proper-

ties, or auxiliary information, such as synonym tables and

previous match results. Each matcher determines an in-

termediate match result consisting of a similarity value

between 0 and 1 for each combination of S1 and S2

schema elements. The result of the matcher execution

phase with k matchers, m S1 elements and n S2 elements

is a k x m x n cube of similarity values, which is stored in

the repository for later combination and selection steps.

Table 1 shows a sample extract from the similarity cube

for the purchase order schemas of Figure 1.

Matcher

PO1 Elements PO2 Elements Sim

PO1.ShipTo.shipToCity 0.65

PO1.ShipTo.shipToStreet 0.3

Type-

Name

PO1.Customer.custCity

PO2.DeliverTo.Address.

City

0.80

PO1.ShipTo.shipToCity 0.78

PO1.ShipTo.shipToStreet 0.73

Name-

Path

PO1.Customer.custCity

PO2.DeliverTo.Address.

City

0.53

Table 1. Similarity values computed for PO1 and PO2

The final step in a match iteration is to derive the

combined match result from the individual matcher results

stored in the similarity cube. This is achieved in two sub-

steps: aggregation of matcher-specific results and selec-

tion of match candidates. First, for each combination of

schema elements the matcher-specific similarity values

are aggregated into a combined similarity value, e.g. by

taking the average or maximum value. Table 2 shows the

result of this step for the example of Table 1 using the

average strategy. Second, we apply a selection strategy to

choose the match candidates for a schema element, e.g. by

selecting the elements of the other schema with the best

similarity value exceeding a certain threshold. For the

example in Table 2 we could thus determine

CREATE TABLE PO1.ShipTo (

poNo INT,

custNo INT REFERENCES PO1.Customer,

shipToStreet VARCHAR(200),

shipToCity VARCHAR(200),

shipToZip VARCHAR(20),

PRIMARY KEY (poNo) ) ;

CREATE TABLE PO1.Customer (

custNo INT,

custName VARCHAR(200),

custStreet VARCHAR(200),

custCity VARCHAR(200),

custZip VARCHAR(20),

PRIMARY KEY (custNo) ) ;

xsd:schema

xmlns:xsd="http

://www.w3.org/2001/

XMLSchema

<xsd:complexType name=“PO2" >

<xsd:sequence>

<xsd:element name=“DeliverTo" type="Address"/>

<xsd:element name=“BillTo" type="Address"/>

</xsd:sequence>

</xsd:complexType>

<xsd:complexType name="Address" >

<xsd:sequence>

<xsd:element name=“Street" type="xsd:string"/>

<xsd:element name=“City" type="xsd:string"/>

<xsd:element name=“Zip" type="xsd:decimal"/>

</xsd:sequence>

</xsd:complexType>

</xsd:schema>

DeliverTo

Address

Street

City

Zip

BillTo

PO2

a) A relational schema and an XML schema

b) Their corresponding graph representation

Containment linkContainment link

Legends:

Node

shipToCity

shipToStreet

ShipTo

shipToZip

custCity

custStreet

Customer

custZip

PO1

poNo

custNo

custName

custNo

Figure 1. External and internal schema representation

PO1

elements

Combined sim

PO1.ShipTo.shipToCity 0.72

PO1.Customer.custCity 0.67

PO1.ShipTo.shipToStreet

PO2.DeliverTo.Address.

City

0.52

Table 2. Similarity values combined from Table 1

PO1.ShipTo.shipToCity as the match candidate of

PO2.DeliverTo.Address.City.

COMA supports the determination of undirectional or

directional match results. In the former case, match can-

didates are determined for both input schemas. Moreover,

an S1 element s1 is only accepted as a match candidate for

an S2 element s2 if s2 is also a match candidate of s1. For

instance, in the above example we would accept

PO1.ShipTo.shipToCity as the match candidate of

PO2.DeliverTo.Address.City only if there are no better

PO2 match candidates for PO1.ShipTo.shipToCity than

PO2.DeliverTo.Address.City. In the case of a directional

match, the goal is to find all match candidates only with

respect to one of the schemas, say S2. Hence, it is only

tried to find match candidates for S2 elements while ac-

cepting that S1 elements remain unmatched. This ap-

proach has been followed by most previous studies and is

motivated by the fact that many applications require such

a directional match (e.g., to integrate a new data source

with schema S1 into a data warehouse or mediator with

global schema S2). If the target schema S2 is small com-

pared to S1 the match problem is substantially simplified.

4 Matcher library

Table 3 gives an overview of the matchers we have

implemented and tested so far. We characterize the kinds

of schema and auxiliary information they exploit. In the

following we first describe the simple matchers followed

by the hybrid matchers. The more complex reuse-oriented

matcher Schema is discussed in Section 5.

4.1 Simple matchers

Element names represent an important source for assess-

ing similarity between schema elements. This can be done

syntactically by comparing the name strings or semanti-

cally by comparing their meanings. Approximate string

matching techniques [ 10] have already been employed in

other fields, such as record linkage [ 20] and data cleaning

[ 19], to detect duplicate database records concerning the

same real-word entity, i.e. matching at the instance level.

In COMA, we have implemented four simple approxi-

mate string matchers:

Affix: This matcher looks for common affixes, i.e. both

prefixes and suffixes, between two name strings.

n-gram: Strings are compared according to their set of n-

grams, i.e. sequences of n characters, leading to different

variants of this matcher, e.g. Digram (2), Trigram (3).

EditDistance: String similarity is computed from the

number of edit operations necessary to transform one

string to another one (the Levenshtein metric [ 10]).

Soundex: This matcher computes the phonetic similarity

between names from their corresponding soundex codes.

Further simple matchers are UserFeedback (Section 3),

a semantic matcher, Synonym, and a DataType matcher:

Synonym: This matcher estimates the similarity between

element names by looking up the terminological relation-

ships in a specified dictionary. Currently, it simply uses

relationship-specific similarity values, e.g., 1.0 for a syn-

onymy and 0.8 for a hypernymy relationship.

DataType: This matcher uses a synonym table specifying

the degree of compatibility between a set of predefined

generic data types, to which data types of schema ele-

ments are mapped in order to determine their similarity.

Matcher Type

Matcher Schema Info Auxiliary Info

Affix Element names -

n-gram Element names -

Soundex Element names -

EditDistance Element names -

Synonym Element names Extern. dictionaries

DataType Data types Data type compatibil

ity

table

Simple

UserFeedback - User-specified

(mis-) matches

Name Element names -

NamePath Names+Paths -

TypeName

Data types+Names

Children Child elements -

Hybrid

Leaves Leaf elements -

Reuse-oriented

Schema - Existing schema-level

match results

Table 3. Implemented matchers in the matcher library

4.2 Hybrid matchers

The hybrid matchers use a fixed combination of simple

matchers and other hybrid matchers to obtain more accu-

rate similarity values. The approach applied for combin-

ing the results of the constituent matchers follows the

same principles used for combining the matcher results in

the final phase of the match process (or iteration). The

details of how matchers are combined within a hybrid

matcher are explained in Section 6.

We currently support two hybrid element-level match-

ers, Name and TypeName, and three hybrid structural

matchers, NamePath, Children and Leaves. All approaches

rely to different degrees on similarities derived from ele-

ment names for which combinations of the simple match-

ers discussed above can be utilized (e.g. Synonym, etc.).

Name: This matcher only considers the element names

but is a hybrid approach because it combines different

Matcher

Library

Simple matchers:

•n-gram, Synonym, ...

Hybrid matchers:

•NamePath, TypeName, ...

Reuse-oriented matchers:

•Schema, ...

Schema Import Match Iteration

Matcher 1

Matcher 2

Matcher 3

Schema S2

Schema S1

Combination

Strategies

Aggregation of matcher-specific results:

•Max, Average, Weighted, Min

Match direction:

•SmallLarge, LargeSmall, Both

Match candidate selection:

•Threshold, MaxN, MaxDelta

User Interaction

(optional)

Matcher execution

Combination of

match results

Similarity cube

UserFeedback

S2→S1

S1→S2

S2→S1

S1→S2

Mapping

Figure 2. Match processing in COMA

simple name matchers. It performs some pre-processing

steps, in particular a tokenization to derive a set of com-

ponents (tokens) of a name, e.g. POShipTo → {PO, Ship,

To}. Moreover it expands abbreviations and acronyms,

e.g. PO → {Purchase, Order}. The Name matcher then

applies multiple simple matchers, such as Affix, Trigram,

and Synonym, on the token sets of the names and com-

bines the obtained similarity values for tokens to derive

similarity values between element names (see Section 6).

NamePath: This matcher matches elements based on

their hierarchical names, i.e. both structural aspects and

element names are considered. It first builds a long name

by concatenating all names of the elements in a path to a

single string. It then applies Name to compute the similar-

ity between these long names. Considering the complete

name path of an element provides additional tokens for

name matching which may improve match accuracy. For

instance, this can be helpful to find match candidates at

different schema levels, e.g. PurchaseOrder.ShipTo.Street

and PurchaseOrder.shipToStreet. On the other hand, it is

possible to distinguish between different contexts of the

same element, e.g. ShipTo.Street and BillTo.Street.

TypeName: This element matcher combines the DataType

and Name matcher, i.e. it matches elements based on a

combination of their name and data type similarity.

Children: This structural matcher is used in combination

with a leaf-level matcher. It determines the similarity be-

tween two inner elements based on the combined similar-

ity between their child elements, which in turn can be

both inner and leaf elements. The similarity between the

inner elements needs to be recursively computed from the

similarity between their respective children. The similar-

ity between the leaf elements is obtained from the leaf-

level matcher, for which TypeName is used as the default.

Leaves: This structural matcher is also used in combina-

tion with a leaf-level matcher, for which TypeName is set

as the default. In contrast to the Children strategy, this

matcher only considers the leaf elements to estimate the

similarity between two inner elements. This strategy aims

at more stable similarity in cases of structural conflicts. In

Figure 1, for example, elements shipToStreet, shipToCity,

etc., are children of ShipTo in PO1, while in PO2, their

matching elements are not children of DeliverTo, but of

Address. Children will therefore only find a correspon-

dence between ShipTo and Address, while Leaves can also

identify a correspondence between ShipTo and DeliverTo.

5 Reuse of previous match results

The consideration of reuse-oriented matchers is motivated

by our expectation that many schemas to be matched are

similar (or identical) to previously matched schemas. The

use of auxiliary information such as synonym dictionar-

ies, thesauri, already represents such a reuse-oriented ap-

proach utilizing confirmed correspondences at the level of

schema elements (names or data types). Our goal is to

generalize this idea and reuse multiple match correspon-

dences at the same time at the levels of schema fragments

or entire schemas.

As a first step, we have implemented two simple re-

use-oriented matchers that can be invoked and combined

like other matchers. One of them, Schema, tries to reuse

match results for entire schemas, the other, Fragment, op-

erates on schema fragments. In both cases we use a spe-

cial compose operation, MatchCompose, to derive a new

match result from existing ones. We first introduce

MatchCompose. Due to lack of space, we then only de-

scribe Schema.

5.1 The MatchCompose operation

Given two match results, match1: S1↔S2 and match2:

S2↔S3 sharing schema S2, MatchCompose derives a new

match result, match: S1↔S3, between S1 and S3. The

operation assumes a transitive nature of the similarity

relation between elements, i.e. if a is similar to b and b to

c, then a is (very likely) also similar to c. Of course wrong

match candidates may be determined in cases where the

transitivity property does not hold.

In the context of information retrieval, transitive simi-

larity estimations have been applied to derive the similar-

ity of words based on terminological relationships, such

as synonymy and hypernymy [ 4, 17]. A common ap-

proach to determine the transitive similarity is to multiply

the individual similarity values [ 2]. This approach, how-

ever, may lead to rapidly degrading similarity values. For

instance, for

firstNameNamestNamecontactFir →←→←

7050 ..

the similarity between contactFirstName and firstName

would become 0.5*0.7=0.35, which is unlikely to reflect

the similarity, which we would expect for the two names.

We thus prefer the alternatives for combining the results

of different matchers, such as Average (Section 6.1), for

calculating transitive similarities, resulting in similarity

value 0.6 in the last example.

Figure 3a and b illustrate the approach for the match

PO1↔PO3 derived from composing the two match re-

sim13PO3PO1

1.0emailEmail

0.8firstNameName

0.8lastNameName

sim13PO3PO1

1.0emailEmail

0.8firstNameName

0.8lastNameName

sim23PO3PO2

1.0emaile-mail

0.6firstNamename

0.6lastNamename

sim23PO3PO2

1.0emaile-mail

0.6firstNamename

0.6lastNamename

sim12PO2PO1

1.0e-mailEmail

1.0

name

Name

sim12PO2PO1

1.0e-mailEmail

1.0

name

Name

Containment linkContainment linkLegends: Element corresondenceElement corresondence

PO1.Contact

Name

lastName

firstName

company

PO3.Contact

match

b) match=MatchCompose(match1, match2)a) match1: PO1↔PO2 and match2: PO2↔PO3

Name

company

PO2.Contact

name

e-mail

PO3.Contact

lastName

firstName

match1 match2

PO1.Contact

match1

match2

match

company

ovals: Mappings

c) relational representation for MatchCompose

Average

Figure 3. MatchCompose example

HTML Viewer

Frequently Asked Questions (10)

Q1. What have the authors contributed in "Coma - a system for flexible combination of schema matching approaches" ?

The authors provide a large spectrum of individual matchers, in particular a novel approach aiming at reusing results from previous match operations, and several mechanisms to combine the results of matcher executions.

Q2. What are the future works in "Coma - a system for flexible combination of schema matching approaches" ?

In future work, the authors plan to add other match and combination algorithms in order to improve match quality. Furthermore, the authors will apply COMA to additional schema types and applications, such as in the bioinformatics domain.

Q3. What is the way to analyze the stability of the default combination strategy?

The stable behavior of the default combination strategy indicates that it can be used for many match tasks thereby limiting the tuning effort.

Q4. What is the way to analyze the stability of the matchers?

In contrast to single matchers, matcher combinations simultaneously analyze schema elements under different aspects, resulting in more stable and accurate similarity for heterogeneous schemas.

Q5. How do the authors determine the match candidates for a schema element?

the authors apply a selection strategy to choose the match candidates for a schema element, e.g. by selecting the elements of the other schema with the best similarity value exceeding a certain threshold.

Q6. What is the default weight of the name and data type similarity?

The default weights of the name and data type similarity, 0.7 and 0.3, respectively, permit to match attributes with similar names but different data types.

Q7. What is the probability of finding the match results for MatchCompose?

Despite the high level of reuse in Schema (schema level), the authors believe that there is a high probability to find the necessary match result pairs for MatchCompose in an environment where many schemas are managed and matched to each other.

Q8. How can a high Precision be achieved at the expense of a poor Recall?

Recall can easily be maximized at the expense of a poor Precision by returning all possible correspondences, i.e. the cross product of two input schemas.

Q9. What is the default matcher for the datatype and name matcher?

This element matcher combines the DataType and Name matcher, i.e. it matches elements based on a combination of their name and data type similarity.

Q10. How can the authors achieve the accurate match predictions?

Most accurate match predictions can be achieved by selecting match candidates showing the (approximately) highest similarity exceeding a minimal threshold.

COMA: a system for flexible combination of schema matching approaches

Summary (1 min read)

Introduction

Recommendations for Using a Komondor

Literature Cited

Figures (15)

Citations

Cites background or methods from "COMA: a system for flexible combina..."

Cites background from "COMA: a system for flexible combina..."

Cites background from "COMA: a system for flexible combina..."

References

"COMA: a system for flexible combina..." refers methods in this paper

"COMA: a system for flexible combina..." refers methods in this paper

"COMA: a system for flexible combina..." refers background in this paper

Related Papers (5)

Frequently Asked Questions (10)

Q1. What have the authors contributed in "Coma - a system for flexible combination of schema matching approaches" ?

Q2. What are the future works in "Coma - a system for flexible combination of schema matching approaches" ?

Q3. What is the way to analyze the stability of the default combination strategy?

Q4. What is the way to analyze the stability of the matchers?

Q5. How do the authors determine the match candidates for a schema element?

Q6. What is the default weight of the name and data type similarity?

Q7. What is the probability of finding the match results for MatchCompose?

Q8. How can a high Precision be achieved at the expense of a poor Recall?

Q9. What is the default matcher for the datatype and name matcher?

Q10. How can the authors achieve the accurate match predictions?