scispace - formally typeset
Search or ask a question
Book ChapterDOI

COMA: a system for flexible combination of schema matching approaches

20 Aug 2002-pp 610-621
TL;DR: This work develops the COMA schema matching system as a platform to combine multiple matchers in a flexible way and uses COMA as a framework to comprehensively evaluate the effectiveness of different matchers and their combinations for real-world schemas.
Abstract: Schema matching is the task of finding semantic correspondences between elements of two schemas. It is needed in many database applications, such as integration of web data sources, data warehouse loading and XML message mapping. To reduce the amount of user effort as much as possible, automatic approaches combining several match techniques are required. While such match approaches have found considerable interest recently, the problem of how to best combine different match algorithms still requires further work. We have thus developed the COMA schema matching system as a platform to combine multiple matchers in a flexible way. We provide a large spectrum of individual matchers, in particular a novel approach aiming at reusing results from previous match operations, and several mechanisms to combine the results of matcher executions. We use COMA as a framework to comprehensively evaluate the effectiveness of different matchers and their combinations for real-world schemas. The results obtained so far show the superiority of combined match approaches and indicate the high value of reuse-oriented strategies.

Summary (1 min read)

Introduction

  • Nine Komondor dogs were observed guarding lambs in two 65ha enclosures for 21 days each.
  • The dogs are subjects of a 3-year study of the efficacy of using Komondor dogs to protect sheep from coyote predation.
  • The enclosures in which the trials were conducted are part of a full section (260 ha) set aside for predator research at the USSES.
  • In 79 of the 153 coyote-sheep interactions which the authors observed, the sheep either stayed with or ran to the dog, and in 75 of the 79 the dogs stood between the sheep and the coyote or chased the coyote away.
  • In addition, the behavior of the sheep changed, generally improving the dogs’ effectiveness in guarding.

Recommendations for Using a Komondor

  • The Komondor was developed by the early Hungarians as a flock guardian.
  • Training and human influence are required in at least three areas: early socialization, obedience, and flock management.’.
  • Training and rearing procedure should capitalize on two basic behaviors of the breed: I. Komondorok are very conservative in nature.
  • It should include a sheltered place where the dog can retire from the sheep.
  • Work with the dog on a regular basis in the pasture with the sheep so that training becomes associated with the pleasure of the owner’s company and with sheep.

Literature Cited

  • Middle Atlantic States Komondor Club, Inc., Princton, N.J. 1 I p. (mimeo.) Anonymous.
  • Seasonal development and yield of native plants of the Upper Snake River Plains and their relation to certain climatic factors.
  • Komondor guard dogs reduce sheep losses to coyotes: a preliminary evaluation.
  • Non-lethal methods-boon for some, bust for others.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

COMA - A system for flexible combination of
schema matching approaches
Hong-Hai Do Erhard Rahm
University of Leipzig University of Leipzig
hong@informatik.uni-leipzig.de
rahm@informatik.uni-leipzig.de
Abstract
Schema matching is the task of finding semantic cor-
respondences between elements of two schemas. It is
needed in many database applications, such as integra-
tion of web data sources, data warehouse loading and
XML message mapping. To reduce the amount of user
effort as much as possible, automatic approaches com-
bining several match techniques are required. While
such match approaches have found considerable inter-
est recently, the problem of how to best combine dif-
ferent match algorithms still requires further work. We
have thus developed the COMA schema matching sys-
tem as a platform to combine multiple matchers in a
flexible way. We provide a large spectrum of individ-
ual matchers, in particular a novel approach aiming at
reusing results from previous match operations, and
several mechanisms to combine the results of matcher
executions. We use COMA as a framework to com-
prehensively evaluate the effectiveness of different
matchers and their combinations for real-world sche-
mas. The results obtained so far show the superiority
of combined match approaches and indicate the high
value of reuse-oriented strategies.
1 Introduction
Schema matching is the task of finding semantic
correspondences between elements of two schemas [ 11,
12, 15]. It is a critical operation in many schema and data
translation and integration applications, such as integra-
tion of web data sources, data warehouse loading, XML
message mapping and XML-relational data mapping. Cur-
rently, schema matching is largely performed manually by
domain experts, and therefore a time-consuming and tedi-
ous process. In web-based applications and services, such
a manual approach is a major limitation due to the rapidly
increasing number of data sources, XML message and
document schemas, and web service interfaces to be dealt
with. Hence, approaches for automating the schema
matching tasks as much as possible are badly needed to
simplify and speed up the development, maintenance and
use of such applications.
Numerous researchers have addressed the schema
matching problem either for specific applications [ 1, 4, 5,
7, 8, 9, 11, 15, 16] or in a more generic way for different
applications and schema languages [ 12, 13, 14]. The pro-
posed techniques for automating schema matching exploit
various types of schema information, e.g. element names,
data types and structural properties [ 2, 12, 15, 16, 9] as
well as characteristics of data instances [ 7, 8, 14, 11, 9].
Some approaches utilize auxiliary sources, such as tax-
onomies, dictionaries and thesauri [ 2, 9]. To achieve high
match accuracy for a large variety of schemas, a single
technique (e.g., name matching) is unlikely to be success-
ful. Hence, it is necessary to combine different ap-
proaches in an effective way. For this purpose, previous
prototypes have followed either a so-called hybrid or
composite combination of match techniques [ 18]. So far
the hybrid approach is most common where different
match criteria or properties (e.g., name and data type) are
used within a single algorithm. Typically these criteria are
fixed and used in a specific way. By contrast, a composite
match approach combines the results of several independ-
ently executed match algorithms, which can be simple
(based on a single match criterion) or hybrid. This allows
for a high flexibility, as there is the potential for selecting
the match algorithms to be executed based on the match
task at hand. Moreover, there are different possibilities for
combining the individual match results. We know of only
three recent systems following such a composite approach
[ 7, 8, 9]. They are all limited to match techniques based
on machine learning and do not fully utilize the flexibility
offered by the composite approach (see Section 2).
To investigate the effectiveness of composite match
approaches more comprehensively we have developed the
COMA system for co
mbining match algorithms in a
flexible way. COMA represents a generic match system
supporting different applications and multiple schema
types such as XML and relational schemas. It provides an
extensible library of match algorithms and supports dif-
ferent ways for combining match results. New match al-
gorithms can be included in the library and used in com-
bination with other matchers. COMA thus allows us to
tailor match strategies by selecting the match algorithms
and their combination for a given match problem. More-
over, we use COMA as an evaluation platform to system-
P
ermission to copy without fee all or part of this material is grant
ed
p
rovided that the copies are not made or distributed for direct comm
e
r-
cial advantage, the VLDB copyright notice and the title of the publica-
tion and its date appear, and notice is given that copying is by permis-
s
ion of the Very Large Data Base Endowment. To copy otherwise, or
to
republish, requires a fee and/or special permission from the Endowment
Proceedings of the 28
th
VLDB Conference,
Hong Kong, China, 2002

atically examine and compare the effectiveness of differ-
ent matchers and combination strategies. In the design of
COMA we observed that in general fully automatic solu-
tions to the match problem are not possible due to the
potentially high degrees of semantic heterogeneity be-
tween schemas. We thus allow an interactive and iterative
match process during which the user can provide feed-
back, e.g. to manually provide match correspondences or
to confirm or reject proposed matches.
As another contribution we propose a new match ap-
proach that aims at reusing previously obtained match
results, motivated by the observation that many schemas
to be matched are very similar to previously matched
schemas. Reusing the previous match results may thus
result in significant savings of manual effort. A simple
form of such an approach is the use of synonym tables
indicating match correspondences at the level of single
schema elements. Our new approach tries to reuse match
results at the level of entire schemas or schema fragments.
The flexibility of COMA is made possible by the use of a
DBMS-based repository for storing schemas, intermediate
similarity results of individual matchers, and complete
(possibly user-confirmed) match results for later reuse.
The paper is organized as follows. In Section 2 we
discuss some related work. Section 3 provides an over-
view of COMA. In Sections 4 and 5 we present the sup-
ported matchers including the reuse-oriented approach.
Section 6 outlines the strategies for matcher combination.
Section 7 presents the results of using COMA for evaluat-
ing different strategies for matching real-world schemas.
Finally, we conclude and discuss some future work.
2 Related work
A recent survey on automatic schema matching proposed
a solution taxonomy differentiating between schema- and
instance-level, element- and structure-level, and language-
and constraint-based matching approaches [ 18, 12]. Fur-
thermore, the distinction between hybrid and composite
combination of matchers is introduced and previous
match prototypes such as Cupid [ 12], SemInt [ 11], LSD
[ 7], Dike [ 16], SF [ 13], TranScm [ 15], and Momis [ 2] are
reviewed.
Cupid [ 12] represents a sophisticated hybrid match
approach combining a name matcher with a structural
match algorithm, which derives the similarity of elements
based on the similarity of their components hereby em-
phasizing the name and data type similarities present at
the finest level of granularity (leaf level). In a compara-
tive evaluation Cupid was generally more effective than
two earlier match prototypes (Dike and Momis).
LSD (Learning Source Description) [ 7] and its exten-
sion GLUE [ 8] represent powerful composite approaches
to combining different matchers. Both use machine-
learning techniques for individual matchers and an auto-
matic combination of match results. Machine learning is a
promising technique especially for evaluating data in-
stances to predict element similarity. On the other hand,
the accuracy of the predictions depends on a suitable
training which can incur a substantial manual effort. The
predictions of individual matchers are combined by a so-
called meta-learner, which weights the predictions from a
matcher according to its accuracy shown during the train-
ing phase. In various experiments LSD and GLUE
showed promising results, albeit based on a not well-
defined accuracy metric apparently not taking into ac-
count wrongly proposed match correspondences.
In [ 9], Embley et al. describe another composite ap-
proach based on machine learning. In addition to instance-
level matchers a name matcher is supported requiring an
external dictionary (WordNet). The predictions of the
individual matchers are combined using an average func-
tion. Like LSD and GLUE, a training phase is needed.
The evaluation of the structural match algorithm SF
(Similarity Flooding) in [ 13] used a more realistic metric
for measuring the match accuracy than previous studies. It
takes into account both the share of correctly proposed
match candidates and wrongly suggested match candi-
dates. In our evaluation we will also use this refined met-
ric (Section 7).
To sum up, the composite approach has so far only
been studied in the context of machine learning ap-
proaches focusing on instance-level matchers and using a
specific combination of match results. By contrast we
want to support and evaluate a spectrum of matchers not
confined to machine learning as well as the customizable
combination of their results. A systematic comparative
evaluation of different match algorithms and their combi-
nations based on well-defined accuracy metrics does not
exist so far. To our knowledge, beyond the use of simple
synonym tables the reuse of previous match results has
not yet been studied.
3 Overview of COMA
A schema consists of a set of elements, such as relational
tables and columns or XML elements and attributes. In
COMA we represent schemas by rooted directed acyclic
graphs. Schema elements are represented by graph nodes
connected by directed links of different types, e.g. for
containment and referential relationships. Schemas are
imported from external sources, e.g. relational databases
or XML files, into the internal format on which all match
algorithms operate. Figure 1 shows our running examples,
a relational and an XML schema for purchase orders
(PO), and their internal graph representation.
The match operation takes as input two schemas and
determines a mapping indicating which elements of the
input schemas logically correspond to each other, i.e.
match. The match result is a set of mapping elements
specifying the matching schema elements together with a
similarity value between 0 (strong dissimilarity) and 1
(strong similarity) indicating the plausibility of their cor-
respondence. Similar to previous work, we focus on one-
to-one (1:1) match relationships. However, match algo-
rithms may determine multiple match candidates with

different similarities for a schema element and finally
select one of them or leave the final choice to the user.
Figure 2 illustrates match processing in COMA on
two input schemas S1 and S2. Match processing either
takes place in one or multiple iterations depending on
whether an automatic or interactive determination of
match candidates is to be performed. Each match iteration
consists of three phases: an optional user feedback phase,
the execution of different matchers and the combination
of the individual match results. In interactive mode, the
user can interact with COMA for each iteration to specify
the match strategy (selection of matchers, of strategies to
combine individual match results), define match or mis-
match relationships, and accept or reject match candidates
proposed in the previous iteration. The interactive ap-
proach is useful to test and compare different match
strategies for specific schemas and to continuously refine
and improve the match result. In automatic mode, the
match process consists of a single match iteration for
which a default strategy is applied or strategy specified by
input parameters. This mode is especially useful for appli-
cations already knowing their most suitable match strat-
egy or implementing their own user interaction interface.
We now describe the steps of the match process in
more detail. After being converted to the internal graph
format introduced above, the schemas are traversed to
determine all schema elements for which the match algo-
rithms calculate the similarity values. We represent
schema elements by their paths, i.e. sequences of nodes
following the containment links from the root to the cor-
responding nodes. Shared schema fragments or elements,
such as Address in PO2, will result in multiple paths for
which we can independently determine match candidates.
COMA supports user interaction by a so-called User-
Feedback matcher to capture match and mismatch infor-
mation provided by the user including corrected match
results from the previous match iteration. This matcher
ensures that approved matches (and mismatches) are as-
signed the maximal (and minimal) similarity and that
these values remain unaffected by the other matchers dur-
ing the matcher execution step. The user-provided simi-
larity values influence the similarity computations for the
neighbourhood of the respective elements and can thus
improve the match accuracy of structural matchers.
A main step during a match iteration is the execution
of multiple independent matchers chosen from the
matcher library. The matchers currently supported fall
into three classes: simple, hybrid and reuse-oriented
matchers. They exploit different kinds of schema infor-
mation, such as names, data types, and structural proper-
ties, or auxiliary information, such as synonym tables and
previous match results. Each matcher determines an in-
termediate match result consisting of a similarity value
between 0 and 1 for each combination of S1 and S2
schema elements. The result of the matcher execution
phase with k matchers, m S1 elements and n S2 elements
is a k x m x n cube of similarity values, which is stored in
the repository for later combination and selection steps.
Table 1 shows a sample extract from the similarity cube
for the purchase order schemas of Figure 1.
Matcher
PO1 Elements PO2 Elements Sim
PO1.ShipTo.shipToCity 0.65
PO1.ShipTo.shipToStreet 0.3
Type-
Name
PO1.Customer.custCity
PO2.DeliverTo.Address.
City
0.80
PO1.ShipTo.shipToCity 0.78
PO1.ShipTo.shipToStreet 0.73
Name-
Path
PO1.Customer.custCity
PO2.DeliverTo.Address.
City
0.53
Table 1. Similarity values computed for PO1 and PO2
The final step in a match iteration is to derive the
combined match result from the individual matcher results
stored in the similarity cube. This is achieved in two sub-
steps: aggregation of matcher-specific results and selec-
tion of match candidates. First, for each combination of
schema elements the matcher-specific similarity values
are aggregated into a combined similarity value, e.g. by
taking the average or maximum value. Table 2 shows the
result of this step for the example of Table 1 using the
average strategy. Second, we apply a selection strategy to
choose the match candidates for a schema element, e.g. by
selecting the elements of the other schema with the best
similarity value exceeding a certain threshold. For the
example in Table 2 we could thus determine
CREATE TABLE PO1.ShipTo (
poNo INT,
custNo INT REFERENCES PO1.Customer,
shipToStreet VARCHAR(200),
shipToCity VARCHAR(200),
shipToZip VARCHAR(20),
PRIMARY KEY (poNo) ) ;
CREATE TABLE PO1.Customer (
custNo INT,
custName VARCHAR(200),
custStreet VARCHAR(200),
custCity VARCHAR(200),
custZip VARCHAR(20),
PRIMARY KEY (custNo) ) ;
<
xsd:schema
xmlns:xsd="http
://www.w3.org/2001/
">
<xsd:complexType name=“PO2" >
<xsd:sequence>
<xsd:element name=“DeliverTo" type="Address"/>
<xsd:element name=“BillTo" type="Address"/>
</xsd:sequence>
</xsd:complexType>
<xsd:complexType name="Address" >
<xsd:sequence>
<xsd:element name=“Street" type="xsd:string"/>
<xsd:element name=“City" type="xsd:string"/>
<xsd:element name=“Zip" type="xsd:decimal"/>
</xsd:sequence>
</xsd:complexType>
</xsd:schema>
DeliverTo
Address
Street
City
Zip
BillTo
PO2
a) A relational schema and an XML schema
b) Their corresponding graph representation
Containment linkContainment link
Legends:
Node
Node
shipToCity
shipToStreet
ShipTo
shipToZip
custCity
custStreet
Customer
custZip
PO1
poNo
custNo
custName
custNo
Figure 1. External and internal schema representation
PO1
elements
PO
2
elements
Combined sim
PO1.ShipTo.shipToCity 0.72
PO1.Customer.custCity 0.67
PO1.ShipTo.shipToStreet
PO2.DeliverTo.Address.
City
0.52
Table 2. Similarity values combined from Table 1

PO1.ShipTo.shipToCity as the match candidate of
PO2.DeliverTo.Address.City.
COMA supports the determination of undirectional or
directional match results. In the former case, match can-
didates are determined for both input schemas. Moreover,
an S1 element s1 is only accepted as a match candidate for
an S2 element s2 if s2 is also a match candidate of s1. For
instance, in the above example we would accept
PO1.ShipTo.shipToCity as the match candidate of
PO2.DeliverTo.Address.City only if there are no better
PO2 match candidates for PO1.ShipTo.shipToCity than
PO2.DeliverTo.Address.City. In the case of a directional
match, the goal is to find all match candidates only with
respect to one of the schemas, say S2. Hence, it is only
tried to find match candidates for S2 elements while ac-
cepting that S1 elements remain unmatched. This ap-
proach has been followed by most previous studies and is
motivated by the fact that many applications require such
a directional match (e.g., to integrate a new data source
with schema S1 into a data warehouse or mediator with
global schema S2). If the target schema S2 is small com-
pared to S1 the match problem is substantially simplified.
4 Matcher library
Table 3 gives an overview of the matchers we have
implemented and tested so far. We characterize the kinds
of schema and auxiliary information they exploit. In the
following we first describe the simple matchers followed
by the hybrid matchers. The more complex reuse-oriented
matcher Schema is discussed in Section 5.
4.1 Simple matchers
Element names represent an important source for assess-
ing similarity between schema elements. This can be done
syntactically by comparing the name strings or semanti-
cally by comparing their meanings. Approximate string
matching techniques [ 10] have already been employed in
other fields, such as record linkage [ 20] and data cleaning
[ 19], to detect duplicate database records concerning the
same real-word entity, i.e. matching at the instance level.
In COMA, we have implemented four simple approxi-
mate string matchers:
Affix: This matcher looks for common affixes, i.e. both
prefixes and suffixes, between two name strings.
n-gram: Strings are compared according to their set of n-
grams, i.e. sequences of n characters, leading to different
variants of this matcher, e.g. Digram (2), Trigram (3).
EditDistance: String similarity is computed from the
number of edit operations necessary to transform one
string to another one (the Levenshtein metric [ 10]).
Soundex: This matcher computes the phonetic similarity
between names from their corresponding soundex codes.
Further simple matchers are UserFeedback (Section 3),
a semantic matcher, Synonym, and a DataType matcher:
Synonym: This matcher estimates the similarity between
element names by looking up the terminological relation-
ships in a specified dictionary. Currently, it simply uses
relationship-specific similarity values, e.g., 1.0 for a syn-
onymy and 0.8 for a hypernymy relationship.
DataType: This matcher uses a synonym table specifying
the degree of compatibility between a set of predefined
generic data types, to which data types of schema ele-
ments are mapped in order to determine their similarity.
Matcher Type
Matcher Schema Info Auxiliary Info
Affix Element names -
n-gram Element names -
Soundex Element names -
EditDistance Element names -
Synonym Element names Extern. dictionaries
DataType Data types Data type compatibil
ity
table
Simple
UserFeedback - User-specified
(mis-) matches
Name Element names -
NamePath Names+Paths -
TypeName
Data types+Names
-
Children Child elements -
Hybrid
Leaves Leaf elements -
Reuse-oriented
Schema - Existing schema-level
match results
Table 3. Implemented matchers in the matcher library
4.2 Hybrid matchers
The hybrid matchers use a fixed combination of simple
matchers and other hybrid matchers to obtain more accu-
rate similarity values. The approach applied for combin-
ing the results of the constituent matchers follows the
same principles used for combining the matcher results in
the final phase of the match process (or iteration). The
details of how matchers are combined within a hybrid
matcher are explained in Section 6.
We currently support two hybrid element-level match-
ers, Name and TypeName, and three hybrid structural
matchers, NamePath, Children and Leaves. All approaches
rely to different degrees on similarities derived from ele-
ment names for which combinations of the simple match-
ers discussed above can be utilized (e.g. Synonym, etc.).
Name: This matcher only considers the element names
but is a hybrid approach because it combines different
Matcher
Library
Simple matchers:
•n-gram, Synonym, ...
Hybrid matchers:
•NamePath, TypeName, ...
Reuse-oriented matchers:
•Schema, ...
Schema Import Match Iteration
Matcher 1
Matcher 2
Matcher 3
Schema S2
Schema S1
Combination
Strategies
Aggregation of matcher-specific results:
•Max, Average, Weighted, Min
Match direction:
•SmallLarge, LargeSmall, Both
Match candidate selection:
•Threshold, MaxN, MaxDelta
User Interaction
(optional)
Matcher execution
Combination of
match results
Similarity cube
UserFeedback
S2S1
S1S2
S2S1
S1S2
Mapping
Figure 2. Match processing in COMA

simple name matchers. It performs some pre-processing
steps, in particular a tokenization to derive a set of com-
ponents (tokens) of a name, e.g. POShipTo {PO, Ship,
To}. Moreover it expands abbreviations and acronyms,
e.g. PO {Purchase, Order}. The Name matcher then
applies multiple simple matchers, such as Affix, Trigram,
and Synonym, on the token sets of the names and com-
bines the obtained similarity values for tokens to derive
similarity values between element names (see Section 6).
NamePath: This matcher matches elements based on
their hierarchical names, i.e. both structural aspects and
element names are considered. It first builds a long name
by concatenating all names of the elements in a path to a
single string. It then applies Name to compute the similar-
ity between these long names. Considering the complete
name path of an element provides additional tokens for
name matching which may improve match accuracy. For
instance, this can be helpful to find match candidates at
different schema levels, e.g. PurchaseOrder.ShipTo.Street
and PurchaseOrder.shipToStreet. On the other hand, it is
possible to distinguish between different contexts of the
same element, e.g. ShipTo.Street and BillTo.Street.
TypeName: This element matcher combines the DataType
and Name matcher, i.e. it matches elements based on a
combination of their name and data type similarity.
Children: This structural matcher is used in combination
with a leaf-level matcher. It determines the similarity be-
tween two inner elements based on the combined similar-
ity between their child elements, which in turn can be
both inner and leaf elements. The similarity between the
inner elements needs to be recursively computed from the
similarity between their respective children. The similar-
ity between the leaf elements is obtained from the leaf-
level matcher, for which TypeName is used as the default.
Leaves: This structural matcher is also used in combina-
tion with a leaf-level matcher, for which TypeName is set
as the default. In contrast to the Children strategy, this
matcher only considers the leaf elements to estimate the
similarity between two inner elements. This strategy aims
at more stable similarity in cases of structural conflicts. In
Figure 1, for example, elements shipToStreet, shipToCity,
etc., are children of ShipTo in PO1, while in PO2, their
matching elements are not children of DeliverTo, but of
Address. Children will therefore only find a correspon-
dence between ShipTo and Address, while Leaves can also
identify a correspondence between ShipTo and DeliverTo.
5 Reuse of previous match results
The consideration of reuse-oriented matchers is motivated
by our expectation that many schemas to be matched are
similar (or identical) to previously matched schemas. The
use of auxiliary information such as synonym dictionar-
ies, thesauri, already represents such a reuse-oriented ap-
proach utilizing confirmed correspondences at the level of
schema elements (names or data types). Our goal is to
generalize this idea and reuse multiple match correspon-
dences at the same time at the levels of schema fragments
or entire schemas.
As a first step, we have implemented two simple re-
use-oriented matchers that can be invoked and combined
like other matchers. One of them, Schema, tries to reuse
match results for entire schemas, the other, Fragment, op-
erates on schema fragments. In both cases we use a spe-
cial compose operation, MatchCompose, to derive a new
match result from existing ones. We first introduce
MatchCompose. Due to lack of space, we then only de-
scribe Schema.
5.1 The MatchCompose operation
Given two match results, match1: S1S2 and match2:
S2S3 sharing schema S2, MatchCompose derives a new
match result, match: S1S3, between S1 and S3. The
operation assumes a transitive nature of the similarity
relation between elements, i.e. if a is similar to b and b to
c, then a is (very likely) also similar to c. Of course wrong
match candidates may be determined in cases where the
transitivity property does not hold.
In the context of information retrieval, transitive simi-
larity estimations have been applied to derive the similar-
ity of words based on terminological relationships, such
as synonymy and hypernymy [ 4, 17]. A common ap-
proach to determine the transitive similarity is to multiply
the individual similarity values [ 2]. This approach, how-
ever, may lead to rapidly degrading similarity values. For
instance, for
firstNameNamestNamecontactFir →→
7050 ..
the similarity between contactFirstName and firstName
would become 0.5*0.7=0.35, which is unlikely to reflect
the similarity, which we would expect for the two names.
We thus prefer the alternatives for combining the results
of different matchers, such as Average (Section 6.1), for
calculating transitive similarities, resulting in similarity
value 0.6 in the last example.
Figure 3a and b illustrate the approach for the match
PO1PO3 derived from composing the two match re-
sim13PO3PO1
1.0emailEmail
0.8firstNameName
0.8lastNameName
sim13PO3PO1
1.0emailEmail
0.8firstNameName
0.8lastNameName
sim23PO3PO2
1.0emaile-mail
0.6firstNamename
0.6lastNamename
sim23PO3PO2
1.0emaile-mail
0.6firstNamename
0.6lastNamename
sim12PO2PO1
1.0e-mailEmail
1.0
name
Name
sim12PO2PO1
1.0e-mailEmail
1.0
name
Name
Containment linkContainment linkLegends: Element corresondenceElement corresondence
PO1.Contact
Name
Email
lastName
firstName
company
company
PO3.Contact
email
match
b) match=MatchCompose(match1, match2)a) match1: PO1PO2 and match2: PO2PO3
Name
Email
company
PO2.Contact
name
e-mail
PO3.Contact
lastName
firstName
email
match1 match2
PO1.Contact
match1
match2
match
company
ovals: Mappings
c) relational representation for MatchCompose
Average
Figure 3. MatchCompose example

Citations
More filters
Book
05 Jun 2007
TL;DR: The second edition of Ontology Matching has been thoroughly revised and updated to reflect the most recent advances in this quickly developing area, which resulted in more than 150 pages of new content.
Abstract: Ontologies tend to be found everywhere. They are viewed as the silver bullet for many applications, such as database integration, peer-to-peer systems, e-commerce, semantic web services, or social networks. However, in open or evolving systems, such as the semantic web, different parties would, in general, adopt different ontologies. Thus, merely using ontologies, like using XML, does not reduce heterogeneity: it just raises heterogeneity problems to a higher level. Euzenat and Shvaikos book is devoted to ontology matching as a solution to the semantic heterogeneity problem faced by computer systems. Ontology matching aims at finding correspondences between semantically related entities of different ontologies. These correspondences may stand for equivalence as well as other relations, such as consequence, subsumption, or disjointness, between ontology entities. Many different matching solutions have been proposed so far from various viewpoints, e.g., databases, information systems, and artificial intelligence. The second edition of Ontology Matching has been thoroughly revised and updated to reflect the most recent advances in this quickly developing area, which resulted in more than 150 pages of new content. In particular, the book includes a new chapter dedicated to the methodology for performing ontology matching. It also covers emerging topics, such as data interlinking, ontology partitioning and pruning, context-based matching, matcher tuning, alignment debugging, and user involvement in matching, to mention a few. More than 100 state-of-the-art matching systems and frameworks were reviewed. With Ontology Matching, researchers and practitioners will find a reference book that presents currently available work in a uniform framework. In particular, the work and the techniques presented in this book can be equally applied to database schema matching, catalog integration, XML schema matching and other related problems. The objectives of the book include presenting (i) the state of the art and (ii) the latest research results in ontology matching by providing a systematic and detailed account of matching techniques and matching systems from theoretical, practical and application perspectives.

2,579 citations


Cites background or methods from "COMA: a system for flexible combina..."

  • ...The most prominent criteria are precision and recall originating from information retrieval (van Rijsbergen 1975) and adapted to ontology matching (Do et al. 2002)....

    [...]

  • ...There have already been some comparisons of matching systems, in particular in (Parent and Spaccapietra 2000; Rahm and Bernstein 2001; Do et al. 2002; Kalfoglou and Schorlemmer 2003b; Noy 2004a; Doan and Halevy 2005; Shvaiko and Euzenat 2005; Choi et al. 2006; Bellahsene et al. 2011)....

    [...]

  • ...This extends the typology introduced in (Noy and Musen 2002a; Do et al. 2002) with regard to our definition of the matching process in Sect....

    [...]

Book ChapterDOI
TL;DR: This paper presents a new classification of schema-based matching techniques that builds on the top of state of the art in both schema and ontology matching and distinguishes between approximate and exact techniques at schema-level; and syntactic, semantic, and external techniques at element- and structure-level.
Abstract: Schema and ontology matching is a critical problem in many application domains, such as semantic web, schema/ontology integration, data warehouses, e-commerce, etc. Many different matching solutions have been proposed so far. In this paper we present a new classification of schema-based matching techniques that builds on the top of state of the art in both schema and ontology matching. Some innovations are in introducing new criteria which are based on (i) general properties of matching techniques, (ii) interpretation of input information, and (iii) the kind of input information. In particular, we distinguish between approximate and exact techniques at schema-level; and syntactic, semantic, and external techniques at element- and structure-level. Based on the classification proposed we overview some of the recent schema/ontology matching systems pointing which part of the solution space they cover. The proposed classification provides a common conceptual basis, and, hence, can be used for comparing different existing schema/ontology matching techniques and systems as well as for designing new ones, taking advantages of state of the art solutions.

1,285 citations


Cites background from "COMA: a system for flexible combina..."

  • ...Some other innovations with respect to COMA, are in the set of elementary matchers based on rules, exploiting explicitly codified knowledge in ontologies, such as information about super- and sub-concepts, super- and sub-properties, etc....

    [...]

  • ..., Cupid [29], COMA[25]), others rely only on instance data (e....

    [...]

  • ...Based on the comparative evaluations conducted in [20], COMA dominates Autoplex [6] and Automatch [7]; LSD [22] and GLUE [23]; SF [50], and SemInt [44] matching tools....

    [...]

  • ...COMA (COmbination of MAtching algorithms) [21] is a composite schema matching tool....

    [...]

  • ...Some of matching systems exploiting the given test are [25, 23]....

    [...]

Journal ArticleDOI
TL;DR: It is conjecture that significant improvements can be obtained only by addressing important challenges for ontology matching and presents such challenges with insights on how to approach them, thereby aiming to direct research into the most promising tracks and to facilitate the progress of the field.
Abstract: After years of research on ontology matching, it is reasonable to consider several questions: is the field of ontology matching still making progress? Is this progress significant enough to pursue further research? If so, what are the particularly promising directions? To answer these questions, we review the state of the art of ontology matching and analyze the results of recent ontology matching evaluations. These results show a measurable improvement in the field, the speed of which is albeit slowing down. We conjecture that significant improvements can be obtained only by addressing important challenges for ontology matching. We present such challenges with insights on how to approach them, thereby aiming to direct research into the most promising tracks and to facilitate the progress of the field.

1,215 citations

Book ChapterDOI
Xin Dong1, Alon Halevy1, Jayant Madhavan1, Ema Nemes1, Jun Zhang1 
31 Aug 2004
TL;DR: Woogle supports similarity search for web services, such as finding similar web-service operations and finding operations that compose with a given one, and novel techniques to support these types of searches are described.
Abstract: Web services are loosely coupled software components, published, located, and invoked across the web. The growing number of web services available within an organization and on the Web raises a new and challenging search problem: locating desired web services. Traditional keyword search is insufficient in this context: the specific types of queries users require are not captured, the very small text fragments in web services are unsuitable for keyword search, and the underlying structure and semantics of the web services are not exploited. We describe the algorithms underlying the Woogle search engine for web services. Woogle supports similarity search for web services, such as finding similar web-service operations and finding operations that compose with a given one. We describe novel techniques to support these types of searches, and an experimental study on a collection of over 1500 web-service operations that shows the high recall and precision of our algorithms.

828 citations


Cites background from "COMA: a system for flexible combina..."

  • ...Schema matching: The database community has considered the problem of automatically matching schemas [24, 12, 13, 22]....

    [...]

Proceedings ArticleDOI
14 Jun 2005
TL;DR: Different match strategies can be applied including various forms of reusing previously determined match results and a so-called fragment-based match approach which decomposes a large match problem into smaller problems.
Abstract: We demonstrate the schema and ontology matching tool COMA++. It extends our previous prototype COMA utilizing a composite approach to combine different match algorithms [3]. COMA++ implements significant improvements and offers a comprehensive infrastructure to solve large real-world match problems. It comes with a graphical interface enabling a variety of user interactions. Using a generic data representation, COMA++ uniformly supports schemas and ontologies, e.g. the powerful standard languages W3C XML Schema and OWL. COMA++ includes new approaches for ontology matching, in particular the utilization of shared taxonomies. Furthermore, different match strategies can be applied including various forms of reusing previously determined match results and a so-called fragment-based match approach which decomposes a large match problem into smaller problems. Finally, COMA++ cannot only be used to solve match problems but also to comparatively evaluate the effectiveness of different match algorithms and strategies.

683 citations

References
More filters
Journal ArticleDOI
01 Dec 2001
TL;DR: A taxonomy is presented that distinguishes between schema-level and instance-level, element- level and structure- level, and language-based and constraint-based matchers and is intended to be useful when comparing different approaches to schema matching, when developing a new match algorithm, and when implementing a schema matching component.
Abstract: Schema matching is a basic problem in many database application domains, such as data integration, E-business, data warehousing, and semantic query processing. In current implementations, schema matching is typically performed manually, which has significant limitations. On the other hand, previous research papers have proposed many techniques to achieve a partial automation of the match operation for specific application domains. We present a taxonomy that covers many of these existing approaches, and we describe the approaches in some detail. In particular, we distinguish between schema-level and instance-level, element-level and structure-level, and language-based and constraint-based matchers. Based on our classification we review some previous match implementations thereby indicating which part of the solution space they cover. We intend our taxonomy and review of past work to be useful when comparing different approaches to schema matching, when developing a new match algorithm, and when implementing a schema matching component.

3,693 citations

Journal ArticleDOI
01 Jan 1989
TL;DR: Experiments in which distance is applied to pairs of concepts and to sets of concepts in a hierarchical knowledge base show the power of hierarchical relations in representing information about the conceptual distance between concepts.
Abstract: Motivated by the properties of spreading activation and conceptual distance, the authors propose a metric, called distance, on the power set of nodes in a semantic net. Distance is the average minimum path length over all pairwise combinations of nodes between two subsets of nodes. Distance can be successfully used to assess the conceptual distance between sets of concepts when used on a semantic net of hierarchical relations. When other kinds of relationships, like 'cause', are used, distance must be amended but then can again be effective. The judgements of distance significantly correlate with the distance judgements that people make and help to determine whether one semantic net is better or worse than another. The authors focus on the mathematical characteristics of distance that presents novel cases and interpretations. Experiments in which distance is applied to pairs of concepts and to sets of concepts in a hierarchical knowledge base show the power of hierarchical relations in representing information about the conceptual distance between concepts. >

1,962 citations


"COMA: a system for flexible combina..." refers methods in this paper

  • ...In the context of information retrieval, transitive similarity estimations have been applied to derive the similarity of words based on terminological relationships, such as synonymy and hypernymy [ 4, 17]....

    [...]

Journal Article
TL;DR: This work classifies data quality problems that are addressed by data cleaning and provides an overview of the main solution approaches and discusses current tool support for data cleaning.
Abstract: We classify data quality problems that are addressed by data cleaning and provide an overview of the main solution approaches. Data cleaning is especially required when integrating heterogeneous data sources and should be addressed together with schema-related data transformations. In data warehouses, data cleaning is a major part of the so-called ETL process. We also discuss current tool support for data cleaning.

1,675 citations


"COMA: a system for flexible combina..." refers methods in this paper

  • ...Approximate string matching techniques [ 10] have already been employed in other fields, such as record linkage [ 20] and data cleaning [ 19], to detect duplicate database records concerning the same real-word entity, i....

    [...]

Proceedings ArticleDOI
26 Feb 2002
TL;DR: This paper presents a matching algorithm based on a fixpoint computation that is usable across different scenarios and conducts a user study, in which the accuracy metric was used to estimate the labor savings that the users could obtain by utilizing the algorithm to obtain an initial matching.
Abstract: Matching elements of two data schemas or two data instances plays a key role in data warehousing, e-business, or even biochemical applications. In this paper we present a matching algorithm based on a fixpoint computation that is usable across different scenarios. The algorithm takes two graphs (schemas, catalogs, or other data structures) as input, and produces as output a mapping between corresponding nodes of the graphs. Depending on the matching goal, a subset of the mapping is chosen using filters. After our algorithm runs, we expect a human to check and if necessary adjust the results. As a matter of fact, we evaluate the 'accuracy' of the algorithm by counting the number of needed adjustments. We conducted a user study, in which our accuracy metric was used to estimate the labor savings that the users could obtain by utilizing our algorithm to obtain an initial matching. Finally, we illustrate how our matching algorithm is deployed as one of several high-level operators in an implemented testbed for managing information models and mappings.

1,613 citations

Proceedings Article
11 Sep 2001
TL;DR: This paper proposes a new algorithm, Cupid, that discovers mappings between schema elements based on their names, data types, constraints, and schema structure, using a broader set of techniques than past approaches.
Abstract: Schema matching is a critical step in many applications, such as XML message mapping, data warehouse loading, and schema integration. In this paper, we investigate algorithms for generic schema matching, outside of any particular data model or application. We first present a taxonomy for past solutions, showing that a rich range of techniques is available. We then propose a new algorithm, Cupid, that discovers mappings between schema elements based on their names, data types, constraints, and schema structure, using a broader set of techniques than past approaches. Some of our innovations are the integrated use of linguistic and structural matching, context-dependent matching of shared types, and a bias toward leaf structure where much of the schema content resides. After describing our algorithm, we present experimental results that compare Cupid to two other schema matching systems.

1,533 citations


"COMA: a system for flexible combina..." refers background in this paper

  • ...Numerous researchers have addressed the schema matching problem either for specific applications [ 1, 4, 5, 7, 8, 9, 11, 15, 16] or in a more generic way for different applications and schema languages [ 12, 13, 14]....

    [...]

  • ...element names, data types and structural properties [ 2, 12, 15, 16, 9] as well as characteristics of data instances [ 7, 8, 14, 11, 9]....

    [...]

  • ...Cupid [ 12] represents a sophisticated hybrid match...

    [...]

  • ...Furthermore, the distinction between hybrid and composite combination of matchers is introduced and previous match prototypes such as Cupid [ 12], SemInt [ 11], LSD [ 7], Dike [ 16], SF [ 13], TranScm [ 15], and Momis [ 2] are...

    [...]

  • ...and constraint-based matching approaches [ 18, 12]....

    [...]

Frequently Asked Questions (10)
Q1. What have the authors contributed in "Coma - a system for flexible combination of schema matching approaches" ?

The authors provide a large spectrum of individual matchers, in particular a novel approach aiming at reusing results from previous match operations, and several mechanisms to combine the results of matcher executions. 

In future work, the authors plan to add other match and combination algorithms in order to improve match quality. Furthermore, the authors will apply COMA to additional schema types and applications, such as in the bioinformatics domain. 

The stable behavior of the default combination strategy indicates that it can be used for many match tasks thereby limiting the tuning effort. 

In contrast to single matchers, matcher combinations simultaneously analyze schema elements under different aspects, resulting in more stable and accurate similarity for heterogeneous schemas. 

the authors apply a selection strategy to choose the match candidates for a schema element, e.g. by selecting the elements of the other schema with the best similarity value exceeding a certain threshold. 

The default weights of the name and data type similarity, 0.7 and 0.3, respectively, permit to match attributes with similar names but different data types. 

Despite the high level of reuse in Schema (schema level), the authors believe that there is a high probability to find the necessary match result pairs for MatchCompose in an environment where many schemas are managed and matched to each other. 

Recall can easily be maximized at the expense of a poor Precision by returning all possible correspondences, i.e. the cross product of two input schemas. 

This element matcher combines the DataType and Name matcher, i.e. it matches elements based on a combination of their name and data type similarity. 

Most accurate match predictions can be achieved by selecting match candidates showing the (approximately) highest similarity exceeding a minimal threshold.