scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Generic schema matching, ten years later

01 Aug 2011-Vol. 4, Iss: 11, pp 695-701
TL;DR: A taxonomy of existing techniques, a new schema matching algorithm, and an approach to comparative evaluation are developed, which summarizes the new techniques that have been developed and applications of the techniques in the commercial world.
Abstract: In a paper published in the 2001 VLDB Conference, we proposed treating generic schema matching as an independent problem. We developed a taxonomy of existing techniques, a new schema matching algorithm, and an approach to comparative evaluation. Since then, the field has grown into a major research topic. We briefly summarize the new techniques that have been developed and applications of the techniques in the commercial world. We conclude by discussing future trends and recommendations for further work.

Summary (2 min read)

1. INTRODUCTION

  • Schema matching is the problem of generating correspondences between elements of two schemas.
  • A correspondence is a relationship between one or more elements of one schema and one or more elements of the other.
  • There are many applications that require schema matching.
  • It may be used to align gene ontologies or anatomical structures.
  • In web applications, it may be used to align product catalogs.

2. CONTRIBUTIONS IN VLDB 2001 [45]

  • Twelve years ago, when the authors embarked on work in this area, they noticed that schema matching techniques were developed as part of a variety of applications.
  • The authors then surveyed the literature to identify these common techniques.
  • This resulted in a taxonomy of schema matching techniques, which was the second contribution of [45].
  • The authors concluded with an experimental comparison of Cupid with two other algorithms that were reported in the literature, namely MOMIS [6] and DIKE [58].
  • The authors Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page.

3. SCHEMA MATCHING TECHNIQUES

  • To give the reader a feel for the scope of the schema matching field, the authors list many of the known techniques here.
  • The authors start with techniques that were known in 2001 and that they discussed in [45]: Linguistic matching – based on an element’s name or description, using stemming, tokenization, string and substrings matching, and information retrieval techniques.
  • Rule-based matching – based on matching rules that are expressed in first-order logic.
  • These include algorithms that use new types of information.
  • Partition-based matching – where to reduce the space of possible matches, the input schemas are partitioned followed by partition-wise matching [20][39][73].

4. SCHEMA MATCHING TOOLS

  • Most of the listed techniques have been implemented in a large number of tools for schema and ontology matching [26][62].
  • GUI support is often provided, albeit still with limitations [31].
  • As indicated in Figure 2, advanced techniques such as schema partitioning, parallel matching, mapping reuse and self-tuning capabilities (e.g., a dynamic selection of matchers for a given match task) are still only supported to a limited extent in current match prototypes.
  • For ontology matching, the Ontology Alignment Evaluation Initiative (OAEI) organizes yearly contests that include some larger problems, e.g., to match web directories or medical ontologies (http://oaei.ontologymatching.org).
  • Semi-automatic schema matching is also increasingly supported in commercial middleware tools, in particular for XML schemas or relational database schemas.

5. USING MATCH RESULTS AS-IS

  • Even the best schema matching algorithms make many mistakes, especially fully-automatic algorithms where there is no human designer in the loop.
  • This is especially the case when a best-effort matching is satisfactory or when the matches contribute only implicitly to the results of some end-user task.
  • First, most of today’s browsers offer automatic form-filling, e.g., personal data such as name and address prior to a purchase.
  • When the crawler encounters an HTML form, it can identify the domain that the form belongs to, and then match the inputs of the form to elements in the previously-computed mediated schema for that domain (see Figure 3).
  • The resulting pages are added to the index of the search engine.

6. APPLYING MATCH TO MODEL MANAGEMENT

  • For most of the applications summarized in Section 1, schema matching is just one step in a multi-step process.
  • Since match algorithms produce correspondences, not semantic relationships, the natural next step is to enrich those correspondences with semantics [54].
  • Depending on the application, the resulting mapping may need to undergo further manipulation.
  • Suppose the authors match schemas S and T and then generate a semantic mapping between them.
  • For most practical applications, all of the model management operators manipulate mappings that have semantics—except for the match operator which has a special role.

8. CONCLUSION

  • The authors briefly summarized generic schema matching developments since they published their 2001 paper that introduced the subject [45].
  • There seem always to be new sources of information available to new schema matching techniques and clever ways of combining existing techniques.
  • The problem of schema matching is inherently open-ended.
  • Thus, the schema matching field is still a vibrant one, with many opportunities for researchers and tool developers to move it forward.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Generic Schema Matching, Ten Years Later
Philip A. Bernstein
Microsoft Corporation
philbe@microsoft.com
Jayant Madhavan
Google Inc.
jayant@google.com
Erhard Rahm
University of Leipzig
rahm@informatik.uni-leipzig.de
ABSTRACT
In a paper published in the 2001 VLDB Conference, we proposed
treating generic schema matching as an independent problem. We
developed a taxonomy of existing techniques, a new schema
matching algorithm, and an approach to comparative evaluation.
Since then, the field has grown into a major research topic. We
briefly summarize the new techniques that have been developed
and applications of the techniques in the commercial world. We
conclude by discussing future trends and recommendations for
further work.
1. INTRODUCTION
Schema matching is the problem of generating correspondences
between elements of two schemas. A schema is a formal structure
that represents an engineered artifact, such as a SQL schema,
XML schema, entity-relationship diagram, ontology description,
interface definition, or form definition. A correspondence is a
relationship between one or more elements of one schema and one
or more elements of the other. For example, the correspondences
in Figure 1 identify columns that represent the same concepts in
the two relational schemas. Often, the relationship is one-to-one,
but sometimes it is not, such as Author corresponding to Last-
Name and FirstName in Figure 1. We say that a correspondence
has semantics if it constrains the instances of the related schema
elements. The common default semantics for one-to-one corres-
pondences is that the instances of two related elements are equal.
There are many applications that require schema matching. In the
database field, it is usually the first step in generating a program
or view definition that maps instances of one schema into
instances of another. For example, it arises in object-to-relational
mappings, data warehouse loading, data exchange, and mediated
schemas for data integration. In knowledge-based applications,
such as life sciences applications and the semantic web, it arises in
the alignment of ontologies. For example, it may be used to align
gene ontologies or anatomical structures. In health care, it may
arise in the alignment of patient records and other medical reports.
In web applications, it may be used to align product catalogs. In e-
commerce, it may be used to align message formats representing
business documents, such as orders and invoices.
This paper recaps the contributions of our VLDB 2001 paper
about schema matching [45], summarizes developments since
then, and suggests problems that would benefit from further work.
Figure 1: Schema matching is the problem of generating cor-
respondences that identify related elements in two schemas.
2. CONTRIBUTIONS IN VLDB 2001 [45]
Twelve years ago, when we embarked on work in this area, we
noticed that schema matching techniques were developed as part
of a variety of applications. The techniques were often similar,
even when the applications were not. We concluded that the field
might move faster and the results might be more reusable if
schema matching were studied as a separate topic, independently
of the applications that use it. This recommendation was the first
contribution of [45].
We then surveyed the literature to identify these common
techniques. This resulted in a taxonomy of schema matching
techniques, which was the second contribution of [45]. We
extended this taxonomy into a survey paper, published later that
year in [63]. The taxonomy has often been used as a standard for
categorizing subsequent schema matching techniques.
Our third contribution was a new schema matching algorithm,
called Cupid, which combined a number of techniques: linguistic
matching, structure-based matching, constraint-based matching,
and context-based matching. Most of the later approaches to
schema matching have used this hybrid matcher approach, which
leverages different criteria to arrive at suggested correspondences.
We concluded with an experimental comparison of Cupid with
two other algorithms that were reported in the literature, namely
MOMIS [6] and DIKE [58]. This was the first such comparison
we know of. Such experimental comparisons have become a
feature of most of the later work on schema matching.
In summary, our 2001 paper posed schema matching as a problem
that could be studied in isolation. It gave a baseline of known
techniques. And given the inherently heuristic nature of solutions,
it suggested an approach to evaluate those solutions based on
experiments. As the references in [45] attest, we were by no
means the first to work on schema matching. However, we
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are not
made or distributed for profit or commercial advantage and that copies bear
this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. Articles from this volume were invited to present
their results at The 37th International Conference on Very Large Data Bases,
August 29th - September 3rd 2011, Seattle, Washington.
Proceedings of the VLDB Endowment, Vol. 4, No. 11
Copyright 2011 VLDB Endowment 2150-8097/11/08... $ 10.00.
ISBN char(15) primary key
Title varchar(100)
Author varchar(50)
MarkedPrice float
ID char(15) primary key
AuthorID integer
references AuthorInfo
BookTitle varchar(150)
ListPrice float
DiscountPrice float
A
uthorID integer primary key
LastName varchar(25)
FirstName varchar
(
25
)
Books
BookInfo
AuthorInfo
Dotted lines indicate
correspondences
695

defined a framework for research on this topic that enabled many
others to follow.
3. SCHEMA MATCHING TECHNIQUES
There are now two books on schema matching [5][26] and two
surveys [63][68], so there is little point in our repeating such a
survey. However, to give the reader a feel for the scope of the
schema matching field, we list many of the known techniques
here. We start with techniques that were known in 2001 and that
we discussed in [45]:
Linguistic matching – based on an element’s name or
description, using stemming, tokenization, string and
substrings matching, and information retrieval techniques.
Using auxiliary information – based on thesauri, acronyms,
dictionaries, and mismatch lists.
Instance-based matching – schema elements are regarded as
similar if their instances are similar, based on statistics,
metadata, or trained classifiers.
Structure-based matching – schema elements are similar if
they appear in similarly-structured groups, have similar rela-
tionships, or have (paths of) relationships to similar elements.
Constraint-based matching – based on data types, value
ranges, uniqueness, nullability, and foreign keys.
Rule-based matching – based on matching rules that are
expressed in first-order logic.
Hybrid-matching – as explained in the previous section.
Since 2001, many other techniques have been developed. These
include algorithms that use new types of information. For
example:
Graph matching – based on comparing the relationships
between elements in the schema graphs by, for example,
either fixed-point computations on a similarity propagation
graph [53], or probabilistic constraint satisfaction algorithms
[22].
Usage-based matching – based on analyzing database query
logs for hints about how users relate schemas, e.g., by
equating elements in join clauses [25]. Taxonomy paths can
be matched by finding web pages that represent the paths and
then analyzing keyword-query logs to determine if the pages
are accessed via similar query distributions [55].
Document content similarity – where instances of a schema
element are grouped into a document that is then matched
with other such documents based on the information retrieval
measure TF-IDF (term frequency times inverse document
frequency) [44][49].
Document link similarity – where concepts in two ontologies
are regarded as similar if the entities referring to those
concepts are similar [42].
Strategies have been proposed to flexibly combine multiple
matching algorithms and to scale gracefully to compare large
schemas. For example:
Workflow-like strategies to independently or sequentially
execute matchers and to combine their results [12][19][67].
Self-tuning match workflows – where for a given match task
or domain of match tasks, a tuner selects the match
components to be combined and/or assigns values to the
various parameters that affect how component match results
are combined [24][43][44].
Early search space pruning – where a fast matcher is used to
eliminate unlikely matches from consideration so that a
manageably-small number of elements can be matched using
more expensive and accurate techniques [23][57].
Partition-based matching – where to reduce the space of
possible matches, the input schemas are partitioned followed
by partition-wise matching [20][39][73].
Parallel matching – where different steps of the matching
algorithm are run in parallel or different partitions of the
schemas are matched in parallel [34].
Optimizations for large schemas such as using string
matching optimizations [40], pre-collecting predecessors and
children of each element to avoid repeated traversal [2], and
using space-efficient similarity matrices [12].
Approaches have been proposed where multiple schemas in a
domain are collectively matched. For example:
Reuse-based matching – where matches between schema
fragments are harvested from validated mappings and used as
auxiliary information to help future match tasks in the same
domain [20][46][65].
Holistic matching – where a single mediated schema is
constructed for a domain by aligning elements of a large
corpus of schemas, such as web forms covering a particular
domain. Similar element names appearing in the same
schema are regarded as mismatches [37][38][66][69].
Strategies have been proposed to incorporate user interaction and
feedback in the matching process. For example:
GUI support to interactively inspect and correct computed
correspondences [3][11][16][31].
Incremental matching – where given a user-selected element
of one schema, the matcher calculates the best match or
matches (top-k) in the other schema [11].
Top-k matching – where instead of computing a complete
mapping between two schemas, the matcher computes the
top-k matches of each element of one schema to elements of
the other schema [11][32].
Collaborative, wiki-like user involvement to provide,
improve, and reuse mappings [50][72].
Finally, algorithms have been proposed that extend the semantics
of matches beyond that of simple correspondences. For example:
Semantic tagging – where correspondences are tagged with
semantic relationships, such as equality, containment,
disjointness, and unknown. [33][35] [48].
Conditional tagging – where correspondences are refined to
be valid only for certain values of another element. For
example, if productType = “book” then Invoice.Code =
ISBN [14][33].
4. SCHEMA MATCHING TOOLS
Most of the listed techniques have been implemented in a large
number of tools for schema and ontology matching [26][62].
Figure 2 shows a comparative overview of selected tools: Cupid
[45], COMA++ [3][19][20], ASMOV [40], Falcon-AO [39],
RiMON [44], AgreementMaker [16], OII Harmony [67]. Most
696

recent prototypes support match workflows and the combined use
of different linguistic, structural and instance-based matchers.
External dictionaries such as synonym lists or thesauri are
commonly used to improve linguistic matching. GUI support is
often provided, albeit still with limitations [31]. A few systems are
able to match both schemas and ontologies [3][16][67]. As
indicated in Figure 2, advanced techniques such as schema
partitioning, parallel matching, mapping reuse and self-tuning
capabilities (e.g., a dynamic selection of matchers for a given
match task) are still only supported to a limited extent in current
match prototypes.
Match tools have been intensively evaluated but typically under
different conditions and for smaller match problems [4][18]. For
ontology matching, the Ontology Alignment Evaluation Initiative
(OAEI) organizes yearly contests that include some larger
problems, e.g., to match web directories or medical ontologies
(http://oaei.ontologymatching.org). The systems participating in
the OAEI contest have significantly improved over the years but
still struggle with larger problems [27]. For schema matching and
mapping, a comparable benchmark effort is still missing.
Semi-automatic schema matching is also increasingly supported
in commercial middleware tools, in particular for XML schemas
or relational database schemas. Systems such as Altova
MapForce, IBM Infosphere, Microsoft BizTalk Server and SAP
Netweaver provide a GUI-based editor for manual mapping
specification with some support for automatic determination of
match candidates, e.g., based on approximate name matching.
However, most of the more recently proposed match techniques
have not yet been incorporated in commercial mapping solutions.
5. USING MATCH RESULTS AS-IS
Even the best schema matching algorithms make many mistakes,
especially fully-automatic algorithms where there is no human
designer in the loop. Despite these errors, some applications can
use schema matching results as-is. This is especially the case
when a best-effort matching is satisfactory or when the matches
contribute only implicitly to the results of some end-user task. For
example, consider the following two scenarios for automatically
filling out HTML forms.
First, most of today’s browsers offer automatic form-filling, e.g.,
personal data such as name and address prior to a purchase. This
can be modeled as a task where the schema of the underlying
web-form is being matched to a model of user data that is stored
in the browser. The user expects the browser to make a best-effort
attempt at filling in personal details, which the user confirms
before submitting the form for processing.
Figure 3 Mappings between domain models and form inputs
can be used to automatically fill out HTML forms.
Second, schema mappings have been proposed as a means of
accessing the content that lies behind HTML forms [47][61]. A
deep-web crawler can work as follows: When the crawler
encounters an HTML form, it can identify the domain that the
form belongs to, and then match the inputs of the form to
elements in the previously-computed mediated schema for that
domain (see Figure 3). It can then generate form submissions by
constructing URLs using sample values for the inputs (based on
known values for the elements in the mediated schema). The
resulting pages are added to the index of the search engine. The
matching results in this case are intermediate results of a multi-
step process. End-users are unlikely to know or care about the
quality of the match result, except insofar as it affects how the
crawler exploits the underlying website.
6. APPLYING MATCH TO
MODEL MANAGEMENT
For most of the applications summarized in Section 1, schema
matching is just one step in a multi-step process. That multi-step
process involves other operators that manipulate schemas and
mappings, such as schema merging and mapping composition.
This recognition was actually the starting point for our research
into schema matching. In [8] and [9], we proposed a set of such
operators under the name “model management”. We then
embarked on a systematic study of these operators. Since nothing
Figure 2 Comparison of selected match tools (based on [62]).
697

much can be done until the first mapping is created, it was logical
that we started our investigation of operators with schema
matching. In fact, our first algorithmic result about one of the
operators was our paper “Generic Schema Matching” [45]. Since
then, there has been a lot of progress on other operators in
addition to match, which is summarized in [10].
Most data integration and data transformation applications, such
as those in Section 1, need to construct executable mappings—
ones that represent transformations of instances. Since match
algorithms produce correspondences, not semantic relationships,
the natural next step is to enrich those correspondences with
semantics [54]. Often this is a two-step process (Section 3.1 of
[10]). The first step is to generate semantics in the form of
constraints that relate parts of the instances of one schema to parts
of the instances of another schema. Such constraints may not be
functions, in which case they are not executable. In this case, a
second step is needed to translate the semantic relationships into
functions [51] via the operator TransGen.
Depending on the application, the resulting mapping may need to
undergo further manipulation. Suppose we match schemas S and
T and then generate a semantic mapping between them. We might
want to merge S and T into a single schema that covers both of
them, for example, to represent a mediated schema. This can be
done by the merge operator, which takes as input two schemas
and a mapping between them and returns a merged schema with
mappings between the merged schema and the two input schemas
[15][59][60][64].
Suppose we are using the mapping between S and T as a data
transformation that translates data from S’s format into T’s
format. If one of the schemas T in a mapping is modified,
generating T, then we need to update the mapping between S and
T to one between S and T. We can do this by composing the map-
pings S-T and T- T [30][36], yielding a mapping T-T between S
and T [7][28][71].
Other model management operators are Diff (which finds the
difference between mappings) and Extract (the complement of
Diff) [52], and Invert, which reverses the direction of a uni-
directional mapping [28][29].
For most practical applications, all of the model management
operators manipulate mappings that have semantics—except for
the match operator which has a special role. First the match
operator computes correspondences and then, building on these
correspondences, the other operators develop and manipulate
mappings that have semantics.
7. FUTURE TRENDS
Since 2001, there has been a growing realization that matching is
not a one-of task. For example, in data integration, as new data
sources become available, they are mapped to a single mediated
schema. In e-commerce, message formats of new business
partners have to be mapped to message formats that interface to
existing business processes. It is natural to expect that with each
subsequent task to match within a given domain or to a given
schema, the effort required to construct the mapping should
decrease, while the quality of the mapping should increase.
For a given vertical domain, such as product catalogs or patient
records, there are many possible schemas. These schemas exhibit
common patterns, which can be used to improve the results of a
schema matching algorithm. Most of the early approaches to
schema matching encoded this domain knowledge as constraints
or heuristics that were baked into the algorithm. The encoded
constraints were developed by a designer with intimate knowledge
of the domain.
A more flexible approach was introduced in [21]. It showed that
new mappings to a mediated schema can be learned from known
mappings to that schema. Machine-learning algorithms were used
to train models for elements in the mediated schema using known
mappings. The models were then applied to the elements in new
schemas to map them to the same mediated schema. The approach
was extended in [17] to learn complex expressions in addition to
just correspondences. It was further extended in [46] to show that
models can be trained from known mappings in a domain and
applied to match two completely new schemas in the same
domain.
Much of the value of mappings is in the semantic expressions that
are developed from the initial correspondences. It is therefore
important to reuse those expressions, not simply generate
correspondences based on learned models. An early approach in
[19] proposed reusing a validated mapping fragment F by
matching the source and target of the schemas to be matched with
the source and target of F. This introduces several related
problems. First, there is the question of how to partition a schema
into fragments, whose validated mappings can be reused. Second,
a repository is needed to store and provide access to validated
mappings [1]. Third, there is the combinatorial problem of finding
possible matches of each mapping in the library to the many
positions where it might fit in the source and target of the schemas
to be matched. One attempt is discussed in [20]. More work along
these lines is needed.
Despite this progress in mapping reuse, little of the technology
has made it into commercial offerings.
The availability of large numbers of schemas on the web makes
the holistic matching approach quite appealing. Collective schema
matching was proposed in [37] and applied in [38] to match the
inputs in HTML forms. Many schemas (i.e., forms) that are
known to be in a given domain are collectively analyzed to infer a
single mediated schema for that domain. Then a generative model
is learned for the domain based on the assumption that each
distinct schema is simply a different representation of a subset of
a single underlying domain schema. Subsequent work has
extended this clustering approach to accommodate more complex
mappings between HTML forms [70]. These approaches have
thus far been restricted to form matching where the schemas are
small, with just a few, well-understood underlying concepts in the
domain.
In most schema matching scenarios, there is a human in the loop.
Therefore, it is important to have excellent graphical support for
viewing mappings [31]. For example, since large schemas cannot
be viewed on a single screen, it is beneficial to partition them into
fragments that can be matched independently, to the extent
possible. Matching tools also need to offer better support for the
mapping process. For example, users need help in remembering
which schema elements they have examined during the match
process and what was learned by that examination, such as
promising and specious candidates.
We see an increasing convergence of schema matching and entity
resolution approaches, i.e., matching at the metadata level and
matching at the instance level. Most recent schema and ontology
matching prototypes include instance-based matchers [61] that
derive the similarity of schema elements from the similarity or
698

overlap of element instances. Entity resolution, i.e., the
identification of semantically corresponding entities or instances,
can benefit from the semantic categorization of entities within
ontologies and the provision of ontology mappings. For example,
the organization of products or product offers within product
catalogs helps to restrict product matching between different
sources to corresponding or closely related product categories,
based on a pre-determined ontology mapping between the product
catalogs. Link discovery to interconnect sources in the so-called
web of linked data [13][56] is an area where such semantic entity
resolution approaches are needed and applicable due to the broad
availability of ontologies.
8. CONCLUSION
In this paper, we briefly summarized generic schema matching
developments since we published our 2001 paper that introduced
the subject [45]. We listed published techniques, how published
techniques are used, and future trends.
There seem always to be new sources of information available to
new schema matching techniques and clever ways of combining
existing techniques. In this sense, the problem of schema
matching is inherently open-ended. Thus, the schema matching
field is still a vibrant one, with many opportunities for researchers
and tool developers to move it forward.
9. ACKNOWLEDGMENTS
We thank the many researchers who have collaborated with us
over the years, helping us learn many of the lessons summarized
in this paper. They include Eddie Churchill, Hong-Hai Do, AnHai
Doan, Alon Halevy, Sabine Massmann, Sergey Melnik, Michalis
Petropoulos, and Christoph Quix.
10. REFERENCES
[1] Alexe, B. M. Gubanov, M. A. Hernandez, H. Ho, J.-W.
Huang, Y. Katsis, L. Popa, B. Saha, and I. Stanoi.
Simplifying Information Integration: Object-Based Flow-
of-Mappings Framework for Integration. Proc. BIRTE,
108–121. Springer, 2009.
[2] Algergawy, A., E. Schallehn, and G. Saake: Improving
XML schema matching performance using Prüfer
sequences. Data Knowl. Eng. 68(8), 728-747, 2009.
[3] Aumueller, D., H.H. Do, S. Massmann, and E. Rahm:
Schema and ontology matching with COMA++. Proc.
SIGMOD, demo paper, 906-908, 2005.
[4] Bellahsene, Z., A. Bonifati, F. Duchateau, and Y.
Velegrakis: On Evaluating Schema Matching and
Mapping. In: Z. Bellahsene, A. Bonifati, E.Rahm (eds),
Schema Matching and Mapping, Springer, 2011.
[5] Bellahsene, Z., A. Bonifati, and E. Rahm (editors), Schema
Matching and Mapping, Springer, 2011.
[6] Bergamaschi, S., S. Castano, and M. Vincini: Semantic
Integration of Semistructured and Structured Data Sources.
SIGMOD Record 28(1), 54-59, 1999.
[7] Bernstein, P.A., T. J. Green, S. Melnik, and A. Nash:
Implementing mapping composition. VLDB J. 17(2), 333-
353, 2008.
[8] Bernstein, P.A., L.M. Haas, M. Jarke, E. Rahm, and G.
Wiederhold: Panel: Is Generic Metadata Management
Feasible? Proc. VLDB, 660-662, 2000.
[9] Bernstein, P.A., A.Y. Halevy, and R. Pottinger: A Vision
of Management of Complex Models. SIGMOD Record
29(4), 55-63, 2000.
[10] Bernstein, P.A. and S. Melnik: Model Management 2.0:
Manipulating Richer Mappings. Proc. SIGMOD, 1-12
2007.
[11] Bernstein, P.A., S. Melnik, and J.E. Churchill: Incremental
schema matching. Proc. VLDB, demo paper, 1167-1170,
2006.
[12] Bernstein, P.A., S. Melnik, M. Petropoulos, and C. Quix:
Industrial-Strength Schema Matching. ACM SIGMOD
Record 33(4), 38-43, 2004.
[13] Bizer, C., T. Heath, and T. Berners-Lee: Linked Data - The
Story So Far. Int. J. Semantic Web Inf. Syst. 5(3), 1-22,
2009.
[14] Bohannon, P. E. Elnahrawy, W. Fan, and M. Flaster:
Putting context into schema matching. Proc. VLDB, 307-
318, 2006.
[15] Chiticariu, L., P. G. Kolaitis, and L. Popa: Interactive
generation of integrated schemas. Proc. SIGMOD, 833-
846, 2008.
[16] Cruz, I.F., F.P. Antonelli, and C. Stroe: AgreementMaker:
Efficient Matching for Large Real-World Schemas and
Ontologies. PVLDB 2(2), demo paper, 1586-1589, 2009.
[17] Dhamankar, R., Y. Lee, A-H. Doan, A.Y. Halevy, and P.
Domingos: iMAP: Discovering Complex Mappings
between Database Schemas. Proc. SIGMOD, 383-394,
2004.
[18] Do, H.H., S. Melnik, and E. Rahm: Comparison of Schema
Matching Evaluations. In: Web, Web-Services, and
Database Systems, Springer LNCS 2593, 221-237, 2003.
[19] Do, H.H. and E. Rahm: COMA – A System for Flexible
Combination of Schema Matching Approaches. Proc.
VLDB, 610-621, 2002.
[20] Do, H.H. and E. Rahm: Matching large schemas:
Approaches and evaluation. Inf. Syst. 32(6), 857-885,
2007.
[21] Doan, A-H., P. Domingos, and A.Y. Halevy: Reconciling
the Schemas of Disparate Data Sources: A Machine-
Learning Approach. Proc. SIGMOD, 509-520, 2001.
[22] Doan, A-H., J. Madhavan, P. Domingos, and A. Y. Halevy:
Learning to Map between Ontologies on the Semantic
Web. Proc. WWW, 662-673, 2002.
[23] Ehrig M., and S. Staab: Quick ontology matching. Proc.
Int. Conf. Semantic Web (ICSW), Springer LNCS 3298,
683-697, 2004.
[24] Ehrig M., S. Staab, and Y. Sure: Bootstrapping Ontology
Alignment Methods with APFEL. Proc. Int. Conf.
Semantic Web (ICSW), Springer LNCS 3729, 186-200,
2005.
[25] Elmeleegy, H., M. Ouzzani, and A.K. Elmagarmid: Usage-
Based Schema Matching. Proc. ICDE, 20-29, 2008
[26] Euzenat, J. and P. Shvaiko, Ontology Matching, Springer,
2007.
699

Citations
More filters
Book
05 Jun 2007
TL;DR: The second edition of Ontology Matching has been thoroughly revised and updated to reflect the most recent advances in this quickly developing area, which resulted in more than 150 pages of new content.
Abstract: Ontologies tend to be found everywhere. They are viewed as the silver bullet for many applications, such as database integration, peer-to-peer systems, e-commerce, semantic web services, or social networks. However, in open or evolving systems, such as the semantic web, different parties would, in general, adopt different ontologies. Thus, merely using ontologies, like using XML, does not reduce heterogeneity: it just raises heterogeneity problems to a higher level. Euzenat and Shvaikos book is devoted to ontology matching as a solution to the semantic heterogeneity problem faced by computer systems. Ontology matching aims at finding correspondences between semantically related entities of different ontologies. These correspondences may stand for equivalence as well as other relations, such as consequence, subsumption, or disjointness, between ontology entities. Many different matching solutions have been proposed so far from various viewpoints, e.g., databases, information systems, and artificial intelligence. The second edition of Ontology Matching has been thoroughly revised and updated to reflect the most recent advances in this quickly developing area, which resulted in more than 150 pages of new content. In particular, the book includes a new chapter dedicated to the methodology for performing ontology matching. It also covers emerging topics, such as data interlinking, ontology partitioning and pruning, context-based matching, matcher tuning, alignment debugging, and user involvement in matching, to mention a few. More than 100 state-of-the-art matching systems and frameworks were reviewed. With Ontology Matching, researchers and practitioners will find a reference book that presents currently available work in a uniform framework. In particular, the work and the techniques presented in this book can be equally applied to database schema matching, catalog integration, XML schema matching and other related problems. The objectives of the book include presenting (i) the state of the art and (ii) the latest research results in ontology matching by providing a systematic and detailed account of matching techniques and matching systems from theoretical, practical and application perspectives.

2,579 citations

Journal ArticleDOI
TL;DR: It is conjecture that significant improvements can be obtained only by addressing important challenges for ontology matching and presents such challenges with insights on how to approach them, thereby aiming to direct research into the most promising tracks and to facilitate the progress of the field.
Abstract: After years of research on ontology matching, it is reasonable to consider several questions: is the field of ontology matching still making progress? Is this progress significant enough to pursue further research? If so, what are the particularly promising directions? To answer these questions, we review the state of the art of ontology matching and analyze the results of recent ontology matching evaluations. These results show a measurable improvement in the field, the speed of which is albeit slowing down. We conjecture that significant improvements can be obtained only by addressing important challenges for ontology matching. We present such challenges with insights on how to approach them, thereby aiming to direct research into the most promising tracks and to facilitate the progress of the field.

1,215 citations


Additional excerpts

  • ...F...

    [...]

Proceedings ArticleDOI
20 May 2012
TL;DR: A novel architecture that leverages preprocessing in MapReduce to achieve extremely fast response times at query time is proposed and has significantly higher precision and coverage and four orders of magnitude faster response times compared with the state-of-the-art approach.
Abstract: The Web contains a vast corpus of HTML tables, specifically entity attribute tables. We present three core operations, namely entity augmentation by attribute name, entity augmentation by example and attribute discovery, that are useful for "information gathering" tasks (e.g., researching for products or stocks). We propose to use web table corpus to perform them automatically. We require the operations to have high precision and coverage, have fast (ideally interactive) response times and be applicable to any arbitrary domain of entities. The naive approach that attempts to directly match the user input with the web tables suffers from poor precision and coverage.Our key insight is that we can achieve much higher precision and coverage by considering indirectly matching tables in addition to the directly matching ones. The challenge is to be robust to spuriously matched tables: we address it by developing a holistic matching framework based on topic sensitive pagerank and an augmentation framework that aggregates predictions from multiple matched tables. We propose a novel architecture that leverages preprocessing in MapReduce to achieve extremely fast response times at query time. Our experiments on real-life datasets and 573M web tables show that our approach has (i) significantly higher precision and coverage and (ii) four orders of magnitude faster response times compared with the state-of-the-art approach.

262 citations


Cites background from "Generic schema matching, ten years ..."

  • ...Our work on building the SMW graph is related to the vast body of work on schema matching [18, 3, 2]....

    [...]

Journal ArticleDOI
01 Apr 2013
TL;DR: This method provides considerable improvement over the well-known WebTables schema extraction method and excels at correctly interpreting full tables, thereby being capable of handling general tables such as those found in spreadsheets, instead of being restricted to HTML tables as is the case with the WebT tables method.
Abstract: Tabular data is an abundant source of information on the Web, but remains mostly isolated from the latter's interconnections since tables lack links and computer-accessible descriptions of their structure. In other words, the schemas of these tables -- attribute names, values, data types, etc. -- are not explicitly stored as table metadata. Consequently, the structure that these tables contain is not accessible to the crawlers that power search engines and thus not accessible to user search queries. We address this lack of structure with a new method for leveraging the principles of table construction in order to extract table schemas. Discovering the schema by which a table is constructed is achieved by harnessing the similarities and differences of nearby table rows through the use of a novel set of features and a feature processing scheme. The schemas of these data tables are determined using a classification technique based on conditional random fields in combination with a novel feature encoding method called logarithmic binning, which is specifically designed for the data table extraction task. Our method provides considerable improvement over the well-known WebTables schema extraction method. In contrast with previous work that focuses on extracting individual relations, our method excels at correctly interpreting full tables, thereby being capable of handling general tables such as those found in spreadsheets, instead of being restricted to HTML tables as is the case with the WebTables method. We also extract additional schema characteristics, such as row groupings, which are important for supporting information retrieval tasks on tabular data.

135 citations


Cites background from "Generic schema matching, ten years ..."

  • ...Although many data publishers recognize this and provide their data in multiple formats to allow for computer or human consumption, a large number of datasets still exist only in human-oriented formats, and thus lack the necessary metadata for querying....

    [...]

Patent
Hao Sun1, Chi-Ho Li1, Jing Li1
24 Aug 2012
TL;DR: In this article, a new word detection and domain dictionary recommendation system is proposed for Chinese text, where text content is received according to a given language, for example, Chinese language, and words are extracted from the content by analyzing the content according to various rules.
Abstract: New word detection and domain dictionary recommendation are provided. When text content is received according to a given language, for example, Chinese language, words are extracted from the content by analyzing the content according to a variety of rules. The words then are ranked for inclusion into one or more lexicons or domain dictionaries for future use for such functionalities as text input methods, spellchecking, grammar checking, auto entry completion, definition, and the like. In addition, when a user is entering or editing text according to one or more prescribed domain dictionaries, a determination may be made as to whether more helpful domain dictionaries may be available. When entered words have a high degree of association with a given domain dictionary, that domain dictionary may be recommended to the user to increase the accuracy of the user's input of additional text and editing of existing text.

132 citations

References
More filters
Journal ArticleDOI
TL;DR: The authors describe progress to date in publishing Linked Data on the Web, review applications that have been developed to exploit the Web of Data, and map out a research agenda for the Linked data community as it moves forward.
Abstract: The term “Linked Data” refers to a set of best practices for publishing and connecting structured data on the Web. These best practices have been adopted by an increasing number of data providers over the last three years, leading to the creation of a global data space containing billions of assertions— the Web of Data. In this article, the authors present the concept and technical principles of Linked Data, and situate these within the broader context of related technological developments. They describe progress to date in publishing Linked Data on the Web, review applications that have been developed to exploit the Web of Data, and map out a research agenda for the Linked Data community as it moves forward.

5,113 citations


"Generic schema matching, ten years ..." refers background in this paper

  • ...Link discovery to interconnect sources in the so-called web of linked data [13][56] is an area where such semantic entity resolution approaches are needed and applicable due to the broad availability of ontologies....

    [...]

Journal ArticleDOI
01 Dec 2001
TL;DR: A taxonomy is presented that distinguishes between schema-level and instance-level, element- level and structure- level, and language-based and constraint-based matchers and is intended to be useful when comparing different approaches to schema matching, when developing a new match algorithm, and when implementing a schema matching component.
Abstract: Schema matching is a basic problem in many database application domains, such as data integration, E-business, data warehousing, and semantic query processing. In current implementations, schema matching is typically performed manually, which has significant limitations. On the other hand, previous research papers have proposed many techniques to achieve a partial automation of the match operation for specific application domains. We present a taxonomy that covers many of these existing approaches, and we describe the approaches in some detail. In particular, we distinguish between schema-level and instance-level, element-level and structure-level, and language-based and constraint-based matchers. Based on our classification we review some previous match implementations thereby indicating which part of the solution space they cover. We intend our taxonomy and review of past work to be useful when comparing different approaches to schema matching, when developing a new match algorithm, and when implementing a schema matching component.

3,693 citations

Book
05 Jun 2007
TL;DR: The second edition of Ontology Matching has been thoroughly revised and updated to reflect the most recent advances in this quickly developing area, which resulted in more than 150 pages of new content.
Abstract: Ontologies tend to be found everywhere. They are viewed as the silver bullet for many applications, such as database integration, peer-to-peer systems, e-commerce, semantic web services, or social networks. However, in open or evolving systems, such as the semantic web, different parties would, in general, adopt different ontologies. Thus, merely using ontologies, like using XML, does not reduce heterogeneity: it just raises heterogeneity problems to a higher level. Euzenat and Shvaikos book is devoted to ontology matching as a solution to the semantic heterogeneity problem faced by computer systems. Ontology matching aims at finding correspondences between semantically related entities of different ontologies. These correspondences may stand for equivalence as well as other relations, such as consequence, subsumption, or disjointness, between ontology entities. Many different matching solutions have been proposed so far from various viewpoints, e.g., databases, information systems, and artificial intelligence. The second edition of Ontology Matching has been thoroughly revised and updated to reflect the most recent advances in this quickly developing area, which resulted in more than 150 pages of new content. In particular, the book includes a new chapter dedicated to the methodology for performing ontology matching. It also covers emerging topics, such as data interlinking, ontology partitioning and pruning, context-based matching, matcher tuning, alignment debugging, and user involvement in matching, to mention a few. More than 100 state-of-the-art matching systems and frameworks were reviewed. With Ontology Matching, researchers and practitioners will find a reference book that presents currently available work in a uniform framework. In particular, the work and the techniques presented in this book can be equally applied to database schema matching, catalog integration, XML schema matching and other related problems. The objectives of the book include presenting (i) the state of the art and (ii) the latest research results in ontology matching by providing a systematic and detailed account of matching techniques and matching systems from theoretical, practical and application perspectives.

2,579 citations

Proceedings ArticleDOI
26 Feb 2002
TL;DR: This paper presents a matching algorithm based on a fixpoint computation that is usable across different scenarios and conducts a user study, in which the accuracy metric was used to estimate the labor savings that the users could obtain by utilizing the algorithm to obtain an initial matching.
Abstract: Matching elements of two data schemas or two data instances plays a key role in data warehousing, e-business, or even biochemical applications. In this paper we present a matching algorithm based on a fixpoint computation that is usable across different scenarios. The algorithm takes two graphs (schemas, catalogs, or other data structures) as input, and produces as output a mapping between corresponding nodes of the graphs. Depending on the matching goal, a subset of the mapping is chosen using filters. After our algorithm runs, we expect a human to check and if necessary adjust the results. As a matter of fact, we evaluate the 'accuracy' of the algorithm by counting the number of needed adjustments. We conducted a user study, in which our accuracy metric was used to estimate the labor savings that the users could obtain by utilizing our algorithm to obtain an initial matching. Finally, we illustrate how our matching algorithm is deployed as one of several high-level operators in an implemented testbed for managing information models and mappings.

1,613 citations


"Generic schema matching, ten years ..." refers background in this paper

  • ... Graph matching – based on comparing the relationships between elements in the schema graphs by, for example, either fixed-point computations on a similarity propagation graph [53], or probabilistic constraint satisfaction algorithms [22]....

    [...]

Proceedings Article
11 Sep 2001
TL;DR: This paper proposes a new algorithm, Cupid, that discovers mappings between schema elements based on their names, data types, constraints, and schema structure, using a broader set of techniques than past approaches.
Abstract: Schema matching is a critical step in many applications, such as XML message mapping, data warehouse loading, and schema integration. In this paper, we investigate algorithms for generic schema matching, outside of any particular data model or application. We first present a taxonomy for past solutions, showing that a rich range of techniques is available. We then propose a new algorithm, Cupid, that discovers mappings between schema elements based on their names, data types, constraints, and schema structure, using a broader set of techniques than past approaches. Some of our innovations are the integrated use of linguistic and structural matching, context-dependent matching of shared types, and a bias toward leaf structure where much of the schema content resides. After describing our algorithm, we present experimental results that compare Cupid to two other schema matching systems.

1,533 citations


"Generic schema matching, ten years ..." refers background or methods or result in this paper

  • ...[45] Madhavan, J., P. A. Bernstein, and E. Rahm: Generic Schema Matching with Cupid....

    [...]

  • ...As the references in [45] attest, we were by no means the first to work on schema matching....

    [...]

  • ...CONTRIBUTIONS IN VLDB 2001 [45] Twelve years ago, when we embarked on work in this area, we noticed that schema matching techniques were developed as part of a variety of applications....

    [...]

  • ...We start with techniques that were known in 2001 and that we discussed in [45]:...

    [...]

  • ...Our third contribution was a new schema matching algorithm, called Cupid, which combined a number of techniques: linguistic matching, structure-based matching, constraint-based matching, and context-based matching....

    [...]

Frequently Asked Questions (13)
Q1. What contributions have the authors mentioned in the paper "Generic schema matching, ten years later" ?

In a paper published in the 2001 VLDB Conference, the authors proposed treating generic schema matching as an independent problem. The authors conclude by discussing future trends and recommendations for further work. 

Optimizations for large schemas such as using string matching optimizations [40], pre-collecting predecessors and children of each element to avoid repeated traversal [2], and using space-efficient similarity matrices [12]. 

Partition-based matching – where to reduce the space of possible matches, the input schemas are partitioned followed by partition-wise matching [20][39][73]. 

The first step is to generate semantics in the form of constraints that relate parts of the instances of one schema to parts of the instances of another schema. 

Usage-based matching – based on analyzing database query logs for hints about how users relate schemas, e.g., by equating elements in join clauses [25]. 

Systems such as Altova MapForce, IBM Infosphere, Microsoft BizTalk Server and SAP Netweaver provide a GUI-based editor for manual mapping specification with some support for automatic determination of match candidates, e.g., based on approximate name matching. 

Most recent schema and ontology matching prototypes include instance-based matchers [61] that derive the similarity of schema elements from the similarity oroverlap of element instances. 

That multi-step process involves other operators that manipulate schemas and mappings, such as schema merging and mapping composition. 

Mostrecent prototypes support match workflows and the combined use of different linguistic, structural and instance-based matchers. 

As indicated in Figure 2, advanced techniques such as schema partitioning, parallel matching, mapping reuse and self-tuning capabilities (e.g., a dynamic selection of matchers for a given match task) are still only supported to a limited extent in current match prototypes. 

Other model management operators are Diff (which finds the difference between mappings) and Extract (the complement of Diff) [52], and Invert, which reverses the direction of a unidirectional mapping [28][29]. 

This can be done by the merge operator, which takes as input two schemas and a mapping between them and returns a merged schema with mappings between the merged schema and the two input schemas [15][59][60][64]. 

It can then generate form submissions by constructing URLs using sample values for the inputs (based on known values for the elements in the mediated schema).