Generic schema matching, ten years later
Summary (2 min read)
1. INTRODUCTION
- Schema matching is the problem of generating correspondences between elements of two schemas.
- A correspondence is a relationship between one or more elements of one schema and one or more elements of the other.
- There are many applications that require schema matching.
- It may be used to align gene ontologies or anatomical structures.
- In web applications, it may be used to align product catalogs.
2. CONTRIBUTIONS IN VLDB 2001 [45]
- Twelve years ago, when the authors embarked on work in this area, they noticed that schema matching techniques were developed as part of a variety of applications.
- The authors then surveyed the literature to identify these common techniques.
- This resulted in a taxonomy of schema matching techniques, which was the second contribution of [45].
- The authors concluded with an experimental comparison of Cupid with two other algorithms that were reported in the literature, namely MOMIS [6] and DIKE [58].
- The authors Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page.
3. SCHEMA MATCHING TECHNIQUES
- To give the reader a feel for the scope of the schema matching field, the authors list many of the known techniques here.
- The authors start with techniques that were known in 2001 and that they discussed in [45]: Linguistic matching – based on an element’s name or description, using stemming, tokenization, string and substrings matching, and information retrieval techniques.
- Rule-based matching – based on matching rules that are expressed in first-order logic.
- These include algorithms that use new types of information.
- Partition-based matching – where to reduce the space of possible matches, the input schemas are partitioned followed by partition-wise matching [20][39][73].
4. SCHEMA MATCHING TOOLS
- Most of the listed techniques have been implemented in a large number of tools for schema and ontology matching [26][62].
- GUI support is often provided, albeit still with limitations [31].
- As indicated in Figure 2, advanced techniques such as schema partitioning, parallel matching, mapping reuse and self-tuning capabilities (e.g., a dynamic selection of matchers for a given match task) are still only supported to a limited extent in current match prototypes.
- For ontology matching, the Ontology Alignment Evaluation Initiative (OAEI) organizes yearly contests that include some larger problems, e.g., to match web directories or medical ontologies (http://oaei.ontologymatching.org).
- Semi-automatic schema matching is also increasingly supported in commercial middleware tools, in particular for XML schemas or relational database schemas.
5. USING MATCH RESULTS AS-IS
- Even the best schema matching algorithms make many mistakes, especially fully-automatic algorithms where there is no human designer in the loop.
- This is especially the case when a best-effort matching is satisfactory or when the matches contribute only implicitly to the results of some end-user task.
- First, most of today’s browsers offer automatic form-filling, e.g., personal data such as name and address prior to a purchase.
- When the crawler encounters an HTML form, it can identify the domain that the form belongs to, and then match the inputs of the form to elements in the previously-computed mediated schema for that domain (see Figure 3).
- The resulting pages are added to the index of the search engine.
6. APPLYING MATCH TO MODEL MANAGEMENT
- For most of the applications summarized in Section 1, schema matching is just one step in a multi-step process.
- Since match algorithms produce correspondences, not semantic relationships, the natural next step is to enrich those correspondences with semantics [54].
- Depending on the application, the resulting mapping may need to undergo further manipulation.
- Suppose the authors match schemas S and T and then generate a semantic mapping between them.
- For most practical applications, all of the model management operators manipulate mappings that have semantics—except for the match operator which has a special role.
7. FUTURE TRENDS
- Since 2001, there has been a growing realization that matching is not a one-of task.
- These schemas exhibit common patterns, which can be used to improve the results of a schema matching algorithm.
- It is therefore important to reuse those expressions, not simply generate correspondences based on learned models.
- An early approach in [19] proposed reusing a validated mapping fragment F by matching the source and target of the schemas to be matched with the source and target of F.
- Many schemas (i.e., forms) that are known to be in a given domain are collectively analyzed to infer a single mediated schema for that domain.
8. CONCLUSION
- The authors briefly summarized generic schema matching developments since they published their 2001 paper that introduced the subject [45].
- There seem always to be new sources of information available to new schema matching techniques and clever ways of combining existing techniques.
- The problem of schema matching is inherently open-ended.
- Thus, the schema matching field is still a vibrant one, with many opportunities for researchers and tool developers to move it forward.
Did you find this useful? Give us your feedback
Citations
[...]
2,579 citations
1,215 citations
Additional excerpts
...F...
[...]
262 citations
Cites background from "Generic schema matching, ten years ..."
...Our work on building the SMW graph is related to the vast body of work on schema matching [18, 3, 2]....
[...]
135 citations
Cites background from "Generic schema matching, ten years ..."
...Although many data publishers recognize this and provide their data in multiple formats to allow for computer or human consumption, a large number of datasets still exist only in human-oriented formats, and thus lack the necessary metadata for querying....
[...]
132 citations
References
5,113 citations
"Generic schema matching, ten years ..." refers background in this paper
...Link discovery to interconnect sources in the so-called web of linked data [13][56] is an area where such semantic entity resolution approaches are needed and applicable due to the broad availability of ontologies....
[...]
3,693 citations
[...]
2,579 citations
1,613 citations
"Generic schema matching, ten years ..." refers background in this paper
... Graph matching – based on comparing the relationships between elements in the schema graphs by, for example, either fixed-point computations on a similarity propagation graph [53], or probabilistic constraint satisfaction algorithms [22]....
[...]
1,533 citations
"Generic schema matching, ten years ..." refers background or methods or result in this paper
...[45] Madhavan, J., P. A. Bernstein, and E. Rahm: Generic Schema Matching with Cupid....
[...]
...As the references in [45] attest, we were by no means the first to work on schema matching....
[...]
...CONTRIBUTIONS IN VLDB 2001 [45] Twelve years ago, when we embarked on work in this area, we noticed that schema matching techniques were developed as part of a variety of applications....
[...]
...We start with techniques that were known in 2001 and that we discussed in [45]:...
[...]
...Our third contribution was a new schema matching algorithm, called Cupid, which combined a number of techniques: linguistic matching, structure-based matching, constraint-based matching, and context-based matching....
[...]
Related Papers (5)
Frequently Asked Questions (13)
Q2. What is the common technique for comparing large schemas?
Optimizations for large schemas such as using string matching optimizations [40], pre-collecting predecessors and children of each element to avoid repeated traversal [2], and using space-efficient similarity matrices [12].
Q3. What is the common technique used to match schemas?
Partition-based matching – where to reduce the space of possible matches, the input schemas are partitioned followed by partition-wise matching [20][39][73].
Q4. What is the first step in the process of generating semantics?
The first step is to generate semantics in the form of constraints that relate parts of the instances of one schema to parts of the instances of another schema.
Q5. What is the common technique for comparing schemas?
Usage-based matching – based on analyzing database query logs for hints about how users relate schemas, e.g., by equating elements in join clauses [25].
Q6. What is the way to match a schema?
Systems such as Altova MapForce, IBM Infosphere, Microsoft BizTalk Server and SAP Netweaver provide a GUI-based editor for manual mapping specification with some support for automatic determination of match candidates, e.g., based on approximate name matching.
Q7. What is the recent work on schema matching?
Most recent schema and ontology matching prototypes include instance-based matchers [61] that derive the similarity of schema elements from the similarity oroverlap of element instances.
Q8. What is the common way to use a multi-step process?
That multi-step process involves other operators that manipulate schemas and mappings, such as schema merging and mapping composition.
Q9. What are the recent prototypes of match tools?
Mostrecent prototypes support match workflows and the combined use of different linguistic, structural and instance-based matchers.
Q10. What are the recent tools that support schema matching?
As indicated in Figure 2, advanced techniques such as schema partitioning, parallel matching, mapping reuse and self-tuning capabilities (e.g., a dynamic selection of matchers for a given match task) are still only supported to a limited extent in current match prototypes.
Q11. What other operators can be used to manipulate mappings?
Other model management operators are Diff (which finds the difference between mappings) and Extract (the complement of Diff) [52], and Invert, which reverses the direction of a unidirectional mapping [28][29].
Q12. What can be done to merge two schemas into a single schema?
This can be done by the merge operator, which takes as input two schemas and a mapping between them and returns a merged schema with mappings between the merged schema and the two input schemas [15][59][60][64].
Q13. What is the best-effort method for generating form submissions?
It can then generate form submissions by constructing URLs using sample values for the inputs (based on known values for the elements in the mediated schema).