A survey of approaches to automatic schema matching
Summary (5 min read)
1. Introduction
- A fundamental operation in the manipulation of schema information isMatch, which takes two schemas as input and produces a mapping between elements of the two schemas that correspond semantically to each other [LC94, MIR94, MZ98, PSU98, MWJ99, DDL00].
- Match plays a central role in numerous applications, such as web-oriented data integration, electronic commerce, schema integration, schema evolution and migration, application evolution, data warehousing, database design, web site creation and management, and component-based development.
- Currently, schema matching is typically performed manually, perhaps supported by a graphical user interface.
- This section illustrates both the complexity of the problem and (at least part of) the solution space.
- Section 9 is a literature review, which describes some integrated solutions and how they fit in their classification.
2.1. Schema integration
- Most work on schema match has been motivated by schema integration, a problem that has been investigated since the early 1980s: Given a set of independently developed schemas, construct a global view [BLN86, EP90, SL90, PS98].
- In an artificial intelligence setting, this is the problem of integrating independently developed ontologies into a single ontology.
- It also occurs even if they model the same real world domain, just because they were developed by different people in different real-world contexts.
- Thus, a first step in integrating the schemas is to identify and characterize these interschema relationships.
- Again, this requires reconciling the structure and terminology of the two schemas, which involves schema matching.
2.2. Data warehouses
- A variation of the schema integration problem that became popular in the 1990s is that of integrating data sources into a data warehouse.
- The extraction process requires transforming data from the source format into the warehouse format.
- As shown in [BR00], the match operation is useful for designing transformations.
- After an initial mapping is created, the data warehouse designer needs to examine the detailed semantics of each source element and create transformations that reconcile those semantics with those of the target.
- First, the common elements ofS′ andS are found (a match operation) and thenS⇒W is reused for those common elements.
2.3. E-commerce
- In the current decade, E-commerce has led to a new motivation for schema matching: message translation.
- Trading partners frequently exchange messages that describe business transactions.
- Fields are grouped into structures that also may differ between the two formats.
- Translating between different message schemas is, in part, a schema matching problem.
- Today, application designers need to specify manually how message formats are related.
2.4. Semantic query processing
- Schema integration, data warehousing, and E-commerce are all similar in that they involve the design-time analysis of schemas to produce mappings and, possibly an integrated schema.
- A somewhat different scenario is semantic query processing – a run-time scenario where a user specifies the output of a query (e.g., the SELECT clause in SQL), and the system figures out how to produce that output (e.g., by determining the FROM and WHERE clauses in SQL).
- The user’s specification is stated in terms of concepts familiar to her, which may not be the same as the names of elements specified in the database schema.
- Therefore, in the first phase of processing the query, the system must map the user-specified concepts in the query output to schema elements.
- Techniques for deriving this qualification have been developed over the past 20 years [MRSS82, KKFG84, WS90, RYAC00].
3. The match operator
- To define the match operator, Match, the authors need to choose a representation for its input schemas and output mapping.
- Each mapping element of the match result specifies that certain elements of schemaS1 logically correspond to, i.e., match, certain elements ofS2, where the semantics of this correspondence is expressed by the mapping element’s mapping expression.
- A complete specification of the result of the invocation of Match would also include the mapping expression of each element, that is “Cust.C# = Customer.CustID”, “Cust.
- The similarity of Match and Join extends to OuterMatch operations, which are useful counterparts to Match in much the same way that OuterJoin is a counterpart to Join.
- A right (or left) OuterMatch ensures that every element ofS2 (or S1) is referenced by the mapping.
4. Architecture for generic match
- XML schema editors, portal development kits, database modeling tools and the like may access libraries to select existing schemas, shown in the lower left of Fig.1.
- This uniform representation significantly reduces the complexity of Match by not having to deal with the large number of different representations of schemas.
- Tools that are tightly integrated with the framework can work directly on the internal representation.
- The implementation of Match should therefore only determinematch candidates, which the user can accept, reject or change.
5. Classification of schema matching approaches
- In this section the authors classify the major approaches to schema matching.
- Fig.2 shows part of their classification scheme together with some sample approaches.
- An implementation of Match may use multiple match algorithms ormatchers.
- For individual matchers, the authors consider the following largely-orthogonal classification criteria: Instance vs schema:matching approaches can consider instance data (i.e., data contents) or only schema-level information.
- In addition, each mapping element may interrelate one or more elements of the two schemas.
6. Schema-level matchers
- Schema-level matchers only consider schema information, not instance data.
- The available information includes the usual properties of schema elements, such as name, description, data type, relationship types (part-of, is-a, etc.), constraints, and schema structure.
- In general, a matcher will find multiple match candidates.
- For each candidate, it is customary to estimate the degree of similarity by a normalized numeric value in the range 0–1, in order to identify the best match candidates (as in [PSU98, BCV99, DDL00, CDD01]).
- Then the authors cover linguistic and constraintbased matchers.
6.1. Granularity of match (element-level vs structure-level)
- The authors distinguish two main alternatives for the granularity of Match, element-level and structure-level matching.
- For each element of the first schema,element-level matchingdetermines the matching elements in the second input schema.
- Structure-level matching,on the other hand, refers to matching combinations of elements that appear together in a structure.
- The fact that the elements “Address” and “CustomerAddress” in Table 2 are likely to match can be derived by a name-based element-level matching without considering their underlying components.
- Element-level matching can be implemented by algorithms similar to relational join processing.
6.2. Match cardinality
- An S1 (or S2) element can participate in zero, one or many mapping elements of the match result between the two input schemasS1 and S2.
- Thus, the authors have the usual relationship cardinalities, namely 1:1 and the set-oriented cases 1:n, n:1, and n:m, between matching elements both with respect to different mapping elements (global cardinality) and with respect to an individual mapping element (local cardinality).
- M mapping elements usually requires considering the structural embedding of the schema elements and thus requires structure-level matching, also known as Obtaining n.
- Row 3 explains how FirstName and LastName are extracted from Name.
- For the first three examples in Table 3, oneS1 instance is matched with oneS2 instance (1:1 instance-level match).
6.3. Linguistic approaches
- Language-based or linguistic matchers use names and text (i.e., words or sentences) to find semantically similar schema elements.
- Name matching Name-based matching matches schema elements with equal or similar names.
- Similarity of names can be defined and measured in various ways, including: Equality of names.
- Homonyms are equal or similar names that refer to different elements.
- Name-based matching is not limited to finding 1:1 matches.
D (name1, name2,
- This assumes that D contains all relevant pairs of the transitive closure over similar names.
- Intuitively, the authors would expect the similarity valueσ to be .9× .8 = .72, but this depends on the type of similarity, the use of homonyms, and perhaps other factors.
- These comments can also be evaluated linguistically to determine the similarity between schema elements.
6.4. Constraint-based approaches
- Schemas often contain constraints to define data types and value ranges, uniqueness, optionality, relationship types and cardinalities, etc. Certain structural information can be interpreted as constraints, such as intra-schema references (e.g., foreign keys) and adjacency-related information (e.g., part-of relationships).
- When performing a match based on hierarchical structures, an algorithm can traverse the structure either top-down or bottom-up.
- This allows us to determine the correct n:m SQL-like match mapping S2.
6.5. Reusing schema and mapping information
- The authors have already discussed the use of auxiliary information in addition to the input schemas, such as dictionaries, thesauri, and user-provided match or mismatch information.
- Another way to use auxiliary information to improve the effectiveness of Match is to support and exploit the reuse of common schema components and previously determined mappings.
- The authors also want to reuse entire structures, which is useful when matching different but similar schemas to the same destination schema, as may occur when integrating new sources into a data warehouse or digital library.
- The authors already have the match result betweenS andS2, illustrated by the arrows.
- Salary and Income may be considered identical in a payroll application but not in a tax reporting application.
7. Instance-level approaches
- Instance-level data can give important insight into the contents and meaning of schema elements.
- It can help disambiguate between equally plausible schema-level matches by choosing to match the elements whose instances are more similar.
- The main benefit of evaluating instances is a precise characterization of the actual contents of schema elements.
- Then, theS2 instances are matched one-by-one against the characterizations ofS1 elements.
- Instance-level matching can also be performed by utilizing auxiliary information, e.g., previous mappings obtained from matching different schemas.
8. Combining different matchers
- The authors have reviewed several types of matchers and many different variations.
- Structure-level matching also benefits from being combined with other approaches such as name matching.
- On the other hand, one can use acomposite matcherthat combines the results of several independently executed matchers, including hybrid matchers.
- Selection of matchers, and determining their execution order and the combination of independently determined match esults can be done either automatically by the implementation of Match itself or its clients (e.g., tools), or manually by a human user.
- An automatic approach can reduce the number of user interactions, but it is difficult to achieve a generic solution that is adaptable to different application domains (although the approach could be controlled by tuning parameters).
9.1. Prototype schema matchers
- In Table 5 the authors show how seven published prototype implementations fit the classification criteria introduced in Sect.5.
- The table thus indicates which part of the solution space is covered by which implementations, thereby supporting a comparison of the approaches.
- The table shows that all systems support multiple matching criteria, six in the form of a hybrid matcher and only one, LSD, by a composite match approach.
- A global matcher that uses the same machine-learning technology is used to merge the lists into a combined list of match candidates for each schema element.
- It computes matches by a weighted sum of name and data type affinity and structural affinity.
10. Conclusion
- Schema matching is a basic problem in many database application domains, such as heterogeneous database integration, E-commerce, data warehousing, and semantic query processing.
- The authors hope that the taxonomy will be useful to programmers who need to implement a match algorithm and to researchers looking to develop more effective and comprehensive schema matching algorithms.
- More attention should be given to the utilization of instance-level information and reuse opportunities to perform Match.
- The authors are grateful for many helpful suggestions from Sonia Bergamaschi, Silvana Castano, Chris Clifton, Hai Hong Do, An Hai Doan, Alon Halevy, Jayant Madhavan, Sergey Melnik, Renée Miller, Rachel Pottinger,Arnie Rosenthal, Dennis Shasha, and the anonymous referees.
Did you find this useful? Give us your feedback
Citations
2,716 citations
Cites background from "A survey of approaches to automatic..."
...• How to build an appropriate global schema, and how to discover inter-schema [31] and mapping assertions (LAV or GAV) in the design of a data integration system (see, for instance, [83])....
[...]
[...]
2,579 citations
Cites background or methods from "A survey of approaches to automatic..."
...We build on the previous work on classifying automated schema matching approaches of (Rahm and Bernstein 2001) which distinguishes between elementary (individual) matchers and composition of matchers....
[...]
...Internal structure-based methods are sometimes referred to as constraint-based approaches in the literature (Rahm and Bernstein 2001)....
[...]
...For example, (Do 2005) extends the work of (Rahm and Bernstein 2001) by adding a reuse-oriented category of techniques on top of schema-based vs....
[...]
...There have already been some comparisons of matching systems, in particular in (Parent and Spaccapietra 2000; Rahm and Bernstein 2001; Do et al. 2002; Kalfoglou and Schorlemmer 2003b; Noy 2004a; Doan and Halevy 2005; Shvaiko and Euzenat 2005; Choi et al. 2006; Bellahsene et al. 2011)....
[...]
...For instance, in schema matching, some authors (Sheth and Larson 1990; Rahm and Bernstein 2001) tend to consider that a correspondence like...
[...]
1,613 citations
1,452 citations
Cites background from "A survey of approaches to automatic..."
...Entity resolution (also known as record linkage [47], object identification [48], instance matching [49], and deduplication [50]) is the problem of identifying which objects in relational data refer to the same underlying entities....
[...]
1,384 citations
References
2,376 citations
1,648 citations
"A survey of approaches to automatic..." refers background in this paper
...Most work on schema match has been motivated by schema integration, a problem that has been investigated since the early 1980s: Given a set of independently developed schemas, construct a global view [ BLN86 , EP90, SL90, PS98]....
[...]
1,613 citations
1,533 citations
Additional excerpts
...Cupid is a hybrid matcher based on both element- and structure-level matching [ MBR01 ]....
[...]
1,367 citations
"A survey of approaches to automatic..." refers background in this paper
...Zhang and Shasha developed an algorithm to find a mapping between two labeled trees [ ZS89 , ZSW92, ZS97], which they later implemented in a system for approximate tree matching [WZJS94]....
[...]
Related Papers (5)
Frequently Asked Questions (11)
Q2. What have the authors stated for future works in "A survey of approaches to automatic schema matching" ?
The authors hope that the taxonomy will be useful to programmers who need to implement a match algorithm and to researchers looking to develop more effective and comprehensive schema matching algorithms. In the future, the authors would like to see quantitative work on the relative performance and accuracy of different approaches. Such results could tell us which of the existing approaches dominate the others and could help identify weaknesses in the existing approaches that suggest opportunities for future research. Since the problem is so fundamental, the authors believe the field would benefit from treating it as an independent problem, as they have begun doing here.
Q3. What is the role of match in various applications?
Match plays a central role in numerous applications, such as web-oriented data integration, electronic commerce, schema integration, schema evolution and migration, application evolution, data warehousing, database design, web site creation and management, and component-based development.
Q4. What is the way to use auxiliary information to improve the effectiveness of Match?
Another way to use auxiliary information to improve the effectiveness of Match is to support and exploit the reuse of common schema components and previously determined mappings.
Q5. What is the level of effort required to perform a match?
The level of effort is at least linear in the number of matches to be performed, maybe worse than linear if one needs to evaluate each match in the context of other possible matches of the same elements.
Q6. Why do the authors think the field would benefit from treating it as an independent problem?
Since the problem is so fundamental, the authors believe the field would benefit from treating it as an independent problem, as the authors have begun doing here.
Q7. What are the main classification criteria for a matcher?
For individual matchers, the authors consider the following largely-orthogonal classification criteria: • Instance vs schema: matching approaches can considerinstance data (i.e., data contents) or only schema-level information.
Q8. What is the way to combine structure- with element-level matching?
One way to combine structure- with element-level matching is to use one algorithm to generate a partial mapping and the other to complete the mapping.
Q9. What is the way to simplify the automatic generation of match candidates?
If S1 is more similar to S than to S2, this can simplify the automatic generation of match candidates by reusing matches from the existing result of Match(S, S2), although some care is needed since matches are sometimes not transitive.
Q10. What is the process of generating a list of match candidates in S2?
The per-instance match results need to be merged and abstracted to the schema level, to generate a ranked list of match candidates in S1 for each (schema-level) element inS2.
Q11. What is the way to deal with input schemas?
General natural language dictionaries may be useful, perhaps even multi-language dictionaries (e.g., English-German) to deal with input schemas of different languages.