scispace - formally typeset

Data integration

About: Data integration is a(n) research topic. Over the lifetime, 9977 publication(s) have been published within this topic receiving 161761 citation(s). The topic is also known as: information integration. more


Open accessJournal ArticleDOI: 10.1007/S007780100057
Erhard Rahm1, Philip A. Bernstein2Institutions (2)
01 Dec 2001-
Abstract: Schema matching is a basic problem in many database application domains, such as data integration, E-business, data warehousing, and semantic query processing. In current implementations, schema matching is typically performed manually, which has significant limitations. On the other hand, previous research papers have proposed many techniques to achieve a partial automation of the match operation for specific application domains. We present a taxonomy that covers many of these existing approaches, and we describe the approaches in some detail. In particular, we distinguish between schema-level and instance-level, element-level and structure-level, and language-based and constraint-based matchers. Based on our classification we review some previous match implementations thereby indicating which part of the solution space they cover. We intend our taxonomy and review of past work to be useful when comparing different approaches to schema matching, when developing a new match algorithm, and when implementing a schema matching component. more

  • Fig. 3.Equivalence pattern
    Fig. 3.Equivalence pattern
  • Table 3.Match cardinalities (Examples)
    Table 3.Match cardinalities (Examples)
  • Table 5.Characteristics of proposed schema match approaches
    Table 5.Characteristics of proposed schema match approaches
  • Table 1.Sample input schemas
    Table 1.Sample input schemas
  • Table 4.Constraint-based matching (example)
    Table 4.Constraint-based matching (example)
  • + 2

Topics: Schema matching (72%), Star schema (66%), Conceptual schema (62%) more

3,611 Citations

Proceedings ArticleDOI: 10.1145/543613.543644
Maurizio Lenzerini1Institutions (1)
03 Jun 2002-
Abstract: Data integration is the problem of combining data residing at different sources, and providing the user with a unified view of these data. The problem of designing data integration systems is important in current real world applications, and is characterized by a number of issues that are interesting from a theoretical point of view. This document presents on overview of the material to be presented in a tutorial on data integration. The tutorial is focused on some of the theoretical issues that are relevant for data integration. Special attention will be devoted to the following aspects: modeling a data integration application, processing queries in data integration, dealing with inconsistent data sources, and reasoning on queries. more

Topics: Ontology-based data integration (69%), Dataspaces (67%), Data integration (64%) more

2,608 Citations

Journal ArticleDOI: 10.1145/1456650.1456651
Jens Bleiholder1, Felix Naumann1Institutions (1)
Abstract: The development of the Internet in recent years has made it possible and useful to access many different information systems anywhere in the world to obtain information. While there is much research on the integration of heterogeneous information systems, most commercial systems stop short of the actual integration of available data. Data fusion is the process of fusing multiple records representing the same real-world object into a single, consistent, and clean representation.This article places data fusion into the greater context of data integration, precisely defines the goals of data fusion, namely, complete, concise, and consistent data, and highlights the challenges of data fusion, namely, uncertain and conflicting data values. We give an overview and classification of different ways of fusing data and present several techniques based on standard and advanced operators of the relational algebra and SQL. Finally, the article features a comprehensive survey of data integration systems from academia and industry, showing if and how data fusion is performed in each. more

1,775 Citations

Open accessJournal ArticleDOI: 10.1109/TKDE.2007.250581
Abstract: Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a difficult task. Errors are introduced as the result of transcription errors, incomplete information, lack of standard formats, or any combination of these factors. In this paper, we present a thorough analysis of the literature on duplicate record detection. We cover similarity metrics that are commonly used to detect similar field entries, and we present an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database. We also cover multiple techniques for improving the efficiency and scalability of approximate duplicate detection algorithms. We conclude with coverage of existing tools and with a brief discussion of the big open problems in the area more

  • Fig. 1. The Pinheiro and Sun similarity metric alignment for the strings σ1 = ABcDeFgH σ2 = AxByDzF .
    Fig. 1. The Pinheiro and Sun similarity metric alignment for the strings σ1 = ABcDeFgH σ2 = AxByDzF .
Topics: Data deduplication (59%), Data integrity (56%), Record linkage (54%) more

1,643 Citations

Journal ArticleDOI: 10.1007/S007780100054
Alon Halevy1Institutions (1)
01 Dec 2001-
Abstract: The problem of answering queries using views is to find efficient methods of answering a query using a set of previously defined materialized views over the database, rather than accessing the database relations. The problem has recently received significant attention because of its relevance to a wide variety of data management problems. In query optimization, finding a rewriting of a query using a set of materialized views can yield a more efficient query execution plan. To support the separation of the logical and physical views of data, a storage schema can be described using views over the logical schema. As a result, finding a query execution plan that accesses the storage amounts to solving the problem of answering queries using views. Finally, the problem arises in data integration systems, where data sources can be described as precomputed views over a mediated schema. This article surveys the state of the art on the problem of answering queries using views, and synthesizes the disparate works into a coherent framework. We describe the different applications of the problem, the algorithms proposed to solve it and the relevant theoretical results. more

Topics: View (66%), Information schema (63%), Query optimization (62%) more

1,593 Citations

No. of papers in the topic in previous years

Top Attributes

Show by:

Topic's top 5 most impactful authors

Sonia Bergamaschi

56 papers, 1.2K citations

Maurizio Lenzerini

49 papers, 5.3K citations

Diego Calvanese

32 papers, 1.6K citations

Norman W. Paton

32 papers, 482 citations

Erhard Rahm

29 papers, 4.6K citations

Network Information
Related Topics (5)
Data management

31.5K papers, 424.3K citations

90% related
Relational database

21.7K papers, 479K citations

88% related

26.6K papers, 393.3K citations

87% related

31.9K papers, 498.3K citations

87% related
Data modeling

29.6K papers, 470.1K citations

86% related