scispace - formally typeset
Search or ask a question

Showing papers by "Jayant Madhavan published in 2005"


Proceedings ArticleDOI
14 Jun 2005
TL;DR: This work considers complex information spaces: the authors' references belong to multiple related classes and each reference may have very few attribute values, and gradually enrich references by merging attribute values.
Abstract: Reference reconciliation is the problem of identifying when different references (i.e., sets of attribute values) in a dataset correspond to the same real-world entity. Most previous literature assumed references to a single class that had a fair number of attributes (e.g., research publications). We consider complex information spaces: our references belong to multiple related classes and each reference may have very few attribute values. A prime example of such a space is Personal Information Management, where the goal is to provide a coherent view of all the information on one's desktop.Our reconciliation algorithm has three principal features. First, we exploit the associations between references to design new methods for reference comparison. Second, we propagate information between reconciliation decisions to accumulate positive and negative evidences. Third, we gradually enrich references by merging attribute values. Our experiments show that (1) we considerably improve precision and recall over standard methods on a diverse set of personal information datasets, and (2) there are advantages to using our algorithm even on a standard citation dataset benchmark.

595 citations


Proceedings ArticleDOI
05 Apr 2005
TL;DR: In this article, a corpus of schemas and mappings can be used to augment the evidence about the schemas being matched, so they can be matched better, and they show experimental results that demonstrate corpus-based matching outperforms direct matching in multiple domains.
Abstract: Schema matching is the problem of identifying corresponding elements in different schemas. Discovering these correspondences or matches is inherently difficult to automate. Past solutions have proposed a principled combination of multiple algorithms. However, these solutions sometimes perform rather poorly due to the lack of sufficient evidence in the schemas being matched. In this paper we show how a corpus of schemas and mappings can be used to augment the evidence about the schemas being matched, so they can be matched better. Such a corpus typically contains multiple schemas that model similar concepts and hence enables us to learn variations in the elements and their properties. We exploit such a corpus in two ways. First, we increase the evidence about each element being matched by including evidence from similar elements in the corpus. Second, we learn statistics about elements and their relationships and use them to infer constraints that we use to prune candidate mappings. We also describe how to use known mappings to learn the importance of domain and generic constraints. We present experimental results that demonstrate corpus-based matching outperforms direct matching (without the benefit of a corpus) in multiple domains.

400 citations


Proceedings ArticleDOI
14 Jun 2005
TL;DR: The explosion of information available in digital form has made search a hot research topic for the Information Management Community, but individual computer users have developed their own vast collections of data on their desktops, and these collections are in critical need for good search and query tools.
Abstract: The explosion of information available in digital form has made search a hot research topic for the Information Management Community. While most of the research on search is focused on the WWW, individual computer users have developed their own vast collections of data on their desktops, and these collections are in critical need for good search and query tools. The problem is exacerbated by the proliferation of varied electronic devices (laptops, PDAs, cellphones) that are at our disposal, which often hold subsets or variations of our data. In fact, several recent venues have noted Personal Information Management (PIM) as an area of growing interest to the data management community [1, 8, 6]

99 citations


01 Jan 2005
TL;DR: This dissertation studies the problem of constructing semantic mappings, i.e., expressions that relate different schemas, and describes different ways in which known schemas and mappings within the same domain can be used to enhance the matching performance of a conventional schema matcher.
Abstract: This dissertation studies the problem of constructing semantic mappings, i.e., expressions that relate different schemas. Semantic mappings play an important role in modern information systems. They let applications relate data in different sources and thus enable them to fruitfully leverage information residing in various forms across multiple data sources. With the proliferation of information systems that adopt distributed and often heterogeneous architectures, there is a need for automated support and tools to better facilitate the construction of semantic mappings. Our thesis is that, given a mapping construction task, knowledge that exists in other schemas and previously known mappings related to the mapping task, can be exploited to construct the required mapping. Our hypothesis is based on the two intuitions that mapping construction tasks are often repetitive in nature, and that different schemas within a domain are alternate representations for the same underlying entities. We demonstrate this in the context of two problems that are crucial to mapping construction. First, we address the problem of schema matching, i.e., the problem of identifying semantically similar elements in different schemas. We first describe techniques to match schemas elements based on their names, data instances, relationships with other elements, and known desirable properties of mappings. Then, given a new mapping task, we describe different ways in which known schemas and mappings within the same domain can be used to enhance the matching performance of a conventional schema matcher. Our approach is called corpus-based matching and our experimental results demonstrate its improved performance over direct matching techniques. Second, we address the problem of mapping composition, i.e., the problem of constructing a direct mapping between two schemas from existing mappings to a common schema. We show that, the composition problem can have very surprising consequences in the case of commonly used GLAV mappings. Specifically, the composition of mappings with finite formulas can in fact have infinite formulas. We further describe how and when compositions, even when infinite, can be encoded in a finite representation. Finally, we show how known query answering algorithms can be extended to handle compositions with infinite formulas.

2 citations