Entity matching (EM) finds data instances that refer to the same real-world entity. In this paper we examine applying deep learning (DL) to EM, to understand DL's benefits and limitations. We review many DL solutions that have been developed for related matching tasks in text processing (e.g., entity linking, textual entailment, etc.). We categorize these solutions and define a space of DL solutions for EM, as embodied by four solutions with varying representational power: SIF, RNN, Attention, and Hybrid. Next, we investigate the types of EM problems for which DL can be helpful. We consider three such problem types, which match structured data instances, textual instances, and dirty instances, respectively. We empirically compare the above four DL solutions with Magellan, a state-of-the-art learning-based EM solution. The results show that DL does not outperform current solutions on structured EM, but it can significantly outperform them on textual and dirty EM. For practitioners, this suggests that they should seriously consider using DL for textual and dirty EM problems. Finally, we analyze DL's performance and discuss future research directions.

Deep Learning for Entity Matching: A Design Space Exploration

Active learning (AL) attempts to maximize the performance gain of the model by marking the fewest samples. Deep learning (DL) is greedy for data and requires a large amount of data supply to optimize massive parameters, so that the model learns how to extract high-quality features. In recent years, due to the rapid development of internet technology, we are in an era of information torrents and we have massive amounts of data. In this way, DL has aroused strong interest of researchers and has been rapidly developed. Compared with DL, researchers have relatively low interest in AL. This is mainly because before the rise of DL, traditional machine learning requires relatively few labeled samples. Therefore, early AL is difficult to reflect the value it deserves. Although DL has made breakthroughs in various fields, most of this success is due to the publicity of the large number of existing annotation datasets. However, the acquisition of a large number of high-quality annotated datasets consumes a lot of manpower, which is not allowed in some fields that require high expertise, especially in the fields of speech recognition, information extraction, medical images, etc. Therefore, AL has gradually received due attention. A natural idea is whether AL can be used to reduce the cost of sample annotations, while retaining the powerful learning capabilities of DL. Therefore, deep active learning (DAL) has emerged. Although the related research has been quite abundant, it lacks a comprehensive survey of DAL. This article is to fill this gap, we provide a formal classification method for the existing work, and a comprehensive and systematic overview. In addition, we also analyzed and summarized the development of DAL from the perspective of application. Finally, we discussed the confusion and problems in DAL, and gave some possible development directions for DAL.

A Survey of Deep Active Learning

Overcoming catastrophic forgetting in neural networks

Entity matching (EM) has been a long-standing challenge in data management Most current EM works focus only on developing matching algorithms We argue that far more efforts should be devoted to building EM systems We discuss the limitations of current EM systems, then present as a solution Magellan, a new kind of EM systems Magellan is novel in four important aspects (1) It provides how-to guides that tell users what to do in each EM scenario, step by step (2) It provides tools to help users do these steps; the tools seek to cover the entire EM pipeline, not just matching and blocking as current EM systems do (3) Tools are built on top of the data analysis and Big Data stacks in Python, allowing Magellan to borrow a rich set of capabilities in data cleaning, IE, visualization, learning, etc (4) Magellan provides a powerful scripting environment to facilitate interactive experimentation and quick "patching" of the system We describe research challenges raised by Magellan, then present extensive experiments with 44 students and users at several organizations that show the promise of the Magellan approach

/pdf/magellan-toward-building-entity-matching-management-systems-304l3jt02a.pdf

Magellan: toward building entity matching management systems

Despite the efforts in 70+ years in all aspects of entity resolution (ER), there is still a high demand for democratizing ER - by reducing the heavy human involvement in labeling data, performing feature engineering, tuning parameters, and defining blocking functions. With the recent advances in deep learning, in particular distributed representations of words (a.k.a. word embeddings), we present a novel ER system, called DeepER, that achieves good accuracy, high efficiency, as well as ease-of-use (i.e., much less human efforts). We use sophisticated composition methods, namely uni- and bi-directional recurrent neural networks (RNNs) with long short term memory (LSTM) hidden units, to convert each tuple to a distributed representation (i.e., a vector), which can in turn be used to effectively capture similarities between tuples. We consider both the case where pre-trained word embeddings are available as well the case where they are not; we present ways to learn and tune the distributed representations that are customized for a specific ER task under different scenarios. We propose a locality sensitive hashing (LSH) based blocking approach that takes all attributes of a tuple into consideration and produces much smaller blocks, compared with traditional methods that consider only a few attributes. We evaluate our algorithms on multiple datasets (including benchmarks, biomedical data, as well as multi-lingual data) and the extensive experimental results show that DeepER outperforms existing solutions.

/pdf/distributed-representations-of-tuples-for-entity-resolution-36u194h8rt.pdf

Distributed representations of tuples for entity resolution

Entity matching (EM) has been a long-standing challenge in data management. Most current EM works, however, focus only on developing matching algorithms. We argue that far more efforts should be devoted to building EM systems. We discuss the limitations of current EM systems, then present Magellan, a new kind of EM systems that addresses these limitations. Magellan is novel in four important aspects. (1) It provides a how-to guide that tells users what to do in each EM scenario, step by step. (2) It provides tools to help users do these steps; the tools seek to cover the entire EM pipeline, not just matching and blocking as current EM systems do. (3) Tools are built on top of the data science stacks in Python, allowing Magellan to borrow a rich set of capabilities in data cleaning, IE, visualization, learning, etc. (4) Magellan provide a powerful scripting environment to facilitate interactive experimentation and allow users to quickly write code to "patch" the system. We have extensively evaluated Magellan with 44 students and users at various organizations. In this paper we propose demonstration scenarios that show the promise of the Magellan approach.

/pdf/magellan-toward-building-entity-matching-management-systems-59gyu6iooz.pdf

Magellan: toward building entity matching management systems over data science stacks

The first step to deal with the significant issue of air pollution in China and elsewhere in the world is to monitor it. While more physical monitoring stations are built, current coverage is limited to large cities with most other places under-monitored. In this paper we propose a complementary approach to monitor Air Quality Index (AQI): using machine learning models to estimate AQI from social media posts. We propose a series of progressively more sophisticated machine learning models, culminating in a Markov Random Field model that utilizes the text content in social media as well as the spatiotemporal correlation among cities and days. Our extensive experiments on Sina Weibo data from 108 cities during a one-month period demonstrate the accurate AQI prediction performance of our approach.

Inferring air pollution by sniffing social media

Entity matching (EM) has been a long-standing challenge in data management. In the past few years we have started two major projects on EM (Magellan and Corleone/Falcon). These projects have raised many human-in-the-loop (HIL) challenges. In this paper we discuss these challenges. In particular, we show how these challenges forced us to revise our solution architecture, from a typical RDBMS-style architecture to a very human-centric one, in which human users are first-class objects driving the EM process, using tools at pain-point places. We discuss how such solution architectures can be viewed as combining "tools in the loop" with "human in the loop". Finally, we discuss lessons learned which can potentially be applied to other problem settings. We also hope that more researchers will investigate EM, as it can be a rich "playground" for HIL research.

Han Li

Papers

Deep Learning for Entity Matching: A Design Space Exploration

Magellan: toward building entity matching management systems

Magellan: toward building entity matching management systems over data science stacks

Inferring air pollution by sniffing social media

Human-in-the-Loop Challenges for Entity Matching: A Midterm Report