scispace - formally typeset
Search or ask a question

Showing papers by "Jiawei Han published in 2015"


Journal ArticleDOI
TL;DR: A thorough overview and analysis of the main approaches to entity linking is presented, and various applications, the evaluation of entity linking systems, and future directions are discussed.
Abstract: The large number of potential applications from bridging web data with knowledge bases have led to an increase in the entity linking research. Entity linking is the task to link entity mentions in text with their corresponding entities in a knowledge base. Potential applications include information extraction, information retrieval, and knowledge base population. However, this task is challenging due to name variations and entity ambiguity. In this survey, we present a thorough overview and analysis of the main approaches to entity linking, and discuss various applications, the evaluation of entity linking systems, and future directions.

702 citations


Posted Content
TL;DR: Several truth discovery methods have been proposed for various scenarios, and they have been successfully applied in diverse application domains as discussed by the authors. But for the same object, there usually exist conflicts among the collected multi-source information.
Abstract: Thanks to information explosion, data for the objects of interest can be collected from increasingly more sources. However, for the same object, there usually exist conflicts among the collected multi-source information. To tackle this challenge, truth discovery, which integrates multi-source noisy information by estimating the reliability of each source, has emerged as a hot topic. Several truth discovery methods have been proposed for various scenarios, and they have been successfully applied in diverse application domains. In this survey, we focus on providing a comprehensive overview of truth discovery methods, and summarizing them from different aspects. We also discuss some future directions of truth discovery research. We hope that this survey will promote a better understanding of the current progress on truth discovery, and offer some guidelines on how to apply these approaches in application domains.

245 citations


Proceedings ArticleDOI
27 May 2015
TL;DR: A new framework that extracts quality phrases from text corpora integrated with phrasal segmentation is proposed, which requires only limited training but the quality of phrases so generated is close to human judgment.
Abstract: Text data are ubiquitous and play an essential role in big data applications. However, text data are mostly unstructured. Transforming unstructured text into structured units (e.g., semantically meaningful phrases) will substantially reduce semantic ambiguity and enhance the power and efficiency at manipulating such data using database technology. Thus mining quality phrases is a critical research problem in the field of databases. In this paper, we propose a new framework that extracts quality phrases from text corpora integrated with phrasal segmentation. The framework requires only limited training but the quality of phrases so generated is close to human judgment. Moreover, the method is scalable: both computation time and required space grow linearly as corpus size increases. Our experiments on large text corpora demonstrate the quality and efficiency of the new method.

206 citations


Proceedings ArticleDOI
10 Aug 2015
TL;DR: Experimental results on two real-world datasets show that FaitCrowd can significantly reduce the error rate of aggregation compared with the state-of-the-art multi-source aggregation approaches due to its ability of learning topical expertise from question content and collected answers.
Abstract: In crowdsourced data aggregation task, there exist conflicts in the answers provided by large numbers of sources on the same set of questions. The most important challenge for this task is to estimate source reliability and select answers that are provided by high-quality sources. Existing work solves this problem by simultaneously estimating sources' reliability and inferring questions' true answers (i.e., the truths). However, these methods assume that a source has the same reliability degree on all the questions, but ignore the fact that sources' reliability may vary significantly among different topics. To capture various expertise levels on different topics, we propose FaitCrowd, a fine grained truth discovery model for the task of aggregating conflicting data collected from multiple users/sources. FaitCrowd jointly models the process of generating question content and sources' provided answers in a probabilistic model to estimate both topical expertise and true answers simultaneously. This leads to a more precise estimation of source reliability. Therefore, FaitCrowd demonstrates better ability to obtain true answers for the questions compared with existing approaches. Experimental results on two real-world datasets show that FaitCrowd can significantly reduce the error rate of aggregation compared with the state-of-the-art multi-source aggregation approaches due to its ability of learning topical expertise from question content and collected answers.

181 citations


Proceedings ArticleDOI
13 Apr 2015
TL;DR: This paper focuses on hierarchical spatio-temporal hashtag clustering techniques and proposes a data structure called STREAMCUBE, which is an extension of the data cube structure from the database community with spatial and temporal hierarchy.
Abstract: What is happening around the world? When and where? Mining the geo-tagged Twitter stream makes it possible to answer the above questions in real-time. Although a single tweet can be short and noisy, proper aggregations of tweets can provide meaningful results. In this paper, we focus on hierarchical spatio-temporal hashtag clustering techniques. Our system has the following features: (1) Exploring events (hashtag clusters) with different space granularity. Users can zoom in and out on maps to find out what is happening in a particular area. (2) Exploring events with different time granularity. Users can choose to see what is happening today or in the past week. (3) Efficient single-pass algorithm for event identification, which provides human-readable hashtag clusters. (4) Efficient event ranking which aims to find burst events and localized events given a particular region and time frame. To support aggregation with different space and time granularity, we propose a data structure called STREAMCUBE, which is an extension of the data cube structure from the database community with spatial and temporal hierarchy. To achieve high scalability, we propose a divide-and-conquer method to construct the STREAMCUBE. To support flexible event ranking with different weights, we proposed a top-k based index. Different efficient methods are used to speed up event similarity computations. Finally, we have conducted extensive experiments on a real twitter data. Experimental results show that our framework can provide meaningful results with high scalability.

146 citations


Proceedings ArticleDOI
10 Aug 2015
TL;DR: This work investigates the temporal relations among both object truths and source reliability, and proposes an incremental truth discovery framework that can dynamically update object truth and source weights upon the arrival of new data.
Abstract: In the era of big data, information regarding the same objects can be collected from increasingly more sources. Unfortunately, there usually exist conflicts among the information coming from different sources. To tackle this challenge, truth discovery, i.e., to integrate multi-source noisy information by estimating the reliability of each source, has emerged as a hot topic. In many real world applications, however, the information may come sequentially, and as a consequence, the truth of objects as well as the reliability of sources may be dynamically evolving. Existing truth discovery methods, unfortunately, cannot handle such scenarios. To address this problem, we investigate the temporal relations among both object truths and source reliability, and propose an incremental truth discovery framework that can dynamically update object truths and source weights upon the arrival of new data. Theoretical analysis is provided to show that the proposed method is guaranteed to converge at a fast rate. The experiments on three real world applications and a set of synthetic data demonstrate the advantages of the proposed method over state-of-the-art truth discovery methods.

143 citations


Proceedings ArticleDOI
18 May 2015
TL;DR: The existence of network effect is examined in a recent online experiment conducted at LinkedIn, an efficient and effective estimator for Average Treatment Effect (ATE) considering the interference between users in real online experiments is proposed, and the method is applied in both simulations and a real world online experiment.
Abstract: A/B testing, also known as bucket testing, split testing, or controlled experiment, is a standard way to evaluate user engagement or satisfaction from a new service, feature, or product. It is widely used in online websites, including social network sites such as Facebook, LinkedIn, and Twitter to make data-driven decisions. The goal of A/B testing is to estimate the treatment effect of a new change, which becomes intricate when users are interacting, i.e., the treatment effect of a user may spill over to other users via underlying social connections.When conducting these online controlled experiments, it is a common practice to make the Stable Unit Treatment Value Assumption (SUTVA) that each individual's response is affected by their own treatment only. Though this assumption simplifies the estimation of treatment effect, it does not hold when network interference is present, and may even lead to wrong conclusion. In this paper, we study the problem of network A/B testing in real networks, which have substantially different characteristics from the simulated random networks studied in previous works. We first examine the existence of network effect in a recent online experiment conducted at LinkedIn; Secondly, we propose an efficient and effective estimator for Average Treatment Effect (ATE) considering the interference between users in real online experiments; Finally, we apply our method in both simulations and a real world online experiment. The simulation results show that our estimator achieves better performance with respect to both bias and variance reduction. The real world online experiment not only demonstrates that large-scale network A/B test is feasible but also further validates many of our observations in the simulation studies.

118 citations


Proceedings ArticleDOI
10 Aug 2015
TL;DR: This paper investigates entity recognition (ER) with distant-supervision and proposes a novel relation phrase-based ER framework, called ClusType, that runs data-driven phrase mining to generate entity mention candidates and relation phrases, and enforces the principle that relation phrases should be softly clustered when propagating type information between their argument entities.
Abstract: Entity recognition is an important but challenging research problem. In reality, many text collections are from specific, dynamic, or emerging domains, which poses significant new challenges for entity recognition with increase in name ambiguity and context sparsity, requiring entity detection without domain restriction. In this paper, we investigate entity recognition (ER) with distant-supervision and propose a novel relation phrase-based ER framework, called ClusType, that runs data-driven phrase mining to generate entity mention candidates and relation phrases, and enforces the principle that relation phrases should be softly clustered when propagating type information between their argument entities. Then we predict the type of each entity mention based on the type signatures of its co-occurring relation phrases and the type indicators of its surface name, as computed over the corpus. Specifically, we formulate a joint optimization problem for two tasks, type propagation with relation phrases and multi-view relation phrase clustering. Our experiments on multiple genres---news, Yelp reviews and tweets---demonstrate the effectiveness and robustness of ClusType, with an average of 37% improvement in F1 score over the best compared method.

107 citations


Proceedings ArticleDOI
10 Aug 2015
TL;DR: This work proposes a probabilistic graphical model, which simultaneously infers truth as well as source quality without any a priori training involving ground truth answers, and proposes an initialization scheme based upon a quantity named truth existence score, which synthesizes two indicators, namely, participation rate and consistency rate.
Abstract: When integrating information from multiple sources, it is common to encounter conflicting answers to the same question. Truth discovery is to infer the most accurate and complete integrated answers from conflicting sources. In some cases, there exist questions for which the true answers are excluded from the candidate answers provided by all sources. Without any prior knowledge, these questions, named no-truth questions, are difficult to be distinguished from the questions that have true answers, named has-truth questions. In particular, these no-truth questions degrade the precision of the answer integration system. We address such a challenge by introducing source quality, which is made up of three fine-grained measures: silent rate, false spoken rate and true spoken rate. By incorporating these three measures, we propose a probabilistic graphical model, which simultaneously infers truth as well as source quality without any a priori training involving ground truth answers. Moreover, since inferring this graphical model requires parameter tuning of the prior of truth, we propose an initialization scheme based upon a quantity named truth existence score, which synthesizes two indicators, namely, participation rate and consistency rate. Compared with existing methods, our method can effectively filter out no-truth questions, which results in more accurate source quality estimation. Consequently, our method provides more accurate and complete answers to both has-truth and no-truth questions. Experiments on three real-world datasets illustrate the notable advantage of our method over existing state-of-the-art truth discovery methods.

59 citations


Proceedings ArticleDOI
14 Nov 2015
TL;DR: In this paper, the authors propose to represent a document as a typed heterogeneous information network (HIN), where the entities and relations are annotated with types and multiple documents can be linked by the words and entities in the HIN.
Abstract: As a fundamental task, document similarity measure has broad impact to document-based classification, clustering and ranking. Traditional approaches represent documents as bag-of-words and compute document similarities using measures like cosine, Jaccard, and dice. However, entity phrases rather than single words in documents can be critical for evaluating document relatedness. Moreover, types of entities and links between entities/words are also informative. We propose a method to represent a document as a typed heterogeneous information network (HIN), where the entities and relations are annotated with types. Multiple documents can be linked by the words and entities in the HIN. Consequently, we convert the document similarity problem to a graph distance problem. Intuitively, there could be multiple paths between a pair of documents. We propose to use the meta-path defined in HIN to compute distance between documents. Instead of burdening user to define meaningful meta paths, an automatic method is proposed to rank the meta-paths. Given the meta-paths associated with ranking scores, an HIN-based similarity measure, KnowSim, is proposed to compute document similarities. Using Freebase, a well-known world knowledge base, to conduct semantic parsing and construct HIN for documents, our experiments on 20Newsgroups and RCV1 datasets show that KnowSim generates impressive high-quality document clustering.

56 citations


Proceedings ArticleDOI
10 Aug 2015
TL;DR: Experimental results on two text benchmark datasets show that incorporating world knowledge as indirect supervision can significantly outperform the state-of-the-art clustering algorithms as well as clustering algorithm enhanced with world knowledge features.
Abstract: One of the key obstacles in making learning protocols realistic in applications is the need to supervise them, a costly process that often requires hiring domain experts. We consider the framework to use the world knowledge as indirect supervision. World knowledge is general-purpose knowledge, which is not designed for any specific domain. Then the key challenges are how to adapt the world knowledge to domains and how to represent it for learning. In this paper, we provide an example of using world knowledge for domain dependent document clustering. We provide three ways to specify the world knowledge to domains by resolving the ambiguity of the entities and their types, and represent the data with world knowledge as a heterogeneous information network. Then we propose a clustering algorithm that can cluster multiple types and incorporate the sub-type information as constraints. In the experiments, we use two existing knowledge bases as our sources of world knowledge. One is Freebase, which is collaboratively collected knowledge about entities and their organizations. The other is YAGO2, a knowledge base automatically extracted from Wikipedia and maps knowledge to the linguistic knowledge base, WordNet. Experimental results on two text benchmark datasets (20newsgroups and RCV1) show that incorporating world knowledge as indirect supervision can significantly outperform the state-of-the-art clustering algorithms as well as clustering algorithms enhanced with world knowledge features.

Journal ArticleDOI
Jing Gao1, Qi Li1, Bo Zhao2, Wei Fan3, Jiawei Han4 
01 Aug 2015
TL;DR: This tutorial presents an organized picture on truth discovery and crowdsourcing aggregation, which are compared on both theory and application levels, and their related areas as well as open questions are discussed.
Abstract: In the era of Big Data, data entries, even describing the same objects or events, can come from a variety of sources, where a data source can be a web page, a database or a person. Consequently, conflicts among sources become inevitable. To resolve the conflicts and achieve high quality data, truth discovery and crowdsourcing aggregation have been studied intensively. However, although these two topics have a lot in common, they are studied separately and are applied to different domains. To answer the need of a systematic introduction and comparison of the two topics, we present an organized picture on truth discovery and crowdsourcing aggregation in this tutorial. They are compared on both theory and application levels, and their related areas as well as open questions are discussed.

Journal ArticleDOI
TL;DR: A UGC‐based automatic road map inference method is proposed that applies data mining techniques and natural language processing tools to extract not only spatial information – the location of the road network – but also attribute information – road class and road name – in an effort to create a complete road map.
Abstract: As mapping is costly and labor-intensive work, government mapping agencies are less and less willing to absorb these costs. In order to reduce the updating cycle and cost, researchers have started to use user generated content (UGC) for updating road maps; however, the existing methods either rely heavily on manual labor or cannot extract enough information for road maps. In view of the above problems, this article proposes a UGC-based automatic road map inference method. In this method, data mining techniques and natural language processing tools are applied to trajectory data and geotagged data in social media to extract not only spatial information – the location of the road network – but also attribute information – road class and road name – in an effort to create a complete road map. A case study using floating car data, collected by the National Commercial Vehicle Monitoring Platform of China, and geotagged text data from Flickr and Google Maps/Earth, validates the effectiveness of this method in inferring road maps.

Proceedings ArticleDOI
09 Aug 2015
TL;DR: A novel adaptive model adaQAC is proposed that adapts query auto-completion to users' implicit negative feedback towards unselected query suggestions, and empirical results show that implicitnegative feedback significantly and consistently boosts the accuracy of the investigated static QAC models that only rely on relevance scores.
Abstract: Query auto-completion (QAC) facilitates user query composition by suggesting queries given query prefix inputs. In 2014, global users of Yahoo! Search saved more than 50% keystrokes when submitting English queries by selecting suggestions of QAC. Users' preference of queries can be inferred during user-QAC interactions, such as dwelling on suggestion lists for a long time without selecting query suggestions ranked at the top. However, the wealth of such implicit negative feedback has not been exploited for designing QAC models. Most existing QAC models rank suggested queries for given prefixes based on certain relevance scores. We take the initiative towards studying implicit negative feed- back during user-QAC interactions. This motivates re-designing QAC in the more general "(static) relevance"(adaptive) implicit negative feedback? framework. We propose a novel adaptive model adaQAC that adapts query auto-completion to users' implicit negative feedback towards unselected query suggestions. We collect user-QAC interaction data and perform large-scale experiments. Empirical results show that implicit negative feedback significantly and consistently boosts the accuracy of the investigated static QAC models that only rely on relevance scores. Our work compellingly makes a key point: QAC should be designed in a more general framework for adapting to implicit negative feedback.

Proceedings ArticleDOI
10 Aug 2015
TL;DR: This paper proposes a novel probabilistic model to capture the semantic regions where people post messages with a coherent topic as well as the patterns of movement between the semantic areas, and develops an efficient inference algorithm to calculate model parameters.
Abstract: With the increasing use of GPS-enabled mobile phones, geo-tagging, which refers to adding GPS information to media such as micro-blogging messages or photos, has seen a surge in popularity recently. This enables us to not only browse information based on locations, but also discover patterns in the location-based behaviors of users. Many techniques have been developed to find the patterns of people's movements using GPS data, but latent topics in text messages posted with local contexts have not been utilized effectively.In this paper, we present a latent topic-based clustering algorithm to discover patterns in the trajectories of geo-tagged text messages. We propose a novel probabilistic model to capture the semantic regions where people post messages with a coherent topic as well as the patterns of movement between the semantic regions. Based on the model, we develop an efficient inference algorithm to calculate model parameters. By exploiting the estimated model, we next devise a clustering algorithm to find the significant movement patterns that appear frequently in data. Our experiments on real-life data sets show that the proposed algorithm finds diverse and interesting trajectory patterns and identifies the semantic regions in a finer granularity than the traditional geographical clustering methods.

Journal ArticleDOI
TL;DR: This paper proposes a novel probabilistic measure for periodicity and design a practical algorithm, ePeriodicity, to detect periods and demonstrates the effectiveness of this method on both synthetic and real datasets.
Abstract: Advanced technology in GPS and sensors enables us to track physical events, such as human movements and facility usage Periodicity analysis from the recorded data is an important data mining task which provides useful insights into the physical events and enables us to report outliers and predict future behaviors To mine periodicity in an event, we have to face real-world challenges of inherently complicated periodic behaviors and imperfect data collection problem Specifically, the hidden temporal periodic behaviors could be oscillating and noisy, and the observations of the event could be incomplete In this paper, we propose a novel probabilistic measure for periodicity and design a practical algorithm, ePeriodicity, to detect periods Our method has thoroughly considered the uncertainties and noises in periodic behaviors and is provably robust to incomplete observations Comprehensive experiments on both synthetic and real datasets demonstrate the effectiveness of our method

Proceedings ArticleDOI
14 Nov 2015
TL;DR: This paper proposes a novel and efficient dynamic hierarchical entity-aware event discovery model to learn news events and their multiple aspects and demonstrates that it naturally generates an informative presentation of each event with entity graphs, time spans, news summaries and tweet highlights to facilitate user digestion.
Abstract: A major event often has repercussions on both news media and microblogging sites such as Twitter. Reports from mainstream news agencies and discussions from Twitter complement each other to form a complete picture. An event can have multiple aspects (sub-events) describing it from multiple angles, each of which attracts opinions/comments posted on Twitter. Mining such reflections is interesting to both policy makers and ordinary people seeking information. In this paper, we propose a unified framework to mine multi-aspect reflections of news events in Twitter. We propose a novel and efficient dynamic hierarchical entity-aware event discovery model to learn news events and their multiple aspects. The aspects of an event are linked to their reflections in Twitter by a bootstrapped dataless classification scheme, which elegantly handles the challenges of selecting informative tweets under overwhelming noise and bridging the vocabularies of news and tweets. In addition, we demonstrate that our framework naturally generates an informative presentation of each event with entity graphs, time spans, news summaries and tweet highlights to facilitate user digestion.

Journal ArticleDOI
TL;DR: This paper proposes a unifying framework of mining trajectory patterns of various temporal tightness, which it calls unifying trajectory patterns (UT-patterns), which consists of two phases: initial pattern discovery and granularity adjustment.
Abstract: Discovering trajectory patterns is shown to be very useful in learning interactions between moving objects. Many types of trajectory patterns have been proposed in the literature, but previous methods were developed for only a specific type of trajectory patterns. This limitation could make pattern discovery tedious and inefficient since users typically do not know which types of trajectory patterns are hidden in their data sets. Our main observation is that many trajectory patterns can be arranged according to the strength of temporal constraints. In this paper, we propose a unifying framework of mining trajectory patterns of various temporal tightness, which we call unifying trajectory patterns ( UT-patterns ). This framework consists of two phases: initial pattern discovery and granularity adjustment . A set of initial patterns are discovered in the first phase, and their granularities (i.e., levels of detail) are adjusted by split and merge to detect other types in the second phase. As a result, the structure called a pattern forest is constructed to show various patterns. Both phases are guided by an information-theoretic formula without user intervention. Experimental results demonstrate that our framework facilitates easy discovery of various patterns from real-world trajectory data.

Journal ArticleDOI
TL;DR: This work presents an algorithm for recursively constructing multi-typed topical hierarchies by a newly designed clustering and ranking algorithm for heterogeneous network data, as well as mining and ranking topical patterns of different types.
Abstract: Many digital documentary data collections (e.g., scientific publications, enterprise reports, news articles, and social media) can be modeled as a heterogeneous information network, linking text with multiple types of entities. Constructing high-quality hierarchies that can represent topics at multiple granularities benefits tasks such as search, information browsing, and pattern mining. In this work, we present an algorithm for recursively constructing multi-typed topical hierarchies. Contrary to traditional text-based topic modeling, our approach handles both textual phrases and multiple types of entities by a newly designed clustering and ranking algorithm for heterogeneous network data, as well as mining and ranking topical patterns of different types. Our experiments on datasets from two different domains demonstrate that our algorithm yields high-quality, multi-typed topical hierarchies.

Proceedings ArticleDOI
10 Aug 2015
TL;DR: This paper proposes a two-stage method called Assember, which conceptually organizes all the SCPs into a novel structure called the SCP search tree, which facilitates the effective pruning of the search space to generate SCPs efficiently.
Abstract: Recent years have witnessed the wide proliferation of geo-sensory applications wherein a bundle of sensors are deployed at different locations to cooperatively monitor the target condition. Given massive geo-sensory data, we study the problem of mining spatial co-evolving patterns (SCPs), i.e., groups of sensors that are spatially correlated and co-evolve frequently in their readings. SCP mining is of great importance to various real-world applications, yet it is challenging because (1) the truly interesting evolutions are often flooded by numerous trivial fluctuations in the geo-sensory time series; and (2) the pattern search space is extremely large due to the spatiotemporal combinatorial nature of SCP. In this paper, we propose a two-stage method called Assember. In the first stage, Assember filters trivial fluctuations using wavelet transform and detects frequent evolutions for individual sensors via a segment-and-group approach. In the second stage, Assember generates SCPs by assembling the frequent evolutions of individual sensors. Leveraging the spatial constraint, it conceptually organizes all the SCPs into a novel structure called the SCP search tree, which facilitates the effective pruning of the search space to generate SCPs efficiently. Our experiments on both real and synthetic data sets show that Assember is effective, efficient, and scalable.

Proceedings ArticleDOI
10 Aug 2015
TL;DR: A novel worker model is proposed to characterize the annotating behavior on data batches, and how to train the worker model on annotation data sets is presented, and a debiasing technique is presented to remove the effect of such annotation bias from adversely affecting the accuracy of labels obtained.
Abstract: Crowdsourcing is the de-facto standard for gathering annotated data. While, in theory, data annotation tasks are assumed to be attempted by workers independently, in practice, data annotation tasks are often grouped into batches to be presented and annotated by workers together, in order to save on the time or cost overhead of providing instructions or necessary background. Thus, even though independence is usually assumed between annotations on data items within the same batch, in most cases, a worker's judgment on a data item can still be affected by other data items within the batch, leading to additional errors in collected labels. In this paper, we study the data annotation bias when data items are presented as batches to be judged by workers simultaneously. We propose a novel worker model to characterize the annotating behavior on data batches, and present how to train the worker model on annotation data sets. We also present a debiasing technique to remove the effect of such annotation bias from adversely affecting the accuracy of labels obtained. Our experimental results on synthetic and real-world data sets demonstrate that our proposed method can achieve up to +57% improvement in F1-score compared to the standard majority voting baseline.

Proceedings ArticleDOI
01 Apr 2015
TL;DR: This paper considers a graph regularized meta-path based transductive regression model (Grempt), which combines the principal philosophies of typical graph-based transductives classification methods and transduction regression models designed for homogeneous networks.
Abstract: A number of real-world networks are heterogeneous information networks, which are composed of different types of nodes and links. Numerical prediction in heterogeneous information networks is a challenging but significant area because network based information for unlabeled objects is usually limited to make precise estimations. In this paper, we consider a graph regularized meta-path based transductive regression model (Grempt), which combines the principal philosophies of typical graph-based transductive classification methods and transductive regression models designed for homogeneous networks. The computation of our method is time and space efficient and the precision of our model can be verified by numerical experiments.

Proceedings ArticleDOI
01 Mar 2015
TL;DR: The concept of query-based outlier in heterogeneous information networks is introduced, a query language is designed to facilitate users to specify outlier queries flexibly, a good outlier measure in heterogeneity networks is defined, and experiments show that following such a methodology, interesting outliers can be defined and uncovered flexibly and effectively in large heterogeneous networks.
Abstract: Outlier or anomaly detection in large data sets is a fundamental task in data science, with broad applications. However, in real data sets with high-dimensional space, most outliers are hidden in certain dimensional combinations and are relative to a user’s search space and interest. It is often more e↵ective to give power to users and allow them to specify outlier queries flexibly, and the system will then process such mining queries eciently. In this study, we introduce the concept of query-based outlier in heterogeneous information networks, design a query language to facilitate users to specify such queries flexibly, define a good outlier measure in heterogeneous networks, and study how to process outlier queries eciently in large data sets. Our experiments on real data sets show that following such a methodology, interesting outliers can be defined and uncovered flexibly and e↵ectively in large heterogeneous networks.

Proceedings ArticleDOI
10 Aug 2015
TL;DR: A novel method is proposed, called STROD, that allows efficient and consistent modification of topic hierarchies, based on a recursive generative model and a scalable tensor decomposition inference algorithm with theoretical performance guarantee.
Abstract: Automatic construction of user-desired topical hierarchies over large volumes of text data is a highly desirable but challenging task. This study proposes to give users freedom to construct topical hierarchies via interactive operations such as expanding a branch and merging several branches. Existing hierarchical topic modeling techniques are inadequate for this purpose because (1) they cannot consistently preserve the topics when the hierarchy structure is modified; and (2) the slow inference prevents swift response to user requests. In this study, we propose a novel method, called STROD, that allows efficient and consistent modification of topic hierarchies, based on a recursive generative model and a scalable tensor decomposition inference algorithm with theoretical performance guarantee. Empirical evaluation shows that STROD reduces the runtime of construction by several orders of magnitude, while generating consistent and quality hierarchies.

Proceedings ArticleDOI
17 Oct 2015
TL;DR: This work proposes class-level meta-paths and studies how they can be used to build more accurate classifiers and improve active learning in identifying objects for which training labels should be obtained to show that class- level meta- Paths and object classification exhibit interesting synergy.
Abstract: A heterogeneous information network (HIN) is used to model objects of different types and their relationships. Meta-paths are sequences of object types. They are used to represent complex relationships between objects beyond what links in a homogeneous network capture. We study the problem of classifying objects in an HIN. We propose class-level meta-paths and study how they can be used to (1) build more accurate classifiers and (2) improve active learning in identifying objects for which training labels should be obtained. We show that class-level meta-paths and object classification exhibit interesting synergy. Our experimental results show that the use of class-level meta-paths results in very effective active learning and good classification performance in HINs.

Proceedings ArticleDOI
26 Jul 2015
TL;DR: This paper presents the first end-to-end context-aware entity morph decoding system that can automatically identify, disambiguate, verify morph mentions based on specific contexts, and resolve them to target entities.
Abstract: People create morphs, a special type of fake alternative names, to achieve certain communication goals such as expressing strong sentiment or evading censors. For example, “Black Mamba”, the name for a highly venomous snake, is a morph that Kobe Bryant created for himself due to his agility and aggressiveness in playing basketball games. This paper presents the first end-to-end context-aware entity morph decoding system that can automatically identify, disambiguate, verify morph mentions based on specific contexts, and resolve them to target entities. Our approach is based on an absolute “cold-start” it does not require any candidate morph or target entity lists as input, nor any manually constructed morph-target pairs for training. We design a semi-supervised collective inference framework for morph mention extraction, and compare various deep learning based approaches for morph resolution. Our approach achieved significant improvement over the state-of-the-art method (Huang et al., 2013), which used a large amount of training data. 1

Journal Article
TL;DR: The authors formulates relation clustering as a constrained tripartite graph clustering problem and introduces several ways that provide side information via must-link and cannot-link constraints to improve the clustering results.
Abstract: In knowledge bases or information extraction results, differently expressed relations can be semantically similar (e.g., (X, wrote, Y) and (X, 's written work, Y)). Therefore, grouping semantically similar relations into clusters would facilitate and improve many applications, including knowledge base completion, information extraction, information retrieval, and more. This paper formulates relation clustering as a constrained tripartite graph clustering problem, presents an efficient clustering algorithm and exhibits the advantage of the constrained framework. We introduce several ways that provide side information via must-link and cannot-link constraints to improve the clustering results. Different from traditional semi-supervised learning approaches, we propose to use the similarity of relation expressions and the knowledge of entity types to automatically construct the constraints for the algorithm. We show improved relation clustering results on two datasets extracted from human annotated knowledge base (i.e., Freebase) and open information extraction results (i.e., ReVerb data).

Book ChapterDOI
07 Sep 2015
TL;DR: This paper proposes two algorithms, namely Squeeze and Ripple, both of which can accurately answer the Ink query in a fast and incremental manner and are orders of magnitude faster than state-of-the-art method.
Abstract: Random walk with restart RWR is widely recognized as one of the most important node proximity measures for graphs, as it captures the holistic graph structure and is robust to noise in the graph. In this paper, we study a novel query based on the RWR measure, called the inbound top-kInk query. Given a query node q and a number k, the Ink query aims at retrieving k nodes in the graph that have the largest weighted RWR scores toq. Ink queries can be highly useful for various applications such as traffic scheduling, disease treatment, and targeted advertising. Nevertheless, none of the existing RWR computation techniques can accurately and efficiently process the Ink query in large graphs. We propose two algorithms, namely Squeeze and Ripple, both of which can accurately answer the Ink query in a fast and incremental manner. To identify the top-k nodes, Squeeze iteratively performs matrix-vector multiplication and estimates the lower and upper bounds for all the nodes in the graph. Ripple employs a more aggressive strategy by only estimating the RWR scores for the nodes falling in the vicinity of q, the nodes outside the vicinity do not need to be evaluated because their RWR scores are propagated from the boundary of the vicinity and thus upper bounded. Ripple incrementally expands the vicinity until the top-k result set can be obtained. Our extensive experiments on real-life graph data sets show that Ink queries can retrieve interesting results, and the proposed algorithms are orders of magnitude faster than state-of-the-art method.

Journal ArticleDOI
TL;DR: The vision, motivations, and research plans of the National Institutes of Health Center for Excellence in Big Data Computing at the University of Illinois, Urbana-Champaign are described, organized around the construction of "Knowledge Engine for Genomics" (KnowEnG), an E-science framework for genomics.

Proceedings ArticleDOI
18 May 2015
TL;DR: This work introduces the novel concept of Semantic Pattern Graph to leverage public signals to understand the underlying semantics of lexical patterns, reinforce pattern evaluation using mined semantics, and yield more accurate and complete entities.
Abstract: Entity Extraction is a process of identifying meaningful entities from text documents. In enterprises, extracting entities improves enterprise efficiency by facilitating numerous applications, including search, recommendation, etc. However, the problem is particularly challenging on enterprise domains due to several reasons. First, the lack of redundancy of enterprise entities makes previous web-based systems like NELL and OpenIE not effective, since using only high-precision/low-recall patterns like those systems would miss the majority of sparse enterprise entities, while using more low-precision patterns in sparse setting also introduces noise drastically. Second, semantic drift is common in enterprises ("Blue" refers to "Windows Blue"), such that public signals from the web cannot be directly applied on entities. Moreover, many internal entities never appear on the web. Sparse internal signals are the only source for discovering them. To address these challenges, we propose an end-to-end framework for extracting entities in enterprises, taking the input of enterprise corpus and limited seeds to generate a high-quality entity collection as output. We introduce the novel concept of Semantic Pattern Graph to leverage public signals to understand the underlying semantics of lexical patterns, reinforce pattern evaluation using mined semantics, and yield more accurate and complete entities. Experiments on Microsoft enterprise data show the effectiveness of our approach.