scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Incorporating Social Context and Domain Knowledge for Entity Recognition

TL;DR: The SOCINST model, which can automatically construct a context of subtopics for each instance, with each subtopic representing one possible meaning of the instance, is proposed and incorporated into the model using a Dirichlet tree distribution.
Abstract: Recognizing entity instances in documents according to a knowledge base is a fundamental problem in many data mining applications. The problem is extremely challenging for short documents in complex domains such as social media and biomedical domains. Large concept spaces and instance ambiguity are key issues that need to be addressed. Most of the documents are created in a social context by common authors via social interactions, such as reply and citations. Such social contexts are largely ignored in the instance-recognition literature. How can users' interactions help entity instance recognition? How can the social context be modeled so as to resolve the ambiguity of different instances? In this paper, we propose the SOCINST model to formalize the problem into a probabilistic model. Given a set of short documents (e.g., tweets or paper abstracts) posted by users who may connect with each other, SOCINST can automatically construct a context of subtopics for each instance, with each subtopic representing one possible meaning of the instance. The model is also able to incorporate social relationships between users to help build social context. We further incorporate domain knowledge into the model using a Dirichlet tree distribution. We evaluate the proposed model on three different genres of datasets: ICDM'12 Contest, Weibo, and I2B2. In ICDM'12 Contest, the proposed model clearly outperforms (+21.4%; $p l 1e-5 with t-test) all the top contestants. In Weibo and I2B2, our results also show that the recognition accuracy of SOCINST is up to 5.3-26.6% better than those of several alternative methods.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: A hybrid recommendation algorithm based on social relations and time-sequenced topics, which has been verified using Real Sina Weibo datasets, works well and achieves better mean average precision (MAP) than existing other counterparts.
Abstract: With the popularization of social networks, increasing numbers of users choose to use Weibo to get information. However, as the number of users grows, the information on Weibo is also multiplying, making it increasingly difficult for users to find the right information they are interested in. Therefore, how to recommend high-quality friends to follow the Weibo is one of the focuses of studies in Weibo-based personalized services. Based on existing Weibo social networking topologies and content-based hybrid recommendation algorithms, the study proposed a hybrid recommendation algorithm based on social relations and time-sequenced topics, which has been verified using Real Sina Weibo datasets. The results show that the improved hybrid recommendation algorithm works well and achieves better mean average precision (MAP) than existing other counterparts.

43 citations

Book
15 Oct 2020
TL;DR: An overview of the literature on knowledge graphs (KGs) in the context of information retrieval (IR) is provided and how KGs can be employed to support IR tasks, including document and entity retrieval is discussed.
Abstract: In this survey, we provide an overview of the literature on knowledge graphs (KGs) in the context of information retrieval (IR). Modern IR systems can benefit from information available in KGs in multiple ways, independent of whether the KGs are publicly available or proprietary ones. We provide an overview of the components required when building IR systems that leverage KGs and use a task-oriented organization of the material that we discuss. As an understanding of the intersection of IR and KGs is beneficial to many researchers and practitioners, we consider prior work from two complementary angles: leveraging KGs for information retrieval and enriching KGs using IR techniques. We start by discussing how KGs can be employed to support IR tasks, including document and entity retrieval. We then proceed by describing how IR—and language technology in general—can be utilized for the construction and completion of KGs. This includes tasks such as entity recognition, typing, and relation extraction. We discuss common issues that appear across the tasks that we consider and identify future directions for addressing them. We also provide pointers to datasets and other resources that should be useful for both newcomers and experienced researchers in the area.

34 citations

Proceedings Article
09 Jul 2016
TL;DR: A multi-modal Bayesian embedding model, GenVector, is proposed to learn latent topics that generate word and network embeddings in a shared latent topic space, and significantly decreases the error rate in an online A/B test with live users.
Abstract: We study the extent to which online social networks can be connected to knowledge bases. The problem is referred to as learning social knowledge graphs. We propose a multi-modal Bayesian embedding model, GenVector, to learn latent topics that generate word embeddings and network embeddings simultaneously. GenVector leverages large-scale unlabeled data with embeddings and represents data of two modalities--i.e., social network users and knowledge concepts--in a shared latent topic space. Experiments on three datasets show that the proposed method clearly outperforms state-of-the-art methods. We then deploy the method on AMiner, an online academic search system to connect with a network of 38,049,189 researchers with a knowledge base with 35,415,011 concepts. Our method significantly decreases the error rate of learning social knowledge graphs in an online A/B test with live users.

27 citations

Proceedings ArticleDOI
Jie Tang1
11 Apr 2016
TL;DR: This talk will focus on answering two fundamental questions for author-centric network analysis: who is who?
Abstract: AMiner is the second generation of the ArnetMiner system. We focus on developing author-centric analytic and mining tools for gaining a deep understanding of the large and heterogeneous networks formed by authors, papers, venues, and knowledge concepts. One fundamental goal is how to extract and integrate semantics from different sources. We have developed algorithms to automatically extract researchers' profiles from the Web and re- solve the name ambiguity problem, and connect different professional networks. We also developed methodologies to incorporate knowledge from the Wikipedia and other sources into the system to bridge the gap between network science and the web mining research. In this talk, I will focus on answering two fundamental questions for author-centric network analysis: who is who? and who are similar to each other? The system has been in operation since 2006 and has collected more than 100,000,000 author profiles, 100,000,000 publication papers, and 7,800,000 knowledge concepts. It has been widely used for collaboration recommendation, similarity analysis, and community evolution.

26 citations


Cites methods from "Incorporating Social Context and Do..."

  • ...We also developed methodologies to incorporate knowledge from the Wikipedia and other sources into the system [7, 2] to bridge the gap between network science and the web mining research....

    [...]

Journal ArticleDOI
TL;DR: A novel approach which combines a user‐based collaborative filtering (CF) algorithm with semantic and social recommendations for the recommendation of users in social networks is proposed and a social recommender system based on this approach is developed.
Abstract: The development of social media technologies has greatly enhanced social interactions. The proliferation of social platforms has generated massive amounts of data and a considerable number of persons join these platforms every day. Therefore, one of the current issues is to facilitate the search for the most appropriate friends for a given user. We focus in this article on the recommendation of users in social networks. We propose a novel approach which combines a user‐based collaborative filtering (CF) algorithm with semantic and social recommendations. The semantic dimension suggests the close friends based on the calculation of the similarity between the active user and his friends. The social dimension is based on some social‐behavior metrics such as friendship and credibility degree. The novelty of our approach concerns the modeling of the credibility of the user, through his/her trust and commitment in the social network. A social recommender system based on this approach is developed and experiments have been conducted using the Yelp social network. The evaluation results demonstrated that the proposed hybrid approach improves the accuracy of the recommendation compared with the user‐based CF algorithm and solves the sparsity and cold start problems.

14 citations

References
More filters
Journal ArticleDOI
TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.
Abstract: We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.

30,570 citations


"Incorporating Social Context and Do..." refers background or methods in this paper

  • ...When training SOCINST, as for the hyperparameters α, β, and η, following [1, 5], we empirically take fixed values (i....

    [...]

  • ...Probabilistic topic models have been successfully applied to multiple text mining tasks to extract topics from text [5, 15, 30]....

    [...]

  • ...The AT model can be considered as an extension of Latent Dirichlet Allocation (LDA) [5], but one that considers the collaborative relationships between users....

    [...]

  • ...From the modeling aspect, substantial research has been conducted for topic models, such as [5, 15, 30]....

    [...]

Proceedings Article
03 Jan 2001
TL;DR: This paper proposed a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI).
Abstract: We propose a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams [6], and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI) [3]. In the context of text modeling, our model posits that each document is generated as a mixture of topics, where the continuous-valued mixture proportions are distributed as a latent Dirichlet random variable. Inference and learning are carried out efficiently via variational algorithms. We present empirical results on applications of this model to problems in text modeling, collaborative filtering, and text classification.

25,546 citations

Proceedings Article
28 Jun 2001
TL;DR: This work presents iterative parameter estimation algorithms for conditional random fields and compares the performance of the resulting models to HMMs and MEMMs on synthetic and natural-language data.
Abstract: We present conditional random fields , a framework for building probabilistic models to segment and label sequence data. Conditional random fields offer several advantages over hidden Markov models and stochastic grammars for such tasks, including the ability to relax strong independence assumptions made in those models. Conditional random fields also avoid a fundamental limitation of maximum entropy Markov models (MEMMs) and other discriminative Markov models based on directed graphical models, which can be biased towards states with few successor states. We present iterative parameter estimation algorithms for conditional random fields and compare the performance of the resulting models to HMMs and MEMMs on synthetic and natural-language data.

13,190 citations


"Incorporating Social Context and Do..." refers methods in this paper

  • ...To recognize instances from a free document, we can consider a sequential labeling model, for example, Conditional Random Fields (CRFs) [19]....

    [...]

Journal ArticleDOI
01 Aug 1999
TL;DR: Probabilistic Latent Semantic Indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data.
Abstract: Probabilistic Latent Semantic Indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. Fitted from a training corpus of text documents by a generalization of the Expectation Maximization algorithm, the utilized model is able to deal with domain{specific synonymy as well as with polysemous words. In contrast to standard Latent Semantic Indexing (LSI) by Singular Value Decomposition, the probabilistic variant has a solid statistical foundation and defines a proper generative data model. Retrieval experiments on a number of test collections indicate substantial performance gains over direct term matching methods as well as over LSI. In particular, the combination of models with different dimensionalities has proven to be advantageous.

4,577 citations