Blogger-Link-Topic model for blog mining
24 May 2011-pp 28-39
TL;DR: This paper proposes the blogger-link-topic model for blog mining based on the multiple attributes of blog content, bloggers, and links and presents a unique blog classification framework that computes the normalized document-topic matrix, which is applied to retrieve the classification results.
Abstract: Blog mining is an important area of behavior informatics because produces effective techniques for analyzing and understanding human behaviors from social media. In this paper, we propose the blogger-link-topic model for blog mining based on the multiple attributes of blog content, bloggers, and links. In addition, we present a unique blog classification framework that computes the normalized document-topic matrix, which is applied our model to retrieve the classification results. After comparing the results for blog classification on real-world blog data, we find that our blogger-link-topic model outperforms the other techniques in terms of overall precision and recall. This demonstrates that additional information contained in blog-specific attributes can help improve blog classification and retrieval results.
Citations
More filters
••
16 Jan 2020
TL;DR: The state-of-the-art text mining approaches and techniques used for analyzing transcripts and speeches, meeting transcripts, and academic journal articles, as well as websites, emails, blogs, and social media platforms, are investigated.
Abstract: Text mining in big data analytics is emerging as a powerful tool for harnessing the power of unstructured textual data by analyzing it to extract new knowledge and to identify significant patterns and correlations hidden in the data. This study seeks to determine the state of text mining research by examining the developments within published literature over past years and provide valuable insights for practitioners and researchers on the predominant trends, methods, and applications of text mining research. In accordance with this, more than 200 academic journal articles on the subject are included and discussed in this review; the state-of-the-art text mining approaches and techniques used for analyzing transcripts and speeches, meeting transcripts, and academic journal articles, as well as websites, emails, blogs, and social media platforms, across a broad range of application areas are also investigated. Additionally, the benefits and challenges related to text mining are also briefly outlined.
103 citations
••
TL;DR: Two methods were used in this paper: K-Nearest Neighbor (KNN) and Artificial Neural Networks (ANNs) which are classified based on Kohkiloye and Boyer Ahmad province bloggers dataset considering input features of each blogger to the other methods and previously provided algorithms as more optimal.
Abstract: Blogs are one of the effective tools of web2 which are considered as one of the major module and of social and interactive capabilities in making IT world wonderful for the cyber and virtual living. Two methods were used in this paper: K-Nearest Neighbor (KNN) and Artificial Neural Networks (ANNs). These methods are classified based on Kohkiloye and Boyer Ahmad province bloggers dataset considering input features of each blogger to the other methods and previously provided algorithms as more optimal. Our simulation and experiments not only provide hopeful results but also higher anticipation and classification rate.
50 citations
••
TL;DR: The factors that influence a blogger to behave professionally are identified based on the classifier with the best results, and the causes behind the varying performance of algorithms are elaborated.
27 citations
01 Jan 2012
TL;DR: (∆ + 1) [1577].
Abstract: (∆ + 1) [1577]. (ρ,G) [266]. (r|p) [781]. 1 [1022]. 1 [1342]. 2 [27, 1294, 1138, 432, 1028, 281, 758, 272, 1440, 546, 861, 867, 1352, 578, 561]. 3 [579, 1293, 1381, 176, 1355, 1623, 1294, 1012, 1358, 341, 1370, 1028, 157, 160, 978, 1440, 861, 1385, 279, 995, 1340, 1400, 1433, 1352, 173, 1295, 1343, 1560, 1409, 662]. 4 [1349]. [0, 1] [660]. + [204]. 2 [608, 1012]. 3 [1012, 622]. p [647]. A∗ [1264]. B [623]. β [217]. C [673]. C [656]. `0 [268]. [324, 1470]. G [649]. GM(1, 1) [536]. H∞ [392]. K [1026, 909, 1433, 1516, 930, 1033]. L1 [673]. μ [1709]. p [526, 240, 1089]. P0 [103]. q [683]. R [297, 1012]. ρ [1643, 1626]. τ [522].
8 citations
••
TL;DR: This study analyzes domestic articles related to AI using topic modeling method based on LDA algorithm to determine new value that can be created through the convergence between artificial intelligence technology (AIT) and all industries.
Abstract: The present study determined new value that can be created through the convergence between artificial intelligence technology (AIT) and all industries by deriving and thoroughly analyzing major issues related to artificial intelligence (AI). This study analyzes domestic articles related to AI using topic modeling method based on LDA algorithm.
6 citations
Additional excerpts
...박자현, 송민 (2013)은 토픽모델링을 통해 국내 문헌정보학 연구 동향 을 분석하였고[2], Flora(2011)는 토픽모델링을 이용하여 웹 블로그의 콘텐츠 동향을 분석하였다[3]....
[...]
References
More filters
••
TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.
Abstract: We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.
30,570 citations
•
03 Jan 2001TL;DR: This paper proposed a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI).
Abstract: We propose a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams [6], and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI) [3]. In the context of text modeling, our model posits that each document is generated as a mixture of topics, where the continuous-valued mixture proportions are distributed as a latent Dirichlet random variable. Inference and learning are carried out efficiently via variational algorithms. We present empirical results on applications of this model to problems in text modeling, collaborative filtering, and text classification.
25,546 citations
••
07 Jul 2004TL;DR: The author-topic model is introduced, a generative model for documents that extends Latent Dirichlet Allocation to include authorship information, and applications to computing similarity between authors and entropy of author output are demonstrated.
Abstract: We introduce the author-topic model, a generative model for documents that extends Latent Dirichlet Allocation (LDA; Blei, Ng, & Jordan, 2003) to include authorship information. Each author is associated with a multinomial distribution over topics and each topic is associated with a multinomial distribution over words. A document with multiple authors is modeled as a distribution over topics that is a mixture of the distributions associated with the authors. We apply the model to a collection of 1,700 NIPS conference papers and 160,000 CiteSeer abstracts. Exact inference is intractable for these datasets and we use Gibbs sampling to estimate the topic and author distributions. We compare the performance with two other generative models for documents, which are special cases of the author-topic model: LDA (a topic model) and a simple author model in which each author is associated with a distribution over words rather than a distribution over topics. We show topics recovered by the author-topic model, and demonstrate applications to computing similarity between authors and entropy of author output.
1,554 citations
••
22 Aug 2004TL;DR: The methodology is applied to a large corpus of 160,000 abstracts and 85,000 authors from the well-known CiteSeer digital library, and a model with 300 topics is learned using a Markov chain Monte Carlo algorithm.
Abstract: We propose a new unsupervised learning technique for extracting information from large text collections. We model documents as if they were generated by a two-stage stochastic process. Each author is represented by a probability distribution over topics, and each topic is represented as a probability distribution over words for that topic. The words in a multi-author paper are assumed to be the result of a mixture of each authors' topic mixture. The topic-word and author-topic distributions are learned from data in an unsupervised manner using a Markov chain Monte Carlo algorithm. We apply the methodology to a large corpus of 160,000 abstracts and 85,000 authors from the well-known CiteSeer digital library, and learn a model with 300 topics. We discuss in detail the interpretation of the results discovered by the system including specific topic and author models, ranking of authors by topic and topics by author, significant trends in the computer science literature between 1990 and 2002, parsing of abstracts by topics and authors and detection of unusual papers by specific authors. An online query interface to the model is also discussed that allows interactive exploration of author-topic models for corpora such as CiteSeer.
618 citations
•
01 Jan 2000TL;DR: A joint probabilistic model for modeling the contents and inter-connectivity of document collections such as sets of web pages or research paper archives is described, based on a Probabilistic factor decomposition.
Abstract: We describe a joint probabilistic model for modeling the contents and inter-connectivity of document collections such as sets of web pages or research paper archives. The model is based on a probabilistic factor decomposition and allows identifying principal topics of the collection as well as authoritative documents within those topics. Furthermore, the relationships between topics is mapped out in order to build a predictive model of link content. Among the many applications of this approach are information retrieval and search, topic identification, query disambiguation, focused web crawling, web authoring, and bibliometric analysis.
519 citations