Showing papers on "Latent Dirichlet allocation published in 2020"

PDF

Open Access

Journal Article•DOI•

An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit

[...]

Stephan A. Curiskis¹, Barry James Drake¹, Thomas R. Osborn¹, Paul J. Kennedy¹•Institutions (1)

01 Mar 2020-Information Processing and Management

TL;DR: This study evaluates several techniques for document clustering and topic modelling on three datasets from Twitter and Reddit, and shows that clustering techniques applied to neural embedding feature representations delivered the best performance over all data sets using appropriate extrinsic evaluation measures.

...read moreread less

Abstract: Methods for document clustering and topic modelling in online social networks (OSNs) offer a means of categorising, annotating and making sense of large volumes of user generated content. Many techniques have been developed over the years, ranging from text mining and clustering methods to latent topic models and neural embedding approaches. However, many of these methods deliver poor results when applied to OSN data as such text is notoriously short and noisy, and often results are not comparable across studies. In this study we evaluate several techniques for document clustering and topic modelling on three datasets from Twitter and Reddit. We benchmark four different feature representations derived from term-frequency inverse-document-frequency (tf-idf) matrices and word embedding models combined with four clustering methods, and we include a Latent Dirichlet Allocation topic model for comparison. Several different evaluation measures are used in the literature, so we provide a discussion and recommendation for the most appropriate extrinsic measures for this task. We also demonstrate the performance of the methods over data sets with different document lengths. Our results show that clustering techniques applied to neural embedding feature representations delivered the best performance over all data sets using appropriate extrinsic evaluation measures. We also demonstrate a method for interpreting the clusters with a top-words based approach using tf-idf weights combined with embedding distance measures.

...read moreread less

149 citations

Journal Article•DOI•

Using Topic Modeling Methods for Short-Text Data: A Comparative Analysis.

[...]

Rania Albalawi¹, Tet Hin Yeap¹, Morad Benyoucef¹•Institutions (1)

University of Ottawa¹

14 Jul 2020

TL;DR: Investigating the topic modeling subject and its common application areas, methods, and tools sheds light on some common topic modeling methods in a short-text context and provides direction for researchers who seek to apply these methods.

...read moreread less

Abstract: With the growth of online social network platforms and applications, large amounts of textual user-generated content are created daily in the form of comments, reviews, and short-text messages. As a result, users often find it challenging to discover useful information or more on the topic being discussed from such content. Machine learning and natural language processing algorithms are used to analyze the massive amount of textual social media data available online, including topic modeling techniques that have gained popularity in recent years. This paper investigates the topic modeling subject and its common application areas, methods, and tools. Also, we examine and compare five frequently used topic modeling methods, as applied to short textual social data, to show their benefits practically in detecting important topics. These methods are latent semantic analysis, latent Dirichlet allocation, non-negative matrix factorization, random projection, and principal component analysis. Two textual datasets were selected to evaluate the performance of included topic modeling methods based on the topic quality and some standard statistical evaluation metrics, like recall, precision, F-score, and topic coherence. As a result, latent Dirichlet allocation and non-negative matrix factorization methods delivered more meaningful extracted topics and obtained good results. The paper sheds light on some common topic modeling methods in a short-text context and provides direction for researchers who seek to apply these methods.

...read moreread less

134 citations

Posted Content•

Top2Vec: Distributed Representations of Topics

[...]

Dimo Angelov

19 Aug 2020-arXiv: Computation and Language

TL;DR: This model does not require stop-word lists, stemming or lemmatization, and it automatically finds the number of topics, and the resulting topic vectors are jointly embedded with the document and word vectors with distance between them representing semantic similarity.

...read moreread less

Abstract: Topic modeling is used for discovering latent semantic structure, usually referred to as topics, in a large collection of documents. The most widely used methods are Latent Dirichlet Allocation and Probabilistic Latent Semantic Analysis. Despite their popularity they have several weaknesses. In order to achieve optimal results they often require the number of topics to be known, custom stop-word lists, stemming, and lemmatization. Additionally these methods rely on bag-of-words representation of documents which ignore the ordering and semantics of words. Distributed representations of documents and words have gained popularity due to their ability to capture semantics of words and documents. We present $\texttt{top2vec}$, which leverages joint document and word semantic embedding to find $\textit{topic vectors}$. This model does not require stop-word lists, stemming or lemmatization, and it automatically finds the number of topics. The resulting topic vectors are jointly embedded with the document and word vectors with distance between them representing semantic similarity. Our experiments demonstrate that $\texttt{top2vec}$ finds topics which are significantly more informative and representative of the corpus trained on than probabilistic generative models.

...read moreread less

130 citations

Journal Article•DOI•

A comprehensive survey and analysis of generative models in machine learning

[...]

GM Harshvardhan¹, Mahendra Kumar Gourisaria¹, Manjusha Pandey¹, Siddharth Swarup Rautaray¹•Institutions (1)

KIIT University¹

01 Nov 2020-Computer Science Review

TL;DR: This paper review and analyse critically all the generative models, namely Gaussian Mixture Models (GMM), Hidden Markov Models (HMM), Latent Dirichlet Allocation (LDA), Restricted Boltzmann Machines (RBM), Deep Belief Networks (DBN), Deep Boltz Mann Machines (DBM), and GANs, to provide the reader some insights on which generative model to pick from while dealing with a problem.

...read moreread less

117 citations

Journal Article•DOI•

Short Text Topic Modeling Techniques, Applications, and Performance: A Survey

[...]

Jipeng Qiang¹, Zhenyu Qian¹, Yun Li¹, Yun-Hao Yuan¹, Xindong Wu² - Show less +1 more•Institutions (2)

Yangzhou University¹, University of Louisiana at Lafayette²

04 May 2020-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This survey conducts a comprehensive review of various short text topic modeling techniques proposed in the literature, and presents three categories of methods based on Dirichlet multinomial mixture, global word co-occurrences, and self-aggregation, with example of representative approaches in each category and analysis of their performance on various tasks.

...read moreread less

Abstract: Analyzing short texts infers discriminative and coherent latent topics that is a critical and fundamental task since many real-world applications require semantic understanding of short texts. Traditional long text topic modeling algorithms (e.g., PLSA and LDA) based on word co-occurrences cannot solve this problem very well since only very limited word co-occurrence information is available in short texts. Therefore, short text topic modeling has already attracted much attention from the machine learning research community in recent years, which aims at overcoming the problem of sparseness in short texts. In this survey, we conduct a comprehensive review of various short text topic modeling techniques proposed in the literature. We present three categories of methods based on Dirichlet multinomial mixture, global word co-occurrences, and self-aggregation, with example of representative approaches in each category and analysis of their performance on various tasks. We develop the first comprehensive open-source library, called STTM, for use in Java that integrates all surveyed algorithms within a unified interface, benchmark datasets, to facilitate the expansion of new methods in this research field. Finally, we evaluate these state-of-the-art methods on many real-world datasets and compare their performance against one another and versus long text topic modeling algorithm.

...read moreread less

101 citations

Journal Article•DOI•

Motivation and satisfaction of Chinese and U.S. tourists in restaurants: A cross-cultural text mining of online reviews

[...]

Susan (Sixue) Jia¹•Institutions (1)

Shanghai Normal University¹

01 Jun 2020-Tourism Management

TL;DR: The authors investigated the motivation and satisfaction of restaurant tourist customers coming from China and U.S. by investigating their online ratings and reviews and found that Chinese tourists are less inclined to assign lower ratings to restaurants, and are more strongly fascinated by the food offered.

...read moreread less

99 citations

Journal Article•DOI•

Deep learning and network analysis: Classifying and visualizing accident narratives in construction

[...]

Botao Zhong¹, Xing Pan¹, Peter E.D. Love², Lieyun Ding¹, Weili Fang¹ - Show less +1 more•Institutions (2)

Huazhong University of Science and Technology¹, Curtin University²

01 May 2020-Automation in Construction

TL;DR: The proposed automated classification model and LDA-based network analysis method provide a useful approach to enable machine-assisted interpretation of texts-based accident narratives and can provide managers with much-needed information and knowledge to improve safety on-site.

...read moreread less

80 citations

Posted Content•DOI•

Exploratory Analysis of Covid-19 Tweets using Topic Modeling, UMAP, and DiGraphs

[...]

Catherine Ordun¹, Sanjay Purushotham¹, Edward Raff¹•Institutions (1)

University of Maryland, Baltimore County¹

06 May 2020-Reproduction, Fertility and Development

TL;DR: This paper illustrates five different techniques to assess the distinctiveness of topics, key terms and features, speed of information dissemination, and network behaviors for Covid19 tweets, and seeks to understand retweet cascades.

...read moreread less

Abstract: This paper illustrates five different techniques to assess the distinctiveness of topics, key terms and features, speed of information dissemination, and network behaviors for Covid19 tweets. First, we use pattern matching and second, topic modeling through Latent Dirichlet Allocation (LDA) to generate twenty different topics that discuss case spread, healthcare workers, and personal protective equipment (PPE). One topic specific to U.S. cases would start to uptick immediately after live White House Coronavirus Task Force briefings, implying that many Twitter users are paying attention to government announcements. We contribute machine learning methods not previously reported in the Covid19 Twitter literature. This includes our third method, Uniform Manifold Approximation and Projection (UMAP), that identifies unique clustering-behavior of distinct topics to improve our understanding of important themes in the corpus and help assess the quality of generated topics. Fourth, we calculated retweeting times to understand how fast information about Covid19 propagates on Twitter. Our analysis indicates that the median retweeting time of Covid19 for a sample corpus in March 2020 was 2.87 hours, approximately 50 minutes faster than repostings from Chinese social media about H7N9 in March 2013. Lastly, we sought to understand retweet cascades, by visualizing the connections of users over time from fast to slow retweeting. As the time to retweet increases, the density of connections also increase where in our sample, we found distinct users dominating the attention of Covid19 retweeters. One of the simplest highlights of this analysis is that early-stage descriptive methods like regular expressions can successfully identify high-level themes which were consistently verified as important through every subsequent analysis.

...read moreread less

79 citations

Journal Article•DOI•

Federated Latent Dirichlet Allocation: A Local Differential Privacy Based Framework

[...]

Yansheng Wang¹, Yongxin Tong¹, Dingyuan Shi¹•Institutions (1)

Beihang University¹

03 Apr 2020

TL;DR: FedLDA, a local differential privacy (LDP) based framework for federated learning of LDA models, contains a novel LDP mechanism called Random Response with Priori (RRP), which provides theoretical guarantees on both data privacy and model accuracy.

...read moreread less

Abstract: Latent Dirichlet Allocation (LDA) is a widely adopted topic model for industrial-grade text mining applications. However, its performance heavily relies on the collection of large amount of text data from users' everyday life for model training. Such data collection risks severe privacy leakage if the data collector is untrustworthy. To protect text data privacy while allowing accurate model training, we investigate federated learning of LDA models. That is, the model is collaboratively trained between an untrustworthy data collector and multiple users, where raw text data of each user are stored locally and not uploaded to the data collector. To this end, we propose FedLDA, a local differential privacy (LDP) based framework for federated learning of LDA models. Central in FedLDA is a novel LDP mechanism called Random Response with Priori (RRP), which provides theoretical guarantees on both data privacy and model accuracy. We also design techniques to reduce the communication cost between the data collector and the users during model training. Extensive experiments on three open datasets verified the effectiveness of our solution.

...read moreread less

74 citations

Journal Article•DOI•

What do Programmers Discuss about Deep Learning Frameworks

[...]

Junxiao Han¹, Emad Shihab², Zhiyuan Wan¹, Shuiguang Deng¹, Xin Xia³ - Show less +1 more•Institutions (3)

Zhejiang University¹, Concordia University², Monash University³

24 Apr 2020-Empirical Software Engineering

TL;DR: This work applies Latent Dirichlet Allocation (LDA) topic modeling techniques to derive the discussion topics related to three popular deep learning frameworks, namely, Tensorflow, PyTorch and Theano, and makes a comparison of topics between the two platforms.

...read moreread less

Abstract: Deep learning has gained tremendous traction from the developer and researcher communities. It plays an increasingly significant role in a number of application domains. Deep learning frameworks are proposed to help developers and researchers easily leverage deep learning technologies, and they attract a great number of discussions on popular platforms, i.e., Stack Overflow and GitHub. To understand and compare the insights from these two platforms, we mine the topics of interests from these two platforms. Specifically, we apply Latent Dirichlet Allocation (LDA) topic modeling techniques to derive the discussion topics related to three popular deep learning frameworks, namely, Tensorflow, PyTorch and Theano. Within each platform, we compare the topics across the three deep learning frameworks. Moreover, we make a comparison of topics between the two platforms. Our observations include 1) a wide range of topics that are discussed about the three deep learning frameworks on both platforms, and the most popular workflow stages are Model Training and Preliminary Preparation. 2) the topic distributions at the workflow level and topic category level on Tensorflow and PyTorch are always similar while the topic distribution pattern on Theano is quite different. In addition, the topic trends at the workflow level and topic category level of the three deep learning frameworks are quite different. 3) the topics at the workflow level show different trends across the two platforms. e.g., the trend of the Preliminary Preparation stage topic on Stack Overflow comes to be relatively stable after 2016, while the trend of it on GitHub shows a stronger upward trend after 2016. Besides, the Model Training stage topic still achieves the highest impact scores across two platforms. Based on the findings, we also discuss implications for practitioners and researchers.

...read moreread less

66 citations

Journal Article•DOI•

Hazard analysis: A deep learning and text mining framework for accident prevention

[...]

Botao Zhong¹, Xing Pan¹, Peter E.D. Love², Jun Sun¹, Chanjuan Tao¹ - Show less +1 more•Institutions (2)

Huazhong University of Science and Technology¹, Curtin University²

01 Oct 2020-Advanced Engineering Informatics

TL;DR: A novel and robust framework that combines deep learning and text mining technologies that provide the ability to analyse hazard records automatically, enabling managers to understand their patterns of manifestation and therefore put in place strategies to prevent them from reoccurring is presented.

...read moreread less

Proceedings Article•DOI•

Neural Topic Modeling with Bidirectional Adversarial Training

[...]

Rui Wang¹, Xuemeng Hu, Deyu Zhou¹, Yulan He², Yuxuan Xiong, Chenchen Ye, Haiyang Xu³ - Show less +3 more•Institutions (3)

Southeast University¹, University of Warwick², DiDi³

01 Jul 2020

TL;DR: The proposed Bidirectional Adversarial Topic (BAT) model is the first attempt of applying bidirectional adversarial training for neural topic modeling and shows that BAT and Gaussian-BAT obtain more coherent topics, outperforming several competitive baselines.

...read moreread less

Abstract: Recent years have witnessed a surge of interests of using neural topic models for automatic topic extraction from text, since they avoid the complicated mathematical derivations for model inference as in traditional topic models such as Latent Dirichlet Allocation (LDA). However, these models either typically assume improper prior (e.g. Gaussian or Logistic Normal) over latent topic space or could not infer topic distribution for a given document. To address these limitations, we propose a neural topic modeling approach, called Bidirectional Adversarial Topic (BAT) model, which represents the first attempt of applying bidirectional adversarial training for neural topic modeling. The proposed BAT builds a two-way projection between the document-topic distribution and the document-word distribution. It uses a generator to capture the semantic patterns from texts and an encoder for topic inference. Furthermore, to incorporate word relatedness information, the Bidirectional Adversarial Topic model with Gaussian (Gaussian-BAT) is extended from BAT. To verify the effectiveness of BAT and Gaussian-BAT, three benchmark corpora are used in our experiments. The experimental results show that BAT and Gaussian-BAT obtain more coherent topics, outperforming several competitive baselines. Moreover, when performing text clustering based on the extracted topics, our models outperform all the baselines, with more significant improvements achieved by Gaussian-BAT where an increase of near 6% is observed in accuracy.

...read moreread less

Journal Article•DOI•

Text Mining of Open-Ended Questions in Self-Assessment of University Teachers: An LDA Topic Modeling Approach

[...]

Diego Buenaño-Fernández¹, Mario González¹, David Gil², Sergio Luján-Mora²•Institutions (2)

Universidad de las Américas Puebla¹, University of Alicante²

19 Feb 2020-IEEE Access

TL;DR: A generic methodology based on topic modeling and text network modeling, that allows researchers to gather valuable information from surveys that use open-ended questions, is evaluated through the use of a case study in which the responses to a teacher self-assessment survey in an Ecuadorian university have been studied.

...read moreread less

Abstract: The large amount of text that is generated daily on the web through comments on social networks, blog posts and open-ended question surveys, among others, demonstrates that text data is used frequently, and therefore; its processing becomes a challenge for researchers. The topic modeling is one of the emerging techniques in text mining; it is based on the discovery of latent data and the search for relationships among text documents. In this paper, the objective of the research is to evaluate a generic methodology based on topic modeling and text network modeling, that allows researchers to gather valuable information from surveys that use open-ended questions. To achieve this, this methodology has been evaluated through the use of a case study in which the responses to a teacher self-assessment survey in an Ecuadorian university have been studied. The main contribution of the article is the inclusion of clustering algorithms in order to complement the results obtained when executing topic modeling. The proposed methodology is based on four phases: (a) Construction of a text database, (b) Text mining and topic modeling, (c) Topic network modeling and (d) The relevance of the identified topics. In previous works, it has been observed that the human interpretative contribution plays an important role in the process, especially in phases (a) and (d). For this reason, the visualization interfaces, such as graphs and dendograms, are of critical importance for researchers in order allow topic to efficiently analyze the results of the topic modeling. As a result of this case study, a compendium of the main strategies that teachers carry out in their classes with the aim of improving student retention is presented. In addition, the proposed methodology can be extended to the analysis of the unstructured textual information found in blogs, social networks, forums, etc.

...read moreread less

Journal Article•DOI•

Aggregated topic models for increasing social media topic coherence

[...]

Stuart J. Blair¹, Yaxin Bi, Maurice Mulvenna•Institutions (1)

Ulster University¹

31 Jan 2020-Applied Intelligence

TL;DR: This research presents a novel aggregating method for constructing an aggregated topic model that is composed of the topics with greater coherence than individual models that outperforms those topic models at a statistically significant level in terms of topic coherence over an external corpus.

...read moreread less

Abstract: This research presents a novel aggregating method for constructing an aggregated topic model that is composed of the topics with greater coherence than individual models. When generating a topic model, a number of parameters have to be specified. The resulting topics can be very general or very specific, which depend on the chosen parameters. In this study we investigate the process of aggregating multiple topic models generated using different parameters with a focus on whether combining the general and specific topics is able to increase topic coherence. We employ cosine similarity and Jensen-Shannon divergence to compute the similarity among topics and combine them into an aggregated model when their similarity scores exceed a predefined threshold. The model is evaluated against the standard topics models generated by the latent Dirichlet allocation and Non-negative Matrix Factorisation. Specifically we use the coherence of topics to compare the individual models that create aggregated models against those of the aggregated model and models generated by Non-negative Matrix Factorisation, respectively. The results demonstrate that the aggregated model outperforms those topic models at a statistically significant level in terms of topic coherence over an external corpus. We also make use of the aggregated topic model on social media data to validate the method in a realistic scenario and find that again it outperforms individual topic models.

...read moreread less

Journal Article•DOI•

Unsupervised machine learning for the discovery of latent disease clusters and patient subgroups using electronic health records.

[...]

Yanshan Wang¹, Yiqing Zhao¹, Terry M. Therneau¹, Elizabeth J. Atkinson¹, Ahmad P. Tafti¹, Nan Zhang¹, Shreyasee Amin¹, Andrew H. Limper¹, Sundeep Khosla¹, Hongfang Liu¹ - Show less +6 more•Institutions (1)

Mayo Clinic¹

01 Feb 2020-Journal of Biomedical Informatics

TL;DR: Experimental results show that the proposed Poisson Dirichlet Model (PDM) could effectively identify distinguished disease clusters based on the latent patterns hidden in the EHR data by alleviating the impact of age and sex, and that LDA could stratify patients into more differentiable subgroups than PDM in terms of p-values.

...read moreread less

Journal Article•DOI•

Topic modeling of online accommodation reviews via latent Dirichlet allocation.

[...]

Ian Sutherland, Sim YoungSeok, Lee SeulKi, Byun JaeMun, Kiattipoom Kiatkawsin - Show less +1 more

28 Feb 2020-Sustainability

TL;DR: This work applies an inductive approach by utilizing large unstructured text data of 104,161 online reviews of Korean accommodation customers to frame which topics of interest guests find important, and finds a higher importance for points of competition and points of uniqueness among the accommodation characteristics.

...read moreread less

Abstract: There is a lot of attention given to the determinants of guest satisfaction and consumer behavior in the tourism literature. While much extant literature uses a deductive approach for identifying guest satisfaction dimensions, we apply an inductive approach by utilizing large unstructured text data of 104,161 online reviews of Korean accommodation customers to frame which topics of interest guests find important. Using latent Dirichlet allocation, a generative, Bayesian, hierarchical statistical model, we extract and validate topics of interest in the dataset. The results corroborate extant literature in that dimensions, such as location and service quality, are important. However, we extend existing dimensions of importance by more precisely distinguishing aspects of location and service quality. Furthermore, by comparing the characteristics of the accommodations in terms of metropolitan versus rural and the type of accommodation, we reveal differences in topics of importance between different characteristics of the accommodations. Specifically, we find a higher importance for points of competition and points of uniqueness among the accommodation characteristics. This has implications for how managers can improve customer satisfaction and how researchers can more precisely measure customer satisfaction in the hospitality industry.

...read moreread less

Journal Article•DOI•

Determinants of Guest Experience in Airbnb: A Topic Modeling Approach Using LDA

[...]

Ian Sutherland, Kiattipoom Kiatkawsin

22 Apr 2020-Sustainability

TL;DR: This study inductively analyzes the topics of interest that drive customer experience and satisfaction within the sharing economy of the accommodation sector using a dataset of 1,086,800 Airbnb reviews across New York City.

...read moreread less

Abstract: This study inductively analyzes the topics of interest that drive customer experience and satisfaction within the sharing economy of the accommodation sector. Using a dataset of 1,086,800 Airbnb reviews across New York City, the text is preprocessed and latent Dirichlet allocation is utilized in order to extract 43 topics of interest from the user-generated content. The topics fall into one of several categories, including the general evaluation of guests, centralized or decentralized location attributes of the accommodation, tangible and intangible characteristics of the listed units, management of the listing or unit, and service quality of the host. The deeper complex relationships between topics are explored in detail using hierarchical Ward Clustering.

...read moreread less

Journal Article•DOI•

Topic modeling, long texts and the best number of topics. Some Problems and solutions

[...]

Stefano Sbalchiero¹, Maciej Eder²•Institutions (2)

University of Padua¹, Pedagogical University of Kraków²

17 Feb 2020-Quality & Quantity

TL;DR: The main aim of this article is to present the results of different experiments focused on the problem of model fitting process in topic modeling and its accuracy when applied to long texts and present a clear-cut power-law relation between the optimal number of topics and the analyzed sample size.

...read moreread less

Abstract: The main aim of this article is to present the results of different experiments focused on the problem of model fitting process in topic modeling and its accuracy when applied to long texts. At the same time, in fact, the digital era has made available both enormous quantities of textual data and technological advances that have facilitated the development of techniques to automate the data coding and analysis processes. In the ambit of topic modeling, different procedures were born in order to analyze larger and larger collections of texts, namely corpora, but this has posed, and continues to pose, a series of methodological questions that urgently need to be resolved. Therefore, through a series of different experiments, this article is based on the following consideration: taking into account Latent Dirichlet Allocation (LDA), a generative probabilistic model (Blei et al. in J Mach Learn Res 3:993–1022, 2003; Blei and Lafferty in: Srivastava, Sahami (eds) Text mining: classification, clustering, and applications, Chapman & Hall/CRC Press, Cambridge, 2009; Griffiths and Steyvers in Proc Natl Acad Sci USA (PNAS), 101(Supplement 1):5228–5235, 2004), the problem of fitting model is crucial because the LDA algorithm demands that the number of topics is specified a priori. Needles to say, the number of topics to detect in a corpus is a parameter which affect the analysis results. Since there is a lack of experiments applied to long texts, our article tries to shed new light on the complex relationship between texts’ length and the optimal number of topics. In the conclusions, we present a clear-cut power-law relation between the optimal number of topics and the analyzed sample size, and we formulate it in a form of a mathematical model.

...read moreread less

Journal Article•DOI•

Automated classification of patents: A topic modeling approach

[...]

Junghwan Yun¹, Youngjung Geum¹•Institutions (1)

Seoul National University of Science and Technology¹

01 Sep 2020-Computers & Industrial Engineering

TL;DR: This study suggests a topic model based on support vector machine (SVM) prediction for automatic patent classification that can lead to the automatic classification of patents without the need for any expert judgment during the process.

...read moreread less

Journal Article•DOI•

Evolution of research topics in LIS between 1996 and 2019: an analysis based on latent Dirichlet allocation topic model

[...]

Xiaoyao Han¹•Institutions (1)

Humboldt University of Berlin¹

10 Oct 2020-Scientometrics

TL;DR: Results indicate that library science has become less prevalent over time, as there are no top topic clusters relevant to library issues since the period 2000–2005 and bibliometrics, especially citation analysis, is highly stable across periods.

...read moreread less

Abstract: This study investigated the evolution of library and information science (LIS) by analyzing research topics in LIS journal articles. The analysis is divided into five periods covering the years 1996–2019. Latent Dirichlet allocation modeling was used to identify underlying topics based on 14,035 documents. An improved data-selection method was devised in order to generate a dynamic journal list that included influential journals for each period. Results indicate that (a) library science has become less prevalent over time, as there are no top topic clusters relevant to library issues since the period 2000–2005; (b) bibliometrics, especially citation analysis, is highly stable across periods, as reflected by the stable subclusters and consistent keywords; and (c) information retrieval has consistently been the dominant domain with interests gradually shifting to model-based text processing. Information seeking and behavior is also a stable field that tends to be dispersed among various topics rather than presented as its own subject. Information systems and organizational activities have been continuously discussed and have developed a closer relationship with e-commerce. Topics that occurred only once have undergone a change of technological context from the networks and Internet to social media and mobile applications.

...read moreread less

Journal Article•DOI•

Identification of highly-cited papers using topic-model-based and bibliometric features: the consideration of keyword popularity

[...]

Ya Han Hu¹, Ya Han Hu², Ya Han Hu³, Chun Tien Tai³, Chun Tien Tai⁴, Kang Ernest Liu⁵, Cheng Fang Cai³ - Show less +3 more•Institutions (5)

National Central University¹, National Cheng Kung University², National Chung Cheng University³, Memorial Hospital of South Bend⁴, National Taiwan University⁵

13 Jan 2020-Journal of Informetrics

TL;DR: This study performs a latent Dirichlet allocation technique to extract topics and keywords from articles and shows that, with KP features, the prediction models are more effective than those with journal and/or author features, especially in the management information system discipline.

...read moreread less

Journal Article•DOI•

A Systematic Spatial and Temporal Sentiment Analysis on Geo-Tweets

[...]

Tao Hu¹, Bing She², Lian Duan, Han Yue³, Julaine Clunis⁴ - Show less +1 more•Institutions (4)

Harvard University¹, University of Michigan², Wuhan University³, Kent State University⁴

01 Jan 2020-IEEE Access

TL;DR: Local users’ sentiments extracted from Geo-tweets data from January to December 2016, analyzed in the spatial and temporal perspective are explored, finding patterns which demonstrate the associations between the nature of Twitter content and the characteristics of places and users.

...read moreread less

Abstract: Sentiment affects every aspect of people's lives and has strong impact on their mental health. This paper explores local users' sentiments extracted from Geo-tweets data from January to December 2016, analyzed in the spatial and temporal perspective. Because of large amount of noisy data and complicated procedure of extracting local user, a workflow is created, facilitating more researchers to reproduce, replicate or extend the procedures using similar Geo-tweet dataset. The workflow is sharing at Harvard Dataverse (https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/6N9VUF). Using the processed data, each tweet's sentiment is classified according to the content. Then, the overall temporal variations of total number of positive, neural, and negative sentiments are analyzed on a monthly, daily and hourly level. From a spatial perspective, the Local Indicators of Spatial Association (LISA) statistical method is employed to discover the spatial clusters. In order to explore the content of positive sentiments, this paper applies the Latent Dirichlet Allocation (LDA) model to classify the Geo-tweets with positive sentiments into different topics. Combining the geospatial information with the topics, some patterns are found which demonstrate the associations between the nature of Twitter content and the characteristics of places and users. For example, weekend events and friend and family gatherings are the time that users prefer to post positive tweets. In the western part of US, users tend to post more photos to share the great moment on Twitter than other parts of the US.

...read moreread less

Journal Article•DOI•

Extracting and tracking hot topics of micro-blogs based on improved Latent Dirichlet Allocation

[...]

Yajun Du¹, YongTao Yi¹, Xianyong Li¹, Xiaoliang Chen¹, Yongquan Fan¹, Fanghong Su - Show less +2 more•Institutions (1)

Xihua University¹

01 Jan 2020-Engineering Applications of Artificial Intelligence

TL;DR: The experiment results show that MF-LDA has a lower perplexity and higher coverage rate than LDA under the same conditions and an excellent effect and practical significance of HTLCM model and HTT algorithm in extracting and tracking hot topics.

...read moreread less

Journal Article•DOI•

Toward Assessing and Recommending Combinations of Behaviors for Improving Health and Well-Being

[...]

Ehimwenma Nosakhare¹, Rosalind W. Picard¹•Institutions (1)

Massachusetts Institute of Technology¹

02 Mar 2020

TL;DR: In this paper, the authors present a framework to map multi-modal data collected in the wild to meaningful feature representations of health-related behaviors, uncover latent patterns comprising combinations of behaviors that best predict health and well-being, and use these learned patterns to make evidence-based recommendations that may improve health.

...read moreread less

Abstract: Multiple behaviors typically work together to influence health, making it hard to understand how one behavior might compensate for another. Rich multi-modal datasets from mobile sensors and advances in machine learning are today enabling new kinds of associations to be made between combinations of behaviors objectively assessed from daily life and self-reported levels of stress, mood, and health. In this article, we present a framework to (1) map multi-modal messy data collected in the “wild” to meaningful feature representations of health-related behaviors, (2) uncover latent patterns comprising combinations of behaviors that best predict health and well-being, and (3) use these learned patterns to make evidence-based recommendations that may improve health and well-being. We show how to use supervised latent Dirichlet allocation to model the observed behaviors, and we apply variational inference to uncover the latent patterns. Implementing and evaluating the model on 5,397 days of data from a group of 244 college students, we find that these latent patterns are indeed predictive of daily self-reported levels of stressed-calm, sad-happy, and sick-healthy states. We investigate the patterns of modifiable behaviors present on different days and uncover several ways in which they relate to stress, mood, and health. This work contributes a new method using objective data analysis to help advance understanding of how combinations of modifiable human behaviors may promote human health and well-being.

...read moreread less

Journal Article•DOI•

Monolingual and multilingual topic analysis using LDA and BERT embeddings

[...]

Qing Xie¹, Xinyuan Zhang², Ying Ding³, Min Song¹•Institutions (3)

Yonsei University¹, Zhengzhou University², University of Texas at Austin³

01 Aug 2020-Journal of Informetrics

TL;DR: The results show that the proposed approach is well-suited to analyzing the scientific evolutions in monolingual and scientific multilingual topic similarity relations.

...read moreread less

Journal Article•DOI•

Comparisons of service quality perceptions between full service carriers and low cost carriers in airline travel

[...]

Juhwan Lim¹, Hyun Cheol Lee²•Institutions (2)

KAIST¹, Korea Aerospace University²

18 May 2020-Current Issues in Tourism

TL;DR: This work applies latent Dirichlet allocation topic modeling to a vast number of passenger-authored online reviews for airline services to compare service quality between full service carriers (FSCs) and low cost carriers (LCCs).

...read moreread less

Abstract: We apply latent Dirichlet allocation topic modeling to a vast number of passenger-authored online reviews for airline services to compare service quality between full service carriers (FSCs) and lo...

...read moreread less

Journal Article•DOI•

Learning the latent structure of collider events

[...]

Barry M. Dillon¹, Darius A. Faroughy², Jernej F. Kamenik¹, Jernej F. Kamenik³, Manuel Szewc⁴ - Show less +1 more•Institutions (4)

Jožef Stefan Institute¹, University of Zurich², University of Ljubljana³, National Scientific and Technical Research Council⁴

01 Oct 2020-Journal of High Energy Physics

TL;DR: In this paper, an unsupervised machine learning technique based on the probabilistic generative model of Latent Dirichlet Allocation is proposed to learn the underlying structure of collider events directly from the data.

...read moreread less

Abstract: We describe a technique to learn the underlying structure of collider events directly from the data, without having a particular theoretical model in mind. It allows to infer aspects of the theoretical model that may have given rise to this structure, and can be used to cluster or classify the events for analysis purposes. The unsupervised machine-learning technique is based on the probabilistic (Bayesian) generative model of Latent Dirichlet Allocation. We pair the model with an approximate inference algorithm called Variational Inference, which we then use to extract the latent probability distributions describing the learned underlying structure of collider events. We provide a detailed systematic study of the technique using two example scenarios to learn the latent structure of di-jet event samples made up of QCD background events and either $$ t\overline{t} $$ or hypothetical W′ → (ϕ → WW)W signal events.

...read moreread less

Journal Article•DOI•

Discovering latent activity patterns from transit smart card data: A spatiotemporal topic model

[...]

Zhan Zhao¹, Haris N. Koutsopoulos², Jinhua Zhao¹•Institutions (2)

Massachusetts Institute of Technology¹, Northeastern University²

01 Jul 2020-Transportation Research Part C-emerging Technologies

TL;DR: A probabilistic topic model is proposed, adapted from Latent Dirichlet Allocation (LDA), to discover representative and interpretable activity categorization from individual-level spatiotemporal data in an unsupervised manner and can successfully distinguish the three most basic types of activities.

...read moreread less

Abstract: Although automatically collected human travel records can accurately capture the time and location of human movements, they do not directly explain the hidden semantic structures behind the data, e.g., activity types. This work proposes a probabilistic topic model, adapted from Latent Dirichlet Allocation (LDA), to discover representative and interpretable activity categorization from individual-level spatiotemporal data in an unsupervised manner. Specifically, the activity-travel episodes of an individual user are treated as words in a document, and each topic is a distribution over space and time that corresponds to certain type of activity. The model accounts for a mixture of discrete and continuous attributes—the location, start time of day, start day of week, and duration of each activity episode. The proposed methodology is demonstrated using pseudonymized transit smart card data from London, U.K. The results show that the model can successfully distinguish the three most basic types of activities—home, work, and other. As the specified number of activity categories increases, more specific subpatterns for home and work emerge, and both the goodness of fit and predictive performance for travel behavior improve. This work makes it possible to enrich human mobility data with representative and interpretable activity patterns without relying on predefined activity categories or heuristic rules.

...read moreread less

Posted Content•DOI•

TClustVID: A Novel Machine Learning Classification Model to Investigate Topics and Sentiment in COVID-19 Tweets

[...]

Md. Shahriare Satu¹, Md. Imran Khan, Mufti Mahmud², Shahadat Uddin³, Matthew A. Summers⁴, Matthew A. Summers⁵, Julian M.W. Quinn⁶, Julian M.W. Quinn⁵, Mohammad Ali Moni⁷, Mohammad Ali Moni⁵ - Show less +6 more•Institutions (7)

Noakhali Science and Technology University¹, Nottingham Trent University², University of Sydney³, University of Cambridge⁴, Garvan Institute of Medical Research⁵, Royal North Shore Hospital⁶, University of New South Wales⁷

04 Aug 2020-medRxiv

TL;DR: An intelligent clustering-based classification and topics extracting model (named TClustVID) that analyze COVID-19-related public tweets to extract significant sentiments with high accuracy and showed higher performance compared to the traditional classifiers determined by clustering criteria.

...read moreread less

Abstract: COVID-19, caused by the SARS-Cov2, varies greatly in its severity but represent serious respiratory symptoms with vascular and other complications, particularly in older adults. The disease can be spread by both symptomatic and asymptomatic infected individuals, and remains uncertainty over key aspects of its infectivity, no effective remedy yet exists and this disease causes severe economic effects globally. For these reasons, COVID-19 is the subject of intense and widespread discussion on social media platforms including Facebook and Twitter. These public forums substantially impact on public opinions in some cases and exacerbate widespread panic and misinformation spread during the crisis. Thus, this work aimed to design an intelligent clustering-based classification and topics extracting model (named TClustVID) that analyze COVID-19-related public tweets to extract significant sentiments with high accuracy. We gathered COVID-19 Twitter datasets from the IEEE Dataport repository and employed a range of data preprocessing methods to clean the raw data, then applied tokenization and produced a word-to-index dictionary. Thereafter, different classifications were employed to Twitter datasets which enabled exploration of the performance of traditional and TclustVID classification methods. TClustVID showed higher performance compared to the traditional classifiers determined by clustering criteria. Finally, we extracted significant topic clusters from TClustVID, split them into positive, neutral and negative clusters and implemented latent dirichlet allocation for extraction of popular COVID-19 topics. This approach identified common prevailing public opinions and concerns related to COVID-19, as well as attitudes to infection prevention strategies held by people from different countries concerning the current pandemic situation.

...read moreread less

Journal Article•DOI•

From Feature Engineering and Topics Models to Enhanced Prediction Rates in Phishing Detection

[...]

Éder S. Gualberto¹, Rafael Timóteo de Sousa Júnior¹, Thiago Pereira de Brito Vieira¹, Joao Paulo C. L. da Costa¹, Cláudio Gottschalg Duque¹ - Show less +1 more•Institutions (1)

University of Brasília¹

21 Apr 2020-IEEE Access

TL;DR: This approach outperforms state-of-the-art phishing detection researches for an accredited data set, in applications based only on the body of the e-mails, without using other e-mail features such as its header, IP information or number of links in the text.

...read moreread less

Abstract: Phishing is a type of fraud attempt in which the attacker, usually by e-mail, pretends to be a trusted person or entity in order to obtain sensitive information from a target. Most recent phishing detection researches have focused on obtaining highly distinctive features from the metadata and text of these e-mails. The obtained attributes are then used to feed classification algorithms in order to determine whether they are phishing or legitimate messages. In this paper, it is proposed an approach based on machine learning to detect phishing e-mail attacks. The methods that compose this approach are performed through a feature engineering process based on natural language processing, lemmatization, topics modeling, improved learning techniques for resampling and cross-validation, and hyperparameters configuration. The first proposed method uses all the features obtained from the Document-Term Matrix (DTM) in the classification algorithms. The second one uses Latent Dirichlet Allocation (LDA) as a operation to deal with the problems of the “curse of dimensionality”, the sparsity, and the text context portion included in the obtained representation. The proposed approach reached marks with an F1-measure of 99.95% success rate using the XGBoost algorithm. It outperforms state-of-the-art phishing detection researches for an accredited data set, in applications based only on the body of the e-mails, without using other e-mail features such as its header, IP information or number of links in the text.

...read moreread less

Collapse