scispace - formally typeset
Search or ask a question

Showing papers on "Latent Dirichlet allocation published in 2020"


Journal ArticleDOI
TL;DR: This study evaluates several techniques for document clustering and topic modelling on three datasets from Twitter and Reddit, and shows that clustering techniques applied to neural embedding feature representations delivered the best performance over all data sets using appropriate extrinsic evaluation measures.
Abstract: Methods for document clustering and topic modelling in online social networks (OSNs) offer a means of categorising, annotating and making sense of large volumes of user generated content. Many techniques have been developed over the years, ranging from text mining and clustering methods to latent topic models and neural embedding approaches. However, many of these methods deliver poor results when applied to OSN data as such text is notoriously short and noisy, and often results are not comparable across studies. In this study we evaluate several techniques for document clustering and topic modelling on three datasets from Twitter and Reddit. We benchmark four different feature representations derived from term-frequency inverse-document-frequency (tf-idf) matrices and word embedding models combined with four clustering methods, and we include a Latent Dirichlet Allocation topic model for comparison. Several different evaluation measures are used in the literature, so we provide a discussion and recommendation for the most appropriate extrinsic measures for this task. We also demonstrate the performance of the methods over data sets with different document lengths. Our results show that clustering techniques applied to neural embedding feature representations delivered the best performance over all data sets using appropriate extrinsic evaluation measures. We also demonstrate a method for interpreting the clusters with a top-words based approach using tf-idf weights combined with embedding distance measures.

149 citations


Journal ArticleDOI
14 Jul 2020
TL;DR: Investigating the topic modeling subject and its common application areas, methods, and tools sheds light on some common topic modeling methods in a short-text context and provides direction for researchers who seek to apply these methods.
Abstract: With the growth of online social network platforms and applications, large amounts of textual user-generated content are created daily in the form of comments, reviews, and short-text messages. As a result, users often find it challenging to discover useful information or more on the topic being discussed from such content. Machine learning and natural language processing algorithms are used to analyze the massive amount of textual social media data available online, including topic modeling techniques that have gained popularity in recent years. This paper investigates the topic modeling subject and its common application areas, methods, and tools. Also, we examine and compare five frequently used topic modeling methods, as applied to short textual social data, to show their benefits practically in detecting important topics. These methods are latent semantic analysis, latent Dirichlet allocation, non-negative matrix factorization, random projection, and principal component analysis. Two textual datasets were selected to evaluate the performance of included topic modeling methods based on the topic quality and some standard statistical evaluation metrics, like recall, precision, F-score, and topic coherence. As a result, latent Dirichlet allocation and non-negative matrix factorization methods delivered more meaningful extracted topics and obtained good results. The paper sheds light on some common topic modeling methods in a short-text context and provides direction for researchers who seek to apply these methods.

134 citations


Posted Content
TL;DR: This model does not require stop-word lists, stemming or lemmatization, and it automatically finds the number of topics, and the resulting topic vectors are jointly embedded with the document and word vectors with distance between them representing semantic similarity.
Abstract: Topic modeling is used for discovering latent semantic structure, usually referred to as topics, in a large collection of documents. The most widely used methods are Latent Dirichlet Allocation and Probabilistic Latent Semantic Analysis. Despite their popularity they have several weaknesses. In order to achieve optimal results they often require the number of topics to be known, custom stop-word lists, stemming, and lemmatization. Additionally these methods rely on bag-of-words representation of documents which ignore the ordering and semantics of words. Distributed representations of documents and words have gained popularity due to their ability to capture semantics of words and documents. We present $\texttt{top2vec}$, which leverages joint document and word semantic embedding to find $\textit{topic vectors}$. This model does not require stop-word lists, stemming or lemmatization, and it automatically finds the number of topics. The resulting topic vectors are jointly embedded with the document and word vectors with distance between them representing semantic similarity. Our experiments demonstrate that $\texttt{top2vec}$ finds topics which are significantly more informative and representative of the corpus trained on than probabilistic generative models.

130 citations


Journal ArticleDOI
TL;DR: This paper review and analyse critically all the generative models, namely Gaussian Mixture Models (GMM), Hidden Markov Models (HMM), Latent Dirichlet Allocation (LDA), Restricted Boltzmann Machines (RBM), Deep Belief Networks (DBN), Deep Boltz Mann Machines (DBM), and GANs, to provide the reader some insights on which generative model to pick from while dealing with a problem.

117 citations


Journal ArticleDOI
TL;DR: This survey conducts a comprehensive review of various short text topic modeling techniques proposed in the literature, and presents three categories of methods based on Dirichlet multinomial mixture, global word co-occurrences, and self-aggregation, with example of representative approaches in each category and analysis of their performance on various tasks.
Abstract: Analyzing short texts infers discriminative and coherent latent topics that is a critical and fundamental task since many real-world applications require semantic understanding of short texts. Traditional long text topic modeling algorithms (e.g., PLSA and LDA) based on word co-occurrences cannot solve this problem very well since only very limited word co-occurrence information is available in short texts. Therefore, short text topic modeling has already attracted much attention from the machine learning research community in recent years, which aims at overcoming the problem of sparseness in short texts. In this survey, we conduct a comprehensive review of various short text topic modeling techniques proposed in the literature. We present three categories of methods based on Dirichlet multinomial mixture, global word co-occurrences, and self-aggregation, with example of representative approaches in each category and analysis of their performance on various tasks. We develop the first comprehensive open-source library, called STTM, for use in Java that integrates all surveyed algorithms within a unified interface, benchmark datasets, to facilitate the expansion of new methods in this research field. Finally, we evaluate these state-of-the-art methods on many real-world datasets and compare their performance against one another and versus long text topic modeling algorithm.

101 citations


Journal ArticleDOI
TL;DR: The authors investigated the motivation and satisfaction of restaurant tourist customers coming from China and U.S. by investigating their online ratings and reviews and found that Chinese tourists are less inclined to assign lower ratings to restaurants, and are more strongly fascinated by the food offered.

99 citations


Journal ArticleDOI
TL;DR: The proposed automated classification model and LDA-based network analysis method provide a useful approach to enable machine-assisted interpretation of texts-based accident narratives and can provide managers with much-needed information and knowledge to improve safety on-site.

80 citations


Posted ContentDOI
TL;DR: This paper illustrates five different techniques to assess the distinctiveness of topics, key terms and features, speed of information dissemination, and network behaviors for Covid19 tweets, and seeks to understand retweet cascades.
Abstract: This paper illustrates five different techniques to assess the distinctiveness of topics, key terms and features, speed of information dissemination, and network behaviors for Covid19 tweets. First, we use pattern matching and second, topic modeling through Latent Dirichlet Allocation (LDA) to generate twenty different topics that discuss case spread, healthcare workers, and personal protective equipment (PPE). One topic specific to U.S. cases would start to uptick immediately after live White House Coronavirus Task Force briefings, implying that many Twitter users are paying attention to government announcements. We contribute machine learning methods not previously reported in the Covid19 Twitter literature. This includes our third method, Uniform Manifold Approximation and Projection (UMAP), that identifies unique clustering-behavior of distinct topics to improve our understanding of important themes in the corpus and help assess the quality of generated topics. Fourth, we calculated retweeting times to understand how fast information about Covid19 propagates on Twitter. Our analysis indicates that the median retweeting time of Covid19 for a sample corpus in March 2020 was 2.87 hours, approximately 50 minutes faster than repostings from Chinese social media about H7N9 in March 2013. Lastly, we sought to understand retweet cascades, by visualizing the connections of users over time from fast to slow retweeting. As the time to retweet increases, the density of connections also increase where in our sample, we found distinct users dominating the attention of Covid19 retweeters. One of the simplest highlights of this analysis is that early-stage descriptive methods like regular expressions can successfully identify high-level themes which were consistently verified as important through every subsequent analysis.

79 citations


Journal ArticleDOI
03 Apr 2020
TL;DR: FedLDA, a local differential privacy (LDP) based framework for federated learning of LDA models, contains a novel LDP mechanism called Random Response with Priori (RRP), which provides theoretical guarantees on both data privacy and model accuracy.
Abstract: Latent Dirichlet Allocation (LDA) is a widely adopted topic model for industrial-grade text mining applications. However, its performance heavily relies on the collection of large amount of text data from users' everyday life for model training. Such data collection risks severe privacy leakage if the data collector is untrustworthy. To protect text data privacy while allowing accurate model training, we investigate federated learning of LDA models. That is, the model is collaboratively trained between an untrustworthy data collector and multiple users, where raw text data of each user are stored locally and not uploaded to the data collector. To this end, we propose FedLDA, a local differential privacy (LDP) based framework for federated learning of LDA models. Central in FedLDA is a novel LDP mechanism called Random Response with Priori (RRP), which provides theoretical guarantees on both data privacy and model accuracy. We also design techniques to reduce the communication cost between the data collector and the users during model training. Extensive experiments on three open datasets verified the effectiveness of our solution.

74 citations


Journal ArticleDOI
TL;DR: This work applies Latent Dirichlet Allocation (LDA) topic modeling techniques to derive the discussion topics related to three popular deep learning frameworks, namely, Tensorflow, PyTorch and Theano, and makes a comparison of topics between the two platforms.
Abstract: Deep learning has gained tremendous traction from the developer and researcher communities. It plays an increasingly significant role in a number of application domains. Deep learning frameworks are proposed to help developers and researchers easily leverage deep learning technologies, and they attract a great number of discussions on popular platforms, i.e., Stack Overflow and GitHub. To understand and compare the insights from these two platforms, we mine the topics of interests from these two platforms. Specifically, we apply Latent Dirichlet Allocation (LDA) topic modeling techniques to derive the discussion topics related to three popular deep learning frameworks, namely, Tensorflow, PyTorch and Theano. Within each platform, we compare the topics across the three deep learning frameworks. Moreover, we make a comparison of topics between the two platforms. Our observations include 1) a wide range of topics that are discussed about the three deep learning frameworks on both platforms, and the most popular workflow stages are Model Training and Preliminary Preparation. 2) the topic distributions at the workflow level and topic category level on Tensorflow and PyTorch are always similar while the topic distribution pattern on Theano is quite different. In addition, the topic trends at the workflow level and topic category level of the three deep learning frameworks are quite different. 3) the topics at the workflow level show different trends across the two platforms. e.g., the trend of the Preliminary Preparation stage topic on Stack Overflow comes to be relatively stable after 2016, while the trend of it on GitHub shows a stronger upward trend after 2016. Besides, the Model Training stage topic still achieves the highest impact scores across two platforms. Based on the findings, we also discuss implications for practitioners and researchers.

66 citations


Journal ArticleDOI
TL;DR: A novel and robust framework that combines deep learning and text mining technologies that provide the ability to analyse hazard records automatically, enabling managers to understand their patterns of manifestation and therefore put in place strategies to prevent them from reoccurring is presented.

Proceedings ArticleDOI
01 Jul 2020
TL;DR: The proposed Bidirectional Adversarial Topic (BAT) model is the first attempt of applying bidirectional adversarial training for neural topic modeling and shows that BAT and Gaussian-BAT obtain more coherent topics, outperforming several competitive baselines.
Abstract: Recent years have witnessed a surge of interests of using neural topic models for automatic topic extraction from text, since they avoid the complicated mathematical derivations for model inference as in traditional topic models such as Latent Dirichlet Allocation (LDA). However, these models either typically assume improper prior (e.g. Gaussian or Logistic Normal) over latent topic space or could not infer topic distribution for a given document. To address these limitations, we propose a neural topic modeling approach, called Bidirectional Adversarial Topic (BAT) model, which represents the first attempt of applying bidirectional adversarial training for neural topic modeling. The proposed BAT builds a two-way projection between the document-topic distribution and the document-word distribution. It uses a generator to capture the semantic patterns from texts and an encoder for topic inference. Furthermore, to incorporate word relatedness information, the Bidirectional Adversarial Topic model with Gaussian (Gaussian-BAT) is extended from BAT. To verify the effectiveness of BAT and Gaussian-BAT, three benchmark corpora are used in our experiments. The experimental results show that BAT and Gaussian-BAT obtain more coherent topics, outperforming several competitive baselines. Moreover, when performing text clustering based on the extracted topics, our models outperform all the baselines, with more significant improvements achieved by Gaussian-BAT where an increase of near 6% is observed in accuracy.

Journal ArticleDOI
TL;DR: A generic methodology based on topic modeling and text network modeling, that allows researchers to gather valuable information from surveys that use open-ended questions, is evaluated through the use of a case study in which the responses to a teacher self-assessment survey in an Ecuadorian university have been studied.
Abstract: The large amount of text that is generated daily on the web through comments on social networks, blog posts and open-ended question surveys, among others, demonstrates that text data is used frequently, and therefore; its processing becomes a challenge for researchers. The topic modeling is one of the emerging techniques in text mining; it is based on the discovery of latent data and the search for relationships among text documents. In this paper, the objective of the research is to evaluate a generic methodology based on topic modeling and text network modeling, that allows researchers to gather valuable information from surveys that use open-ended questions. To achieve this, this methodology has been evaluated through the use of a case study in which the responses to a teacher self-assessment survey in an Ecuadorian university have been studied. The main contribution of the article is the inclusion of clustering algorithms in order to complement the results obtained when executing topic modeling. The proposed methodology is based on four phases: (a) Construction of a text database, (b) Text mining and topic modeling, (c) Topic network modeling and (d) The relevance of the identified topics. In previous works, it has been observed that the human interpretative contribution plays an important role in the process, especially in phases (a) and (d). For this reason, the visualization interfaces, such as graphs and dendograms, are of critical importance for researchers in order allow topic to efficiently analyze the results of the topic modeling. As a result of this case study, a compendium of the main strategies that teachers carry out in their classes with the aim of improving student retention is presented. In addition, the proposed methodology can be extended to the analysis of the unstructured textual information found in blogs, social networks, forums, etc.

Journal ArticleDOI
TL;DR: This research presents a novel aggregating method for constructing an aggregated topic model that is composed of the topics with greater coherence than individual models that outperforms those topic models at a statistically significant level in terms of topic coherence over an external corpus.
Abstract: This research presents a novel aggregating method for constructing an aggregated topic model that is composed of the topics with greater coherence than individual models. When generating a topic model, a number of parameters have to be specified. The resulting topics can be very general or very specific, which depend on the chosen parameters. In this study we investigate the process of aggregating multiple topic models generated using different parameters with a focus on whether combining the general and specific topics is able to increase topic coherence. We employ cosine similarity and Jensen-Shannon divergence to compute the similarity among topics and combine them into an aggregated model when their similarity scores exceed a predefined threshold. The model is evaluated against the standard topics models generated by the latent Dirichlet allocation and Non-negative Matrix Factorisation. Specifically we use the coherence of topics to compare the individual models that create aggregated models against those of the aggregated model and models generated by Non-negative Matrix Factorisation, respectively. The results demonstrate that the aggregated model outperforms those topic models at a statistically significant level in terms of topic coherence over an external corpus. We also make use of the aggregated topic model on social media data to validate the method in a realistic scenario and find that again it outperforms individual topic models.

Journal ArticleDOI
TL;DR: Experimental results show that the proposed Poisson Dirichlet Model (PDM) could effectively identify distinguished disease clusters based on the latent patterns hidden in the EHR data by alleviating the impact of age and sex, and that LDA could stratify patients into more differentiable subgroups than PDM in terms of p-values.

Journal ArticleDOI
TL;DR: This work applies an inductive approach by utilizing large unstructured text data of 104,161 online reviews of Korean accommodation customers to frame which topics of interest guests find important, and finds a higher importance for points of competition and points of uniqueness among the accommodation characteristics.
Abstract: There is a lot of attention given to the determinants of guest satisfaction and consumer behavior in the tourism literature. While much extant literature uses a deductive approach for identifying guest satisfaction dimensions, we apply an inductive approach by utilizing large unstructured text data of 104,161 online reviews of Korean accommodation customers to frame which topics of interest guests find important. Using latent Dirichlet allocation, a generative, Bayesian, hierarchical statistical model, we extract and validate topics of interest in the dataset. The results corroborate extant literature in that dimensions, such as location and service quality, are important. However, we extend existing dimensions of importance by more precisely distinguishing aspects of location and service quality. Furthermore, by comparing the characteristics of the accommodations in terms of metropolitan versus rural and the type of accommodation, we reveal differences in topics of importance between different characteristics of the accommodations. Specifically, we find a higher importance for points of competition and points of uniqueness among the accommodation characteristics. This has implications for how managers can improve customer satisfaction and how researchers can more precisely measure customer satisfaction in the hospitality industry.

Journal ArticleDOI
TL;DR: This study inductively analyzes the topics of interest that drive customer experience and satisfaction within the sharing economy of the accommodation sector using a dataset of 1,086,800 Airbnb reviews across New York City.
Abstract: This study inductively analyzes the topics of interest that drive customer experience and satisfaction within the sharing economy of the accommodation sector. Using a dataset of 1,086,800 Airbnb reviews across New York City, the text is preprocessed and latent Dirichlet allocation is utilized in order to extract 43 topics of interest from the user-generated content. The topics fall into one of several categories, including the general evaluation of guests, centralized or decentralized location attributes of the accommodation, tangible and intangible characteristics of the listed units, management of the listing or unit, and service quality of the host. The deeper complex relationships between topics are explored in detail using hierarchical Ward Clustering.

Journal ArticleDOI
TL;DR: The main aim of this article is to present the results of different experiments focused on the problem of model fitting process in topic modeling and its accuracy when applied to long texts and present a clear-cut power-law relation between the optimal number of topics and the analyzed sample size.
Abstract: The main aim of this article is to present the results of different experiments focused on the problem of model fitting process in topic modeling and its accuracy when applied to long texts. At the same time, in fact, the digital era has made available both enormous quantities of textual data and technological advances that have facilitated the development of techniques to automate the data coding and analysis processes. In the ambit of topic modeling, different procedures were born in order to analyze larger and larger collections of texts, namely corpora, but this has posed, and continues to pose, a series of methodological questions that urgently need to be resolved. Therefore, through a series of different experiments, this article is based on the following consideration: taking into account Latent Dirichlet Allocation (LDA), a generative probabilistic model (Blei et al. in J Mach Learn Res 3:993–1022, 2003; Blei and Lafferty in: Srivastava, Sahami (eds) Text mining: classification, clustering, and applications, Chapman & Hall/CRC Press, Cambridge, 2009; Griffiths and Steyvers in Proc Natl Acad Sci USA (PNAS), 101(Supplement 1):5228–5235, 2004), the problem of fitting model is crucial because the LDA algorithm demands that the number of topics is specified a priori. Needles to say, the number of topics to detect in a corpus is a parameter which affect the analysis results. Since there is a lack of experiments applied to long texts, our article tries to shed new light on the complex relationship between texts’ length and the optimal number of topics. In the conclusions, we present a clear-cut power-law relation between the optimal number of topics and the analyzed sample size, and we formulate it in a form of a mathematical model.

Journal ArticleDOI
TL;DR: This study suggests a topic model based on support vector machine (SVM) prediction for automatic patent classification that can lead to the automatic classification of patents without the need for any expert judgment during the process.

Journal ArticleDOI
TL;DR: Results indicate that library science has become less prevalent over time, as there are no top topic clusters relevant to library issues since the period 2000–2005 and bibliometrics, especially citation analysis, is highly stable across periods.
Abstract: This study investigated the evolution of library and information science (LIS) by analyzing research topics in LIS journal articles. The analysis is divided into five periods covering the years 1996–2019. Latent Dirichlet allocation modeling was used to identify underlying topics based on 14,035 documents. An improved data-selection method was devised in order to generate a dynamic journal list that included influential journals for each period. Results indicate that (a) library science has become less prevalent over time, as there are no top topic clusters relevant to library issues since the period 2000–2005; (b) bibliometrics, especially citation analysis, is highly stable across periods, as reflected by the stable subclusters and consistent keywords; and (c) information retrieval has consistently been the dominant domain with interests gradually shifting to model-based text processing. Information seeking and behavior is also a stable field that tends to be dispersed among various topics rather than presented as its own subject. Information systems and organizational activities have been continuously discussed and have developed a closer relationship with e-commerce. Topics that occurred only once have undergone a change of technological context from the networks and Internet to social media and mobile applications.

Journal ArticleDOI
TL;DR: This study performs a latent Dirichlet allocation technique to extract topics and keywords from articles and shows that, with KP features, the prediction models are more effective than those with journal and/or author features, especially in the management information system discipline.

Journal ArticleDOI
TL;DR: Local users’ sentiments extracted from Geo-tweets data from January to December 2016, analyzed in the spatial and temporal perspective are explored, finding patterns which demonstrate the associations between the nature of Twitter content and the characteristics of places and users.
Abstract: Sentiment affects every aspect of people's lives and has strong impact on their mental health. This paper explores local users' sentiments extracted from Geo-tweets data from January to December 2016, analyzed in the spatial and temporal perspective. Because of large amount of noisy data and complicated procedure of extracting local user, a workflow is created, facilitating more researchers to reproduce, replicate or extend the procedures using similar Geo-tweet dataset. The workflow is sharing at Harvard Dataverse (https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/6N9VUF). Using the processed data, each tweet's sentiment is classified according to the content. Then, the overall temporal variations of total number of positive, neural, and negative sentiments are analyzed on a monthly, daily and hourly level. From a spatial perspective, the Local Indicators of Spatial Association (LISA) statistical method is employed to discover the spatial clusters. In order to explore the content of positive sentiments, this paper applies the Latent Dirichlet Allocation (LDA) model to classify the Geo-tweets with positive sentiments into different topics. Combining the geospatial information with the topics, some patterns are found which demonstrate the associations between the nature of Twitter content and the characteristics of places and users. For example, weekend events and friend and family gatherings are the time that users prefer to post positive tweets. In the western part of US, users tend to post more photos to share the great moment on Twitter than other parts of the US.

Journal ArticleDOI
Yajun Du1, YongTao Yi1, Xianyong Li1, Xiaoliang Chen1, Yongquan Fan1, Fanghong Su 
TL;DR: The experiment results show that MF-LDA has a lower perplexity and higher coverage rate than LDA under the same conditions and an excellent effect and practical significance of HTLCM model and HTT algorithm in extracting and tracking hot topics.

Journal ArticleDOI
02 Mar 2020
TL;DR: In this paper, the authors present a framework to map multi-modal data collected in the wild to meaningful feature representations of health-related behaviors, uncover latent patterns comprising combinations of behaviors that best predict health and well-being, and use these learned patterns to make evidence-based recommendations that may improve health.
Abstract: Multiple behaviors typically work together to influence health, making it hard to understand how one behavior might compensate for another. Rich multi-modal datasets from mobile sensors and advances in machine learning are today enabling new kinds of associations to be made between combinations of behaviors objectively assessed from daily life and self-reported levels of stress, mood, and health. In this article, we present a framework to (1) map multi-modal messy data collected in the “wild” to meaningful feature representations of health-related behaviors, (2) uncover latent patterns comprising combinations of behaviors that best predict health and well-being, and (3) use these learned patterns to make evidence-based recommendations that may improve health and well-being. We show how to use supervised latent Dirichlet allocation to model the observed behaviors, and we apply variational inference to uncover the latent patterns. Implementing and evaluating the model on 5,397 days of data from a group of 244 college students, we find that these latent patterns are indeed predictive of daily self-reported levels of stressed-calm, sad-happy, and sick-healthy states. We investigate the patterns of modifiable behaviors present on different days and uncover several ways in which they relate to stress, mood, and health. This work contributes a new method using objective data analysis to help advance understanding of how combinations of modifiable human behaviors may promote human health and well-being.

Journal ArticleDOI
TL;DR: The results show that the proposed approach is well-suited to analyzing the scientific evolutions in monolingual and scientific multilingual topic similarity relations.

Journal ArticleDOI
TL;DR: This work applies latent Dirichlet allocation topic modeling to a vast number of passenger-authored online reviews for airline services to compare service quality between full service carriers (FSCs) and low cost carriers (LCCs).
Abstract: We apply latent Dirichlet allocation topic modeling to a vast number of passenger-authored online reviews for airline services to compare service quality between full service carriers (FSCs) and lo...

Journal ArticleDOI
TL;DR: In this paper, an unsupervised machine learning technique based on the probabilistic generative model of Latent Dirichlet Allocation is proposed to learn the underlying structure of collider events directly from the data.
Abstract: We describe a technique to learn the underlying structure of collider events directly from the data, without having a particular theoretical model in mind. It allows to infer aspects of the theoretical model that may have given rise to this structure, and can be used to cluster or classify the events for analysis purposes. The unsupervised machine-learning technique is based on the probabilistic (Bayesian) generative model of Latent Dirichlet Allocation. We pair the model with an approximate inference algorithm called Variational Inference, which we then use to extract the latent probability distributions describing the learned underlying structure of collider events. We provide a detailed systematic study of the technique using two example scenarios to learn the latent structure of di-jet event samples made up of QCD background events and either $$ t\overline{t} $$ or hypothetical W′ → (ϕ → WW)W signal events.

Journal ArticleDOI
TL;DR: A probabilistic topic model is proposed, adapted from Latent Dirichlet Allocation (LDA), to discover representative and interpretable activity categorization from individual-level spatiotemporal data in an unsupervised manner and can successfully distinguish the three most basic types of activities.
Abstract: Although automatically collected human travel records can accurately capture the time and location of human movements, they do not directly explain the hidden semantic structures behind the data, e.g., activity types. This work proposes a probabilistic topic model, adapted from Latent Dirichlet Allocation (LDA), to discover representative and interpretable activity categorization from individual-level spatiotemporal data in an unsupervised manner. Specifically, the activity-travel episodes of an individual user are treated as words in a document, and each topic is a distribution over space and time that corresponds to certain type of activity. The model accounts for a mixture of discrete and continuous attributes—the location, start time of day, start day of week, and duration of each activity episode. The proposed methodology is demonstrated using pseudonymized transit smart card data from London, U.K. The results show that the model can successfully distinguish the three most basic types of activities—home, work, and other. As the specified number of activity categories increases, more specific subpatterns for home and work emerge, and both the goodness of fit and predictive performance for travel behavior improve. This work makes it possible to enrich human mobility data with representative and interpretable activity patterns without relying on predefined activity categories or heuristic rules.

Posted ContentDOI
04 Aug 2020-medRxiv
TL;DR: An intelligent clustering-based classification and topics extracting model (named TClustVID) that analyze COVID-19-related public tweets to extract significant sentiments with high accuracy and showed higher performance compared to the traditional classifiers determined by clustering criteria.
Abstract: COVID-19, caused by the SARS-Cov2, varies greatly in its severity but represent serious respiratory symptoms with vascular and other complications, particularly in older adults. The disease can be spread by both symptomatic and asymptomatic infected individuals, and remains uncertainty over key aspects of its infectivity, no effective remedy yet exists and this disease causes severe economic effects globally. For these reasons, COVID-19 is the subject of intense and widespread discussion on social media platforms including Facebook and Twitter. These public forums substantially impact on public opinions in some cases and exacerbate widespread panic and misinformation spread during the crisis. Thus, this work aimed to design an intelligent clustering-based classification and topics extracting model (named TClustVID) that analyze COVID-19-related public tweets to extract significant sentiments with high accuracy. We gathered COVID-19 Twitter datasets from the IEEE Dataport repository and employed a range of data preprocessing methods to clean the raw data, then applied tokenization and produced a word-to-index dictionary. Thereafter, different classifications were employed to Twitter datasets which enabled exploration of the performance of traditional and TclustVID classification methods. TClustVID showed higher performance compared to the traditional classifiers determined by clustering criteria. Finally, we extracted significant topic clusters from TClustVID, split them into positive, neutral and negative clusters and implemented latent dirichlet allocation for extraction of popular COVID-19 topics. This approach identified common prevailing public opinions and concerns related to COVID-19, as well as attitudes to infection prevention strategies held by people from different countries concerning the current pandemic situation.

Journal ArticleDOI
TL;DR: This approach outperforms state-of-the-art phishing detection researches for an accredited data set, in applications based only on the body of the e-mails, without using other e-mail features such as its header, IP information or number of links in the text.
Abstract: Phishing is a type of fraud attempt in which the attacker, usually by e-mail, pretends to be a trusted person or entity in order to obtain sensitive information from a target. Most recent phishing detection researches have focused on obtaining highly distinctive features from the metadata and text of these e-mails. The obtained attributes are then used to feed classification algorithms in order to determine whether they are phishing or legitimate messages. In this paper, it is proposed an approach based on machine learning to detect phishing e-mail attacks. The methods that compose this approach are performed through a feature engineering process based on natural language processing, lemmatization, topics modeling, improved learning techniques for resampling and cross-validation, and hyperparameters configuration. The first proposed method uses all the features obtained from the Document-Term Matrix (DTM) in the classification algorithms. The second one uses Latent Dirichlet Allocation (LDA) as a operation to deal with the problems of the “curse of dimensionality”, the sparsity, and the text context portion included in the obtained representation. The proposed approach reached marks with an F1-measure of 99.95% success rate using the XGBoost algorithm. It outperforms state-of-the-art phishing detection researches for an accredited data set, in applications based only on the body of the e-mails, without using other e-mail features such as its header, IP information or number of links in the text.