scispace - formally typeset
Search or ask a question

Showing papers on "Hierarchical Dirichlet process published in 2020"


Journal ArticleDOI
TL;DR: A method for mapping the technology evolution path using a novel non-parametric topic model, the citation involved Hierarchical Dirichlet Process (CIHDP), to achieve better topic detection and tracking of scientific literature and can be helpful for understanding and analyzing the development of technical topics.
Abstract: Identifying the evolution path of a research field is essential to scientific and technological innovation. There have been many attempts to identify the technology evolution path based on the topic model or social networks analysis, but many of them had deficiencies in methodology. First, many studies have only considered a single type of information (text or citation information) in scientific literature, which may lead to incomplete technology path mapping. Second, the number of topics in each period cannot be determined automatically, making dynamic topic tracking difficult. Third, data mining methods fail to be effectively combined with visual analysis, which will affect the efficiency and flexibility of mapping. In this study, we developed a method for mapping the technology evolution path using a novel non-parametric topic model, the citation involved Hierarchical Dirichlet Process (CIHDP), to achieve better topic detection and tracking of scientific literature. To better present and analyze the path, D3.js is used to visualize the splitting and fusion of the evolutionary path. We used this novel model to mapping the artificial intelligence research domain, through a successful mapping of the evolution path, the proposed method’s validity and merits are shown. After incorporating the citation information, we found that the CIHDP can be mapping a complete path evolution process and had better performance than the Hierarchical Dirichlet Process and LDA. This method can be helpful for understanding and analyzing the development of technical topics. Moreover, it can be well used to map the science or technology of the innovation ecosystem. It may also arouse the interest of technology evolution path researchers or policymakers.

23 citations


Journal ArticleDOI
01 Mar 2020
TL;DR: Experimental results demonstrate that the proposed probabilistic framework based on driving primitives can provide a quantitative measurement of similar levels of driving behavior associated with the dynamic and stochastic characteristics.
Abstract: Evaluating the similarity levels of driving behavior plays a pivotal role in driving style classification and analysis, thus benefiting the design of human-centric driver assistance systems. This article presents a novel framework capable of quantitatively measuring the similarity of driving behaviors for human based on driving primitives, i.e., the building blocks of driving behavior. To this end, we develop a Bayesian nonparametric method by integrating hierarchical Dirichlet process (HDP) with a hidden Markov model (HMM) in order to automatically extract the driving primitives from sequential observations without using any prior knowledge. Then, we propose a grid-based relative entropy approach, which allows quantifying the probabilistic similarity levels among these extracted primitives. Finally, the naturalistic driving data from 10 drivers are collected to evaluate the proposed framework, with comparison to traditional work. Experimental results demonstrate that the proposed probabilistic framework based on driving primitives can provide a quantitative measurement of similar levels of driving behavior associated with the dynamic and stochastic characteristics.

22 citations


Posted Content
TL;DR: The semi-hierarchical Dirichlet process (SDP) as mentioned in this paper is a hierarchical prior that extends the hierarchical DDP and avoids the degeneracy issues of nested processes.
Abstract: Assessing homogeneity of distributions is an old problem that has received considerable attention, especially in the nonparametric Bayesian literature. To this effect, we propose the semi-hierarchical Dirichlet process, a novel hierarchical prior that extends the hierarchical Dirichlet process of Teh et al. (2006) and that avoids the degeneracy issues of nested processes recently described by Camerlenghi et al. (2019a). We go beyond the simple yes/no answer to the homogeneity question and embed the proposed prior in a random partition model; this procedure allows us to give a more comprehensive response to the above question and in fact find groups of populations that are internally homogeneous when I greater or equal than 2 such populations are considered. We study theoretical properties of the semi-hierarchical Dirichlet process and of the Bayes factor for the homogeneity test when I = 2. Extensive simulation studies and applications to educational data are also discussed.

16 citations


Journal ArticleDOI
TL;DR: A novel Bayesian nonparametric approach for expressing the bivariate density of individual arrival and departure times at different sites across a number of years as a mixture model, motivated by capture-recapture data on blackcaps collected by the British Trust for Ornithology.
Abstract: Environmental changes in recent years have been linked to phenological shifts which in turn are linked to the survival of species. The work in this paper is motivated by capture-recapture data on blackcaps collected by the British Trust for Ornithology as part of the Constant Effort Sites monitoring scheme. Blackcaps overwinter abroad and migrate to the UK annually for breeding purposes. We propose a novel Bayesian nonparametric approach for expressing the bivariate density of individual arrival and departure times at different sites across a number of years as a mixture model. The new model combines the ideas of the hierarchical and the dependent Dirichlet process, allowing the estimation of site-specific weights and year-specific mixture locations, which are modelled as functions of environmental covariates using a multivariate extension of the Gaussian process. The proposed modelling framework is extremely general and can be used in any context where multivariate density estimation is performed jointly across different groups and in the presence of a continuous covariate.

12 citations


Journal ArticleDOI
30 Mar 2020-Entropy
TL;DR: This paper proposes a novel approach for analyzing the influence of different regularization types on results of topic modeling and concludes that regularization may introduce unpredictable distortions into topic models that need further research.
Abstract: Topic modeling is a popular technique for clustering large collections of text documents. A variety of different types of regularization is implemented in topic modeling. In this paper, we propose a novel approach for analyzing the influence of different regularization types on results of topic modeling. Based on Renyi entropy, this approach is inspired by the concepts from statistical physics, where an inferred topical structure of a collection can be considered an information statistical system residing in a non-equilibrium state. By testing our approach on four models—Probabilistic Latent Semantic Analysis (pLSA), Additive Regularization of Topic Models (BigARTM), Latent Dirichlet Allocation (LDA) with Gibbs sampling, LDA with variational inference (VLDA)—we, first of all, show that the minimum of Renyi entropy coincides with the “true” number of topics, as determined in two labelled collections. Simultaneously, we find that Hierarchical Dirichlet Process (HDP) model as a well-known approach for topic number optimization fails to detect such optimum. Next, we demonstrate that large values of the regularization coefficient in BigARTM significantly shift the minimum of entropy from the topic number optimum, which effect is not observed for hyper-parameters in LDA with Gibbs sampling. We conclude that regularization may introduce unpredictable distortions into topic models that need further research.

10 citations


Posted Content
TL;DR: A novel Bayesian approach is employed designed to overcome problems and tease apart the different determinants of irregularity in patterns of West Iranian sound change and provisionally resolve a number of outstanding questions in the literature on West Iranian dialectology concerning the dialectal affiliation of certain sound changes.
Abstract: This paper addresses a series of complex and unresolved issues in the historical phonology of West Iranian languages. The West Iranian languages (Persian, Kurdish, Balochi, and other languages) display a high degree of non-Lautgesetzlich behavior. Most of this irregularity is undoubtedly due to language contact; we argue, however, that an oversimplified view of the processes at work has prevailed in the literature on West Iranian dialectology, with specialists assuming that deviations from an expected outcome in a given non-Persian language are due to lexical borrowing from some chronological stage of Persian. It is demonstrated that this qualitative approach yields at times problematic conclusions stemming from the lack of explicit probabilistic inferences regarding the distribution of the data: Persian may not be the sole donor language; additionally, borrowing at the lexical level is not always the mechanism that introduces irregularity. In many cases, the possibility that West Iranian languages show different reflexes in different conditioning environments remains under-explored. We employ a novel Bayesian approach designed to overcome these problems and tease apart the different determinants of irregularity in patterns of West Iranian sound change. Our methodology allows us to provisionally resolve a number of outstanding questions in the literature on West Iranian dialectology concerning the dialectal affiliation of certain sound changes. We outline future directions for work of this sort.

10 citations


Journal ArticleDOI
TL;DR: This paper proposes a multi-task learning based algorithm to predict users’ mobility by learning the mobility behaviors of different users at the same time and exploiting the similarities among them to overcome the sparse issue and improve the prediction performance.
Abstract: Human mobility prediction techniques are instrumental for many important applications including service management and city planning. Previous work looks at the inherent patterns of a user's historical trajectories to predict his/her future location. Such method suffers when only a small number of historical locations are available, which is a main challenge for mobility prediction on sparse trajectory datasets. In this paper, we propose a multi-task learning based algorithm to predict users’ mobility by learning the mobility behaviors of different users at the same time and exploiting the similarities among them to overcome the sparse issue and improve the prediction performance. Specifically, we use a Bayesian mixture model to describe users’ spatio-temporal mobility patterns, where we introduce a novel von Mises distribution to model the temporal distribution of users’ mobility to better preserve its continuity across time. Then, we use the hierarchical Dirichlet process to perform joint prediction of all users’ mobility. This model allows us to leverage similar mobility patterns among users to improve the prediction performance for users with sparse trajectory data. We conduct extensive evaluations using data collected by a cellular network operator and mobile applications. Our evaluations show that our algorithms achieve prediction accuracy of 53.9% on average and 73.2% for users with high-quality mobility data, with the performance gap over 8.7% on average compared with the state-of-the-art algorithms, demonstrating the effectiveness of our proposed algorithm.

10 citations


Journal ArticleDOI
TL;DR: An ensembling technique for SKDB, called ensemble of SKDB (ESKDB), is introduced and it is shown that ESKDB significantly outperforms RF on categorical and numerical data, as well as rivalling XGBoost.
Abstract: Bayesian network classifiers are, functionally, an interesting class of models, because they can be learnt out-of-core, i.e. without needing to hold the whole training data in main memory. The selective K-dependence Bayesian network classifier (SKDB) is state of the art in this class of models and has shown to rival random forest (RF) on problems with categorical data. In this paper, we introduce an ensembling technique for SKDB, called ensemble of SKDB (ESKDB). We show that ESKDB significantly outperforms RF on categorical and numerical data, as well as rivalling XGBoost. ESKDB combines three main components: (1) an effective strategy to vary the networks that is built by single classifiers (to make it an ensemble), (2) a stochastic discretization method which allows to both tackle numerical data as well as further increases the variance between different components of our ensemble and (3) a superior smoothing technique to ensure proper calibration of ESKDB’s probabilities. We conduct a large set of experiments with 72 datasets to study the properties of ESKDB (through a sensitivity analysis) and show its competitiveness with the state of the art.

10 citations


Journal ArticleDOI
TL;DR: In this paper, a bag of biterms (BBM) model is proposed for modeling massive, dynamic, and short text collections, which can be easily deployed for a large class of probabilistic models and demonstrate its usefulness with two well-known models.
Abstract: Analyzing texts from social media encounters many challenges due to their unique characteristics of shortness, massiveness, and dynamic. Short texts do not provide enough context information, causing the failure of the traditional statistical models. Furthermore, many applications often face with massive and dynamic short texts, causing various computational challenges to the current batch learning algorithms. This paper presents a novel framework, namely bag of biterms modeling (BBM), for modeling massive, dynamic, and short text collections. BBM comprises of two main ingredients: (1) the concept of bag of biterms (BoB) for representing documents, and (2) a simple way to help statistical models to include BoB. Our framework can be easily deployed for a large class of probabilistic models, and we demonstrate its usefulness with two well-known models: latent Dirichlet allocation (LDA) and hierarchical Dirichlet process (HDP). By exploiting both terms (words) and biterms (pairs of words), the major advantages of BBM are: (1) it enhances the length of the documents and makes the context more coherent by emphasizing the word connotation and co-occurrence via bag of biterms, and (2) it inherits inference and learning algorithms from the primitive to make it straightforward to design online and streaming algorithms for short texts. Extensive experiments suggest that BBM outperforms several state-of-the-art models. We also point out that the BoB representation performs better than the traditional representations (e.g., bag of words, tf-idf) even for normal texts.

9 citations


Journal ArticleDOI
TL;DR: A three-layer latent variable model is proposed that relaxes such an assumption and therefore provides a more flexible solution to identify the frequency components and characterize their evolution over time for non-stationary time series with multi-component signals.
Abstract: Time-frequency analysis (TFA) plays an important role in various engineering and biomedical fields. For a non-stationary time series, a common practice is to divide data into segments under the piecewise stationarity assumption and perform TFA for each segment. In this article, we propose a three-layer latent variable model that relaxes such an assumption and therefore provides a more flexible solution to identify the frequency components and characterize their evolution over time for non-stationary time series with multi-component signals. Our proposed model is built upon hierarchical Dirichlet process (HDP), hidden Markov model (HMM) and extended time-varying autoregressive (ETVAR) model. The proposed approach does not impose any restrictions on the number and locations of segments, or the number and values of the frequency components within a segment. Both the simulation studies and real data applications demonstrate the superiority of the proposed method over existing methods.

7 citations


Posted Content
TL;DR: The proposed approach, the hierarchical Dirichlet process latent position clustering model (HDP-LPCM), incorporates transitivity, models both individual and group level aspects of the data, and avoids the computationally expensive selection of the number of groups required by most popular methods.
Abstract: The evolution of communities in dynamic (time-varying) network data is a prominent topic of interest. A popular approach to understanding these dynamic networks is to embed the dyadic relations into a latent metric space. While methods for clustering with this approach exist for dynamic networks, they all assume a static community structure. This paper presents a Bayesian nonparametric model for dynamic networks that can model networks with evolving community structures. Our model extends existing latent space approaches by explicitly modeling the additions, deletions, splits, and mergers of groups with a hierarchical Dirichlet process hidden Markov model. Our proposed approach, the hierarchical Dirichlet process latent position clustering model (HDP-LPCM), incorporates transitivity, models both individual and group level aspects of the data, and avoids the computationally expensive selection of the number of groups required by most popular methods. We provide a Markov chain Monte Carlo estimation algorithm and apply our method to synthetic and real-world networks to demonstrate its performance.

Journal ArticleDOI
TL;DR: A hierarchical Dirichlet process nonnegative matrix factorization model in which the Gaussian mixture model is used to approximate the complex noise distribution and a mean-field variational inference algorithm is derived for the proposed nonparametric Bayesian model.
Abstract: Nonnegative Matrix Factorization (NMF) is valuable in many applications of blind source separation, signal processing and machine learning A number of algorithms that can infer nonnegative latent factors have been developed, but most of these assume a specific noise kernel This is insufficient to deal with complex noise in real scenarios In this paper, we present a hierarchical Dirichlet process nonnegative matrix factorization (DPNMF) model in which the Gaussian mixture model is used to approximate the complex noise distribution Moreover, the model is cast in the nonparametric Bayesian framework by using Dirichlet process mixture to infer the necessary number of Gaussian components We derive a mean-field variational inference algorithm for the proposed nonparametric Bayesian model We first test the model on synthetic data sets contaminated by Gaussian, sparse and mixed noise We then apply it to extract muscle synergies from the electromyographic (EMG) signal and to select discriminative features for motor imagery single-trial electroencephalogram (EEG) classification Experimental results demonstrate that DPNMF performs better in extracting the latent nonnegative factors in comparison with state-of-the-art methods

Posted Content
TL;DR: A novel framework, namely bag of biterms modeling (BBM), for modeling massive, dynamic, and short text collections and inherits inference and learning algorithms from the primitive to make it straightforward to design online and streaming algorithms for short texts.
Abstract: Analyzing texts from social media encounters many challenges due to their unique characteristics of shortness, massiveness, and dynamic. Short texts do not provide enough context information, causing the failure of the traditional statistical models. Furthermore, many applications often face with massive and dynamic short texts, causing various computational challenges to the current batch learning algorithms. This paper presents a novel framework, namely Bag of Biterms Modeling (BBM), for modeling massive, dynamic, and short text collections. BBM comprises of two main ingredients: (1) the concept of Bag of Biterms (BoB) for representing documents, and (2) a simple way to help statistical models to include BoB. Our framework can be easily deployed for a large class of probabilistic models, and we demonstrate its usefulness with two well-known models: Latent Dirichlet Allocation (LDA) and Hierarchical Dirichlet Process (HDP). By exploiting both terms (words) and biterms (pairs of words), the major advantages of BBM are: (1) it enhances the length of the documents and makes the context more coherent by emphasizing the word connotation and co-occurrence via Bag of Biterms, (2) it inherits inference and learning algorithms from the primitive to make it straightforward to design online and streaming algorithms for short texts. Extensive experiments suggest that BBM outperforms several state-of-the-art models. We also point out that the BoB representation performs better than the traditional representations (e.g, Bag of Words, tf-idf) even for normal texts.

Journal ArticleDOI
TL;DR: The dynamic temporal detection method is proposed to detect multiple attack patterns of ADS-B data sequence using hidden Markov model with sticky hierarchical Dirichlet process, and the feasibility and accuracy of the proposed method are validated.

Journal ArticleDOI
TL;DR: A copula guided parallel Gibbs sampling algorithm is developed which can adjust the number of topics dynamically and capture the latent semantic dependencies between words that compose a coherent segment in HDP.
Abstract: Hierarchical Dirichlet Process (HDP) has attracted much attention in the research community of natural language processing. Given a corpus, HDP is able to determine the number of topics automatically, possessing an important feature dubbed nonparametric that overcomes the challenging issue of manually specifying a suitable topic number in parametric topic models, such as Latent Dirichlet Allocation (LDA). Nevertheless, HDP requires a much higher computational cost than LDA for parameter estimation. By taking the advantage of multi-threading, a parallel Gibbs sampling algorithm is proposed to estimate parameters for HDP based on the equivalence between HDP and Gamma-Gamma Poisson Process (G2PP) in terms of the generative process. Unfortunately, the above parallel Gibbs sampling algorithm requires to apply the finite approximation on the number of topics manually (i.e., predefine the topic number), thus can not retain the nonparametric feature of HDP. Another drawback of the above models is the lack of capturing the semantic dependencies between words, because the topic assignment of words is independent with each other. Although some works have been done in phrase-based topic modelling, these existing methods are still limited by either enforcing the entire phrase to share a common topic or requiring much complex and time-consuming phrase mining methods. In this paper, we aim to develop a copula guided parallel Gibbs sampling algorithm for HDP which can adjust the number of topics dynamically and capture the latent semantic dependencies between words that compose a coherent segment. Extensive experiments on real-world datasets indicate that our method achieves low perplexities and high topic coherence scores with a small time cost. In addition, we validate the effectiveness of our method on the modelling of word semantic dependencies by comparing the extracted topical phrases with those learned by state-of-the-art phrase-based baselines.

Book ChapterDOI
30 Jul 2020
TL;DR: This experiment uses generative statistical model Latent Dirichlet Allocation (LDA) which is also the most widely explored model in topic modeling and another nonparametric bayesian based approach model HeuristicDirichlet Process (HDP) to extract the topics from 150 lyrics samples of Manipuri songs written using Roman script.
Abstract: With the increase in an enormous amount of data, text analysis has become a challenging task Techniques like classification, categorization, summarization and topic modeling have become part of every natural language processing activity In this experiment, we aim to perform sentiment class extraction from lyrics using topic modeling techniques We have use generative statistical model Latent Dirichlet Allocation (LDA) which is also the most widely explored model in topic modeling and another nonparametric bayesian based approach model Heuristic Dirichlet Process (HDP) to extract the topics from 150 lyrics samples of Manipuri songs written using Roman script We observe this unsupervised techniques, able to obtain the underlying different sentiment class of lyrics in the form of topics

Journal ArticleDOI
TL;DR: This work proposes an approximation for a general class of hierarchical processes, which leads to an efficient conditional Gibbs sampling algorithm and provides both empirical and theoretical support for such a procedure.
Abstract: Hierarchical normalized discrete random measures identify a general class of priors that is suited to flexibly learn how the distribution of a response variable changes across groups of observations. A special case widely used in practice is the hierarchical Dirichlet process. Although current theory on hierarchies of nonparametric priors yields all relevant tools for drawing posterior inference, their implementation comes at a high computational cost. We fill this gap by proposing an approximation for a general class of hierarchical processes, which leads to an efficient conditional Gibbs sampling algorithm. The key idea consists of a deterministic truncation of the underlying random probability measures leading to a finite dimensional approximation of the original prior law. We provide both empirical and theoretical support for such a procedure.

Journal ArticleDOI
TL;DR: This paper proposes a hierarchical Dirichlet process of generalized linear models in which the latent heterogeneity can depend on context-level features and demonstrates the importance of accounting for latent heterogeneity with a Monte Carlo exercise and with two applications that replicate recent scholarly work.
Abstract: Classical generalized linear models assume that marginal effects are homogeneous in the population given the observed covariates. Researchers can never be sure a priori if that assumption is adequate. Recent literature in statistics and political science have proposed models that use Dirichlet process priors to deal with the possibility of latent heterogeneity in the covariate effects. In this paper, we extend and generalize those approaches and propose a hierarchical Dirichlet process of generalized linear models in which the latent heterogeneity can depend on context-level features. Such a model is important in comparative analyses when the data comes from different countries and the latent heterogeneity can be a function of country-level features. We provide a Gibbs sampler for the general model, a special Gibbs sampler for gaussian outcome variables, and a Hamiltonian Monte Carlo within Gibbs to handle discrete outcome variables. We demonstrate the importance of accounting for latent heterogeneity with a Monte Carlo exercise and with two applications that replicate recent scholarly work. We show how Simpson’s paradox can emerge in the empirical analysis if latent heterogeneity is ignored and how the proposed model can be used to estimate heterogeneity in the effect of covariates.

Posted Content
TL;DR: The paper adopts the dependent Dirichlet process (DDP) to learn the multiple object state prior by exploiting inherent dynamic dependencies in the state transition using the dynamic clustering property of the DDP.
Abstract: Some challenging problems in tracking multiple objects include the time-dependent cardinality, unordered measurements and object parameter labeling. In this paper, we employ Bayesian Bayesian nonparametric methods to address these challenges. In particular, we propose modeling the multiple object parameter state prior using the dependent Dirichlet and Pitman-Yor processes. These nonparametric models have been shown to be more flexible and robust, when compared to existing methods, for estimating the time-varying number of objects, providing object labeling and identifying measurement to object associations. Monte Carlo sampling methods are then proposed to efficiently learn the trajectory of objects from noisy measurements. Using simulations, we demonstrate the estimation performance advantage of the new methods when compared to existing algorithms such as the generalized labeled multi-Bernoulli filter.

Posted Content
TL;DR: The Local Hierarchical Dirichlet Process (Local-HDP) as discussed by the authors is a nonparametric hierarchical Bayesian approach for open-ended 3D object categorization, which allows an agent to learn independent topics for each category incrementally and to adapt to the environment in time.
Abstract: We introduce a non-parametric hierarchical Bayesian approach for open-ended 3D object categorization, named the Local Hierarchical Dirichlet Process (Local-HDP). This method allows an agent to learn independent topics for each category incrementally and to adapt to the environment in time. Hierarchical Bayesian approaches like Latent Dirichlet Allocation (LDA) can transform low-level features to high-level conceptual topics for 3D object categorization. However, the efficiency and accuracy of LDA-based approaches depend on the number of topics that is chosen manually. Moreover, fixing the number of topics for all categories can lead to overfitting or underfitting of the model. In contrast, the proposed Local-HDP can autonomously determine the number of topics for each category. Furthermore, an inference method is proposed that results in a fast posterior approximation. Experiments show that Local-HDP outperforms other state-of-the-art approaches in terms of accuracy, scalability, and memory efficiency with a large margin.

Posted Content
TL;DR: Not all the topics were being used for clustering on the first run of the LDA model, which results in a less effective clustering, and a reasoning on how Zeno's paradox is avoided is established.
Abstract: There has been an increasingly popular trend in Universities for curriculum transformation to make teaching more interactive and suitable for online courses. An increase in the popularity of online courses would result in an increase in the number of course-related queries for academics. This, coupled with the fact that if lectures were delivered in a video on demand format, there would be no fixed time where the majority of students could ask questions. When questions are asked in a lecture there is a negligible chance of having similar questions repeatedly, but asynchronously this is more likely. In order to reduce the time spent on answering each individual question, clustering them is an ideal choice. There are different unsupervised models fit for text clustering, of which the Latent Dirichlet Allocation model is the most commonly used. We use the Hierarchical Dirichlet Process to determine an optimal topic number input for our LDA model runs. Due to the probabilistic nature of these topic models, the outputs of them vary for different runs. The general trend we found is that not all the topics were being used for clustering on the first run of the LDA model, which results in a less effective clustering. To tackle probabilistic output, we recursively use the LDA model on the effective topics being used until we obtain an efficiency ratio of 1. Through our experimental results we also establish a reasoning on how Zeno's paradox is avoided.

Posted ContentDOI
10 Nov 2020-bioRxiv
TL;DR: A hierarchical Dirichlet process mixture model that incorporates the correlation structure induced by a structured sampling arrangement and it is shown that this model improves the quality of inference and outperforms standard numerical and statistical methods for decomposing admixed count data.
Abstract: There are distinguishing features or "hallmarks" of cancer that are found across tumors, individuals, and types of cancer, and these hallmarks can be driven by specific genetic mutations. Yet, within a single tumor there is often extensive genetic heterogeneity as evidenced by single-cell and bulk DNA sequencing data. The goal of this work is to jointly infer the underlying genotypes of tumor subpopulations and the distribution of those subpopulations in individual tumors by integrating single-cell and bulk sequencing data. Understanding the genetic composition of the tumor at the time of treatment is important in the personalized design of targeted therapeutic combinations and monitoring for possible recurrence after treatment. We propose a hierarchical Dirichlet process mixture model that incorporates the correlation structure induced by a structured sampling arrangement and we show that this model improves the quality of inference. We develop a representation of the hierarchical Dirichlet process prior as a Gamma-Poisson hierarchy and we use this representation to derive a fast Gibbs sampling inference algorithm using the augment-and-marginalize method. Experiments with simulation data show that our model outperforms standard numerical and statistical methods for decomposing admixed count data. Analyses of real acute lymphoblastic leukemia cancer sequencing dataset shows that our model improves upon state-of-the-art bioinformatic methods. An interpretation of the results of our model on this real dataset reveals co-mutated loci across samples.

Book ChapterDOI
11 May 2020
TL;DR: This work applies a recent advanced smoothing method called Hierarchical Dirichlet Process (HDP) to PETs, and proposes a novel hierarchical smoothing approach called HGS as an alternative that makes a single tree almost as good as a Random Forest with 10 trees.
Abstract: Decision trees are still seeing use in online, non-stationary and embedded contexts, as well as for interpretability. For applications like ranking and cost-sensitive classification, probability estimation trees (PETs) are used. These are built using smoothing or calibration techniques. Older smoothing techniques used counts local to a leaf node, but a few more recent techniques consider the broader context of a node when doing estimation. We apply a recent advanced smoothing method called Hierarchical Dirichlet Process (HDP) to PETs, and then propose a novel hierarchical smoothing approach called Hierarchical Gradient Smoothing (HGS) as an alternative. HGS smooths leaf nodes up to all the ancestors, instead of recursively smoothing to the parent used by HDP. HGS is made faster by efficiently optimizing the Leave-One-Out Cross-Validation (LOOCV) loss measure using gradient descent, instead of sampling used in HDP. An extensive set of experiments are conducted on 143 datasets showing that our HGS estimates are not only more accurate but also do so within a fraction of HDP time. Besides, HGS makes a single tree almost as good as a Random Forest with 10 trees. For applications that require more interpretability and efficiency, a single decision tree plus HGS is more preferred.

Journal ArticleDOI
02 Jan 2020
TL;DR: A nonparametric Bayesian graph topic model (GTM) based on hierarchical Dirichlet process (HDP) and the combination of HDP and GTM is proposed, which is named as HDP–GTM.
Abstract: In this paper, a nonparametric Bayesian graph topic model (GTM) based on hierarchical Dirichlet process (HDP) is proposed. The HDP makes the number of topics selected flexibly, which breaks the lim...

Posted Content
TL;DR: In this paper, a Bayesian nonparametric hidden Markov model assuming a sticky hierarchical Dirichlet process for the switching dynamics between different states while the periodicities characterizing each state are explored by means of a trans-dimensional Markov chain Monte Carlo sampling step.
Abstract: We propose to model time-varying periodic and oscillatory processes by means of a hidden Markov model where the states are defined through the spectral properties of a periodic regime. The number of states is unknown along with the relevant periodicities, the role and number of which may vary across states. We address this inference problem by a Bayesian nonparametric hidden Markov model assuming a sticky hierarchical Dirichlet process for the switching dynamics between different states while the periodicities characterizing each state are explored by means of a trans-dimensional Markov chain Monte Carlo sampling step. We develop the full Bayesian inference algorithm and illustrate the use of our proposed methodology for different simulation studies as well as an application related to respiratory research which focuses on the detection of apnea instances in human breathing traces.

Posted Content
TL;DR: The disentangled sticky HDP-HMM is proposed, which outperforms the sticky HDHMM and HDM on both synthetic and real data, and is applied to analyze neural data and segment behavioral video data.
Abstract: The Hierarchical Dirichlet Process Hidden Markov Model (HDP-HMM) has been used widely as a natural Bayesian nonparametric extension of the classical Hidden Markov Model for learning from sequential and time-series data. A sticky extension of the HDP-HMM has been proposed to strengthen the self-persistence probability in the HDP-HMM. However, the sticky HDP-HMM entangles the strength of the self-persistence prior and transition prior together, limiting its expressiveness. Here, we propose a more general model: the disentangled sticky HDP-HMM (DS-HDP-HMM). We develop novel Gibbs sampling algorithms for efficient inference in this model. We show that the disentangled sticky HDP-HMM outperforms the sticky HDP-HMM and HDP-HMM on both synthetic and real data, and apply the new approach to analyze neural data and segment behavioral video data.

Book ChapterDOI
22 Jul 2020
TL;DR: This chapter takes a Bayesian nonparametric approach in defining a prior on the hidden Markov model that allows for flexibility in addressing the problem of modeling the complex dynamics during robot manipulation task.
Abstract: In this chapter, we take a Bayesian nonparametric approach in defining a prior on the hidden Markov model that allows for flexibility in addressing the problem of modeling the complex dynamics during robot manipulation task. At first, considering the underlying dynamics that can be well-modeled as a hidden discrete Markov process, but in which there is uncertainty about the cardinality of the state space. Through the use of the hierarchical Dirichlet process (HDP), one can examine an HMM with an unbounded number of possible states. Subsequently, the sticky HDP-HMM is investigated for allowing more robust learning of the complex dynamics through a learned bias by increasing the probability of self-transitions. Additionally, although the HDP-HMM and its sticky extension are very flexible time series models, they make a strong Markovian assumption that observations are conditionally independent given the discrete HMM state. This assumption is often insufficient for capturing the temporal dependencies of the observations in real data. To address this issue, we consider extensions of the sticky HDP-HMM for learning the switching dynamical processes with switching linear dynamical system. In the later chapters of this book, we will verify the performances in modeling mulitmodal time series and present the results of robot movement identification, anomaly monitoring, and anomaly diagnose.

Posted Content
TL;DR: The proposed variational inference method substantially outperforms its competitor in terms of lower perplexity and much clearer topic-words clustering, and reduces the computational complexity by approximating the posterior functionally instead of updating the stick-breaking parameters individually.
Abstract: The scalable inference for Bayesian nonparametric models with big data is still challenging. Current variational inference methods fail to characterise the correlation structure among latent variables due to the mean-field setting and cannot infer the true posterior dimension because of the universal truncation. To overcome these limitations, we build a general framework to infer Bayesian nonparametric models by maximising the proposed nonparametric evidence lower bound, and then develop a novel approach by combining Monte Carlo sampling and stochastic variational inference framework. Our method has several advantages over the traditional online variational inference method. First, it achieves a smaller divergence between variational distributions and the true posterior by factorising variational distributions under the conditional setting instead of the mean-field setting to capture the correlation pattern. Second, it reduces the risk of underfitting or overfitting by truncating the dimension adaptively rather than using a prespecified truncated dimension for all latent variables. Third, it reduces the computational complexity by approximating the posterior functionally instead of updating the stick-breaking parameters individually. We apply the proposed method on hierarchical Dirichlet process and gamma--Dirichlet process models, two essential Bayesian nonparametric models in topic analysis. The empirical study on three large datasets including arXiv, New York Times and Wikipedia reveals that our proposed method substantially outperforms its competitor in terms of lower perplexity and much clearer topic-words clustering.

Posted Content
06 Jan 2020
TL;DR: The full Bayesian inference algorithm is developed and the use of the proposed methodology for different simulation studies as well as an application related to respiratory research which focuses on the detection of apnea instances in human breathing traces is illustrated.
Abstract: We propose to model time-varying periodic and oscillatory processes by means of a hidden Markov model where the states are defined through the spectral properties of a periodic regime. The number of states is unknown along with the relevant periodicities, the role and number of which may vary across states. We address this inference problem by a Bayesian nonparametric hidden Markov model assuming a sticky hierarchical Dirichlet process for the switching dynamics between different states while the periodicities characterizing each state are explored by means of a trans-dimensional Markov chain Monte Carlo sampling step. We develop the full Bayesian inference algorithm and illustrate the use of our proposed methodology for different simulation studies as well as an application related to respiratory research which focuses on the detection of apnea instances in human breathing traces.

Proceedings ArticleDOI
11 Dec 2020
TL;DR: Topic-oriented gravity model (TopicGM) as mentioned in this paper investigates a directed and weighted network incorporating users' topical aspects, where an individual is first represented as a textual content he created or read, and a topical network is then constructed where nodes represent individuals and an edge connects two individuals in the direction from the poster to the reader with a topical confidence weight.
Abstract: The task of identifying influencers provides a lot of benefits for various practical applications such as recommendation systems, viral marketing, and information monitoring This issue can traditionally be solved via a network structure with several proposed graph algorithms However, most of them employ a global computation with much time-consuming; some consider only undirected and unweighted networks which may be inconsistent with the nature of data Inspired by the law of gravity in Physics, we present the Topic-oriented Gravity Model (TopicGM) that investigates a directed and weighted network incorporating users' topical aspects The key concept is that an individual is first represented as a textual content he created or read Afterwards, TopicGM simply adopts a topic modeling, ie, the Hierarchical Dirichlet Process (HDP), to classify topics over those contents A topical network is then constructed where nodes represent individuals and an edge connects two individuals in the direction from the poster to the reader with a topical confidence weight Finally, we apply the gravity formula to calculate influence scores and rank individuals The experimental results, conducted on real-world data gathered from Pantipcom (famous Thai web forum), show that our approach outperforms many state-of-the-art baselines by accurately identifying influencers within the top of rankings