scispace - formally typeset
Search or ask a question

Showing papers on "Expectation–maximization algorithm published in 2012"


Journal ArticleDOI
TL;DR: In this article, the authors show that the common factors based on maximum likelihood are consistent for the size of the cross-section (n) and the sample size (T) going to infinity along any path of n and T and therefore maximum likelihood is viable for n large.
Abstract: Is maximum likelihood suitable for factor models in large cross-sections of time series? We answer this question from both an asymptotic and an empirical perspective. We show that estimates of the common factors based on maximum likelihood are consistent for the size of the cross-section (n) and the sample size (T) going to infinity along any path of n and T and that therefore maximum likelihood is viable for n large. The estimator is robust to misspecification of the cross-sectional and time series correlation of the the idiosyncratic components. In practice, the estimator can be easily implemented using the Kalman smoother and the EM algorithm as in traditional factor analysis.

497 citations


Journal ArticleDOI
TL;DR: A novel Bayesian probabilistic method that jointly estimates the scene albedo and depth from a single foggy image by fully leveraging their latent statistical structures by exploiting natural image and depth statistics as priors on these hidden layers.
Abstract: Atmospheric conditions induced by suspended particles, such as fog and haze, severely alter the scene appearance. Restoring the true scene appearance from a single observation made in such bad weather conditions remains a challenging task due to the inherent ambiguity that arises in the image formation process. In this paper, we introduce a novel Bayesian probabilistic method that jointly estimates the scene albedo and depth from a single foggy image by fully leveraging their latent statistical structures. Our key idea is to model the image with a factorial Markov random field in which the scene albedo and depth are two statistically independent latent layers and to jointly estimate them. We show that we may exploit natural image and depth statistics as priors on these hidden layers and estimate the scene albedo and depth with a canonical expectation maximization algorithm with alternating minimization. We experimentally evaluate the effectiveness of our method on a number of synthetic and real foggy images. The results demonstrate that the method achieves accurate factorization even on challenging scenes for past methods that only constrain and estimate one of the latent variables.

397 citations


Journal ArticleDOI
TL;DR: In this paper, the authors study the problem of high-dimensional sparse linear regression with noisy, missing and/or dependent data and show that a simple algorithm based on projected gradient descent can converge in polynomial time to a small neighborhood of all global minimizers.
Abstract: Although the standard formulations of prediction problems involve fully-observed and noiseless data drawn in an i.i.d. manner, many applications involve noisy and/or missing data, possibly involving dependence, as well. We study these issues in the context of high-dimensional sparse linear regression, and propose novel estimators for the cases of noisy, missing and/or dependent data. Many standard approaches to noisy or missing data, such as those using the EM algorithm, lead to optimization problems that are inherently nonconvex, and it is difficult to establish theoretical guarantees on practical algorithms. While our approach also involves optimizing nonconvex programs, we are able to both analyze the statistical error associated with any global optimum, and more surprisingly, to prove that a simple algorithm based on projected gradient descent will converge in polynomial time to a small neighborhood of the set of all global minimizers. On the statistical side, we provide nonasymptotic bounds that hold with high probability for the cases of noisy, missing and/or dependent data. On the computational side, we prove that under the same types of conditions required for statistical consistency, the projected gradient descent algorithm is guaranteed to converge at a geometric rate to a near-global minimizer. We illustrate these theoretical predictions with simulations, showing close agreement with the predicted scalings.

376 citations


Proceedings ArticleDOI
16 Apr 2012
TL;DR: The approach is shown to outperform the state of the art fact-finding heuristics, as well as simple baselines such as majority voting, and to offer the first optimal solution to the above truth discovery problem.
Abstract: This paper addresses the challenge of truth discovery from noisy social sensing data. The work is motivated by the emergence of social sensing as a data collection paradigm of growing interest, where humans perform sensory data collection tasks. A challenge in social sensing applications lies in the noisy nature of data. Unlike the case with well-calibrated and well-tested infrastructure sensors, humans are less reliable, and the likelihood that participants' measurements are correct is often unknown a priori. Given a set of human participants of unknown reliability together with their sensory measurements, this paper poses the question of whether one can use this information alone to determine, in an analytically founded manner, the probability that a given measurement is true. The paper focuses on binary measurements. While some previous work approached the answer in a heuristic manner, we offer the first optimal solution to the above truth discovery problem. Optimality, in the sense of maximum likelihood estimation, is attained by solving an expectation maximization problem that returns the best guess regarding the correctness of each measurement. The approach is shown to outperform the state of the art fact-finding heuristics, as well as simple baselines such as majority voting.

372 citations


Proceedings Article
16 Jun 2012
TL;DR: In this article, a method of moments approach is proposed for parameter estimation for a broad class of high-dimensional mixture models with many components, including multi-view mixtures of Gaussians and hidden Markov models.
Abstract: Mixture models are a fundamental tool in applied statistics and machine learning for treating data taken from multiple subpopulations. The current practice for estimating the parameters of such models relies on local search heuristics (e.g., the EM algorithm) which are prone to failure, and existing consistent methods are unfavorable due to their high computational and sample complexity which typically scale exponentially with the number of mixture components. This work develops an ecient method of moments approach to parameter estimation for a broad class of high-dimensional mixture models with many components, including multi-view mixtures of Gaussians (such as mixtures of axis-aligned Gaussians) and hidden Markov models. The new method leads to rigorous unsupervised learning results for mixture models that were not achieved by previous works; and, because of its simplicity, it oers a viable alternative to EM for practical deployment.

363 citations


Journal ArticleDOI
TL;DR: In this paper, the authors present the probabilistic approach to reconstruction and discuss its optimality and robustness, and derive the derivation of the message passing algorithm for reconstruction and expectation maximization learning of signal model parameters.
Abstract: Compressed sensing is a signal processing method that acquires data directly in a compressed form. This allows one to make fewer measurements than were considered necessary to record a signal, enabling faster or more precise measurement protocols in a wide range of applications. Using an interdisciplinary approach, we have recently proposed in Krzakala et?al (2012 Phys. Rev. X 2 021005) a strategy that allows compressed sensing to be performed at acquisition rates approaching the theoretical optimal limits. In this paper, we give a more thorough presentation of our approach, and introduce many new results. We present the probabilistic approach to reconstruction and discuss its optimality and robustness. We detail the derivation of the message passing algorithm for reconstruction and expectation maximization learning of signal-model parameters. We further develop the asymptotic analysis of the corresponding phase diagrams with and without measurement noise, for different distributions of signals, and discuss the best possible reconstruction performances regardless of the algorithm. We also present new efficient seeding matrices, test them on synthetic data and analyze their performance asymptotically.

285 citations


Journal ArticleDOI
TL;DR: In this paper, the authors considered the maximum likelihood estimation of factor models of high dimension, where the number of variables (N) is comparable with or even greater than the total number of observations (T) and developed an inferential theory to establish not only consistency but also the rate of convergence and limiting distributions.
Abstract: This paper considers the maximum likelihood estimation of factor models of high dimension, where the number of variables (N) is comparable with or even greater than the number of observations (T). An inferential theory is developed. We establish not only consistency but also the rate of convergence and the limiting distributions. Five different sets of identification conditions are considered. We show that the distributions of the MLE estimators depend on the identification restrictions. Unlike the principal components approach, the maximum likelihood estimator explicitly allows heteroskedasticities, which are jointly estimated with other parameters. Efficiency of MLE relative to the principal components method is also considered.

249 citations


Proceedings ArticleDOI
21 Mar 2012
TL;DR: This work proposes an empirical-Bayesian technique that simultaneously learns the signal distribution while MMSE-recovering the signal-according to the learned distribution-using AMP, and model the non-zero distribution as a Gaussian mixture and learn its parameters through expectation maximization, using AMP to implement the expectation step.
Abstract: When recovering a sparse signal from noisy compressive linear measurements, the distribution of the signal's non-zero coefficients can have a profound affect on recovery mean-squared error (MSE). If this distribution was apriori known, one could use efficient approximate message passing (AMP) techniques for nearly minimum MSE (MMSE) recovery. In practice, though, the distribution is unknown, motivating the use of robust algorithms like Lasso—which is nearly minimax optimal—at the cost of significantly larger MSE for non-least-favorable distributions. As an alternative, we propose an empirical-Bayesian technique that simultaneously learns the signal distribution while MMSE-recovering the signal—according to the learned distribution—using AMP. In particular, we model the non-zero distribution as a Gaussian mixture, and learn its parameters through expectation maximization, using AMP to implement the expectation step. Numerical experiments confirm the state-of-the-art performance of our approach on a range of signal classes.

205 citations


Book ChapterDOI
TL;DR: The second edition of the book chapter attempts to capture advanced developments in EM methodology in recent years, especially in its applications to the related fields of biomedical and health sciences.
Abstract: The Expectation-Maximization (EM) algorithm is a broadly applicable approach to the iterative computation of maximum likelihood estimates in a wide variety of incomplete-data problems. The EM algorithm has a number of desirable properties, such as its numerical stability, reliable global convergence, and simplicity of implementation. There are, however, two main drawbacks of the basic EM algorithm – lack of an in-built procedure to compute the covariance matrix of the parameter estimates and slow convergence. In addition, some complex problems lead to intractable Expectation-steps and Maximization-steps. The first edition of the book chapter published in 2004 covered the basic theoretical framework of the EM algorithm and discussed further extensions of the EM algorithm to handle complex problems. The second edition attempts to capture advanced developments in EM methodology in recent years, especially in its applications to the related fields of biomedical and health sciences.

178 citations


Book ChapterDOI
01 Jan 2012
TL;DR: This chapter presents methods that make the MAR assumption, the EM algorithm for covariance matrices, normal-model multiple imputation (MI), and what I will refer to as FIML (full information maximum likelihood) methods.
Abstract: In this chapter, I present older methods for handling missing data. I then turn to the major new approaches for handling missing data. In this chapter, I present methods that make the MAR assumption. Included in this introduction are the EM algorithm for covariance matrices, normal-model multiple imputation (MI), and what I will refer to as FIML (full information maximum likelihood) methods. Before getting to these methods, however, I talk about the goals of analysis.

168 citations


Journal ArticleDOI
TL;DR: A cyclic minimization algorithm is developed where the number of Dirichlet modes is inferred based on the minimum description length principle, thus automatically enforcing the constraints on the abundance fractions imposed by the acquisition process.
Abstract: This paper introduces a new unsupervised hyperspectral unmixing method conceived to linear but highly mixed hyperspectral data sets, in which the simplex of minimum volume, usually estimated by the purely geometrically based algorithms, is far way from the true simplex associated with the endmembers. The proposed method, an extension of our previous studies, resorts to the statistical framework. The abundance fraction prior is a mixture of Dirichlet densities, thus automatically enforcing the constraints on the abundance fractions imposed by the acquisition process, namely, nonnegativity and sum-to-one. A cyclic minimization algorithm is developed where the following are observed: 1) The number of Dirichlet modes is inferred based on the minimum description length principle; 2) a generalized expectation maximization algorithm is derived to infer the model parameters; and 3) a sequence of augmented Lagrangian-based optimizations is used to compute the signatures of the endmembers. Experiments on simulated and real data are presented to show the effectiveness of the proposed algorithm in unmixing problems beyond the reach of the geometrically based state-of-the-art competitors.

Journal ArticleDOI
TL;DR: A general EM-type algorithm is employed for iteratively computing parameter estimates with emphasis on finite mixtures of skew-normal, skew-t, ske-slash and skew-contaminated normal distributions, and a general information-based method for approximating the asymptotic covariance matrix of the estimates is presented.

Journal ArticleDOI
TL;DR: In this article, the authors propose a new criterion that is based on a non-asymptotic approximation of the marginal likelihood, which can be computed through a variational Bayes EM algorithm.
Abstract: It is now widely accepted that knowledge can be acquired from networks by clustering their vertices according to connection profiles. Many methods have been proposed and in this paper we concentrate on the Stochastic Block Model (SBM). The clustering of vertices and the estimation of SBM model parameters have been subject to previous work and numerous inference strategies such as variational Expectation Maximization (EM) and classification EM have been proposed. However, SBM still suffers from a lack of criteria to estimate the number of components in the mixture. To our knowledge, only one model based criterion, ICL, has been derived for SBM in the literature. It relies on an asymptotic approximation of the Integrated Complete-data Likelihood and recent studies have shown that it tends to be too conservative in the case of small networks. To tackle this issue, we propose a new criterion that we call ILvb, based on a non asymptotic approximation of the marginal likelihood. We describe how the criterion can be computed through a variational Bayes EM algorithm.

Journal ArticleDOI
TL;DR: An approach is proposed for initializing the expectation-maximization (EM) algorithm in multivariate Gaussian mixture models with an unknown number of components by choosing points with higher concentrations of neighbors and using a truncated normal distribution for the preliminary estimation of dispersion matrices.

Journal ArticleDOI
TL;DR: In this article, the authors show that these MM algorithms can be reinterpreted as special instances of expectation-maximization algorithms associated with suitable sets of latent variables and propose some original extensions.
Abstract: The Bradley–Terry model is a popular approach to describe probabilities of the possible outcomes when elements of a set are repeatedly compared with one another in pairs. It has found many applications including animal behavior, chess ranking, and multiclass classification. Numerous extensions of the basic model have also been proposed in the literature including models with ties, multiple comparisons, group comparisons, and random graphs. From a computational point of view, Hunter has proposed efficient iterative minorization-maximization (MM) algorithms to perform maximum likelihood estimation for these generalized Bradley–Terry models whereas Bayesian inference is typically performed using Markov chain Monte Carlo algorithms based on tailored Metropolis–Hastings proposals. We show here that these MM algorithms can be reinterpreted as special instances of expectation-maximization algorithms associated with suitable sets of latent variables and propose some original extensions. These latent variables all...

Journal ArticleDOI
TL;DR: It is shown that the proposed EM has a lower computational complexity than the optimum maximum a posteriori estimator and yet incurs only an insignificant loss in performance.
Abstract: In this paper, the problem of joint carrier frequency offset (CFO) and channel estimation for OFDM systems over the fast time-varying frequency-selective channel is explored within the framework of the expectation-maximization (EM) algorithm and parametric channel model. Assuming that the path delays are known, a novel iterative pilot-aided algorithm for joint estimation of the multipath Rayleigh channel complex gains (CG) and the carrier frequency offset (CFO) is introduced. Each CG time-variation, within one OFDM symbol, is approximated by a basis expansion model (BEM) representation. An autoregressive (AR) model is built to statistically characterize the variations of the BEM coefficients across the OFDM blocks. In addition to the algorithm, the derivation of the hybrid Cramer-Rao bound (HCRB) for CFO and CGs estimation in our context of very high mobility is provided. We show that the proposed EM has a lower computational complexity than the optimum maximum a posteriori estimator and yet incurs only an insignificant loss in performance.

Journal ArticleDOI
TL;DR: A new finite Student's-t mixture model (SMM) is proposed that exploits Dirichlet distribution andDirichlet law to incorporate the local spatial constrains in an image and is successfully compared to the state-of-the-art finite mixture models.
Abstract: Finite mixture model based on the Student's-t distribution, which is heavily tailed and more robust than Gaussian, has recently received great attention for image segmentation. A new finite Student's-t mixture model (SMM) is proposed in this paper. Existing models do not explicitly incorporate the spatial relationships between pixels. First, our model exploits Dirichlet distribution and Dirichlet law to incorporate the local spatial constrains in an image. Secondly, we directly deal with the Student's-t distribution in order to estimate the model parameters, whereas, the Student's-t distributions in previous models are represented as an infinite mixture of scaled Gaussians that lead to an increase in complexity. Finally, instead of using expectation maximization (EM) algorithm, the proposed method adopts the gradient method to minimize the higher bound on the data negative log-likelihood and to optimize the parameters. The proposed model is successfully compared to the state-of-the-art finite mixture models. Numerical experiments are presented where the proposed model is tested on various simulated and real medical images.

Journal ArticleDOI
TL;DR: In this article, the intractable expectations needed in the e −step can be written out analytically, bypassing the need for numerical estimation procedures, such as Monte Carlo methods, leading to accurate calculation of maximum likelihood estimates.

Journal ArticleDOI
TL;DR: Estimates of the measurement error are used to weight the input data such that the resulting eigenvectors are more sensitive to the true underlying signal variations rather than being pulled by heteroskedastic measurement noise.
Abstract: We present a method for performing principal component analysis (PCA) on noisy datasets with missing values. Estimates of the measurement error are used to weight the input data such that the resulting eigenvectors, when compared to classic PCA, are more sensitive to the true underlying signal variations rather than being pulled by heteroskedastic measurement noise. Missing data are simply limiting cases of weight = 0. The underlying algorithm is a noise weighted expectation maximization (EM) PCA, which has additional benefits of implementation speed and flexibility for smoothing eigenvectors to reduce the noise contribution. We present applications of this method on simulated data and QSO spectra from the Sloan Digital Sky Survey (SDSS).

Journal ArticleDOI
TL;DR: An efficient EM algorithm for optimization with provable numerical convergence properties is proposed and the methodology to handle missing values in a sparse regression context is extended.
Abstract: We propose an ? 1-regularized likelihood method for estimating the inverse covariance matrix in the high-dimensional multivariate normal model in presence of missing data. Our method is based on the assumption that the data are missing at random (MAR) which entails also the completely missing at random case. The implementation of the method is non-trivial as the observed negative log-likelihood generally is a complicated and non-convex function. We propose an efficient EM algorithm for optimization with provable numerical convergence properties. Furthermore, we extend the methodology to handle missing values in a sparse regression context. We demonstrate both methods on simulated and real data.

Journal ArticleDOI
TL;DR: In this paper, the authors carried out robust modeling and influence diagnostics in Birnbaum-Saunders regression models and developed BS-t regression models, including maximum likelihood estimation based on the EM algorithm and diagnostic tools.
Abstract: In this paper, we carry out robust modeling and influence diagnostics in Birnbaum-Saunders (BS) regression models. Specifically, we present some aspects related to BS and log-BS distributions and their generalizations from the Student-t distribution, and develop BS-t regression models, including maximum likelihood estimation based on the EM algorithm and diagnostic tools. In addition, we apply the obtained results to real data from insurance, which shows the uses of the proposed model. Copyright © 2011 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: This paper introduces the generalized exponential-power series (GEPS) class of distributions, which is obtained by compounding generalized exponential and power series distributions and obtains several properties of the GEPS distributions such as moments, maximum likelihood estimation procedure via an EM-algorithm and inference for a large sample.

Journal ArticleDOI
TL;DR: This paper exploits a Boltzmann machine, allowing to take a large variety of structures into account, and resorts to a mean-field approximation and the “variational Bayes expectation-maximization” algorithm to solve a marginalized maximum a posteriori problem.
Abstract: Taking advantage of the structures inherent in many sparse decompositions constitutes a promising research axis. In this paper, we address this problem from a Bayesian point of view. We exploit a Boltzmann machine, allowing to take a large variety of structures into account, and focus on the resolution of a marginalized maximum a posteriori problem. To solve this problem, we resort to a mean-field approximation and the “variational Bayes expectation-maximization” algorithm. This approach results in a soft procedure making no hard decision on the support or the values of the sparse representation. We show that this characteristic leads to an improvement of the performance over state-of-the-art algorithms.

Journal ArticleDOI
TL;DR: In this article, a method for performing Principal Component Analysis (PCA) on noisy datasets with missing values is presented, where estimates of the measurement error are used to weight the input data such that compared to classic PCA, the resulting eigenvectors are more sensitive to the true underlying signal variations rather than being pulled by heteroskedastic measurement noise.
Abstract: We present a method for performing Principal Component Analysis (PCA) on noisy datasets with missing values. Estimates of the measurement error are used to weight the input data such that compared to classic PCA, the resulting eigenvectors are more sensitive to the true underlying signal variations rather than being pulled by heteroskedastic measurement noise. Missing data is simply the limiting case of weight=0. The underlying algorithm is a noise weighted Expectation Maximization (EM) PCA, which has additional benefits of implementation speed and flexibility for smoothing eigenvectors to reduce the noise contribution. We present applications of this method on simulated data and QSO spectra from the Sloan Digital Sky Survey.

Journal ArticleDOI
TL;DR: A robust estimation procedure and an EM-type algorithm to estimate the mixture regression models and it is demonstrated that the proposed new estimation method is robust and works much better than the MLE when there are outliers or the error distribution has heavy tails.

Journal ArticleDOI
TL;DR: A novel foreground object detection scheme that integrates the top-down information based on the expectation maximization (EM) framework and uses the detection result of moving object to incorporate the domain knowledge of the object shapes into the construction of top- down information.
Abstract: In this paper, we present a novel foreground object detection scheme that integrates the top-down information based on the expectation maximization (EM) framework. In this generalized EM framework, the top-down information is incorporated in an object model. Based on the object model and the state of each target, a foreground model is constructed. This foreground model can augment the foreground detection for the camouflage problem. Thus, an object's state-specific Markov random field (MRF) model is constructed for detection based on the foreground model and the background model. This MRF model depends on the latent variables that describe each object's state. The maximization of the MRF model is the M-step in the EM framework. Besides fusing spatial information, this MRF model can also adjust the contribution of the top-down information for detection. To obtain detection result using this MRF model, sampling importance resampling is used to sample the latent variable and the EM framework refines the detection iteratively. Besides the proposed generalized EM framework, our method does not need any prior information of the moving object, because we use the detection result of moving object to incorporate the domain knowledge of the object shapes into the construction of top-down information. Moreover, in our method, a kernel density estimation (KDE)—Gaussian mixture model (GMM) hybrid model is proposed to construct the probability density function of background and moving object model. For the background model, it has some advantages over GMM- and KDE-based methods. Experimental results demonstrate the capability of our method, particularly in handling the camouflage problem.

Journal ArticleDOI
TL;DR: In this article, an identification of nonlinear parameter varying systems using particle filter under the framework of the expectation-maximizaiton (EM) algorithm is described, where particle filters are adopted to deal with the computation of expectation functions.
Abstract: An identification of nonlinear parameter varying systems using particle filter under the framework of the expectation-maximizaiton (EM) algorithm is described. In chemical industries, processes are often designed to perform tasks under various operating conditions. To circumvent the modeling difficulties rendered by multiple operating conditions and the transitions between different working points, the EM algorithm, which iteratively increases the likelihood function, is applied. Meanwhile the missing output data problem which is common in real industry is also considered in this work. Particle filters are adopted to deal with the computation of expectation functions. The efficiency of the proposed method is illustrated through simulated examples and a pilot-scale experiment. © 2012 American Institute of Chemical Engineers AIChE J, 2012

Journal ArticleDOI
TL;DR: It is shown that the sensor self-localization problem can be cast as a static parameter estimation problem for Hidden Markov Models and fully decentralized versions of the Recursive Maximum Likelihood and on-line Expectation-Maximization algorithms to localize the sensor network simultaneously with target tracking are implemented.
Abstract: We show that the sensor self-localization problem can be cast as a static parameter estimation problem for Hidden Markov Models and we implement fully decentralized versions of the Recursive Maximum Likelihood and on-line Expectation-Maximization algorithms to localize the sensor network simultaneously with target tracking. For linear Gaussian models, our algorithms can be implemented exactly using a distributed version of the Kalman filter and a novel message passing algorithm. The latter allows each node to compute the local derivatives of the likelihood or the sufficient statistics needed for Expectation-Maximization. In the non-linear case, a solution based on local linearization in the spirit of the Extended Kalman Filter is proposed. In numerical examples we demonstrate that the developed algorithms are able to learn the localization parameters.

Posted Content
TL;DR: A general method for obtaining more flexible new distributions by compounding the extended Weibull and power series distributions and defines 68 new sub-models, which includes some well-known mixing distributions.
Abstract: In this paper, we introduce a new class of distributions which is obtained by compounding the extended Weibull and power series distributions. The compounding procedure follows the same set-up carried out by Adamidis and Loukas (1998) and defines at least new 68 sub-models. This class includes some well-known mixing distributions, such as the Weibull power series (Morais and Barreto-Souza, 2010) and exponential power series (Chahkandi and Ganjali, 2009) distributions. Some mathematical properties of the new class are studied including moments and generating function. We provide the density function of the order statistics and obtain their moments. The method of maximum likelihood is used for estimating the model parameters and an EM algorithm is proposed for computing the estimates. Special distributions are investigated in some detail. An application to a real data set is given to show the flexibility and potentiality of the new class of distributions.

Posted Content
TL;DR: The EM algorithm is used to extend the representation of CTBNs to allow a much richer class of transition durations distributions, known as phase distributions, which are a highly expressive semi-parametric representation, which can approximate any duration distribution arbitrarily closely.
Abstract: Continuous time Bayesian networks (CTBNs) describe structured stochastic processes with finitely many states that evolve over continuous time. A CTBN is a directed (possibly cyclic) dependency graph over a set of variables, each of which represents a finite state continuous time Markov process whose transition model is a function of its parents. We address the problem of learning the parameters and structure of a CTBN from partially observed data. We show how to apply expectation maximization (EM) and structural expectation maximization (SEM) to CTBNs. The availability of the EM algorithm allows us to extend the representation of CTBNs to allow a much richer class of transition durations distributions, known as phase distributions. This class is a highly expressive semi-parametric representation, which can approximate any duration distribution arbitrarily closely. This extension to the CTBN framework addresses one of the main limitations of both CTBNs and DBNs - the restriction to exponentially / geometrically distributed duration. We present experimental results on a real data set of people's life spans, showing that our algorithm learns reasonable models - structure and parameters - from partially observed data, and, with the use of phase distributions, achieves better performance than DBNs.