Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models

Question

Q1. What are examples of problems that are trivially parallelizable?

Q2. What is the ideal speed-up for a master-slave paradigm?

Q3. Why was it necessary to write two functions for the master and one for the slaves?

Q4. What is the definition of factor analysis?

Q5. How many models were fitted to the data for G 1, 2,. ?

Q6. What is the speed-up of the AECM algorithm?

Q7. What is the expected value of the complete-data log-likelihood?

Q8. What is the nature of the problem that makes it trivially parallelizable?

Q9. Why is the parallelization of the master and slave functions not implemented?

Q10. How many PGMMs were fitted to the data?

Q11. What are the examples of parallel algorithms used in MPI?

Accepted Answer

Ray tracing in computer graphics, signal processing, brute force attacks in cryptography and gene sequence alignment are all examples of problems that are trivially parallelizable.

Accepted Answer

Within a master-slave paradigm, the ideal situation occurs when the speed-up is directly proportional to the number of processors — this is known as linear speed-up.

Accepted Answer

Due to the strategy adopted for parallelization it was necessary to write two functions, one for the master and one for the slaves.

Accepted Answer

Factor analysis (Spearman 1904) is a data reduction technique in which a p-dimensional real-valued data vector x is modelled using a q-dimensional vector of latent variables u, where q ≪ p.

Accepted Answer

The PGMM family was fitted to the data for G ∈ {1, 2, . . . , 5} and q ∈ {1, 2, . . . , 5} by running the software from three random starting values, so that a total of 600 models were fitted.

Accepted Answer

The AECM algorithm used for parameter estimation was parallelized within the master-slave paradigm using MPI and the resulting speed-up has been shown to be linear up to a certain point.

Accepted Answer

In the E-step, the expected value of the complete-data log-likelihood is computed based on the current estimates of the model parameters and the completedata vector, which is the vector of observed data plus missing data.

Accepted Answer

The nature of the problem makes it trivially parallelizable: that is, each triple (M, G, q) can be sent to a different processor and processors can work independently of one another.

Accepted Answer

The prospect of parallelizing within-triple is not implemented here because any within-triple parallelization may actually cost time since the saving achieved by sending jobs triple-wise to processors may well be so great as to negate any possible advantage of within-triple parallelization.

Accepted Answer

The eight PGMMs were fitted to the data for G ∈ {1, 2, . . . , 6} and q ∈ {1, 2, . . . , 6} and three random starts were used for each model.

Accepted Answer

TThese include parallel implementations of algorithms for kernel estimation (Racine 2002), linear models (Kontoghiorghes 2000, Yanev & Kontoghiorghes 2006), partial least squares (Milidiú & Rentera 2005) and regression submodels (Gatu et al. 2007).

Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models

Figures

Citations

Model-based clustering of microarray expression data via latent Gaussian mixture models

Model-based clustering, classification, and discriminant analysis via mixtures of multivariate t-distributions

Mixtures of Shifted AsymmetricLaplace Distributions

Extending mixtures of multivariate t-factor analyzers

A mixture of generalized hyperbolic distributions

References

Maximum likelihood from incomplete data via the EM algorithm

Estimating the Dimension of a Model

Finite Mixture Models

Objective Criteria for the Evaluation of Clustering Methods

The EM algorithm and extensions

Related Papers (5)