A Bayesian missing value estimation method for gene expression profile data
read more
Citations
MissForest—non-parametric missing value imputation for mixed-type data
Singular Value Decomposition for Genome-Wide Expression Data Processing and Modeling
pcaMethods—a bioconductor package providing PCA methods for incomplete data
dbNSFP: a lightweight database of human nonsynonymous SNPs and their functional predictions.
Exploration of essential gene functions via titratable promoter alleles
References
Cluster analysis and display of genome-wide expression patterns
Molecular portraits of human breast tumours
Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.
The Elements of Statistical Learning
Related Papers (5)
Missing value estimation for DNA microarray gene expression data: local least squares imputation
Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling
Frequently Asked Questions (14)
Q2. How did the authors measure the missing value estimation ability?
In order to measure how the estimation ability depends on the value of K , the authors applied BPCA, SVDimpute and KNNimpute to the test data sets, (data A + E) and (data I), and calculated NRMSE with various K-values.
Q3. What are the three elementary processes that are used to estimate missing variables?
They are (1) principal component (PC) regression, (2) Bayesian estimation, and (3) an expectation–maximization (EM)-like repetitive algorithm.
Q4. Why does the BPCA degrade the estimation performance?
For (data I), large amount of missing entries do not degrade the estimation performance, probably because there are a lot of samples in the data set.
Q5. What is the expression vector of the i-th sample?
The i-th row vector and the j -th column vector of the matrix are called the expression vector of the i-th sample and the expression vector of the j -th gene, respectively.
Q6. How does the distance between vectors be defined?
Existing hierarchical clustering software, such as ‘Cluster’ (Eisen et al., 1998), defines the distance between vectors with missing values, by just ignoring the missing dimensions.
Q7. What is the effect of the ARD prior on the NRMSE curve?
If the effective dimension of the data set is smaller than the K-value, the ARD prior automatically reduces the redundant principal axes.
Q8. What is the reason for the improvement by SVDimpute and BPCA?
As the number of samples increased the information useful for the imputation increased, which is the reason for the improvement by SVDimpute and BPCA.
Q9. What are the methodologies used in clinical studies?
Their methodologies on class discovery and class prediction have been applied in a number of studies examining expression changes underlying various clinical phenomena.
Q10. How did the authors introduce artificial missing entries to a complete expression matrix?
In order to evaluate the performance of missing value estimation methods, the authors introduced artificial missing entries to a complete (i.e. without missing values) expression matrix.
Q11. What are the main ways to deal with missing values?
There are several simple ways to deal with missing values such as deleting an expression vector with missing values from further analysis, imputing missing values to zero, or imputing missing values of a certain gene (sample) to the sample (gene) average (Alizadeh et al., 2000).
Q12. What is the reason why the performance of KNNimpute did not improve?
The performance by KNNimpute, however, did not improve much, possibly because the similarity measure used in the method was not very suitable for cases with a large number of missing values.
Q13. What is the role of the priors in BPCA?
The hierarchical prior p(W |α, τ), which is called an automatic relevance determination (ARD) prior, has an important role in BPCA.
Q14. What is the reason why the estimation with BPCA may not be accurate?
only a global covariance structure, the estimation with BPCA may not be accurate if genes have dominant local similarity structures.