How did the authors measure the missing value estimation ability?

In order to measure how the estimation ability depends on the value of K , the authors applied BPCA, SVDimpute and KNNimpute to the test data sets, (data A + E) and (data I), and calculated NRMSE with various K-values.

Why does the BPCA degrade the estimation performance?

For (data I), large amount of missing entries do not degrade the estimation performance, probably because there are a lot of samples in the data set.

How does the distance between vectors be defined?

Existing hierarchical clustering software, such as ‘Cluster’ (Eisen et al., 1998), defines the distance between vectors with missing values, by just ignoring the missing dimensions.

What is the reason for the improvement by SVDimpute and BPCA?

As the number of samples increased the information useful for the imputation increased, which is the reason for the improvement by SVDimpute and BPCA.

How did the authors introduce artificial missing entries to a complete expression matrix?

In order to evaluate the performance of missing value estimation methods, the authors introduced artificial missing entries to a complete (i.e. without missing values) expression matrix.

What is the reason why the performance of KNNimpute did not improve?

The performance by KNNimpute, however, did not improve much, possibly because the similarity measure used in the method was not very suitable for cases with a large number of missing values.

What is the reason why the estimation with BPCA may not be accurate?

only a global covariance structure, the estimation with BPCA may not be accurate if genes have dominant local similarity structures.

(Open Access) A Bayesian missing value estimation method for gene expression profile data (2003) | Shigeyuki Oba

Q: What contributions have the authors mentioned in the paper "A bayesian missing value estimation method for gene expression profile data" ?

In this article, the authors propose an estimation method for missing values, which is based on Bayesian principal component analysis ( BPCA ). Accordingly, the BPCA method provides accurate and convenient estimation

Q: What are the three elementary processes that are used to estimate missing variables?

They are (1) principal component (PC) regression, (2) Bayesian estimation, and (3) an expectation–maximization (EM)-like repetitive algorithm.

Q: What is the expression vector of the i-th sample?

The i-th row vector and the j -th column vector of the matrix are called the expression vector of the i-th sample and the expression vector of the j -th gene, respectively.

Q: What is the effect of the ARD prior on the NRMSE curve?

If the effective dimension of the data set is smaller than the K-value, the ARD prior automatically reduces the redundant principal axes.

BIOINFORMATICS

Vol. 19 no. 16 2003, pages 2088–2096

DOI: 10.1093/bioinformatics/btg287

A Bayesian missing value estimation method

for gene expression proﬁle data

Shigeyuki Oba

, Masa-aki Sato

2,5

, Ichiro Takemasa

Morito Monden

, Ken-ichi Matsubara

and Shin Ishii

1,5,∗

Graduate School of Information Science, Nara Institute of Science and Technology,

8916-5 Takayama, Ikoma 630-0192, Japan,

ATR Human Information Science

Laboratories, 2-2-2 Hikaridai, Seika-cho, Soraku-gun, Kyoto, Japan,

Graduate School

of Medicine, Osaka University, 2-2 Yamadaoka, Suita, Osaka, Japan,

DNA Chip

Research Institute, 134 Kobecho, Hodogayaku, Yokohama, Japan and

CREST, Japan

Science and Technology Corporation

Received on March 10, 2003; revised on May 6, 2003; accepted on May 9, 2003

ABSTRACT

Motivation: Gene expression proﬁle analyses havebeenused

in numerous studies coveringa broad range of areas in biology.

When unreliable measurements are excluded, missing values

are introduced in gene expression proﬁles. Although existing

multivariate analysis methods have difﬁculty with the treatment

of missing values, this problem has received little attention.

There are many options for dealing with missing values, each

of which reaches drastically different results. Ignoring missing

values is the simplest method and is frequently applied. This

approach, however, has its ﬂaws. In this article, we propose

an estimation method for missing values, which is based on

Bayesian principal component analysis (BPCA). Although the

methodology that a probabilistic model and latent variables

are estimated simultaneously within the framework of Bayes

inference is not new in principle, actual BPCA implementation

that makes it possible to estimate arbitrary missing variables is

new in terms of statistical methodology.

Results: When applied to DNA microarray data from various

experimental conditions, the BPCA method exhibitedmarkedly

better estimation ability than other recently proposed methods,

such as singular value decomposition and K-nearest neigh-

bors. While the estimation performance of existing methods

dependsonmodel parameters whose determinationis difﬁcult,

our BPCA method is free from this difﬁculty. Accordingly, the

BPCA method provides accurate and convenient estimation

for missing values.

Availability: The software is available at http://hawaii.aist-

nara.ac.jp/~shige-o/tools/

Contact: ishii@is.aist-nara.ac.jp

1 INTRODUCTION

Gene expression proﬁling, using DNA microarrays, provides

high throughput investigation of gene expressions by

simultaneously measuring the expression of thousands of

∗

To whom correspondence should be addressed.

genes under a certain experimental condition. Gene expres-

sion proﬁling has been used in numerous studies over a

broad range of biological disciplines. In clinical studies on

cancer classiﬁcation, e.g. Golub et al. (1999) distinguished

between two leukemia subtypes, acute myeloid leukemia and

acute lymphoblastic leukemia, by comparing the expression

of ‘predictor genes’. Their methodologies on class discovery

and class prediction have been applied in a number of stud-

ies examining expression changes underlying various clinical

phenomena. Unknown effects of a speciﬁc therapy were

estimated by comparing gene expression proﬁles before and

after the therapy (Perou et al., 2000). Gene expression proﬁle

analyses were also effective in cancer prognosis prediction,

even when morphological or immunohistological study was

difﬁcult (Alizadeh et al., 2000; Kihara et al., 2001; Pomeroy

et al.,2002; Shipp et al., 2002; van’tVeeret al.,2002). In addi-

tion, expression proﬁle analyses successfully identiﬁed genes

relevant to a certain diagnosis or therapy (Takemasa et al.,

2001; Muro et al., 2003). In these studies, various multivari-

ate analysis methods have played crucial roles. Clustering,

e.g. hierarchical clustering, is a popular unsupervised clas-

siﬁcation analysis, which has mainly been applied to class

discovery problems (Eisen et al., 1998).

In order to extract underlying biological reality based on

gene expression proﬁle analyses, it is necessary to discard

various artifacts, such as noise and ﬂuctuations that occur

through the acquisition and normalization of data. Suspi-

cious values are usually regarded as missing values, because

they may be detrimental to analyses further. Existing multi-

variate analyses for expression proﬁle data, however, often

have difﬁculty with the treatment of missing values. Differ-

ent methods of treating missing values may lead to different

results. Although the handling of missing values is thus very

important, researchers have often been unaware of this issue.

As an example, hierarchical clustering (Eisen et al., 1998)

constructs gene clusters or sample clusters based on the

distance between two gene expression proﬁles (vectors) or

by on July 4, 2010 http://bioinformatics.oxfordjournals.orgDownloaded from

Bayesian missing value estimation

two sample expression vectors, respectively. The distance

measurement on missing values, however, is problematic.

Existing hierarchical clustering software, such as ‘Cluster’

(Eisen et al., 1998), deﬁnes the distance between vectors

with missing values, by just ignoring the missing dimen-

sions. Since ignoring dimensions is identical to assuming that

expression levels are the same in two vectors, the distance

between vectors with missing values tends to be smaller than

that between vectors without missing values. Therefore, a

cluster of genes with a lot of missing values is often obtained

as a result. Support vector machine (SVM) classiﬁer (Brown

et al., 2000), a popular multivariate supervised classiﬁcation

method, also encounters a similar problem in deﬁning dis-

tance. Moreover, many multivariate statistical analyses, like

principal component analysis (PCA) (Raychaudhuri et al.,

2000) and singular value decomposition (SVD) (Alter et al.,

2000), cannot be applied to data with missing values. Thus,

in order to avoid improper analyses, missing value estimation

is an important preprocess.

There are several simple ways to deal with missing values

suchas deleting anexpressionvectorwithmissing valuesfrom

further analysis, imputing missing values to zero, or imput-

ing missing values of a certain gene (sample) to the sample

(gene) average (Alizadeh et al., 2000). On the other hand,

Troyanskaya et al. (2001) proposed two advanced estima-

tion methods for missing values in expression proﬁles. One

methodis based on K-nearestneighbor (KNNimpute), and the

other is based on SVD (SVDimpute). Troyanskaya et al. eval-

uatedtheirperformanceusingvariousmicroarraydatasetsand

reported that the two advanced methods performed better than

the above-mentioned simple methods. The estimation ability

of these advanced methods depends on important model para-

meters, such as the K-value in KNNimpute and the number

of eigenvectors in SVDimpute. There is no theoretical way,

however, to determine these parameters appropriately.

In this paper, we propose a new missing value estimation

method based on Bayesian PCA (BPCA) (Bishop, 1999).

Although the methodology that a probabilistic model and lat-

ent variables are estimated simultaneously within the frame-

work of Bayes inference is not new in principle, actual BPCA

implementation that makes it possible to estimate arbitrary

missing variables is new in terms of statistical methodology.

We evaluated the method by comparing it to KNNimpute

and SVDimpute (Troyanskaya et al., 2001), using various

microarray data sets, and showed marked improvement in

estimation performance. In addition, the model parameter

is automatically determined in this BPCA method. There-

fore, our BPCA method can be easily used by medical and

biological scientists to analyze gene expression data.

2 SYSTEM AND METHODS

Awhole data setofgeneexpressionproﬁlesisrepresentedby a

numerical (D ×N)matrix Y , where N is the number of genes

and D is the number of samples. Y is called an expression

matrix. The (i,j) component of the matrix, y

, denotes the

expression level of the j -th gene in the i-th sample, which is

typically a logarithm of the expression ratio between the con-

trolandthe objectivesamples, in the caseof cDNAmicroarray

data. The i-th row vector and the j-th column vector of the

matrix are called the expression vector of the i-th sample and

the expression vector of the j -th gene, respectively.

2.1 BPCA

Themissingvalueestimation method based on BPCAconsists

of three elementary processes. They are (1) principal com-

ponent (PC) regression, (2) Bayesian estimation, and (3) an

expectation–maximization (EM)-like repetitive algorithm.

Below, we describe each of these processes.

2.2 PC regression

For the time being, we consider a situation where there is no

missingvalue. PCA represents thevariationof D-dimensional

gene expression vectors y as a linear combination of principal

axis vectors w

(1 ≤ l ≤ K)whose number is relatively small

(K < D):

y =



l=1

+ . (1)

The linear coefﬁcients x

(1 ≤ l ≤ K)are called factor scores.

 denotes the residual error. Using a speciﬁcally determined

numberK, PCAobtainsx

andw

suchthat the sumofsquared

error 

over the whole data set Y is minimized.

When there is no missing value, x

and w

are calculated

as follows. A covariance matrix S for the expression vectors

(1 ≤ i ≤ N) is given by

S =



i=1

− µ)(y

− µ)

where µ is the mean vector of y: µ

def

= (1/N)



i=1

T denotes the transpose of a vectoror a matrix. For description

convenience, Y is assumed to be row-wisely normalized by a

preprocess, so that µ = 0 holds. With this normalization, the

result by PCA is identical to that by SVD.

Let λ

≥ λ

≥ ··· ≥ λ

and u

,..., u

denote

the eigenvalues and the corresponding eigenvectors, respect-

ively, of S. We also deﬁne the l-th principal axis vector by

√

. With these notations, the l-th factor score for an

expression vector y is given by x

= (w

/λ

)

Now we assume the existence of missing values. In PC

regression, the missing part y

miss

in the expression vector y

is estimated from the observed part y

obs

by using the PCA

result. Let w

obs

and w

miss

be parts of each principal axis w

corresponding to the observed and missing parts, respectively,

in y. Similarly, let W = (W

obs

miss

) where W

obs

or W

miss

denotes a matrix whose column vectors are w

obs

,..., w

obs

miss

,..., w

miss

, respectively.

2089

by on July 4, 2010 http://bioinformatics.oxfordjournals.orgDownloaded from

S.Oba et al.

Factor scores x = (x

,..., x

) for the expression vector y

are obtained by minimization of the residual error:

err =



obs

− W

obs



This is a well-known regression problem, and the least square

solution is given by

x = (W

obsT

obs

)

−1

obsT

obs

Using x, the missing part is estimated as

miss

= W

miss

x.(2)

In the PC regression above, W should be known beforehand.

Later, we will discuss the way to determine the parameter.

2.3 Bayesian estimation

A parametric probabilistic model, which is called probab-

ilistic PCA (PPCA), has been proposed recently (Tipping

and Bishop, 1999). The probabilistic model is based on the

assumption that the residual error  and the factor scores

(1 ≤ l ≤ K) in Equation (1) obey normal distributions:

p(x) = N

(x|0, I

p() = N

(|0, (1/τ )I

where N

(x|µ, ) denotes a K-dimensional normal distri-

bution for x, whose mean and covariance are µ and ,

respectively. I

is a (K × K) identity matrix and τ is a

scalar inverse variance of . In this PPCA model, a complete

log-likelihood function is written as:

ln p(y, x|θ) ≡ ln p(y, x|W, µ, τ)

=−

y − Wx − µ

−

x

ln τ

−

K + D

ln 2π,

where θ ≡{W ,µ, τ }is the parameter set. Since the maximum

likelihood (ML) estimation of the PPCA is identical to PCA,

PPCA is a natural extension of PCA to a probabilistic model.

We introduce here a Bayesian estimation method for PPCA,

which was originally proposed by Bishop (1999). Bayesian

estimation obtains the posterior distribution of θ and X,

according to the Bayes theorem:

p(θ, X|Y ) ∝ p(Y ,X|θ)p(θ ). (3)

p(θ) is called a prior distribution, which denotes a priori pref-

erence for parameter θ . The prior distribution is a part of the

model and must be deﬁned before estimation.

We assume conjugate priors for τ and µ, and a hierarch-

ical prior for W , namely, the prior for W , p(W|τ , α),is

parameterized by a hyperparameter α ∈ R

p(θ|α) ≡ p(µ,W , τ |α) = p(µ|τ)p(τ)



j=1

p(w

|τ ,α

p(µ|τ) = N (µ|

,(γ

τ)

−1

p(w

|τ ,α

) = N (w

|0, (α

τ)

−1

p(τ) = G(τ |¯τ

,γ

G(τ |¯τ ,γ

) denotes a Gamma distribution with hyperparamet-

ers ¯τ and γ

G(τ |¯τ ,γ

) ≡

(γ

¯τ

−1

)

(γ

)

exp



−γ

¯τ

−1

τ + (γ

− 1) ln τ ],

where (·) is a Gamma function.

The variables used in the above priors, γ

,µ

,γ

and ¯τ

are deterministic hyperparameters that deﬁne the prior. Their

actual values should be given before the estimation. We set

= γ

= 10

−10

, µ

= 0 and ¯τ

= 1, which corresponds

to an almost non-informative prior.

Assumingthe priors and given a whole data set Y ={y}, the

type-II ML hyperparameter α

ML−II

and the posterior distribu-

tion of the parameter, q(θ ) = p(θ|Y , α

ML−II

), are obtained

by Bayesian estimation.

The hierarchical prior p(W|α,τ), which is called an auto-

matic relevance determination (ARD) prior, has an important

role in BPCA. The j-th principal axis w

has a Gaussian prior,

and its variance 1/(α

τ)is controled by a hyperparameter α

which is determined by type-II ML estimation from the data.

When the Euclidian norm of the principal axis, w

, is small

relatively to the noise variance 1/τ , the hyperparameter α

gets large and the principal axis w

shrinks nearly to be 0.

Thus, redundant principal axes are automatically supressed.

2.4 EM-like repetitive algorithm

If we know the true parameter θ

true

, the posterior of the

missing values is given by

q(Y

miss

) = p(Y

miss

obs

,θ

true

which produces equivalent estimation to the PC regression.

Here, p(Y

miss

obs

,θ

true

) is obtained by marginalizing the

likelihood (3) with respect to the observed variables Y

obs

If we have the parameter posterior q(θ) instead of the true

parameter, the posterior of the missing values is given by

q(Y

miss

) =



dθ q(θ)p(Y

miss

obs

,θ),

whichcorresponds to the Bayesian PCregression. Sincewe do

not know the true parameter naturally, we conduct the BPCA.

Although the parameter posterior q(θ) can be easily obtained

by the Bayesian estimation when a complete data set Y is

available, we assume that only a part of Y , Y

obs

, is observed

2090

by on July 4, 2010 http://bioinformatics.oxfordjournals.orgDownloaded from

Bayesian missing value estimation

and the rest Y

miss

is missing. In that situation, it is required to

obtain q(θ) and q(Y

miss

) simultaneously.

We use a variational Bayes (VB) algorithm (Attias, 1999),

in order to execute Bayesian estimation for both model para-

meter θ and missing values Y

miss

. Although the VB algorithm

resembles the EM algorithm that obtains ML estimators for

θ and Y

miss

, it obtains the posterior distributions for θ and

miss

, q(θ) and q(Y

miss

), by a repetitive algorithm.

The VB algorithm is implemented as follows: (a) the pos-

terior distribution of missing values, q(Y

miss

), is initialized

by imputing each of the missing values to gene-wise average;

(b)theposterior distributionof theparameterθ, q(θ), isestim-

ated using the observed data Y

obs

and the current posterior

distribution of missing values, q(Y

miss

); (c) the posterior dis-

tribution of the missing values, q(Y

miss

), is estimated using

the current q(θ); (d) the hyperparameter α is updated using

both of the current q(θ) and the current q(Y

miss

); (e) repeat

(b)–(d) until convergence.

The VB algorithm has been proved to converge to a locally

optimal solution (Sato, 2001). Although the convergence to

the global optimum is not guaranteed, the VB algorithm for

BPCA almost always converges to a single solution pract-

ically. This is probably because the objective function of

BPCA has a simple landscape. As a consequence of the

VB algorithm, therefore, q(θ) and q(Y

miss

) are expected to

approach the global optimal posteriors.

Then, the missing values in the expression matrix are

imputed to the expectation with respect to the estimated

posterior distribution:

miss



miss

q(Y

miss

) dY

miss

.(4)

2.5 SVDimpute

Withrespecttotheabovethreeelementary processes, SVDim-

pute (Troyanskaya et al., 2001) is a method incorporating

the ﬁrst process and the ML estimation based on the EM

algorithm, because SVD is identical to standard PCA when

applied to a matrix normalized so that the row-wise mean is

zero. Therefore, the most important advance of BPCA, in

comparison to SVDimpute, is the existence of the second

process, i.e. the Bayesian estimation using the ARD prior.

An SVD-based imputation method was also proposed and

described in detail (Hastie et al., 1999).

2.6 KNNimpute

In order to estimate a missing value y

in the i-th gene expres-

sion vector y

by KNNimpute (Troyanskaya et al., 2001), we

ﬁrst select K genes whose expression vectors are similar to

. Next, the missing value is estimated as the average of the

corresponding entries in the selected K expression vectors.

The similarity measure s

) between two expression vec-

tors y

and y

is deﬁned by the reciprocal of the Euclidian

distance calculated over observed components in y

. When

there are other missing values in y

and/or y

, their treat-

ment requires some heuristics. Following (Troyanskaya et al.,

2001), we deﬁne the measure as follows:

1/s

) =



h∈O

∩O

− y

)

, (5)

={h |the h-th component of y

is observed}.

The missing entry y

is estimated as average weighted by the

similarity measure:

ˆy



j∈I

Kih



j∈I

Kih

)

, (6)

where I

Kih

is the index set of K-nearest neighbor genes of

the i-th gene, and if y

is missing the j-th gene is excluded

from I

Kih

. Note that KNNimpute has no theoretical criteria

for selecting the best K-value and the K-value has to be

determined empirically.

3 RESULTS AND DISCUSSION

3.1 Data sets

Spellman et al. (1998) placed a cDNA microarray data set

relevant to the yeast cell-cycle at the URL http://genome-

www.stanford.edu/cellcycle/data/rawdata/ as a complement.

This data set consists of three parts, which are relevant to

alpha factor (A-part), elutriation (E-part), cdc15, and cdc28

(C-part). We ﬁrst used samples in the A-part (18 samples)

and the E-part (14 samples) to prepare test data sets. Each

sample represents relative expression levels of 6178 genes,

and 4304 genes have no missing value in the A- and E-parts.

Therefore, the complete expression matrix is composed of

4304 genes. We prepared three test data sets: (data A), (data

E) and (data A + E), from the complete expression matrix.

The C-part samples were used for examining the effects of

additional samples (see Section 3.4).

We also prepared a test data set (data A+E +C) by adding

the C-part samples to (data A +E).

Takemasa et al. (2001) obtained original cDNA micro-

arraydatarelevantto humancolorectalcancer (CRC). Clinical

materials of the data consist of 205 primary CRCs that include

127 non-metastatic primary CRCs, 54 metastatic primary

CRCs to the liver and 24 metastatic primary CRCs to dis-

tant organs exclusive of the liver, and 12 normal colonic

epithelia that were histopathologically conﬁrmed to be free of

cancer. Each sample expression vector represents logarithm-

transformed ratios between the expressionlevels in the object-

ive sample and that in the control reference using cDNA

microarrays specialized for CRC, by selecting genes that

were preferentially expressed in colorectal carcinoma tissue.

As members of the complete expression matrix, we selected

758 genes in 4608 genes, and a test data set (data I) was

prepared.

2091

by on July 4, 2010 http://bioinformatics.oxfordjournals.orgDownloaded from

S.Oba et al.

Using these four data sets, (data A), (data E), (data A + E)

and (data I), we examined the estimation ability for missing

values.

In order to evaluate the performance of missing value

estimation methods, we introduced artiﬁcial missing entries

to a complete (i.e. without missing values) expression matrix.

The artiﬁcial missing entries were introduced in two differ-

ent ways:

Rate-based way Randomly select a speciﬁc percentage of

theentriesinthecomplete expressionmatrix, andremove

them.

Histogram-based way Obtain a histogram of column-wise

numbers of missing entries in the original expression

matrix. Then, remove entries from the complete expres-

sion matrix so that the histogram of the artiﬁcial missing

entries is similar to the histogram of the original missing

entries.

When 5% artiﬁcial missing entries are introduced to (data I)

in the rate-based way, the test data set is denoted by

(data I, 5%).

The performance of the missing value estimation is evalu-

ated by normalized root mean squared error (NRMSE):

NRMSE =



mean[(y

guess

− y

answer

)

]

variance[y

answer

]

, (7)

where the mean and the variance are calculated over missing

entries in the whole matrix. We know y

answer

because the

missing entries are artiﬁcial. When the estimation is accurate,

NRMSEapproaches its minimum value0.0. When the estima-

tion is equivalentto a random guess, whichoccurs either when

the estimation is too poor or when the noise involved is too

large, NRMSE approaches a value of 1.0.

3.2 K-value selection

Both BPCA and SVDimpute depend on the number of prin-

cipal axes (eigenvectors), K, and KNNimpute depends on

the number of neighbors, K (see Section 2.6). Since these

K-values describe similar parameters, we use the same sym-

bol. In order to measure howthe estimation ability depends on

the value of K, we applied BPCA, SVDimpute and KNNim-

pute to the test data sets, (data A + E) and (data I), and

calculated NRMSE with various K-values.

Figure 1 shows the results for (data A + E, 5%) and

(data I, 5%). BPCA produces better results than KNNimpute

or SVDimpute at the optimal K-value for each method.

BPCA exhibits its best results with K = D − 1, where

D is the number of samples. SVDimpute and BPCA show

similar results when K is small, because they employ the

same PC regression process. When K is larger, however,

BPCA exhibits much better results than SVDimpute. This

is due to the ARD prior, because the main difference of

BPCA from SVDimpute is its existence. When K = 0,

0 10 20 30

0.2

0.4

0.6

0.8

data A+E, 5%

20 40

0 100 200

0.2

0.4

0.6

0.8

data I, 5%

20 40

NRMSENRMSE

SVD

BPCA

KNN

Fig. 1. Estimation ability (NRMSE) by BPCA, SVD and KNN with

various K-values. (top panel): Application to (data A + E, 5%).

(bottom panel): Application to (data I, 5%).

the imputation by BPCA or SVDimpute is identical to

that based on gene-wise average, and therefore the results

are poor.

In Figure 1, we see that BPCA exhibits different NRMSE

curves in the (data A + E) case and the (data I) case, with

respectto the optimalK-value. For(data I), the NRMSE curve

becomes almost ﬂat between K = 100 and K = 204, because

the principal axes corresponding to the eigenvalues exceeding

K = 100 degenerated so that their lengths became almost

zero. For (data A + E), on the other hand, almost none of

the axes degenerated, except for the 31st one. From K =

1toK = 30, therefore, each additional axis improved the

estimation ability. Although, the improvement by adding the

30th axis was apparently large, we consider this is a special

phenomenon for this data set. This phenomenon implies the

importance of setting K = D − 1 if we do not have a priori

knowledge on the data set.

Accordingly, we cansafelyuse K = D−1for everydata set

in BPCA. If the effective dimension of the data set is smaller

than the K-value, the ARD prior automatically reduces the

redundant principal axes. Therefore, in our BPCA method,

there is no need to tune the K-value in advance.

2092

by on July 4, 2010 http://bioinformatics.oxfordjournals.orgDownloaded from

A Bayesian missing value estimation method for gene expression profile data

Figures

Citations

MissForest—non-parametric missing value imputation for mixed-type data

Singular Value Decomposition for Genome-Wide Expression Data Processing and Modeling

pcaMethods—a bioconductor package providing PCA methods for incomplete data

dbNSFP: a lightweight database of human nonsynonymous SNPs and their functional predictions.

Exploration of essential gene functions via titratable promoter alleles

References

The Elements of Statistical Learning

Cluster analysis and display of genome-wide expression patterns

Molecular portraits of human breast tumours

Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.

The Elements of Statistical Learning

Related Papers (5)

Missing value estimation methods for DNA microarrays.

Missing value estimation for DNA microarray gene expression data: local least squares imputation

Statistical Analysis with Missing Data

Comprehensive Identification of Cell Cycle–regulated Genes of the Yeast Saccharomyces cerevisiae by Microarray Hybridization

Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling

Frequently Asked Questions (14)

Q1. What contributions have the authors mentioned in the paper "A bayesian missing value estimation method for gene expression profile data" ?

Q2. How did the authors measure the missing value estimation ability?

Q3. What are the three elementary processes that are used to estimate missing variables?

Q4. Why does the BPCA degrade the estimation performance?

Q5. What is the expression vector of the i-th sample?

Q6. How does the distance between vectors be defined?

Q7. What is the effect of the ARD prior on the NRMSE curve?

Q8. What is the reason for the improvement by SVDimpute and BPCA?

Q9. What are the methodologies used in clinical studies?

Q10. How did the authors introduce artificial missing entries to a complete expression matrix?

Q11. What are the main ways to deal with missing values?

Q12. What is the reason why the performance of KNNimpute did not improve?

Q13. What is the role of the priors in BPCA?

Q14. What is the reason why the estimation with BPCA may not be accurate?