scispace - formally typeset
Open AccessJournal ArticleDOI

A Bayesian missing value estimation method for gene expression profile data

TLDR
While the estimation performance of existing methods depends on model parameters whose determination is difficult, the BPCA method is free from this difficulty, and provides accurate and convenient estimation for missing values.
Abstract
Motivation: Gene expression profile analyses have been used in numerous studies covering a broad range of areas in biology. When unreliable measurements are excluded, missing values are introduced in gene expression profiles. Although existing multivariate analysis methods have difficulty with the treatment of missing values, this problem has received little attention. There are many options for dealing with missing values, each of which reaches drastically different results. Ignoring missing values is the simplest method and is frequently applied. This approach, however, has its flaws. In this article, we propose an estimation method for missing values, which is based on Bayesian principal component analysis (BPCA). Although the methodology that a probabilistic model and latent variables are estimated simultaneously within the framework of Bayes inference is not new in principle, actual BPCA implementation that makes it possible to estimate arbitrary missing variables is new in terms of statistical methodology. Results: When applied to DNA microarray data from various experimental conditions, the BPCA method exhibited markedly better estimation ability than other recently proposed methods, such as singular value decomposition and K -nearest neighbors. While the estimation performance of existing methods depends on model parameters whose determination is difficult, our BPCA method is free from this difficulty. Accordingly, the BPCA method provides accurate and convenient estimation for missing values. Availability: The software is available at http://hawaii.aist

read more

Content maybe subject to copyright    Report

BIOINFORMATICS
Vol. 19 no. 16 2003, pages 2088–2096
DOI: 10.1093/bioinformatics/btg287
A Bayesian missing value estimation method
for gene expression profile data
Shigeyuki Oba
1
, Masa-aki Sato
2,5
, Ichiro Takemasa
3
,
Morito Monden
3
, Ken-ichi Matsubara
4
and Shin Ishii
1,5,
1
Graduate School of Information Science, Nara Institute of Science and Technology,
8916-5 Takayama, Ikoma 630-0192, Japan,
2
ATR Human Information Science
Laboratories, 2-2-2 Hikaridai, Seika-cho, Soraku-gun, Kyoto, Japan,
3
Graduate School
of Medicine, Osaka University, 2-2 Yamadaoka, Suita, Osaka, Japan,
4
DNA Chip
Research Institute, 134 Kobecho, Hodogayaku, Yokohama, Japan and
5
CREST, Japan
Science and Technology Corporation
Received on March 10, 2003; revised on May 6, 2003; accepted on May 9, 2003
ABSTRACT
Motivation: Gene expression profile analyses havebeenused
in numerous studies coveringa broad range of areas in biology.
When unreliable measurements are excluded, missing values
are introduced in gene expression profiles. Although existing
multivariate analysis methods have difficulty with the treatment
of missing values, this problem has received little attention.
There are many options for dealing with missing values, each
of which reaches drastically different results. Ignoring missing
values is the simplest method and is frequently applied. This
approach, however, has its flaws. In this article, we propose
an estimation method for missing values, which is based on
Bayesian principal component analysis (BPCA). Although the
methodology that a probabilistic model and latent variables
are estimated simultaneously within the framework of Bayes
inference is not new in principle, actual BPCA implementation
that makes it possible to estimate arbitrary missing variables is
new in terms of statistical methodology.
Results: When applied to DNA microarray data from various
experimental conditions, the BPCA method exhibitedmarkedly
better estimation ability than other recently proposed methods,
such as singular value decomposition and K-nearest neigh-
bors. While the estimation performance of existing methods
dependsonmodel parameters whose determinationis difficult,
our BPCA method is free from this difficulty. Accordingly, the
BPCA method provides accurate and convenient estimation
for missing values.
Availability: The software is available at http://hawaii.aist-
nara.ac.jp/~shige-o/tools/
Contact: ishii@is.aist-nara.ac.jp
1 INTRODUCTION
Gene expression profiling, using DNA microarrays, provides
high throughput investigation of gene expressions by
simultaneously measuring the expression of thousands of
To whom correspondence should be addressed.
genes under a certain experimental condition. Gene expres-
sion profiling has been used in numerous studies over a
broad range of biological disciplines. In clinical studies on
cancer classification, e.g. Golub et al. (1999) distinguished
between two leukemia subtypes, acute myeloid leukemia and
acute lymphoblastic leukemia, by comparing the expression
of ‘predictor genes’. Their methodologies on class discovery
and class prediction have been applied in a number of stud-
ies examining expression changes underlying various clinical
phenomena. Unknown effects of a specific therapy were
estimated by comparing gene expression profiles before and
after the therapy (Perou et al., 2000). Gene expression profile
analyses were also effective in cancer prognosis prediction,
even when morphological or immunohistological study was
difficult (Alizadeh et al., 2000; Kihara et al., 2001; Pomeroy
et al.,2002; Shipp et al., 2002; van’tVeeret al.,2002). In addi-
tion, expression profile analyses successfully identified genes
relevant to a certain diagnosis or therapy (Takemasa et al.,
2001; Muro et al., 2003). In these studies, various multivari-
ate analysis methods have played crucial roles. Clustering,
e.g. hierarchical clustering, is a popular unsupervised clas-
sification analysis, which has mainly been applied to class
discovery problems (Eisen et al., 1998).
In order to extract underlying biological reality based on
gene expression profile analyses, it is necessary to discard
various artifacts, such as noise and fluctuations that occur
through the acquisition and normalization of data. Suspi-
cious values are usually regarded as missing values, because
they may be detrimental to analyses further. Existing multi-
variate analyses for expression profile data, however, often
have difficulty with the treatment of missing values. Differ-
ent methods of treating missing values may lead to different
results. Although the handling of missing values is thus very
important, researchers have often been unaware of this issue.
As an example, hierarchical clustering (Eisen et al., 1998)
constructs gene clusters or sample clusters based on the
distance between two gene expression profiles (vectors) or
2088 Bioinformatics 19(16) © Oxford University Press 2003; all rights reserved.
by on July 4, 2010 http://bioinformatics.oxfordjournals.orgDownloaded from

Bayesian missing value estimation
two sample expression vectors, respectively. The distance
measurement on missing values, however, is problematic.
Existing hierarchical clustering software, such as ‘Cluster’
(Eisen et al., 1998), defines the distance between vectors
with missing values, by just ignoring the missing dimen-
sions. Since ignoring dimensions is identical to assuming that
expression levels are the same in two vectors, the distance
between vectors with missing values tends to be smaller than
that between vectors without missing values. Therefore, a
cluster of genes with a lot of missing values is often obtained
as a result. Support vector machine (SVM) classifier (Brown
et al., 2000), a popular multivariate supervised classification
method, also encounters a similar problem in defining dis-
tance. Moreover, many multivariate statistical analyses, like
principal component analysis (PCA) (Raychaudhuri et al.,
2000) and singular value decomposition (SVD) (Alter et al.,
2000), cannot be applied to data with missing values. Thus,
in order to avoid improper analyses, missing value estimation
is an important preprocess.
There are several simple ways to deal with missing values
suchas deleting anexpressionvectorwithmissing valuesfrom
further analysis, imputing missing values to zero, or imput-
ing missing values of a certain gene (sample) to the sample
(gene) average (Alizadeh et al., 2000). On the other hand,
Troyanskaya et al. (2001) proposed two advanced estima-
tion methods for missing values in expression profiles. One
methodis based on K-nearestneighbor (KNNimpute), and the
other is based on SVD (SVDimpute). Troyanskaya et al. eval-
uatedtheirperformanceusingvariousmicroarraydatasetsand
reported that the two advanced methods performed better than
the above-mentioned simple methods. The estimation ability
of these advanced methods depends on important model para-
meters, such as the K-value in KNNimpute and the number
of eigenvectors in SVDimpute. There is no theoretical way,
however, to determine these parameters appropriately.
In this paper, we propose a new missing value estimation
method based on Bayesian PCA (BPCA) (Bishop, 1999).
Although the methodology that a probabilistic model and lat-
ent variables are estimated simultaneously within the frame-
work of Bayes inference is not new in principle, actual BPCA
implementation that makes it possible to estimate arbitrary
missing variables is new in terms of statistical methodology.
We evaluated the method by comparing it to KNNimpute
and SVDimpute (Troyanskaya et al., 2001), using various
microarray data sets, and showed marked improvement in
estimation performance. In addition, the model parameter
is automatically determined in this BPCA method. There-
fore, our BPCA method can be easily used by medical and
biological scientists to analyze gene expression data.
2 SYSTEM AND METHODS
Awhole data setofgeneexpressionprofilesisrepresentedby a
numerical (D ×N)matrix Y , where N is the number of genes
and D is the number of samples. Y is called an expression
matrix. The (i,j) component of the matrix, y
ij
, denotes the
expression level of the j -th gene in the i-th sample, which is
typically a logarithm of the expression ratio between the con-
trolandthe objectivesamples, in the caseof cDNAmicroarray
data. The i-th row vector and the j-th column vector of the
matrix are called the expression vector of the i-th sample and
the expression vector of the j -th gene, respectively.
2.1 BPCA
Themissingvalueestimation method based on BPCAconsists
of three elementary processes. They are (1) principal com-
ponent (PC) regression, (2) Bayesian estimation, and (3) an
expectation–maximization (EM)-like repetitive algorithm.
Below, we describe each of these processes.
2.2 PC regression
For the time being, we consider a situation where there is no
missingvalue. PCA represents thevariationof D-dimensional
gene expression vectors y as a linear combination of principal
axis vectors w
l
(1 l K)whose number is relatively small
(K < D):
y =
K
l=1
x
l
w
l
+ . (1)
The linear coefficients x
l
(1 l K)are called factor scores.
denotes the residual error. Using a specifically determined
numberK, PCAobtainsx
l
andw
l
suchthat the sumofsquared
error
2
over the whole data set Y is minimized.
When there is no missing value, x
l
and w
l
are calculated
as follows. A covariance matrix S for the expression vectors
y
i
(1 i N) is given by
S =
1
N
N
i=1
(y
i
µ)(y
i
µ)
T
,
where µ is the mean vector of y: µ
def
= (1/N)
N
i=1
y
i
.
T denotes the transpose of a vectoror a matrix. For description
convenience, Y is assumed to be row-wisely normalized by a
preprocess, so that µ = 0 holds. With this normalization, the
result by PCA is identical to that by SVD.
Let λ
1
λ
2
··· λ
D
and u
1
,u
2
,..., u
D
denote
the eigenvalues and the corresponding eigenvectors, respect-
ively, of S. We also define the l-th principal axis vector by
w
l
=
λ
l
u
l
. With these notations, the l-th factor score for an
expression vector y is given by x
l
= (w
l
l
)
T
y.
Now we assume the existence of missing values. In PC
regression, the missing part y
miss
in the expression vector y
is estimated from the observed part y
obs
by using the PCA
result. Let w
obs
l
and w
miss
l
be parts of each principal axis w
l
,
corresponding to the observed and missing parts, respectively,
in y. Similarly, let W = (W
obs
,W
miss
) where W
obs
or W
miss
denotes a matrix whose column vectors are w
obs
1
,..., w
obs
K
or
w
miss
1
,..., w
miss
K
, respectively.
2089
by on July 4, 2010 http://bioinformatics.oxfordjournals.orgDownloaded from

S.Oba et al.
Factor scores x = (x
1
,..., x
K
) for the expression vector y
are obtained by minimization of the residual error:
err =
y
obs
W
obs
x
2
.
This is a well-known regression problem, and the least square
solution is given by
x = (W
obsT
W
obs
)
1
W
obsT
y
obs
.
Using x, the missing part is estimated as
y
miss
= W
miss
x.(2)
In the PC regression above, W should be known beforehand.
Later, we will discuss the way to determine the parameter.
2.3 Bayesian estimation
A parametric probabilistic model, which is called probab-
ilistic PCA (PPCA), has been proposed recently (Tipping
and Bishop, 1999). The probabilistic model is based on the
assumption that the residual error and the factor scores
x
l
(1 l K) in Equation (1) obey normal distributions:
p(x) = N
K
(x|0, I
K
),
p() = N
D
(|0, (1 )I
D
),
where N
K
(x|µ, ) denotes a K-dimensional normal distri-
bution for x, whose mean and covariance are µ and ,
respectively. I
K
is a (K × K) identity matrix and τ is a
scalar inverse variance of . In this PPCA model, a complete
log-likelihood function is written as:
ln p(y, x|θ) ln p(y, x|W, µ, τ)
=−
τ
2
y Wx µ
2
1
2
x
2
+
D
2
ln τ
K + D
2
ln 2π,
where θ ≡{W ,µ, τ }is the parameter set. Since the maximum
likelihood (ML) estimation of the PPCA is identical to PCA,
PPCA is a natural extension of PCA to a probabilistic model.
We introduce here a Bayesian estimation method for PPCA,
which was originally proposed by Bishop (1999). Bayesian
estimation obtains the posterior distribution of θ and X,
according to the Bayes theorem:
p(θ, X|Y ) p(Y ,X|θ)p(θ ). (3)
p(θ) is called a prior distribution, which denotes a priori pref-
erence for parameter θ . The prior distribution is a part of the
model and must be defined before estimation.
We assume conjugate priors for τ and µ, and a hierarch-
ical prior for W , namely, the prior for W , p(W|τ , α),is
parameterized by a hyperparameter α R
K
.
p(θ|α) p(µ,W , τ |α) = p(µ|τ)p(τ)
K
j=1
p(w
j
|τ ,α
j
),
p(µ|τ) = N (µ|
µ
0
,
µ
0
τ)
1
I
m
),
p(w
j
|τ ,α
j
) = N (w
j
|0,
j
τ)
1
I
m
),
p(τ) = G τ
0
,γ
τ
0
).
G τ ,γ
τ
) denotes a Gamma distribution with hyperparamet-
ers ¯τ and γ
τ
:
G τ ,γ
τ
)
τ
¯τ
1
)
γ
τ
(γ
τ
)
exp
γ
τ
¯τ
1
τ +
τ
1) ln τ ],
where (·) is a Gamma function.
The variables used in the above priors, γ
µ
0
,µ
0
,γ
τ
0
and ¯τ
0
are deterministic hyperparameters that define the prior. Their
actual values should be given before the estimation. We set
γ
µ
0
= γ
τ
0
= 10
10
, µ
0
= 0 and ¯τ
0
= 1, which corresponds
to an almost non-informative prior.
Assumingthe priors and given a whole data set Y ={y}, the
type-II ML hyperparameter α
MLII
and the posterior distribu-
tion of the parameter, q(θ ) = p(θ|Y , α
MLII
), are obtained
by Bayesian estimation.
The hierarchical prior p(W|α,τ), which is called an auto-
matic relevance determination (ARD) prior, has an important
role in BPCA. The j-th principal axis w
j
has a Gaussian prior,
and its variance 1/(α
j
τ)is controled by a hyperparameter α
j
which is determined by type-II ML estimation from the data.
When the Euclidian norm of the principal axis, w
j
, is small
relatively to the noise variance 1 , the hyperparameter α
j
gets large and the principal axis w
j
shrinks nearly to be 0.
Thus, redundant principal axes are automatically supressed.
2.4 EM-like repetitive algorithm
If we know the true parameter θ
true
, the posterior of the
missing values is given by
q(Y
miss
) = p(Y
miss
|Y
obs
,θ
true
),
which produces equivalent estimation to the PC regression.
Here, p(Y
miss
|Y
obs
,θ
true
) is obtained by marginalizing the
likelihood (3) with respect to the observed variables Y
obs
.
If we have the parameter posterior q(θ) instead of the true
parameter, the posterior of the missing values is given by
q(Y
miss
) =
dθ q(θ)p(Y
miss
|Y
obs
,θ),
whichcorresponds to the Bayesian PCregression. Sincewe do
not know the true parameter naturally, we conduct the BPCA.
Although the parameter posterior q(θ) can be easily obtained
by the Bayesian estimation when a complete data set Y is
available, we assume that only a part of Y , Y
obs
, is observed
2090
by on July 4, 2010 http://bioinformatics.oxfordjournals.orgDownloaded from

Bayesian missing value estimation
and the rest Y
miss
is missing. In that situation, it is required to
obtain q(θ) and q(Y
miss
) simultaneously.
We use a variational Bayes (VB) algorithm (Attias, 1999),
in order to execute Bayesian estimation for both model para-
meter θ and missing values Y
miss
. Although the VB algorithm
resembles the EM algorithm that obtains ML estimators for
θ and Y
miss
, it obtains the posterior distributions for θ and
Y
miss
, q(θ) and q(Y
miss
), by a repetitive algorithm.
The VB algorithm is implemented as follows: (a) the pos-
terior distribution of missing values, q(Y
miss
), is initialized
by imputing each of the missing values to gene-wise average;
(b)theposterior distributionof theparameterθ, q(θ), isestim-
ated using the observed data Y
obs
and the current posterior
distribution of missing values, q(Y
miss
); (c) the posterior dis-
tribution of the missing values, q(Y
miss
), is estimated using
the current q(θ); (d) the hyperparameter α is updated using
both of the current q(θ) and the current q(Y
miss
); (e) repeat
(b)–(d) until convergence.
The VB algorithm has been proved to converge to a locally
optimal solution (Sato, 2001). Although the convergence to
the global optimum is not guaranteed, the VB algorithm for
BPCA almost always converges to a single solution pract-
ically. This is probably because the objective function of
BPCA has a simple landscape. As a consequence of the
VB algorithm, therefore, q(θ) and q(Y
miss
) are expected to
approach the global optimal posteriors.
Then, the missing values in the expression matrix are
imputed to the expectation with respect to the estimated
posterior distribution:
ˆ
Y
miss
=
Y
miss
q(Y
miss
) dY
miss
.(4)
2.5 SVDimpute
Withrespecttotheabovethreeelementary processes, SVDim-
pute (Troyanskaya et al., 2001) is a method incorporating
the first process and the ML estimation based on the EM
algorithm, because SVD is identical to standard PCA when
applied to a matrix normalized so that the row-wise mean is
zero. Therefore, the most important advance of BPCA, in
comparison to SVDimpute, is the existence of the second
process, i.e. the Bayesian estimation using the ARD prior.
An SVD-based imputation method was also proposed and
described in detail (Hastie et al., 1999).
2.6 KNNimpute
In order to estimate a missing value y
ih
in the i-th gene expres-
sion vector y
i
by KNNimpute (Troyanskaya et al., 2001), we
first select K genes whose expression vectors are similar to
y
i
. Next, the missing value is estimated as the average of the
corresponding entries in the selected K expression vectors.
The similarity measure s
i
(y
j
) between two expression vec-
tors y
i
and y
j
is defined by the reciprocal of the Euclidian
distance calculated over observed components in y
i
. When
there are other missing values in y
i
and/or y
j
, their treat-
ment requires some heuristics. Following (Troyanskaya et al.,
2001), we define the measure as follows:
1/s
i
(y
j
) =
hO
i
O
j
(y
ih
y
jh
)
2
, (5)
O
i
={h |the h-th component of y
i
is observed}.
The missing entry y
ih
is estimated as average weighted by the
similarity measure:
ˆy
ih
=
jI
Kih
s
i
(y
j
)y
jh
jI
Kih
s
i
(y
j
)
, (6)
where I
Kih
is the index set of K-nearest neighbor genes of
the i-th gene, and if y
jh
is missing the j-th gene is excluded
from I
Kih
. Note that KNNimpute has no theoretical criteria
for selecting the best K-value and the K-value has to be
determined empirically.
3 RESULTS AND DISCUSSION
3.1 Data sets
Spellman et al. (1998) placed a cDNA microarray data set
relevant to the yeast cell-cycle at the URL http://genome-
www.stanford.edu/cellcycle/data/rawdata/ as a complement.
This data set consists of three parts, which are relevant to
alpha factor (A-part), elutriation (E-part), cdc15, and cdc28
(C-part). We first used samples in the A-part (18 samples)
and the E-part (14 samples) to prepare test data sets. Each
sample represents relative expression levels of 6178 genes,
and 4304 genes have no missing value in the A- and E-parts.
Therefore, the complete expression matrix is composed of
4304 genes. We prepared three test data sets: (data A), (data
E) and (data A + E), from the complete expression matrix.
The C-part samples were used for examining the effects of
additional samples (see Section 3.4).
We also prepared a test data set (data A+E +C) by adding
the C-part samples to (data A +E).
Takemasa et al. (2001) obtained original cDNA micro-
arraydatarelevantto humancolorectalcancer (CRC). Clinical
materials of the data consist of 205 primary CRCs that include
127 non-metastatic primary CRCs, 54 metastatic primary
CRCs to the liver and 24 metastatic primary CRCs to dis-
tant organs exclusive of the liver, and 12 normal colonic
epithelia that were histopathologically confirmed to be free of
cancer. Each sample expression vector represents logarithm-
transformed ratios between the expressionlevels in the object-
ive sample and that in the control reference using cDNA
microarrays specialized for CRC, by selecting genes that
were preferentially expressed in colorectal carcinoma tissue.
As members of the complete expression matrix, we selected
758 genes in 4608 genes, and a test data set (data I) was
prepared.
2091
by on July 4, 2010 http://bioinformatics.oxfordjournals.orgDownloaded from

S.Oba et al.
Using these four data sets, (data A), (data E), (data A + E)
and (data I), we examined the estimation ability for missing
values.
In order to evaluate the performance of missing value
estimation methods, we introduced artificial missing entries
to a complete (i.e. without missing values) expression matrix.
The artificial missing entries were introduced in two differ-
ent ways:
Rate-based way Randomly select a specific percentage of
theentriesinthecomplete expressionmatrix, andremove
them.
Histogram-based way Obtain a histogram of column-wise
numbers of missing entries in the original expression
matrix. Then, remove entries from the complete expres-
sion matrix so that the histogram of the artificial missing
entries is similar to the histogram of the original missing
entries.
When 5% artificial missing entries are introduced to (data I)
in the rate-based way, the test data set is denoted by
(data I, 5%).
The performance of the missing value estimation is evalu-
ated by normalized root mean squared error (NRMSE):
NRMSE =
mean[(y
guess
y
answer
)
2
]
variance[y
answer
]
, (7)
where the mean and the variance are calculated over missing
entries in the whole matrix. We know y
answer
because the
missing entries are artificial. When the estimation is accurate,
NRMSEapproaches its minimum value0.0. When the estima-
tion is equivalentto a random guess, whichoccurs either when
the estimation is too poor or when the noise involved is too
large, NRMSE approaches a value of 1.0.
3.2 K-value selection
Both BPCA and SVDimpute depend on the number of prin-
cipal axes (eigenvectors), K, and KNNimpute depends on
the number of neighbors, K (see Section 2.6). Since these
K-values describe similar parameters, we use the same sym-
bol. In order to measure howthe estimation ability depends on
the value of K, we applied BPCA, SVDimpute and KNNim-
pute to the test data sets, (data A + E) and (data I), and
calculated NRMSE with various K-values.
Figure 1 shows the results for (data A + E, 5%) and
(data I, 5%). BPCA produces better results than KNNimpute
or SVDimpute at the optimal K-value for each method.
BPCA exhibits its best results with K = D 1, where
D is the number of samples. SVDimpute and BPCA show
similar results when K is small, because they employ the
same PC regression process. When K is larger, however,
BPCA exhibits much better results than SVDimpute. This
is due to the ARD prior, because the main difference of
BPCA from SVDimpute is its existence. When K = 0,
0 10 20 30
0
0.2
0.4
0.6
0.8
1
K
data A+E, 5%
20 40
K
0 100 200
0
0.2
0.4
0.6
0.8
1
data I, 5%
20 40
KK
NRMSENRMSE
SVD
SVD
BPCA
BPCA
KNN
KNN
Fig. 1. Estimation ability (NRMSE) by BPCA, SVD and KNN with
various K-values. (top panel): Application to (data A + E, 5%).
(bottom panel): Application to (data I, 5%).
the imputation by BPCA or SVDimpute is identical to
that based on gene-wise average, and therefore the results
are poor.
In Figure 1, we see that BPCA exhibits different NRMSE
curves in the (data A + E) case and the (data I) case, with
respectto the optimalK-value. For(data I), the NRMSE curve
becomes almost flat between K = 100 and K = 204, because
the principal axes corresponding to the eigenvalues exceeding
K = 100 degenerated so that their lengths became almost
zero. For (data A + E), on the other hand, almost none of
the axes degenerated, except for the 31st one. From K =
1toK = 30, therefore, each additional axis improved the
estimation ability. Although, the improvement by adding the
30th axis was apparently large, we consider this is a special
phenomenon for this data set. This phenomenon implies the
importance of setting K = D 1 if we do not have a priori
knowledge on the data set.
Accordingly, we cansafelyuse K = D1for everydata set
in BPCA. If the effective dimension of the data set is smaller
than the K-value, the ARD prior automatically reduces the
redundant principal axes. Therefore, in our BPCA method,
there is no need to tune the K-value in advance.
2092
by on July 4, 2010 http://bioinformatics.oxfordjournals.orgDownloaded from

Citations
More filters
Journal ArticleDOI

MissForest—non-parametric missing value imputation for mixed-type data

TL;DR: In this comparative study, missForest outperforms other methods of imputation especially in data settings where complex interactions and non-linear relations are suspected and the out-of-bag imputation error estimates of missForest prove to be adequate in all settings.

Singular Value Decomposition for Genome-Wide Expression Data Processing and Modeling

TL;DR: Using singular value decomposition in transforming genome-wide expression data from genes x arrays space to reduced diagonalized "eigengenes" x "eigenarrays" space gives a global picture of the dynamics of gene expression, in which individual genes and arrays appear to be classified into groups of similar regulation and function, or similar cellular state and biological phenotype.
Journal ArticleDOI

pcaMethods—a bioconductor package providing PCA methods for incomplete data

TL;DR: PcaMethods is a Bioconductor compliant library for computing principal component analysis (PCA) on incomplete data sets that can be analyzed directly or used to estimate missing values to enable the use of missing value sensitive statistical methods.
Journal ArticleDOI

dbNSFP: a lightweight database of human nonsynonymous SNPs and their functional predictions.

TL;DR: dbNSFP as mentioned in this paper compiles prediction scores from four new and popular algorithms (SIFT, Polyphen2, LRT, and MutationTaster), along with a conservation score (PhyloP) and other related information, for every potential nonsynonymous SNP in the human genome (a total of 75,931,005).
Journal ArticleDOI

Exploration of essential gene functions via titratable promoter alleles

TL;DR: This study created promoter-shutoff strains for over two-thirds of all essential yeast genes and subjected them to morphological analysis, size profiling, drug sensitivity screening, and microarray expression profiling, which identified genes involved in ribosome biogenesis, protein secretion, mitochondrial import, and tRNA charging.
References
More filters
Journal ArticleDOI

Cluster analysis and display of genome-wide expression patterns

TL;DR: A system of cluster analysis for genome-wide expression data from DNA microarray hybridization is described that uses standard statistical algorithms to arrange genes according to similarity in pattern of gene expression, finding in the budding yeast Saccharomyces cerevisiae that clustering gene expression data groups together efficiently genes of known similar function.
Journal ArticleDOI

Molecular portraits of human breast tumours

TL;DR: Variation in gene expression patterns in a set of 65 surgical specimens of human breast tumours from 42 different individuals were characterized using complementary DNA microarrays representing 8,102 human genes, providing a distinctive molecular portrait of each tumour.
Journal ArticleDOI

Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.

TL;DR: A generic approach to cancer classification based on gene expression monitoring by DNA microarrays is described and applied to human acute leukemias as a test case and suggests a general strategy for discovering and predicting cancer classes for other types of cancer, independent of previous biological knowledge.
Journal ArticleDOI

The Elements of Statistical Learning

Eric R. Ziegel
- 01 Aug 2003 - 
TL;DR: Chapter 11 includes more case studies in other areas, ranging from manufacturing to marketing research, and a detailed comparison with other diagnostic tools, such as logistic regression and tree-based methods.
Related Papers (5)
Frequently Asked Questions (14)
Q1. What contributions have the authors mentioned in the paper "A bayesian missing value estimation method for gene expression profile data" ?

In this article, the authors propose an estimation method for missing values, which is based on Bayesian principal component analysis ( BPCA ). Accordingly, the BPCA method provides accurate and convenient estimation 

In order to measure how the estimation ability depends on the value of K , the authors applied BPCA, SVDimpute and KNNimpute to the test data sets, (data A + E) and (data I), and calculated NRMSE with various K-values. 

They are (1) principal component (PC) regression, (2) Bayesian estimation, and (3) an expectation–maximization (EM)-like repetitive algorithm. 

For (data I), large amount of missing entries do not degrade the estimation performance, probably because there are a lot of samples in the data set. 

The i-th row vector and the j -th column vector of the matrix are called the expression vector of the i-th sample and the expression vector of the j -th gene, respectively. 

Existing hierarchical clustering software, such as ‘Cluster’ (Eisen et al., 1998), defines the distance between vectors with missing values, by just ignoring the missing dimensions. 

If the effective dimension of the data set is smaller than the K-value, the ARD prior automatically reduces the redundant principal axes. 

As the number of samples increased the information useful for the imputation increased, which is the reason for the improvement by SVDimpute and BPCA. 

Their methodologies on class discovery and class prediction have been applied in a number of studies examining expression changes underlying various clinical phenomena. 

In order to evaluate the performance of missing value estimation methods, the authors introduced artificial missing entries to a complete (i.e. without missing values) expression matrix. 

There are several simple ways to deal with missing values such as deleting an expression vector with missing values from further analysis, imputing missing values to zero, or imputing missing values of a certain gene (sample) to the sample (gene) average (Alizadeh et al., 2000). 

The performance by KNNimpute, however, did not improve much, possibly because the similarity measure used in the method was not very suitable for cases with a large number of missing values. 

The hierarchical prior p(W |α, τ), which is called an automatic relevance determination (ARD) prior, has an important role in BPCA. 

only a global covariance structure, the estimation with BPCA may not be accurate if genes have dominant local similarity structures.