What are the future works mentioned in the paper "Deep autoencoder neural networks for gene ontology annotation predictions" ?

Future work will address advantages and issues related to the application of the same methods and rule to the prediction of multi-terminologies, not only annotations.

What is the common hyperparameter for tSVD?

For tSVD, the number of singular values is a hyper-parameter that determines the rank of the final prediction matrix, andis usually chosen through cross-validation.

What is the SVD of the matrix A?

The SVD of the matrix A is given byA = U ΣV T (3)where U is a m ×m unitary matrix (i.e. UT U = I), Σ is a non-negative diagonal matrix of size m × n, and V T is a n × n unitary matrix (i.e. V T V = I).

What is the way to predict gene functions?

It can be used to predict both inaccuracies and missing gene functions — a large value of ãij suggests that gene i should be annotated with term j, whereas a value close to zero suggests the opposite.

How can the authors get the to predict gene-to-term annotations?

In order to better comprehend why Ã can be used to predict gene-to-term annotations, the authors highlight that an alternative expression of Equation (4) can be obtained using basic linear algebra manipulations:

What is the advantage of the pLSA algorithm?

The approach has numerous advantages: (1) autoencoders can be trained online with very large datasets, (2) they can be trained quickly using graphics processors, and (3) the number and size of the hidden layers provides an easy way of controlling the complexity of the model.

(Open Access) Deep autoencoder neural networks for gene ontology annotation predictions (2014) | Davide Chicco

Q: What are the contributions in "Deep autoencoder neural networks for gene ontology annotation predictions" ?

In this work, the authors develop an algorithm that achieves both goals using deep autoencoder neural networks. With experiments on gene annotation data from the Gene Ontology project, the authors show that deep autoencoder networks achieve better performance than other standard machine learning methods, including the popular truncated singular value decomposition.

Q: How do the authors learn the parameters of the autoencoder?

The authors learn the parameters of the autoencoder by performing stochastic gradient descent to minimize the reconstruction error, the MSE between a and â.

Q: What is the way to predict gene function annotations?

2014 533neural networks have more expressive power, and may be better suited for discovering the underlying patterns in gene function annotation data.

Q: What is the definition of a hidden layer in an autencoder?

A small hidden layer in an autencoder network creates an information bottleneck, forcing the network to compress the data into a low-dimensional representation.

Deep Autoencoder Neural Networks

for Gene Ontology Annotation Predictions

Davide Chicco

∗

Politecnico di Milano

Dipartimento di Elettronica

Informazione Bioingegneria

Milan, Italy

davide.chicco@gmail.com

Peter Sadowski

University of California, Irvine

Dept. of Computer Science

Institute for Genomics and

Bioinformatics

Irvine, CA, USA

peter.j.sadowski@uci.edu

Pierre Baldi

†

University of California, Irvine

Dept. of Computer Science

Institute for Genomics and

Bioinformatics

Irvine, CA, USA

pfbaldi@ics.uci.edu

ABSTRACT

The annotation of genomic information is a major challenge

in biology and bioinformatics. Existing databases of known

gene functions are incomplete and prone to errors, and the

bimolecular experiments needed to improve these databases

are slow and costly. While computational methods are not

a substitute for experimental veriﬁcation, they can help in

two ways: algorithms can aid in the curation of gene anno-

tations by automatically suggesting inaccuracies, and they

can predict previously-unidentiﬁed gene functions, acceler-

ating the rate of gene function discovery. In this work, we

develop an algorithm that achieves both goals using deep

autoencoder neural networks. With experiments on gene

annotation data from the Gene Ontology project, we show

that deep autoencoder networks achieve better performance

than other standard machine learning methods, including

the popular truncated singular value decomposition.

Categories and Subject Descriptors

I.2.6 [Artiﬁcial Intelligence]: Learning; J.3 [Life and

Medical Sciences]: Biology and Genetics; H.2.8 [Database

Applications]: Data mining

Keywords

biomolecular annotations, matrix-completion, autoencoders,

neural networks, Gene Ontology, truncated singular value

decomposition, principal component analysis

1. INTRODUCTION

In bioinformatics, a controlled gene function annotation

is a binary matrix associating genes or gene products with

∗

corresponding author

†

corresponding author

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for proﬁt or commercial advantage and that copies bear this notice and the full citation

on the ﬁrst page. Copyrights for components of this work owned by others than the

author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or

republish, to post on servers or to redistribute to lists, requires prior speciﬁc permission

and/or a fee. Request permissions from permissions@acm.org.

BCB’14, September 20–23, 2014, Newport Beach, CA, USA.

ACM 978-1-4503-2894-4/14/09 ...$15.00.

http://dx.doi.org/10.1145/2649387.2649442.

functional features from a controlled vocabulary. These an-

notations are important for eﬀective communication in biomed-

ical research, and lay the groundwork for bioinformatics soft-

ware tools and data mining investigations. The in vitro

biomolecular experiments used to validate gene functions are

expensive, so the development of computational methods to

identify errors and prioritize new biomolecular experiments

is a worthwhile area of research [1].

The Gene Ontology project (GO) is a bioinformatics ini-

tiative to characterize all the important features of genes and

gene products within a cell [2] [3]. GO is composed of three

controlled vocabularies structured as mostly-separate sub-

ontologies: biological processes, cellular components, and

molecular functions. Each GO sub-ontology is structured as

a directed acyclic graph of features (nodes) and ontological

relationships (edges). In January 2014, GO contained 39,000

terms with more than 25,450 biological processes, 2,250 cel-

lular components, and 9,650 molecular functions. However,

GO annotations are constantly being revised and added as

new experimental evidence is produced.

One approach to improving gene function annotation data

bases like GO is to use patterns in the known annotations

to predict new annotations. This can be viewed as a matrix-

completion problem, in which one attempts to recover a ma-

trix with some underlying structure from noisy observations.

Machine learning algorithms have proved very successful in

similar applications, such as the famous million-dollar Net-

ﬂix prize awarded in 2009. Many machine learning algo-

rithms have already been applied to gene function annota-

tion ([4] [5] [6] [7] [8] [9]), but to the best of our knowledge

deep autoencoder neural networks have not. Deep networks

of multiple hidden layers have an advantage over shallow

machine learning methods in that they are able to model

complex data with greater eﬃciency. They have proven their

usefulness in ﬁelds such as vision and speech recognition, and

promise to yield similar performance gains in other machine

learning applications that have complex underlying struc-

ture in the data.

A popular algorithm for matrix-completion is the trun-

cated singular value decomposition method (tSVD). Kha-

tri et al. ﬁrst used this method for GO annotation predic-

tion [10], and one of the authors of this work has extended

their method with gene clustering and term-term similarity

weights [11] [12]. However, the tSVD method can be viewed

as a special linear case of a more general approach using

autoencoders [13] [14] [15]. Deep, non-linear, autoencoder

ACM-BCB 2014 533

neural networks have more expressive power, and may be

better suited for discovering the underlying patterns in gene

function annotation data.

In this paper, we summarize the tSVD and autoencoder

methods, show how they can be used to predict annotations,

and compare the performance on six separate GO datasets.

2. SYSTEM AND METHODS

In this section we describe the two annotation-prediction

algorithms used in this paper: Truncated Singular Value De-

composition and Autoencoder Neural Network.

2.1 Truncated Singular Value Decomposition

Truncated Singular Value Decomposition (tSVD) [16] is a

matrix factorization method that produces a low-rank ap-

proximation to a matrix. Deﬁne A

∈ {0, 1}

m×n

to be a

matrix of annotations. The m rows of A

correspond to

genes, while the n columns correspond to GO features, such

that

(i, j) =

(

1 if gene i is annotated with feature j,

0 otherwise.

(1)

When features are organized into ontologies, sometimes

only the most speciﬁc feature is speciﬁed, and the more gen-

eral features (ancestors) are implicit. Thus, in this work we

consider a modiﬁed matrix A deﬁned as

A(i, j) =











if gene i is annotated with feature j

or with any descendant of j,

otherwise.

(2)

The i

row of the A matrix (a

) contains all the direct and

indirect annotations of gene i. The j

column encodes the

list of genes that have been annotated (directly or indirectly)

to feature j. This process is sometimes deﬁned as annotation

unfolding [17].

Predictions are produced by computing the SVD of the

matrix A and truncating the less-signiﬁcant singular values.

The SVD of the matrix A is given by

A = U Σ V

(3)

where U is a m × m unitary matrix (i.e. U

U = I), Σ is

a non-negative diagonal matrix of size m × n, and V

a n × n unitary matrix (i.e. V

V = I). Conventionally,

the entries along the diagonal of Σ (the singular values)

are sorted in non-increasing order. The number r ≤ p of

non-zero singular values is equal to the rank of the matrix

A, where p = min(m, n). For a positive integer k < r, the

tSVD matrix

A is given by

A = U

(4)

where U

) is a m × k (n × k) matrix achieved by

retaining the ﬁrst k columns of U (V ) and Σ is a k × k di-

agonal matrix with the k largest singular values along the

diagonal. The decomposition of the matrices and the diﬀer-

ence between SVD and tSVD are represented in Fig. 1. The

matrix

A is the optimal rank-k approximation of A, i.e. the

one that minimizes the norm (either the spectral norm or

the Frobenius norm) kA−

Ak subject to the rank constraint.

Figure 1: An illustration of the Singular Value

Decomposition (upper green image) and the Trun-

cated SVD reconstruction (lower blue image) of the

A matrix. In the classical SVD decompostion, A

∈ {0, 1}

m×n

, U∈ R

m×m

, Σ ∈ R

m×n

∈ R

n×n

. In the

Truncated decomposition, where k ∈ N is the trun-

cation level, U

∈ R

m×k

, Σ

∈ R

k×k

, V

∈ R

k×n

, and the

output matrix

A∈ R

m×n

The matrix

A is real valued and can be interpreted as a

model of the noisy, incomplete observations. It can be used

to predict both inaccuracies and missing gene functions — a

large value of ea

suggests that gene i should be annotated

with term j, whereas a value close to zero suggests the oppo-

site. The choice of the k truncation parameter controls the

complexity of the model, and aﬀects the predictions. Khatri

et al. use a ﬁxed value of k = 500 in [10] [18] [19], while one

of the authors of this paper has developed a new discrete op-

timization algorithm to select the best truncation level on

the basis of the ROC AUCs, described in [20].

In order to better comprehend why

A can be used to

predict gene-to-term annotations, we highlight that an al-

ternative expression of Equation (4) can be obtained using

basic linear algebra manipulations:

A = A V

(5)

Additionally, the SVD of the matrix A is related to the

eigen-decomposition of the symmetric matrices T = A

and G = AA

. The columns of V

) are a set of k

eigenvectors corresponding to the k largest eigenvalues of

the matrix T (G). The matrix T has a simple interpretation

in our context. In fact,

T (j

, j

) =

i=1

(i,j

)

· A

(i,j

)

(6)

i.e. T (j

, j

) is the number of genes annotated with both

terms, j

and j

. Consequently, T (j

, j

) indicates the (un-

normalized) correlation between term pairs and it can be

interpreted as a similarity score of the terms j

and j

, the

computation of which is exclusively based on the use of these

terms in available annotations. The eigenvectors of T (i.e.

the columns of V

) are a reduced set of eigen-terms. Intu-

itively, if two terms co-occur frequently, they are likely to

be mapped to the same eigen-term. Based on Equation (5),

the i

row of

A can be written as

= a

(7)

ACM-BCB 2014 534

Figure 2: An autoencoder neural network with d

hidden layers. The number of input units is equal to

the number of output units, while there are usually

fewer units in each hidden layer.

Thus, the original annotation proﬁle is ﬁrst transformed in

the eigen-term domain, while retaining only the ﬁrst k eigen-

terms by the multiplication with V

, and then mapped back

to the original domain by means of V

. This corresponds

to projecting the original vector a

onto the k-dimensional

subspace spanned by the columns of V

2.2 Autoencoder Neural Network

An autoencoder is a feed-forward artiﬁcial neural network

with the same input and target output. A small hidden

layer in an autencoder network creates an information bot-

tleneck, forcing the network to compress the data into a

low-dimensional representation. As with the tSVD method,

this modelling of the data can be used to make predictions.

For a simple autoencoder with a single hidden layer, the

vector of the hidden unit activities, h, is given by

h = f(W

· a + bias

) (8)

where f is the activation function (we use the logistic sig-

moid function in this work), W

is a parameter matrix, and

bias

is a vector of bias parameters. The hidden represen-

tation of the data is then mapped back into the space of a

using the decoding function:

ˆa = f (W

· h, +bias

) (9)

where W

is the decoding matrix and bias

a vector of bias

parameters. We learn the parameters of the autoencoder

by performing stochastic gradient descent to minimize the

reconstruction error, the MSE between a and ˆa.

MSE(a, ˆa) = ||a − ˆa||

= ||a − (W

· h + bias

)||

(10)

When the hidden layer has fewer dimensions than a, the au-

toencoder learns a compressed representation of the training

data. In fact, an autoencoder with k linear hidden units will

learn to project the data onto its ﬁrst k principal compo-

nents, and the decoded data matrix is exactly the tSVD ma-

trix with the top k singular values [14]. Non-linear hidden

units allow an autoencoder to learn more complex encoding

functions, as do additional hidden layers.

As in the tSVD approach, the matrix A is an array of m

gene proﬁles with n possible features deﬁned in Equation 2,

such that gene proﬁle a

is the i

row of A. An autoen-

coder is trained to learn these gene proﬁles and produces a

prediction matrix

A as described in Fig. 3.

Given the input matrix A ∈ {0, 1}

m×n

, where rows and

columns correspond to genes and features, respectively:

1. Fix a number h of hidden units (h ∈ N, h < m), and

a number d of hidden layers (d ∈ {1, ..., maxhl})

2. Training: for each gene proﬁle a

of A, where i ∈

[1, m]:

(a) for each training iteration:

i. for each d hidden layer:

a) compute hidden activation h

from in-

put a

(Equation 8)

ii. compute reconstructed output ˆa

from

hidden activation h

(Equation 9)

iii. compute error gradient (Equation 10)

iv. back-propagate error gradient to update

weight parameters

3. Testing: for each gene proﬁle a

of A, where i ∈

[1, m]:

(a) autoencode a

and produce ˆa

(b) set ˆa

as i

row of the output matrix

Figure 3: Overview of the autoencoder neural net-

work algorithm.

2.3 Predictions

The tSVD and autoencoder both provide a prediction ma-

trix

A of real values, with larger values indicating a higher

predicted likelihood. For an ROC curve analysis, only the

relative ordering of these predictions is relevant. To make

binary predictions, we set a threshold τ such that

A(i, j) > τ

is interpreted as a prediction that gene i should be annotated

with feature j.

2.4 Autoencoder Training Details

Autoencoder neural networks were trained using the free

GPU-accelerated software package Torch7 [21] using stochas-

tic gradient descent with a learning rate of 0.01 for 25 it-

erations. L2 regularization was used on all weights, which

were initialized randomly from the uniform distribution over

[0, 1]. The hidden unit function is a Sigmoid.

2.5 Datasets

The GO database contains annotation datasets for a vari-

ety of species, and for each of the three GO sub-ontologies:

Biological Processes (BP), Molecular Functions (MF), and

Cellular Components (CC). We focused on the Bos taurus

(cattle) and Gallus gallus (red junglefowl) gene sets, which

are available from the Genomic and Proteomic Data Ware-

house (GPDW) [22] [23]. We use the July 2009 version of the

datasets for analyzing and selecting hyper-parameters, and

ACM-BCB 2014 535

the March 2013 version for comparing prediction algorithms.

Table 1 describes the size and number of annotations in each

version. We exclude all annotations that are ﬂagged as IEA

(inferred from electronic annotation) or ND (no biological

data available), and all feature terms and genes that do not

appear in both dataset versions.

root term, which has the sub-ontology name (BP, CC,

MF). In January 2014, GO contained about 39,000 terms

describing gene and gene product features, with more than

25,450 BP, 9,650 MF and 3,350 CC terms. However, these

are far from complete and new annotations are added regu-

larly; over a third of the biological process annotations have

been added within the last four years.

3. RESULTS AND DISCUSSION

We perform two separate experiments. First, we analyze

the eﬀects of hyper-parameters for both tSVD and the au-

toencoder algorithms on a validation set created by holding-

out (removing) 10% of the annotations from the July 2009

database, then we test the prediction algorithms on new an-

notations that were added in the 2013 version. In both cases,

the goal is to identify missing annotations within the large

set of negative training examples. Fig. 4 visually describes

the analysis procedure.

Figure 4: A ﬂowchart of our analysis, with the

hyper-parameter selection and validation procedure

on the left, and the test procedure on the right. A

rounded rectangle represents an operation, repeated

in a cycle if attached to a sharp rectangle. A paral-

lelogram represent an output production step, and a

cylinder represents an interaction with the database.

3.1 Hyper-Parameter Analysis

For tSVD, the number of singular values is a hyper-parameter

that determines the rank of the ﬁnal prediction matrix, and

is usually chosen through cross-validation. In an autoen-

coder network, the analogous hyper-parameter is the num-

ber of hidden units. These hyper-parameters control the

complexity of the model; keeping a large number of singu-

lar values or using a large number of hidden units results

in a very accurate reconstruction of the input data matrix,

but will overﬁt to noise, such as missing annotations and

inaccuracies. Figure 5 and Fig. 6 show how there is often

an optimal hyper-parameter of this type. The best hyper-

parameters for each data set are shown in Table 2.

The curves for each type of sub-ontology have similar be-

havior. For the Cellular Component annotation datasets,

the autoencoder algorithm always outperform tSVD, regard-

less of the number of singular values. For the Molecular

Function datasets, the autoencoder and tSVD have similar

AUCs with singular values in the range [20, 50], while au-

toencoder networks outperform tSVD in the other intervals.

For Biological Process datasets, the autoencoders outper-

form tSVD only when it uses the maximum possible number

of hidden units.

3.2 Predictive Accuracy

We test the tSVD and autoencoder algorithms on a set of

annotations added to the database between July 2009 and

March 2013. Training and testing was performed on the un-

folded matrices described in Equation 2 to eliminate the pos-

sibility of trivial predictions. The performance metric is the

percentage of the top 100 predictions from each method that

were added to the database during this period. The results

are displayed in Table 3, along with results from four other

state-of-the-art algorithms from the computational gene an-

notation literature:

1. tSVD with gene clustering (SIM1) [24] [25]

2. tSVD with gene clustering and term-term similarity

weights (SIM2) [24] [25]

3. Probabilistic Latent Semantic Analysis (pLSA) [26]

4. Latent Dirichlet Allocation (LDA) [27]

Overall, the tSVD-based techniques (tSVD, SIM1, SIM2)

achieve similar performance, and LDA appears comparable

to these methods. The pLSA algorithm performs slightly

better than these methods on most of the datasets, and the

autoencoder networks are consistently the best. The au-

toencoder networks improve performance by +6% to +36%

with respect to the second best method.

3.3 Novel Predictions

We examine the predicted annotations with highest likeli-

hood score that are not already annotated in the GO database.

Many of the predicted annotations are rather obvious high-

level descriptive features such as cellular process, so we list

the three interesting predictions with the highest likelihood

in Table 4, where we deﬁne interesting as an annotation with

distance greater than two from the root node in the ontology

tree.

4. CONCLUSIONS

Gene function annotation databases are an essential tool

in biomedical research, yet existing databases are incom-

plete and contain inaccuracies. In this work, we have shown

ACM-BCB 2014 536

Table 1: Quantitative characteristics of the considered annotation datasets in the July 2009 database version

versus the March 2013 database version used for testing. Numbers do not include annotations inferred from

electronic annotations (IEA), those for which no biological data is available (ND), obsolete terms, or obsolete

genes. #gs is the number of genes; #fs is the number of biological function features; #as is the number of

annotations; ∆ is the diﬀerence of annotation amounts of the #gs genes and the #fs features between the

two database versions, and ∆% is the percentage diﬀerence.

July 2009 March 2013 #as comparison

Dataset #gs #fs #as #as ∆ ∆%

Bos taurus CC 497 493 8,003 9,683 1,680 20.99%

Bos taurus MF 543 856 4,295 6,394 2,099 48,87%

Bos taurus BP 512 2,719 17,145 27,075 9,930 57.92%

Gallus gallus CC 260 344 3,717 3,798 81 2.18%

Gallus gallus MF 309 501 2,358 2,654 256 10.86%

Gallus gallus BP 275 1,824 8,350 11,984 3,634 43.52%

(a) (b) (c)

Figure 5: AUC values for the tSVD and autoencoder predictions with diﬀerent hyper-parameter choices

(number of singular values and number of hidden units, respectively) for Bos taurus Cellular Components

(5a), Molecular Functions (5b), and Biological Process (5c). For comparison purposes, we use an autoencoder

with a single hidden layer.

(a) (b) (c)

Figure 6: AUC values for the tSVD and autoencoder predictions with diﬀerent hyper-parameter choices

(number of singular values and number of hidden units, respectively) for Gallus gallus Cellular Components

(6a), Molecular Functions (6b), and Biological Process (6c). For comparison purposes, we use an autoencoder

with a single hidden layer.

Table 2: Hyper-parameters were optimized separately for each algorithm and dataset. We select the number

of k singular values for tSVD, the number of clusters c for the SIM1 and SIM2 methods as described in [24];

the number of topics t in pLSA as described in [26]; the number of topics t in LDA as described in [27]; and

the number of hidden units h in each of d hidden layers for the autoencoder (AE) algorithm.

tSVD SIM pLSA LDA AE

Dataset k c t t h d

Bos taurus CC 90 3 12 465 2

Bos taurus MF 71 3 13 302 3

Bos taurus BP 241 5 112 500 2

Gallus gallus CC 51 3 25 258 3

Gallus gallus MF 41 2 74 271 3

Gallus gallus BP 111 3 126 253 2

ACM-BCB 2014 537

Deep autoencoder neural networks for gene ontology annotation predictions

Figures

Citations

DeepFool: A Simple and Accurate Method to Fool Deep Neural Networks

A State-of-the-Art Survey on Deep Learning Theory and Architectures

Ten quick tips for machine learning in computational biology

The History Began from AlexNet: A Comprehensive Survey on Deep Learning Approaches.

Deep Learning with Convolutional Neural Networks Applied to Electromyography Data: A Resource for the Classification of Movements for Prosthetic Hands

References

Gene Ontology: tool for the unification of biology

Singular value decomposition and least squares solutions

Torch7: A Matlab-like Environment for Machine Learning

Neural networks and principal component analysis: learning from examples without local minima

Auto-association by multilayer perceptrons and singular value decomposition

Related Papers (5)

Reducing the Dimensionality of Data with Neural Networks

Deep learning

A fast learning algorithm for deep belief nets

ImageNet Classification with Deep Convolutional Neural Networks

Learning representations by back-propagating errors

Frequently Asked Questions (20)

Q1. What are the contributions in "Deep autoencoder neural networks for gene ontology annotation predictions" ?

Q2. What are the future works mentioned in the paper "Deep autoencoder neural networks for gene ontology annotation predictions" ?

Q3. How do the authors learn the parameters of the autoencoder?

Q4. What is the way to improve gene function annotation data bases?

Q5. How many iterations did the neural network learn?

Q6. What is the common hyperparameter for tSVD?

Q7. What is the advantage of deep neural networks over shallow machine learning methods?

Q8. What is the way to predict gene function annotations?

Q9. What is the definition of a hidden layer in an autencoder?

Q10. What is the SVD of the matrix A?

Q11. What is the way to predict gene functions?

Q12. How can the authors get the to predict gene-to-term annotations?

Q13. What are the two algorithms used in this paper?

Q14. What was the training and testing procedure for the tSVD and autoencoder algorithms?

Q15. What is the advantage of the pLSA algorithm?

Q16. What is the MSE of a hidden layer?

Q17. What is the threshold for making binary predictions?

Q18. what is the optimal rank-k approximation of a?

Q19. What is the way to predict gene-to-term annotations?

Q20. How many annotations are added to the GO database?