scispace - formally typeset
Open AccessJournal ArticleDOI

ExCAPE-DB: an integrated large scale dataset facilitating Big Data analysis in chemogenomics

Reads0
Chats0
TLDR
In this article, the authors compile a comprehensive chemogenomics dataset with over 70 million SAR data points from publicly available databases (PubChem and ChEMBL) including structure, target information and activity annotations.
Abstract
Chemogenomics data generally refers to the activity data of chemical compounds on an array of protein targets and represents an important source of information for building in silico target prediction models. The increasing volume of chemogenomics data offers exciting opportunities to build models based on Big Data. Preparing a high quality data set is a vital step in realizing this goal and this work aims to compile such a comprehensive chemogenomics dataset. This dataset comprises over 70 million SAR data points from publicly available databases (PubChem and ChEMBL) including structure, target information and activity annotations. Our aspiration is to create a useful chemogenomics resource reflecting industry-scale data not only for building predictive models of in silico polypharmacology and off-target effects but also for the validation of cheminformatics approaches in general.

read more

Content maybe subject to copyright    Report

Sun
et al. J Cheminform (2017) 9:17
DOI 10.1186/s13321-017-0203-5
DATABASE
ExCAPE-DB: an integrated large
scale dataset facilitating Big Data analysis
inchemogenomics
Jiangming Sun
1*
, Nina Jeliazkova
2
, Vladimir Chupakhin
3
, Jose-Felipe Golib-Dzib
4
, Ola Engkvist
1
, Lars Carlsson
1
,
Jörg Wegner
3
, Hugo Ceulemans
3
, Ivan Georgiev
2
, Vedrin Jeliazkov
2
, Nikolay Kochev
2,5
, Thomas J. Ashby
6
and Hongming Chen
1*
Abstract
Chemogenomics data generally refers to the activity data of chemical compounds on an array of protein targets and
represents an important source of information for building in silico target prediction models. The increasing volume of
chemogenomics data offers exciting opportunities to build models based on Big Data. Preparing a high quality data
set is a vital step in realizing this goal and this work aims to compile such a comprehensive chemogenomics dataset.
This dataset comprises over 70 million SAR data points from publicly available databases (PubChem and ChEMBL)
including structure, target information and activity annotations. Our aspiration is to create a useful chemogenomics
resource reflecting industry-scale data not only for building predictive models of in silico polypharmacology and off-
target effects but also for the validation of cheminformatics approaches in general.
Keywords: Big Data, Bioactivity, Chemogenomics, Chemical structure, Molecular fingerprints, Search engine, QSAR
© The Author(s) 2017. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License
(
http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium,
provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license,
and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (
http://creativecommons.org/
publicdomain/zero/1.0/
) applies to the data made available in this article, unless otherwise stated.
Background
In pharmacology, “Big Data” on protein activity and gene
expression perturbations has grown rapidly over the past
decade thanks to the tremendous development of prot
-
eomics and genome sequencing technology [
1, 2]. Simi-
larly there has also been a remarkable increase in the
amount of available compound structure and activity
relation (SAR) data, contributed mainly by the develop
-
ment of high throughput screening (HTS) technologies
and combinatorial chemistry for compound synthesis [
3].
ese SAR data points represent an important resource
for chemogenomics modelling, a computational strat
-
egy in drug discovery that investigates an interaction of
a large set of compounds (one or more libraries) against
families of functionally related proteins [
4].
Frequently, the “Big Data” in chemogenomics refers to
large databases recording the bioactivity annotation of
chemical compounds against different protein targets.
Databases such as PubChem [
5], BindingDB [6], and
ChEMBL [
7] are examples of large public domain reposito
-
ries of this kind of information. PubChem is a well-known
public repository for storing small molecules and their
biological activity data [
5, 8]. It was originally started as
a central repository of HTS experiments for the National
Institute of Health (USA) Molecular Libraries Program,
but nowadays also incorporates data from other sources.
ChEMBL contains data that was manually extracted from
numerous peer reviewed journal articles, as do WOMBAT
[
9], BindingDB [6], and CARLSBAD [10]. Similarly, com
-
mercial databases, such as SciFinder [
11], GOSTAR [12]
and Reaxys [
13] have accumulated a large amount of data
from publications as well as patents. Besides these sources,
large pharmaceutical companies maintain their own data
collections originating from in-house HTS screening cam
-
paigns and drug discovery projects.
is data serves as a valuable source for building in
silico models for predicting polypharmacology and
Open Access
*Correspondence: Jiangming.Sun@astrazeneca.com; hongming.chen@
astrazeneca.com
1
Discovery Sciences, Innovative Medicines and Early Development
Biotech Unit, AstraZeneca R&D Gothenburg, 43183 Mölndal, Sweden
Full list of author information is available at the end of the article

Page 2 of 9
Sun
et al. J Cheminform (2017) 9:17
off-target effects, and benchmarking the prediction per-
formance and computation speed of machine-learning
algorithms. e aforementioned publicly available data
-
bases have been widely used in numerous cheminformat-
ics studies [
1416]. However, the curated data are quite
heterogeneous [
17] and lack a standard way for anno
-
tating biological endpoints, mode of action and target
identifier. ere is an urgent need to create an integrated
data source with a standardized form for chemical struc
-
ture, activity annotation and target identifier, covering as
large a chemical and target space as possible. ere are
also irregularities within databases: the public screening
data in PubChem, especially the inactive data points, are
spread in different assay entries uploaded by data provid
-
ers from around world and cannot be directly compared
without processing. is makes curating SAR data for
quantitative structure–activity relationship (QSAR) mod
-
eling very tedious. An example of work to synthesize the
curated and uncurated data is Mervin etal. [
15], where a
dataset with ChEMBL active compounds and Pubchem
inactive compounds was constructed, including inac
-
tive compounds for homologous proteins. However, the
dataset can only be accessed as a plain text file, not as a
searchable database.
In this work, by combining active and inactive com
-
pounds from both PubChem and ChEMBL, we created
an integrated dataset for cheminformatics modeling
purposes to be used in the ExCAPE [
18] (Exascale Com
-
pound Activity Prediction Engine) Horizon 2020 project.
ExCAPE-DB, a searchable open access database, was
established for sharing the dataset. It will serve as a data
hub for giving researchers around world easy access to a
publicly available standardized chemogenomics dataset,
with the data and accompanying software available under
open licenses.
Dataset curation
e standardized ChEMBL20 data from an in-house
database ChemistryConnect [
3] was extracted and
PubChem data was downloaded in January 2016 from the
PubChem website (
https://pubchem.ncbi.nlm.nih.gov/)
using the REST API. Both data sources are heterogene
-
ous. Data cleaning and standardisation procedures were
applied in preparing both chemical structures and bioac
-
tivity data.
Chemical structure standardisation
Standardisation of PubChem and ChEMBL chemical
structures was performed with ambitcli version 3.0.2. e
ambitcli tool is part of the AMBIT cheminformatics plat
-
form [
1921] and relies on e Chemistry Development
Kit library 1.5 [
22, 23]. It includes a number of chemical
structure processing options (fragment splitting, isotope
removal, handling implicit hydrogens, stereochemistry,
InChI [
24] generation, SMILES [25] generation and struc
-
ture transformation via SMIRKS [
26], tautomer genera-
tion and neutralisation etc.). e details of the structure
processing procedure can be found in Additional file
1.
All standardisation rules were aligned between Janssen
Pharmaceutica, AstraZeneca and IDEAConsult to reflect
industry standards and implemented in open source soft
-
ware (
https://doi.org/10.5281/zenodo.173560).
Bioactivity data standardisation
e processing protocol for extracting and standard-
izing bioactivity data is shown in Fig. 1. First, bioassays
were restricted to only those comprising a single target;
the black box (target unknown) or multi-target assays
were excluded. 58,235 and 92,147 single targets con
-
taining concentration response (CR) type assays (con-
firmatory type in PubChem) remained in PubChem and
ChEMBL, respectively. e assay target was further lim
-
ited to human, rat and mouse species, and data points
missing a compound identifier (CID) were removed. For
those filtered assays, active compounds whose dose–
response value was equal to or lower than 10 μM were
kept as active entries and others were removed. Inactive
compounds in CR assays were kept as inactive entries.
Compounds that were labelled as inactive in PubChem
screening assays (assays run with a single concentration)
were also kept as inactive records.
e chemical structure identifiers (InChI, InChIKey
and SMILES) generated from the standardized com
-
pound structures (as explained above) were joined with
the compounds obtained after the filtering procedure.
e compound set was further filtered by the following
physicochemical properties: organic filters (compounds
without metal atoms), molecular weight (MW) <1000Da,
and a number of heavy atoms (HEV) >12. is was done
to remove small or inorganic compounds not representa
-
tive for modelling the chemical space relevant for a nor-
mal drug discovery project. is is a much more generous
rule than the Lipinski rule-of-five [
27], but the aim was
to keep as much useful chemical information as possi
-
ble while still removing some non-drug like compounds.
Finally, fingerprint descriptors were generated for all
remaining compounds. So far JCompoundMapper (JCM)
[
28], CDK circular fingerprint descriptors and signature
descriptors [
29] were generated respectively. For circular
fingerprint and signature calculation, the maximum topo
-
logical radius for fragment generation was set to 3.
From each data source, various attributes were read
and converted into controlled vocabularies. e most
important of these are target (Entrez ID), activity value,
mode of action, assay type and assay technology etc. e
underlying data sources contain activity data with various

Page 3 of 9
Sun
et al. J Cheminform (2017) 9:17
result types; the results were unified as best possible to
make them comparable across tests (and data sources)
irrespective of the original result type. e selected com
-
patible dose–response result types are listed in Addi-
tional file
2: Table S1. Generally, the end-point name of
a concentration related assay (e.g. IC50, units in µM)
should match one of the keywords in this list. In the case
when a compound has multiple activity data records for
the same target, the records are aggregated so that one
compound only has one record per target and the best
(maximal) potency was chosen as the final aggregated
value for a compound–target pair. e AMBIT generated
InChIKey from the standardisation procedure was used
as the molecular identifier to identify duplicate structures
in the data aggregation. Finally, targets which have <20
active compounds were removed from the final dataset.
Entrez ID [
30], gene symbol [3133] and gene ortho
-
logue were collected as information for the target. e
gene symbol was converted from Entrez ID with the
gene2accession table [
34] provided by National Center
for Biotechnology Information (NCBI). Gene orthologues
was included from the orthologue table [34] from NCBI.
Database andweb interface
e ExCAPE-DB is built based on the AMBIT database
and web application [
19], enhanced with a free text search
engine (Apache Solr [
35]). An instance of the AMBIT web
application (ambit2.war) was installed and the chemi
-
cal structures were imported. is enables chemistry-
aware search (similarity, substructure) and depiction, all
exposed via a REST API and the web interface provided
by the web application itself. e bioactivity data, consist
-
ing of compound related information (e.g. target activity
label and InChIKey) and target related information (e.g.
Entrez IDs and official gene symbols), is imported into an
Apache Solr collection (
http://lucene.apache.org/solr/)
and exposed through the Solr REST API. e open source
JavaScript client library jToxKit (
https://github.com/
ideaconsult/jToxKit
) is used to interact with the AMBIT
REST API and the Solr REST API. A dedicated JavaScript
web interface was developed for ExCAPE-DB, integrating
the chemical search, as well as the free text and faceted
search functionality for biological activities.
e ExCAPE-DB is available online (
https://solr.idea
-
consult.net/search/excape/
) and a screenshot of the web
Fig. 1 Workflow for data preparation

Page 4 of 9
Sun
et al. J Cheminform (2017) 9:17
browser interface is shown in Fig.
2a. e dataset can be
searched both by target name and CID. For target based
searches, the Entrez ID, gene symbol, gene orthologous
group and target species can be used for subsetting
datasets. For compound searches, a user can choose to
input the InChIKey or specify a CID (SMILES, InChI
or IUPAC chemical name) for doing free-text search or
use the embedded structure editor for doing substruc
-
ture or similarity search (Fig.
2b). It is also possible to
follow a link to the original ChEMBL or PubChem page
of the specific compound from the search result. e
download tab on the web page provides several down
-
load options. e “Filtered entries” download option
allows the downloading of all of the current search
Fig. 2 Browsing the ExCAPE-DB web interface. a Searching the database via gene symbol or free-text. The original compound information is linked
to from the result page. b Searching the database via substructure search

Page 5 of 9
Sun
et al. J Cheminform (2017) 9:17
result. For downloading specific entries, it is possible
to include “Add to selection” links and compile a subset
of selected entries, which will be available for download
as “Selected entries”. A static link for downloading the
entire ExCAPE-DB dataset is available at the down
-
load tab. e dataset is also uploaded to the
Zenodo.
org
repository and available for download from there as
doi:
10.5281/zenodo.173258.
Discussion
e dataset composition is described in Table
1. In total
there are 998,131 unique compounds and 70,850,163 SAR
data points. ese SAR data points cover 1667 targets
Table 1 Public chemogenomics dataset
ChEMBL PubChem ExCAPE-DB
Actives
# SAR data points 1,259,338 439,288 1,332,426
# Compounds 566,143 263,119 593,156
Inactives
# SAR data points 1,530,908 68,948,609 69,517,737
# Compounds 416,655 654,562 719,192
Total
# SAR data points 2,790,246 69,387,897 70,850,163
# Compounds 710,324 828,317 998,131
# Targets 1644 1588 1667
Fig. 3 Composition of active compounds in the dataset. The distribution of active compounds among the targets in a ExCAPE-DB, b ChEMBL part
of ExCAPE-DB and c the fraction span of actives in both datasets. We note that the ChEMBL dataset is shown here before the filtering and aggrega-
tion process and contains only single-target assays. Active compounds should have a pXC50 no less than 5 and only targets with at least 20 active
compounds were considered

Citations
More filters
Journal ArticleDOI

The rise of deep learning in drug discovery.

TL;DR: The first wave of applications of deep learning in pharmaceutical research has emerged in recent years, and its utility has gone beyond bioactivity predictions and has shown promise in addressing diverse problems in drug discovery.
Journal ArticleDOI

Molecular de-novo design through deep reinforcement learning

TL;DR: A method to tune a sequence-based generative model for molecular de novo design that through augmented episodic likelihood can learn to generate structures with certain specified desirable properties is introduced.
Journal ArticleDOI

Automating drug discovery

TL;DR: This article aims to identify the approaches and technologies that could be implemented robustly by medicinal chemists in the near future and to critically analyse the opportunities and challenges for their more widespread application.
Journal ArticleDOI

Application of Generative Autoencoder in De Novo Molecular Design.

TL;DR: In this article, various generative autoencoders were used to map molecule structures into a continuous latent space and vice versa and their performance as a structure generator was assessed, showing that the latent space preserves chemical similarity principle and thus can be used for the generation of analogue structures.
Journal ArticleDOI

De novo generation of hit-like molecules from gene expression signatures using artificial intelligence

TL;DR: A generative model that bridges systems biology and molecular design, conditioning a generative adversarial network with transcriptomic data that can automatically design molecules that have a high probability to induce a desired transcriptomic profile.
References
More filters
Journal ArticleDOI

LIBSVM: A library for support vector machines

TL;DR: Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Journal ArticleDOI

A Coefficient of agreement for nominal Scales

TL;DR: In this article, the authors present a procedure for having two or more judges independently categorize a sample of units and determine the degree, significance, and significance of the units. But they do not discuss the extent to which these judgments are reproducible, i.e., reliable.
Journal ArticleDOI

Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings

TL;DR: Experimental and computational approaches to estimate solubility and permeability in discovery and development settings are described in this article, where the rule of 5 is used to predict poor absorption or permeability when there are more than 5 H-bond donors, 10 Hbond acceptors, and the calculated Log P (CLogP) is greater than 5 (or MlogP > 415).
Journal ArticleDOI

The cancer genome atlas pan-cancer analysis project

John N. Weinstein, +379 more
- 01 Oct 2013 - 
TL;DR: The Pan-Cancer initiative compares the first 12 tumor types profiled by TCGA with a major opportunity to develop an integrated picture of commonalities, differences and emergent themes across tumor lineages.
Related Papers (5)