ExCAPE-DB: an integrated large scale dataset facilitating Big Data analysis in chemogenomics

doi:10.1186/S13321-017-0203-5

Sun

et al. J Cheminform (2017) 9:17

DOI 10.1186/s13321-017-0203-5

DATABASE

ExCAPE-DB: an integrated large

scale dataset facilitating Big Data analysis

inchemogenomics

Jiangming Sun

1*

, Nina Jeliazkova

2

, Vladimir Chupakhin

3

, Jose-Felipe Golib-Dzib

4

, Ola Engkvist

1

, Lars Carlsson

1

,

Jörg Wegner

3

, Hugo Ceulemans

3

, Ivan Georgiev

2

, Vedrin Jeliazkov

2

, Nikolay Kochev

2,5

, Thomas J. Ashby

6

and Hongming Chen

1*

Abstract

Chemogenomics data generally refers to the activity data of chemical compounds on an array of protein targets and

represents an important source of information for building in silico target prediction models. The increasing volume of

chemogenomics data oﬀers exciting opportunities to build models based on Big Data. Preparing a high quality data

set is a vital step in realizing this goal and this work aims to compile such a comprehensive chemogenomics dataset.

This dataset comprises over 70 million SAR data points from publicly available databases (PubChem and ChEMBL)

including structure, target information and activity annotations. Our aspiration is to create a useful chemogenomics

resource reﬂecting industry-scale data not only for building predictive models of in silico polypharmacology and oﬀ-

target eﬀects but also for the validation of cheminformatics approaches in general.

Keywords: Big Data, Bioactivity, Chemogenomics, Chemical structure, Molecular ﬁngerprints, Search engine, QSAR

(

http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium,

provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license,

and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (

http://creativecommons.org/

publicdomain/zero/1.0/

) applies to the data made available in this article, unless otherwise stated.

Background

In pharmacology, “Big Data” on protein activity and gene

expression perturbations has grown rapidly over the past

decade thanks to the tremendous development of prot

-

eomics and genome sequencing technology [

1, 2]. Simi-

larly there has also been a remarkable increase in the

amount of available compound structure and activity

relation (SAR) data, contributed mainly by the develop

-

ment of high throughput screening (HTS) technologies

and combinatorial chemistry for compound synthesis [

3].

ese SAR data points represent an important resource

for chemogenomics modelling, a computational strat

-

egy in drug discovery that investigates an interaction of

a large set of compounds (one or more libraries) against

families of functionally related proteins [

4].

Frequently, the “Big Data” in chemogenomics refers to

large databases recording the bioactivity annotation of

chemical compounds against diﬀerent protein targets.

Databases such as PubChem [

5], BindingDB [6], and

ChEMBL [

7] are examples of large public domain reposito

-

ries of this kind of information. PubChem is a well-known

public repository for storing small molecules and their

biological activity data [

5, 8]. It was originally started as

a central repository of HTS experiments for the National

Institute of Health (USA) Molecular Libraries Program,

but nowadays also incorporates data from other sources.

ChEMBL contains data that was manually extracted from

numerous peer reviewed journal articles, as do WOMBAT

[

9], BindingDB [6], and CARLSBAD [10]. Similarly, com

-

mercial databases, such as SciFinder [

11], GOSTAR [12]

and Reaxys [

13] have accumulated a large amount of data

from publications as well as patents. Besides these sources,

large pharmaceutical companies maintain their own data

collections originating from in-house HTS screening cam

-

paigns and drug discovery projects.

is data serves as a valuable source for building in

silico models for predicting polypharmacology and

Open Access

*Correspondence: Jiangming.Sun@astrazeneca.com; hongming.chen@

astrazeneca.com

1

Discovery Sciences, Innovative Medicines and Early Development

Biotech Unit, AstraZeneca R&D Gothenburg, 43183 Mölndal, Sweden

Full list of author information is available at the end of the article

Page 2 of 9

Sun

et al. J Cheminform (2017) 9:17

oﬀ-target eﬀects, and benchmarking the prediction per-

formance and computation speed of machine-learning

algorithms. e aforementioned publicly available data

-

bases have been widely used in numerous cheminformat-

ics studies [

14–16]. However, the curated data are quite

heterogeneous [

17] and lack a standard way for anno

-

tating biological endpoints, mode of action and target

identiﬁer. ere is an urgent need to create an integrated

data source with a standardized form for chemical struc

-

ture, activity annotation and target identiﬁer, covering as

large a chemical and target space as possible. ere are

also irregularities within databases: the public screening

data in PubChem, especially the inactive data points, are

spread in diﬀerent assay entries uploaded by data provid

-

ers from around world and cannot be directly compared

without processing. is makes curating SAR data for

quantitative structure–activity relationship (QSAR) mod

-

eling very tedious. An example of work to synthesize the

curated and uncurated data is Mervin etal. [

15], where a

dataset with ChEMBL active compounds and Pubchem

inactive compounds was constructed, including inac

-

tive compounds for homologous proteins. However, the

dataset can only be accessed as a plain text ﬁle, not as a

searchable database.

In this work, by combining active and inactive com

-

pounds from both PubChem and ChEMBL, we created

an integrated dataset for cheminformatics modeling

purposes to be used in the ExCAPE [

18] (Exascale Com

-

pound Activity Prediction Engine) Horizon 2020 project.

ExCAPE-DB, a searchable open access database, was

established for sharing the dataset. It will serve as a data

hub for giving researchers around world easy access to a

publicly available standardized chemogenomics dataset,

with the data and accompanying software available under

open licenses.

Dataset curation

e standardized ChEMBL20 data from an in-house

database ChemistryConnect [

3] was extracted and

PubChem data was downloaded in January 2016 from the

PubChem website (

https://pubchem.ncbi.nlm.nih.gov/)

using the REST API. Both data sources are heterogene

-

ous. Data cleaning and standardisation procedures were

applied in preparing both chemical structures and bioac

-

tivity data.

Chemical structure standardisation

Standardisation of PubChem and ChEMBL chemical

structures was performed with ambitcli version 3.0.2. e

ambitcli tool is part of the AMBIT cheminformatics plat

-

form [

19–21] and relies on e Chemistry Development

Kit library 1.5 [

22, 23]. It includes a number of chemical

structure processing options (fragment splitting, isotope

removal, handling implicit hydrogens, stereochemistry,

InChI [

24] generation, SMILES [25] generation and struc

-

ture transformation via SMIRKS [

26], tautomer genera-

tion and neutralisation etc.). e details of the structure

processing procedure can be found in Additional ﬁle

1.

All standardisation rules were aligned between Janssen

Pharmaceutica, AstraZeneca and IDEAConsult to reﬂect

industry standards and implemented in open source soft

-

ware (

https://doi.org/10.5281/zenodo.173560).

Bioactivity data standardisation

e processing protocol for extracting and standard-

izing bioactivity data is shown in Fig. 1. First, bioassays

were restricted to only those comprising a single target;

the black box (target unknown) or multi-target assays

were excluded. 58,235 and 92,147 single targets con

-

taining concentration response (CR) type assays (con-

ﬁrmatory type in PubChem) remained in PubChem and

ChEMBL, respectively. e assay target was further lim

-

ited to human, rat and mouse species, and data points

missing a compound identiﬁer (CID) were removed. For

those ﬁltered assays, active compounds whose dose–

response value was equal to or lower than 10 μM were

kept as active entries and others were removed. Inactive

compounds in CR assays were kept as inactive entries.

Compounds that were labelled as inactive in PubChem

screening assays (assays run with a single concentration)

were also kept as inactive records.

e chemical structure identiﬁers (InChI, InChIKey

and SMILES) generated from the standardized com

-

pound structures (as explained above) were joined with

the compounds obtained after the ﬁltering procedure.

e compound set was further ﬁltered by the following

physicochemical properties: organic ﬁlters (compounds

without metal atoms), molecular weight (MW) <1000Da,

and a number of heavy atoms (HEV) >12. is was done

to remove small or inorganic compounds not representa

-

tive for modelling the chemical space relevant for a nor-

mal drug discovery project. is is a much more generous

rule than the Lipinski rule-of-ﬁve [

27], but the aim was

to keep as much useful chemical information as possi

-

ble while still removing some non-drug like compounds.

Finally, ﬁngerprint descriptors were generated for all

remaining compounds. So far JCompoundMapper (JCM)

[

28], CDK circular ﬁngerprint descriptors and signature

descriptors [

29] were generated respectively. For circular

ﬁngerprint and signature calculation, the maximum topo

-

logical radius for fragment generation was set to 3.

From each data source, various attributes were read

and converted into controlled vocabularies. e most

important of these are target (Entrez ID), activity value,

mode of action, assay type and assay technology etc. e

underlying data sources contain activity data with various

Page 3 of 9

Sun

et al. J Cheminform (2017) 9:17

result types; the results were uniﬁed as best possible to

make them comparable across tests (and data sources)

irrespective of the original result type. e selected com

-

patible dose–response result types are listed in Addi-

tional ﬁle

2: Table S1. Generally, the end-point name of

a concentration related assay (e.g. IC50, units in µM)

should match one of the keywords in this list. In the case

when a compound has multiple activity data records for

the same target, the records are aggregated so that one

compound only has one record per target and the best

(maximal) potency was chosen as the ﬁnal aggregated

value for a compound–target pair. e AMBIT generated

InChIKey from the standardisation procedure was used

as the molecular identiﬁer to identify duplicate structures

in the data aggregation. Finally, targets which have <20

active compounds were removed from the ﬁnal dataset.

Entrez ID [

30], gene symbol [31–33] and gene ortho

-

logue were collected as information for the target. e

gene symbol was converted from Entrez ID with the

gene2accession table [

34] provided by National Center

for Biotechnology Information (NCBI). Gene orthologues

was included from the orthologue table [34] from NCBI.

Database andweb interface

e ExCAPE-DB is built based on the AMBIT database

and web application [

19], enhanced with a free text search

engine (Apache Solr [

35]). An instance of the AMBIT web

application (ambit2.war) was installed and the chemi

-

cal structures were imported. is enables chemistry-

aware search (similarity, substructure) and depiction, all

exposed via a REST API and the web interface provided

by the web application itself. e bioactivity data, consist

-

ing of compound related information (e.g. target activity

label and InChIKey) and target related information (e.g.

Entrez IDs and oﬃcial gene symbols), is imported into an

Apache Solr collection (

http://lucene.apache.org/solr/)

and exposed through the Solr REST API. e open source

JavaScript client library jToxKit (

https://github.com/

ideaconsult/jToxKit

) is used to interact with the AMBIT

REST API and the Solr REST API. A dedicated JavaScript

web interface was developed for ExCAPE-DB, integrating

the chemical search, as well as the free text and faceted

search functionality for biological activities.

e ExCAPE-DB is available online (

https://solr.idea

-

consult.net/search/excape/

) and a screenshot of the web

Fig. 1 Workﬂow for data preparation

Page 4 of 9

Sun

et al. J Cheminform (2017) 9:17

browser interface is shown in Fig.

2a. e dataset can be

searched both by target name and CID. For target based

searches, the Entrez ID, gene symbol, gene orthologous

group and target species can be used for subsetting

datasets. For compound searches, a user can choose to

input the InChIKey or specify a CID (SMILES, InChI

or IUPAC chemical name) for doing free-text search or

use the embedded structure editor for doing substruc

-

ture or similarity search (Fig.

2b). It is also possible to

follow a link to the original ChEMBL or PubChem page

of the speciﬁc compound from the search result. e

download tab on the web page provides several down

-

load options. e “Filtered entries” download option

allows the downloading of all of the current search

Fig. 2 Browsing the ExCAPE-DB web interface. a Searching the database via gene symbol or free-text. The original compound information is linked

to from the result page. b Searching the database via substructure search

Page 5 of 9

Sun

et al. J Cheminform (2017) 9:17

result. For downloading speciﬁc entries, it is possible

to include “Add to selection” links and compile a subset

of selected entries, which will be available for download

as “Selected entries”. A static link for downloading the

entire ExCAPE-DB dataset is available at the down

-

load tab. e dataset is also uploaded to the

Zenodo.

org

repository and available for download from there as

doi:

10.5281/zenodo.173258.

Discussion

e dataset composition is described in Table

1. In total

there are 998,131 unique compounds and 70,850,163 SAR

data points. ese SAR data points cover 1667 targets

Table 1 Public chemogenomics dataset

ChEMBL PubChem ExCAPE-DB

Actives

# SAR data points 1,259,338 439,288 1,332,426

# Compounds 566,143 263,119 593,156

Inactives

# SAR data points 1,530,908 68,948,609 69,517,737

# Compounds 416,655 654,562 719,192

Total

# SAR data points 2,790,246 69,387,897 70,850,163

# Compounds 710,324 828,317 998,131

# Targets 1644 1588 1667

Fig. 3 Composition of active compounds in the dataset. The distribution of active compounds among the targets in a ExCAPE-DB, b ChEMBL part

of ExCAPE-DB and c the fraction span of actives in both datasets. We note that the ChEMBL dataset is shown here before the ﬁltering and aggrega-

tion process and contains only single-target assays. Active compounds should have a pXC50 no less than 5 and only targets with at least 20 active

compounds were considered

ExCAPE-DB: an integrated large scale dataset facilitating Big Data analysis in chemogenomics

Citations

The rise of deep learning in drug discovery.

Molecular de-novo design through deep reinforcement learning

Automating drug discovery

Application of Generative Autoencoder in De Novo Molecular Design.

De novo generation of hit-like molecules from gene expression signatures using artificial intelligence

References

LIBSVM: A library for support vector machines

A Coefficient of agreement for nominal Scales

Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings

Tissue-based map of the human proteome

The cancer genome atlas pan-cancer analysis project

Related Papers (5)

Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules

Extended-Connectivity Fingerprints

Long short-term memory

ZINC 15 – Ligand Discovery for Everyone

SMILES, a chemical language and information system. 1. introduction to methodology and encoding rules