ChEMBL Resources


Monday, 28 April 2014

6th Open PHACTS Community Workshop - 26 June 2014, London

The Open PHACTS Discovery Platform is a freely accessible infrastructure that semantically integrates publicly available data for applied life science R&D. The Platform provides a powerful Application Programming Interface (API) which allows application builders and researchers to query the integrated data using existing applications, to build new applications and to access the API using workflows tools (e.g. KNIME and Pipeline Pilot). Examples of such applications, which illustrate what can be achieved, include the Open PHACTS Explorer, ChemBioNavigator, and PharmaTrek.

The Open PHACTS Community Workshop in London on Thursday 26th June aims to introduce members of the academic community to the Open PHACTS Discovery Platform.

The workshop will be of interest to:
·         Researchers who would benefit from directly querying the Open PHACTS API using scripting languages or by developing applications to consume the data.
·         Lecturers & Principal Investigators who can use the Open PHACTS application ecosystem to access the data within the Open PHACTS Discovery Platform.

The Community Workshop will introduce attendees to the Open PHACTS API and showcase how it can be used to create new or enhance existing applications. We will demonstrate, using real life use-cases, how universities can use the Open PHACTS API and associated tools for teaching and research in drug discovery.

Venue: Burlington House, Piccadilly, London W1J 0BA

The Workshop is free to attend for those at academic institutions, for more information or to register please email

Monday, 21 April 2014

Meeting: 20th European Symposium on Quantitative Structure-Activity Relationships (EuroQSAR-2014), St. Petersburg

EuroQSAR-2014 will be held in St.-Petersburg, Russia on August 31st - September 4th, 2014. The deadline for oral talks' abstracts submission to the EuroQSAR-2014 is April 23rd, 2014. The meeting, entitled Understanding Chemical-Biological Interactions, will include 9 plenary lectures and 28 oral communications, which will be selected from the submitted abstracts and will focus on:
  • Chemical-Biological Space: Representation, Visualisation and Navigation.
  • Chemo- and Bioinformatics Approaches to Multi-Target (Q)SAR.
  • Modeling of Protein-Ligand Interactions: Structure, Function and Dynamics.
  • Assessing Ligand Binding Kinetics.
  • Computational Toxicology in Drug and Chemical Safety Assessment.
  • Translational Bioinformatics: From Genomes to Drugs.
  • Emerging QSAR and Modeling Methods.
Two seminars/roundtables are also planned on the last day of the Symposium:
  • (Q)SAR-Related European Initiatives.
  • Employing Proper Statistical Approaches for QSAR Modeling and Best Publishing Practices.
Confirmed speakers include:
  • Opening Lecture - SAR, the Lifelong Learning for my Career Prof. Toshio FUJITA (KYOTO UNIVERSITY, Kyoto, Japan)
  • From QSAR to MQSPR and Beyond: Predictive Materials Informatics Using a Blend of Heuristic and Physics-Based Methods
  • Integrating Pharmacometrics into Drug Development Dr Roberta BURSI (GR√úNENTHAL, Aachen, Germany)
  • Lead Discovery and Optimisation by Use of Interaction Kinetic Analysis Prof. Helena DANIELSON (UPPSALA UNIVERSITY, Uppsala, Sweden)
  • Navigation in Chemical Space Towards Biological Activity Dr Peter ERTL (NOVARTIS INSTITUTE FOR BIOMEDICAL RESEARCH, Basel,
  • Switzerland)
  • Computational Toxicology – An Essential Part of Drug Safety Dr Catrin HASSELGREN (ASTRAZENECA, M√∂lndal, Sweden)
  • Ensemble-Based Drug Design, Combining Protein Structures and Simulations Dr Will PITT (UCB PHARMA, Slough, United Kingdom)
  • The Metabolic Code
  • Prof. Brian SHOICHET (UNIVERSITY OF TORONTO, Toronto, Canada)
  • Closing Lecture - Opportunities and Challenges in Therapeutics Discovery and Development Dr John C. REED (F. HOFFMAN-LA-ROCHE, Basel, Switzerland)
Hansch Session

  • On the Nature of Non-Classical Hydrogen Bonds and Aromatic Interactions Prof. Anna LINUSSON (UMEA UNIVERSITY, Umea, Sweden)
  • Lessons Learned from the Invention of QSAR Can Inspire Other Breakthrough Discoveries Dr Yvonne C. MARTIN (MARTIN CONSULTING, Waukegan, United States)
  • The Road Ahead: New Challenges for Computational Forecasts Prof. Tudor I. OPREA (UNIVERSITY OF NEW MEXICO, Albuquerque, United States)
  • Molecular Design of Bivalent and Dual Action Drugs Prof. Nikolay S. ZEFIROV (MOSCOW STATE UNIVERSITY, Moscow, Russia)
Proceedings of the Symposium will be published in a special issue of the journal Molecular Informatics.

More information you may find at the Symposium’s web-site:

Friday, 11 April 2014

Target Prediction IPython Notebook Tutorial

As promised in the previous post, the ChEMBL target prediction models are now available to download from here. Furthermore, here is an IPython Notebook that showcases how the models can be used in Python. As usual, your feedback is very welcome. 


Friday, 4 April 2014

Paper: Chemical, Target, and Bioactive Properties of Allosteric Modulation

We have just had a paper accepted in PLoS Computational Biology on the work we've done on allosteric modulators (first mentioned on the blog here).  The work is based on the mining of allosteric bioactivity points from ChEMBL_14. The data set of allosteric and non-allosteric interactions is available on our FTP site (here). This blogpost will just highlight some sections of the paper, but we would like to refer the interested reader to the full paper (here). 

The dataset contains ChEMBL annotated and cleaned data divided in both an 'allosteric' set and a 'non-allosteric' (or background) set. Abstracts and titles mentioning allosteric keywords were pulled and from the resulting papers we extracted the primary target and all bioactivities on this primary target. From the remainder of the papers we also retrieved the primary target and all bioactivities on this primary target in a similar manner. 

When we observed the target distribution in both sets, we saw differences (see below ; also touched upon in the previous post). Targets that are known to be amenable to allosteric modulation are indeed well represented in our allosteric set (e.g. Class C GPCRs). However there are also some interesting observations that we did not expect (please see the paper for further details). 

Obviously, as we are the ChEMBL group, we are interested in potential chemical differences between the allosteric and background set. Interestingly, the allosteric modulators appear to form a subset of the background set, rather than that they are distinct from the background set. We have calculated a large number of descriptors and compared the sets (median values, but also histograms; all available on the FTP). We observe that allosteric modulator molecules tend to be smaller, more lipophilic and more rigid. Although there is understandably a large variance over the diverse targets included in the set. Shown here is the rigidity index calculated over the full sets (L0), but when the target selection becomes more concise, the differences become more distinct.

Likewise we observe differences between our allosteric subset and the background set with regard to bioactivity. While 'allosteric modulation' is a very diverse concept, in which the specific manner wherein the protein is influenced by the small molecule differs per protein - ligand pair, we do observe some general differences. From our data it appears that allosteric modulators bind with a lower affinity (on average) but similar ligand efficiency (on average) when compared to our background set. In the paper we provide a more extensive discussion on this observation and we would again refer the reader given the limited space here.

Classification models
Built on the dataset we have created allosteric classifier models that can predict if an interaction is likely allosteric or not. We have tried this on the full dataset, but also on lower levels (restricting the data to e.g. Class A GPCRs). We find that we can train predictive models that gain in quality if we have a more concise dataset (eliminating some of the inter-target variation). In the paper we provide case studies on HIV Reverse Transcriptase, the adenosine receptors (family), and protein Kinase B. Here the model performance for class A GPCRs (full L2 tgt class) is shown. Note that rigidity, number of sp3 carbons, Polar Solvent Accessible Surface (normalized), and rotatable bonds fraction are most important for model fit.

All data is ChEMBL and hence can be freely downloaded and used. Please let us know if you find any errors or misclassifications as we will correct them (crowd curation).

Anna, jpo, and Gerard

%T Chemical, Target, and Bioactive Properties of Allosteric Modulation
%A G.J.P. van Westen
%A A. Gaulton
%A J.P. Overington
%J PLoS. Comput. Biol.
%D 2014
%V 10
%O doi:10.1371/journal.pcbi.1003559

Thursday, 3 April 2014

Ligand-based target predictions in ChEMBL

In case you haven't noticed, ChEMBL_18 has arrived. As usual, it brings new additions, improvements and enhancements both on the data/annotation, as well as on the interface. One of the new features is the target predictions for small molecule drugs. If you go to the compound report card for such a drug, say imatinib or cabozantinib, and scroll down towards the bottom of the page, you'll see two tables with predicted single-protein targets, corresponding to the two models that we used for the predictions. 

 - So what are these models and how were they generated? 

They belong to the family of the so-called ligand-based target prediction methods. That means that the models are trained using ligand information only. Specifically, the model learns what substructural features (encoded as fingerprints) of ligands correlate with activity against a certain target and assign a score to each of these features. Given a new molecule with a new set of features, the model sums the individual feature scores for all the targets and comes up with a sorted list of likely targets with the highest scores. Ligand-based target prediction methods have been quite popular over the last years as they have been proved useful for target-deconvolution and mode-of-action prediction of phenotypic hits / orphan actives. See here for an example of such an approach and here for a comprehensive review.

 - OK, and how where they generated?

As usual, it all started with a carefully selected subset of ChEMBL_18 data containing pairs of compounds and single-protein targets. We used two activity cut-offs, namely 1uM and a more relaxed 10uM, which correspond to two models trained on bioactivity data against 1028 and 1244 targets respectively. KNIME and pandas were used for the data pre-processing. Morgan fingerprints (radius=2) were calculated using RDKit and then used to train a multinomial Naive Bayesian multi-category scikit-learn model. These models then were used to predict targets for the small molecule drugs as mentioned above. 

 - Any validation? 

Besides more trivial property predictions such as logP/logD, this is the first time ChEMBL hosts non experimental/measured data - so this is a big deal and we wanted to try and do this right. First of all, we did a 5-fold stratified cross-validation. But how do you assess a model with a many-to-many relationship between items (compounds) and categories (targets)? For each compound in each of the 5 20% test sets, we got the top 10 ranked predictions. We then checked whether these predictions agree with the known targets for that compound. Ideally, the known target should be correctly predicted at the 1st position of the ranked list, otherwise at the 2nd position, the 3rd and so on. By aggregating over all compounds of all test sets, you get this pie chart:

This means that a known target is correctly predicted by the model at the first attempt (Position 1 in the list of predicted targets) in ~69% of the cases. Actually, only 9% of compounds in the test sets had completely mis-predicted known targets within the top 10 predictions list (Found above 10). 

This is related to precision but what about recall of know targets? here's another chart:

This means that, on average, by considering the top 10 most likely target predictions (<1% of the target pool), the model can correctly predict around ~89% of a compound's known single protein targets. 

Finally, we compared the new open source approach (right) to an established one generated with a commercial workflow environment software (left) using the same data and very similar descriptors:

If you manage to ignore for a moment the slightly different colour coding, you'll see that their predictive performance is pretty much equivalent.

 - It all sounds good, but can I get predictions for my own compounds?

We could provide the models and examples in IPython Notebook on how to use these on another blog post that will follow soon. There are also plans for a publicly available target prediction web service, something like SMILES to predicted targets. Actually, if you would be interested in this, or if you have any feedback or suggestions for the target prediction functionality, let us know


Wednesday, 2 April 2014

ChEMBL_18 Released

We are pleased to announce the release of ChEMBL_18. This version of the database was prepared on 12th March 2014 and contains:
  • 1,566,466 compound records
  • 1,359,508 compounds (of which 1,352,681 have mol files)
  • 12,419,715 activities
  • 1,042,374 assays
  • 9,414 targets
  • 53,298 documents

The web front end at is now connected to the ChEMBL 18 data, but you can also download the data from the ChEMBL ftpsite. Please see ChEMBL_18 release notes for full details of all changes in this release.

Changes since the last release


New data sets


The ChEMBL_18 release includes the following new datasets:
  • University of Vienna G-glycoprotein (pgp) screening data
  • UCSF MMV Malaria Box screening data
  • DNDi Trypanosoma cruzi screening data
  • DrugMatrix in vivo toxicology data
In addition, 43,335 new compound records from 2015 publications in the primary literature have been added to this release. Approved drug and usan data have also been updated, with 103 new structures added.


Updates to the protein family classification


A review and update of the ChEMBL protein family classification has been carried out. The main changes are listed below:

  • New ion channel/transporter classification, based on the BPS classification
  • New epigenetic protein classification, based on SGC/ChromoHub classification
  • Modification of kinase classification, to follow Human Kinome classification


Assay classification and ontology mapping


The following annotations and classifications have been added to the ChEMBL assay data:
  • Classification of assay format (e.g., biochemical, cell-based, organism-based) using BioAssay Ontology
  • Classification of endpoints (e.g., IC50, AUC, Ki) using BioAssay Ontology
  • Addition of Physicochemical and Toxicity assay type classification
  • Mapping of assay cell-lines to CLO, EFO and Cellosaurus
  • Mapping of standard units to Unit Ontology and QUDT



Capture of assay parameters


A new table in the database (assay_parameters), is used to capture additional properties of assays such as dose, administration route, time points. These additional parameters are displayed on the Assay Report Card.


Target predictions


Bioactivity data for single protein targets in ChEMBL have been used to train and validate two Naive Bayesian multi-label classifier models (at <= 1uM and <= 10uM bioactivity cutoffs respectively). These models have been subsequently employed to predict biological targets for a set of approved drugs, which are displayed on in the new Target Predictions section of the Compound Report Card, where applicable. Since some of the predictions correspond to compound/target pairs that were included in the training set for the models, these are shown in white, to distinguish them from genuine predictions (coloured light yellow). Only predictions scoring >= 0.2 are included in the result tables. The models were built with open source tools such as RDKit and scikit-learn and are available upon request.

We would appreciate any feedback on this feature, and any further ideas you may have on including predicted data on top of ChEMBL experimental data.


UniChem connectivity mapping


In addition to the standard UniChem cross-references shown on the report card (based on exact InChI Key matching), a new link is included to an expanded view of UniChem cross-references, generated based on InChI connectivity layer matching (e.g., 

This expanded view shows any compounds in UniChem that share the same connectivity as the query structure, even if they have stereochemical, isotopic or protonation state differences. The differences between the query and retrieved structures are shown by their position in the table: the first column shows compounds that match in all InChI layers, while the subsequent columns show those structures that differ in stereochemistry (s column), isotope (i column), protonation state (p column), or various combinations of these layers (final four columns). A button at the top of the table gives the additional option to retrieve compounds that match individual components of a mixture or salt. Where the query structure consists of multiple components, matches to each of these components will be coloured different colours (e.g., black, blue, red). 




The ChEMBL RDF data model has been enhanced and now includes the following information:
  • Drug mechanism of action and binding site information
  • Molecule hierarchy
  • Target relationships
  • Assay format
  • Cell-line information
More information (documentation, SPARQL endpoint and example queries), about the RDF version of the ChEMBL database can be found on the EBI-RDF Platform and you can download the RDF files from the ChEMBL ftpsite.


Web Service Update


Three new Web Service calls focused on approved drugs, mechanism of action and compound forms are now available. Example calls to these methods can be seen below and also please visit the ChEMBL Web Service page for more details.

As always, we greatly appreciate to reporting of any omissions or errors.

The ChEMBL Team