ChEMBL Resources


Wednesday, 17 April 2019

Job opportunities in the ChEMBL Group

We have two exciting opportunities for scientists to come and work with the ChEMBL team at the Wellcome Genome Campus in Hinxton near Cambridge.

If you've used ChEMBL in the past perhaps now is the chance to come and shape its future.  Even if you haven't this is a great place to work and in both positions you will collaborate with people developing the ChEMBL resources but also our collaborators here at Hinxton and around Europe.  These include the Open Targets project and EU funded toxicology projects such as EU-ToxRisk and eTRANSAFE.

We are looking for:

(1) A talented chemoinformatician to work on methods for the annotation, searching and visualization of toxicologically relevant data. You will develop pipelines and tools to enable the better prediction and assessment of the toxicity of pharmaceutical and environmental chemicals.

Closing Date 19th May 2019
More details here

(2) A protein computational scientist to  develop, assess and validate methods for quantifying target tractability with the goal of incorporating such methodologies into the Open Targets informatics platform  This exciting work will focus on developing new methods based on protein structure and sequence.

Closing Date 26th May 2019
More details here

Don't delay, apply today.

Thursday, 28 March 2019

ChEMBL 25 and new web interface released

We are pleased to announce the release of ChEMBL 25 and our new web interface. This version of the database, prepared on 10/12/2018 contains:

  • 2,335,417 compound records
  • 1,879,206 compounds (of which 1,870,461 have mol files)
  • 15,504,603 activities
  • 1,125,387 assays
  • 12,482 targets
  • 72,271 documents

Data can be downloaded from the ChEMBL ftp site:

Please see ChEMBL_25 release notes for full details of all changes in this release:


# Deposited Data Sets:

Kuster Lab Chemical Proteomics Drug Profiling (src_id = 48, Document ChEMBL_ID = CHEMBL3991601):
Data have been included from the publication: The target landscape of clinical kinase drugs. Klaeger S, Heinzlmeir S and Wilhelm M et al (2017), Science, 358-6367 (

# In Vivo Assay Classification:

A classification scheme has been created for in vivo assays. This is stored in the ASSAY_CLASSIFICATION table in the database schema and consists of a three-level classification. Level 1 corresponds to the top-levels of the ATC classification i.e., anatomical system/therapeutic area (e.g., CARDIOVASCULAR SYSTEM, MUSCULO-SKELETAL SYSTEM, NERVOUS SYSTEM). Level 2 provides a more fine-grained classification of the phenotype or biological process being studied (e.g., Learning and Memory, Anti-Obesity Activity, Gastric Function). Level three represents the specific in vivo assay being performed (e.g., Laser Induced Thrombosis, Hypoxia Tolerance Test in Rats, Paw Edema Test) and is assigned a specific ASSAY_CLASS_ID. Individual in vivo assays in the ChEMBL ASSAYS table are mapped to reference in-vivo assays in the ASSAY_CLASSIFICATION table via the ASSAY_CLASS_MAP table. More information about the classification scheme is available in the following publication: The assay classification is available via web services and will be included in the ChEMBL web interface in the near future.

# Updated Data Sets:
Scientific Literature
Patent Bioactivity Data
BindingDB Database (corrections to compound structures)


# Web Interface:

The new ChEMBL web interface is now live at (this replaces the previous beta version). The old ChEMBL web interface will be retired before the ChEMBL_26 release, but is available on the following URL until then: The new interface provides richer search and filtering capabilities. Documentation regarding this new functionality and frequently asked questions are available on our help pages:

# Changes to Web Services:

The Assay web service has been updated to include both assay_parameters and the in vivo assay classification for an assay (where applicable):

A separate endpoint has also been created for the in vivo assay classification:

The Activity web service has been updated to include activity_properties. The 'published_type', 'published_relation', 'published_value' and 'published_units' fields have also been renamed to 'type', 'relation', 'value' and 'units':

A new endpoint has been created to retrieve supplementary data associated with an activity measurement (or list of measurements):


# Tables Added:

Classification scheme for phenotypic assays e.g., by therapeutic area, phenotype/process and assay type. Can be used to find standard assays for a particular disease area or phenotype e.g., anti-obesity assays. Currently data are available only for in vivo efficacy assays

ASSAY_CLASS_ID NUMBER(9,0) Primary key
L1 VARCHAR2(100) High level classification e.g., by anatomical/therapeutic area
L2 VARCHAR2(100) Mid-level classification e.g., by phenotype/biological process
L3 VARCHAR2(1000) Fine-grained classification e.g., by assay type
CLASS_TYPE VARCHAR2(50) The type of assay being classified e.g., in vivo efficacy
SOURCE VARCHAR2(50) Source from which the assay class was obtained

Mapping table linking assays to classes in the ASSAY_CLASSIFICATION table

ASSAY_ID NUMBER assay_id is the foreign key that maps to the 'assays' table
ASSAY_CLASS_ID NUMBER assay_class_id is the foreign key that maps to the 'assay_classification' table

# Columns Removed:

PUBLISHED_TYPE (DEPRECATED in ChEMBL_24, now removed, replaced by TYPE)
PUBLISHED_VALUE (DEPRECATED in ChEMBL_24, now removed, replaced by VALUE)
PUBLISHED_UNITS (DEPRECATED in ChEMBL_24, now removed, replaced by UNITS)

Funding Acknowledgements:
Work contributing to ChEMBL_25 was funded by the Wellcome Trust, EMBL Member States, Open Targets, National Institutes of Health (NIH) Common Fund, EU Innovative Medicines Initiative (IMI) and EU Framework 7 programmes. Please see for more details.

The ChEMBL Team

Tuesday, 19 February 2019

ChEMBL is 10 years old in 2019!

In 2019 we celebrate the 10th anniversary of the first public release of the ChEMBL database. To recognise this important landmark we are organising a one-day symposium to celebrate the work achieved by ChEMBL during its first ten years, and look forward to its future.
Save the date - Tuesday 8th October 2019

The symposium will be held on Tuesday 8th October in the Francis Crick Auditorium on the Wellcome Genome Campus, Hinxton, Cambridge, UK. A series of talks from invited speakers will be followed by a celebratory birthday cake and drinks reception. During the breaks, the poster session will be a great opportunity to catch up with other users of the ChEMBL database and chat to colleagues, co-workers and others to find how more about how the database is being used.

For the programme of invited talks, and more information on how to register, see

Thursday, 7 February 2019

Target prediction, QSAR and conformal prediction 

You know that in the ChEMBL group, we love to play with the data we collect!! Back in April 2014, we started to work on a target prediction tool.  Wow! This was almost 5 years ago! Since then, we have continued to update the tool for each new ChEMBL release, providing you with the actual models and the result of the prediction on the ChEMBL website for the drug molecules. The good news is that these target predictions are not dead and a successor is on its way!

First, we would like to introduce you some closely related work. You may have heard about conformal prediction (CP). If not, it is a machine learning framework developed to associate confidence to predictions. I personally consider this as a requirement for decision making. Basically, you train a model as you would do in QSAR but then you first predict a so-called calibration set, for which you know the actual values. For each of these observations you obtain two probabilities: one for the active and one for the inactive class (in a typical classification scheme). Now that you have this information, each time you predict a new compound you compare its probabilities to those of the calibration set (the non-conformity scores as they are called) and you derived p-values for each class. Based on your predefined significance level, the compound can be assigned in different categories: only active or only inactive, but also both active and inactive or none of them. I am sure you can start seeing here the added value of CP!

Here I have briefly detailed how it works for classification models but CP can also be applied to regression models. If you want to know more about conformal prediction, I recommend you to read this book and also this very nice example of the application in drug discovery. Having learnt how to build conformal predictors, we were intrigued to know how well they perform against traditional QSAR models with our ChEMBL data!

With this in mind, we decided to build a panel of models using a substantial data set from ChEMBL. With our new protocol, we were able to build models for 788 targets (550 of them human targets). For the descriptors we used RDKit Morgan fingerprint (2048 bits and radius 2) and 6 physicochemical descriptors. For the machine learning part we used the good old Random Forests as implemented in Scikit-learn version 0.19. For the QSAR models, this is all that is needed, but for CP you need a framework and this was provided by the very nice library provided by Henrik Linusson.

The next part consisted of training the models and checking their internal performance, but we went a bit further and decided that with our models trained on ChEMBL_23 data, it would be interesting to see how they perform with new data in ChEMBL_24 in a so-called temporal validation. All the details, results and conclusion are presented in the recently accepted article!
Image result for right wrong decision

The dataset for each target is already available here and you can find the models ready to use there.

Feel free to take a look and to share your opinion in the comments.

Now, you remember that I started this post mentioning our good old target predictors. So does it mean a new generation of ChEMBL models using conformal prediction is ready to be launched for our users? Well, unfortunately not yet, but stay tuned!