ChEMBL Resources


Tuesday, 11 February 2020

cbl_migrator is now open source!

Resultado de imagen de Printing press old

cbl_migrator is the Python tool we developed to migrate the ChEMBL database from our primary Oracle instance to PosgreSQL, MySQL and SQLite. We first developed it to generate our dumps for the mentioned RDBMs but we also recently started to use it to populate our new PosgreSQL instances serving our API and web interface.

It is built on top of the great SQLAlchemy library and its source code is now available in our GitHub.

Thursday, 30 January 2020

New ChEMBL ligand-based target predictions docker image available

Resultado de imagen de zoltar fortune

One year ago we published a new version of our target prediction models and since then we've been working on its implementation for the upcoming ChEMBL 26 release.

What did we do?

First of all we re-trained the models with the LightGBM library instead of using scikit-learn. By doing this and tuning a bit the parameters our prediction timing improved by 2 orders of magnitude while keeping comparable prediction power. Having quicker models allowed us to easily implement a simple web service providing real time predictions.

Since we are currently migrating to a more sustainable Kubernetes infrastructure it made sense to us to directly write the small target prediction web service as a cloud native app. We then decided to give OpenFaaS a try as a platform to deploy machine learning models.

OpenFaaS is a framework for building serverless functions with Docker and Kubernetes. It provides templates for deploying functions as REST endpoints in many different programming languages (Python, Node, Java, Ruby, go...).

Our target predicitons OpenFaaS function source code is now available in our github repository. A Docker image with ready to use ChEMBL 25 trained models is also available here.

Does this mean that you won't be able to use the models without an Kubernetes/OpenFaaS installation? No way! It is also easy to start an instance in your local machine:

docker run -p 8080:8080 chembl/mcp:25
# in a different shell
curl -X POST -H 'Accept: */*' -H 'Content-Type: application/json' -d '{"smiles": "CC(=O)Oc1ccccc1C(=O)O"}'

Bear in mind that the service needs to load the models into memory, so it may take few minutes until it returns predictions. The predictions returned by the service are the ones for the models with CCR ((sensitivity + specificity) / 2) >= 0.85

Thursday, 19 December 2019

Merry Christmas and ChEMBL_26 coming soon!

The ChEMBL team will be heading off for Christmas soon, but just before we do, we wanted to share some updates...

First, thanks to all of our many users and collaborators and we wish you all a happy holiday season and a productive 2020!

Thanks also to everyone who helped us celebrate 10 years of ChEMBL at our symposium in October. For those who were unable to make it on the day, many of the talks and posters are available here.

Over the last few months we've been busy working on ChEMBL_26, which we plan to release early next year. There will be some important changes in this release:

We are now using RDKit for almost all of our compound-related processing. For the first time in ChEMBL_26, this will include compound standardisation (look out for more info on this in the new year), salt-stripping, generation of canonical smiles, structural alerts, substructure searches and similarity searches (via FPSim2). Therefore, all molecules have been reprocessed and you may notice some differences compared with previous releases.

We have also switched our pKa calculations to use ChemAxon software. The compound properties ACD_MOST_APKA, ACD_MOST_BPKA, ACD_LOGP and ACD_LOGD will now therefore become CX_MOST_APKA, CX_MOST_BPKA, CX_LOGP and CX_LOGD.

Target predictions are also now being generated by a new method, using conformal prediction models.

Finally, our old ChEMBL interface will be switched off at the end of the year, so if you haven't made the jump yet, now is the time! We are still making improvements to the new interface and adding new features, so if you have any suggestions or feedback please do let us know.

The ChEMBL Team

Friday, 13 December 2019

Mechanism of Action and Drug Indication data on the interface.

Two new 'Browse' pages have been added to the interface; Browse Drug Mechanisms and Browse Drug Indications. Users can now access these 2 pages directly to explore all the data. Or alternatively, they can land on these pages from drugs, compounds and targets in ChEMBL.

Accessing all the data from the main page

The 'circles' visualisation on the main page shows a summary of the entities in ChEMBL. Circles for Drug Mechanisms of Action and Drug Indications have been added. By clicking on the circles, you will be taken to a page that allows you to explore the corresponding entity. 
Visualisation that summarises the entities in ChEMBL, Drug Mechanisms of Action and Drug Indications are now included.

The Browse Drug Mechanisms and Browse Drug Indications pages allow you to use filters, link to other entities, and download the data in the same way as the other 'Browse' pages.

All Drug Mechanism data.
All Drug Indication data.

Accessing Drug Indication and Drug Mechanism data related to other entities

You can now explore the Drug Indication and Drug Mechanism data in relation to the following entities:

From Browse Drug Mechanisms you can:
  • Browse related Drugs
  • Browse related Compounds
  • Browse related Targets

From Browse Drug Indications you can:
  • Browse related Drugs
  • Browse related Compounds

From Browse Drugs you can:
  • Browse related Drug Mechanisms
  • Browse related Drug Indications
  • Browse related Activities

Example A:

1. Go to the Browse Drug Mechanisms page. Find all drugs with mechanisms as neurokinin receptor antagonists.
Note that the data describes the mechanisms of action of 17 compounds for 3 targets.

2. Select all items and click on 'Browse Drugs', a new tab will open showing the drugs for the targets selected in step 1.

3. Click on 'Browse Drug Indications' to view all annotated indications for the drugs in step 2.

Example B:

1. Go to the Browse Drug Indications page. Find all drugs whose indication is asthma. There are 175 entries with asthma as an indication.

2. Select all items and click on 'Browse Drugs', a new tab will open showing the drugs for the indications selected in step 1.

3. Click on 'Browse Drug Mechanisms' to view of all annotated mechanisms for the drugs in step 2.

Accessing Drug Indication and Drug Mechanism data from report cards

You can now go to a dedicated page from the Drug Mechanism and Drug Indication data in the report cards. For example go to the report card for IMATINIB (CHEMBL941)

In the Drug Mechanism section you can see the data for that compound. If you click on 'Browse All', you will be directed to the Browse Drug Mechanisms page showing the data.

Drug Mechanisms section for the report card of IMATINIB (CHEMBL941)

Similarly, in the Drug Indications section you can click on 'Browse All' to be directed to a 'Browse Drug Indications' page showing all the data.
Drug Indications section for the report card of IMATINIB (CHEMBL941)

If you have any questions, please contact the ChEMBL Team support (chembl-help [at]

Tuesday, 20 August 2019

New text filter on the ChEMBL interface

A new text filter has been added to the 
search results and the 'Browse' pages of the interface. This filter is shown as a small search bar at the top-right of tables and card pages. It can be used as a simple and fast way to filter a set of items.

The filter appends a new query to the current query to match the term entered with all the available fields that are non-numeric. It is based on the Querystring query of Elasticsearch, so wildcards can be used in the search box.

To see an example of how it works, you can follow these steps:

  • Go to the Browse Drugs page:
  • Use the filters to the left to select only Phase 4 drugs with no Rule of Five violations:

  •  Enter the term '*antibacterial*' on the search box and click on the search button:

  • It will match the term on the following fields:
Parent Molecule ChEMBL ID, Synonyms, Research Codes, Applicants, USAN Stem, ATC Codes, USAN Stem Definition, USAN Stem Substem, Level 4 ATC Codes, Level 3 ATC Codes, Level 2 ATC Codes, Level 1 ATC Codes, Indication Class, Patent, Withdrawn Reason, Withdrawn Country, Withdrawn Class, Smiles.

You will see the resulting drugs on your screen:

If you show all the available columns (by clicking on 'Show/Hide Columns') you will see the matches for the term that was entered on the search box (You can use the text finder function of your browser to locate them):

By clicking on the 'clear' button, the term will be reset and the filter will be removed from the query.

The filter is available for the search results and the following pages:

If you have any questions, please contact the ChEMBL Team support (chembl-help [at]

Thursday, 18 July 2019

CuPy example for CUDA based similarity search in Python

CuPy is a really nice library developed by a Japanese startup and supported by NVIDIA that allows to easily run CUDA code in Python using NumPy arrays as input. It also provides interoperability with Numba (just-in-time Python compiler) and DLPackAt (tensor specification used in PyTorch, the deep learning library).

CUDA is a parallel computing platform and application programming interface that allows using GPUs for general purpose, not only graphics related computing. Just to give an idea of the level of parallelization it can be achieved with it, a not very expensive consumer's GPU like the NVIDIA GTX 1080 comes with 2560 CUDA cores.

Because at ChEMBL we love anything that makes Python fast and that is well integrated with NumPy we couldn't resist to give it a try!

Let's go through a example to see how it is working...

Google colab notebook. Colab provides the option to run notebooks in GPU and CuPy is already installed on the default Python environment :)
You will need to upgrade PyTables to 3.5.1 in the default Python environment in order to make it work.

Rendered notebook

Tuesday, 14 May 2019

Multi-task neural network on ChEMBL with PyTorch 1.0 and RDKit

 Image result for one man band

The use and application of multi-task neural networks is growing rapidly in cheminformatics and drug discovery. Examples can be found in the following publications:

- Deep Learning as an Opportunity in VirtualScreening
- Massively Multitask Networks for Drug Discovery
- Beyond the hype: deep neural networks outperform established methods using a ChEMBL bioactivity benchmark set

But what is a multi-task neural network? In short, it's a kind of neural network architecture that can optimise multiple classification/regression problems at the same time while taking advantage of their shared description. This blogpost gives a great overview of their architecture. All networks in references above implement the hard parameter sharing approach.

So, having a set of activities relating targets and molecules we can train a single neural network as a binary multi-label classifier that will output the probability of activity/inactivity for each of the targets (tasks) for a given query molecule.

PyTorch is one of the most popular open source AI libraries at present. It's getting a lot of traction in research environments, it's deeply integrated with the NumPy ecosystem and it also implements a dynamic graph approach making it easier to debug.

We have some interesting references, we have data in ChEMBL, we have PyTorch and RDKit... what are we waiting for?

First of all we'll need to extract the data from ChEMBL and format it for our purpose. The following notebook explains step by step how to do it. The output will be a H5 file that you can also download from here in case you want go directly to the network training phase.

Notebook to extract the data

Nice! We have the data, let's go then through the main notebook and train a model!

Notebook to train the model

This was a simple example. We hope you enjoyed it and will be inspired to experiment with deeper architectures, skipping connections, different learning rate strategies, more epochs, early stopping... and so on!

Notebooks also available in GitHub


Wednesday, 17 April 2019

Job opportunities in the ChEMBL Group

We have two exciting opportunities for scientists to come and work with the ChEMBL team at the Wellcome Genome Campus in Hinxton near Cambridge.

If you've used ChEMBL in the past perhaps now is the chance to come and shape its future.  Even if you haven't this is a great place to work and in both positions you will collaborate with people developing the ChEMBL resources but also our collaborators here at Hinxton and around Europe.  These include the Open Targets project and EU funded toxicology projects such as EU-ToxRisk and eTRANSAFE.

We are looking for:

(1) A talented chemoinformatician to work on methods for the annotation, searching and visualization of toxicologically relevant data. You will develop pipelines and tools to enable the better prediction and assessment of the toxicity of pharmaceutical and environmental chemicals.

Closing Date 19th May 2019
More details here

(2) A protein computational scientist to  develop, assess and validate methods for quantifying target tractability with the goal of incorporating such methodologies into the Open Targets informatics platform  This exciting work will focus on developing new methods based on protein structure and sequence.

Closing Date 26th May 2019
More details here

Don't delay, apply today.

Thursday, 28 March 2019

ChEMBL 25 and new web interface released

We are pleased to announce the release of ChEMBL 25 and our new web interface. This version of the database, prepared on 10/12/2018 contains:

  • 2,335,417 compound records
  • 1,879,206 compounds (of which 1,870,461 have mol files)
  • 15,504,603 activities
  • 1,125,387 assays
  • 12,482 targets
  • 72,271 documents

Data can be downloaded from the ChEMBL ftp site:

Please see ChEMBL_25 release notes for full details of all changes in this release:


# Deposited Data Sets:

Kuster Lab Chemical Proteomics Drug Profiling (src_id = 48, Document ChEMBL_ID = CHEMBL3991601):
Data have been included from the publication: The target landscape of clinical kinase drugs. Klaeger S, Heinzlmeir S and Wilhelm M et al (2017), Science, 358-6367 (

# In Vivo Assay Classification:

A classification scheme has been created for in vivo assays. This is stored in the ASSAY_CLASSIFICATION table in the database schema and consists of a three-level classification. Level 1 corresponds to the top-levels of the ATC classification i.e., anatomical system/therapeutic area (e.g., CARDIOVASCULAR SYSTEM, MUSCULO-SKELETAL SYSTEM, NERVOUS SYSTEM). Level 2 provides a more fine-grained classification of the phenotype or biological process being studied (e.g., Learning and Memory, Anti-Obesity Activity, Gastric Function). Level three represents the specific in vivo assay being performed (e.g., Laser Induced Thrombosis, Hypoxia Tolerance Test in Rats, Paw Edema Test) and is assigned a specific ASSAY_CLASS_ID. Individual in vivo assays in the ChEMBL ASSAYS table are mapped to reference in-vivo assays in the ASSAY_CLASSIFICATION table via the ASSAY_CLASS_MAP table. More information about the classification scheme is available in the following publication: The assay classification is available via web services and will be included in the ChEMBL web interface in the near future.

# Updated Data Sets:
Scientific Literature
Patent Bioactivity Data
BindingDB Database (corrections to compound structures)


# Web Interface:

The new ChEMBL web interface is now live at (this replaces the previous beta version). The old ChEMBL web interface will be retired before the ChEMBL_26 release, but is available on the following URL until then: The new interface provides richer search and filtering capabilities. Documentation regarding this new functionality and frequently asked questions are available on our help pages:

# Changes to Web Services:

The Assay web service has been updated to include both assay_parameters and the in vivo assay classification for an assay (where applicable):

A separate endpoint has also been created for the in vivo assay classification:

The Activity web service has been updated to include activity_properties. The 'published_type', 'published_relation', 'published_value' and 'published_units' fields have also been renamed to 'type', 'relation', 'value' and 'units':

A new endpoint has been created to retrieve supplementary data associated with an activity measurement (or list of measurements):


# Tables Added:

Classification scheme for phenotypic assays e.g., by therapeutic area, phenotype/process and assay type. Can be used to find standard assays for a particular disease area or phenotype e.g., anti-obesity assays. Currently data are available only for in vivo efficacy assays

ASSAY_CLASS_ID NUMBER(9,0) Primary key
L1 VARCHAR2(100) High level classification e.g., by anatomical/therapeutic area
L2 VARCHAR2(100) Mid-level classification e.g., by phenotype/biological process
L3 VARCHAR2(1000) Fine-grained classification e.g., by assay type
CLASS_TYPE VARCHAR2(50) The type of assay being classified e.g., in vivo efficacy
SOURCE VARCHAR2(50) Source from which the assay class was obtained

Mapping table linking assays to classes in the ASSAY_CLASSIFICATION table

ASSAY_ID NUMBER assay_id is the foreign key that maps to the 'assays' table
ASSAY_CLASS_ID NUMBER assay_class_id is the foreign key that maps to the 'assay_classification' table

# Columns Removed:

PUBLISHED_TYPE (DEPRECATED in ChEMBL_24, now removed, replaced by TYPE)
PUBLISHED_VALUE (DEPRECATED in ChEMBL_24, now removed, replaced by VALUE)
PUBLISHED_UNITS (DEPRECATED in ChEMBL_24, now removed, replaced by UNITS)

Funding Acknowledgements:
Work contributing to ChEMBL_25 was funded by the Wellcome Trust, EMBL Member States, Open Targets, National Institutes of Health (NIH) Common Fund, EU Innovative Medicines Initiative (IMI) and EU Framework 7 programmes. Please see for more details.

The ChEMBL Team