Skip to main content

ChEMBL 25 and new web interface released

We are pleased to announce the release of ChEMBL 25 and our new web interface. This version of the database, prepared on 10/12/2018 contains:

  • 2,335,417 compound records
  • 1,879,206 compounds (of which 1,870,461 have mol files)
  • 15,504,603 activities
  • 1,125,387 assays
  • 12,482 targets
  • 72,271 documents

Data can be downloaded from the ChEMBL ftp site:

Please see ChEMBL_25 release notes for full details of all changes in this release:


# Deposited Data Sets:

Kuster Lab Chemical Proteomics Drug Profiling (src_id = 48, Document ChEMBL_ID = CHEMBL3991601):
Data have been included from the publication: The target landscape of clinical kinase drugs. Klaeger S, Heinzlmeir S and Wilhelm M et al (2017), Science, 358-6367 (

# In Vivo Assay Classification:

A classification scheme has been created for in vivo assays. This is stored in the ASSAY_CLASSIFICATION table in the database schema and consists of a three-level classification. Level 1 corresponds to the top-levels of the ATC classification i.e., anatomical system/therapeutic area (e.g., CARDIOVASCULAR SYSTEM, MUSCULO-SKELETAL SYSTEM, NERVOUS SYSTEM). Level 2 provides a more fine-grained classification of the phenotype or biological process being studied (e.g., Learning and Memory, Anti-Obesity Activity, Gastric Function). Level three represents the specific in vivo assay being performed (e.g., Laser Induced Thrombosis, Hypoxia Tolerance Test in Rats, Paw Edema Test) and is assigned a specific ASSAY_CLASS_ID. Individual in vivo assays in the ChEMBL ASSAYS table are mapped to reference in-vivo assays in the ASSAY_CLASSIFICATION table via the ASSAY_CLASS_MAP table. More information about the classification scheme is available in the following publication: The assay classification is available via web services and will be included in the ChEMBL web interface in the near future.

# Updated Data Sets:
Scientific Literature
Patent Bioactivity Data
BindingDB Database (corrections to compound structures)


# Web Interface:

The new ChEMBL web interface is now live at (this replaces the previous beta version). The old ChEMBL web interface will be retired before the ChEMBL_26 release, but is available on the following URL until then: The new interface provides richer search and filtering capabilities. Documentation regarding this new functionality and frequently asked questions are available on our help pages:

# Changes to Web Services:

The Assay web service has been updated to include both assay_parameters and the in vivo assay classification for an assay (where applicable):

A separate endpoint has also been created for the in vivo assay classification:

The Activity web service has been updated to include activity_properties. The 'published_type', 'published_relation', 'published_value' and 'published_units' fields have also been renamed to 'type', 'relation', 'value' and 'units':

A new endpoint has been created to retrieve supplementary data associated with an activity measurement (or list of measurements):


# Tables Added:

Classification scheme for phenotypic assays e.g., by therapeutic area, phenotype/process and assay type. Can be used to find standard assays for a particular disease area or phenotype e.g., anti-obesity assays. Currently data are available only for in vivo efficacy assays

ASSAY_CLASS_ID NUMBER(9,0) Primary key
L1 VARCHAR2(100) High level classification e.g., by anatomical/therapeutic area
L2 VARCHAR2(100) Mid-level classification e.g., by phenotype/biological process
L3 VARCHAR2(1000) Fine-grained classification e.g., by assay type
CLASS_TYPE VARCHAR2(50) The type of assay being classified e.g., in vivo efficacy
SOURCE VARCHAR2(50) Source from which the assay class was obtained

Mapping table linking assays to classes in the ASSAY_CLASSIFICATION table

ASSAY_ID NUMBER assay_id is the foreign key that maps to the 'assays' table
ASSAY_CLASS_ID NUMBER assay_class_id is the foreign key that maps to the 'assay_classification' table

# Columns Removed:

PUBLISHED_TYPE (DEPRECATED in ChEMBL_24, now removed, replaced by TYPE)
PUBLISHED_VALUE (DEPRECATED in ChEMBL_24, now removed, replaced by VALUE)
PUBLISHED_UNITS (DEPRECATED in ChEMBL_24, now removed, replaced by UNITS)

Funding Acknowledgements:
Work contributing to ChEMBL_25 was funded by the Wellcome Trust, EMBL Member States, Open Targets, National Institutes of Health (NIH) Common Fund, EU Innovative Medicines Initiative (IMI) and EU Framework 7 programmes. Please see for more details.

The ChEMBL Team


Popular posts from this blog

RDKit, C++ and Jupyter Notebook

Fancy playing with RDKit C++ API without needing to set up a C++ project and compile it? But wait... isn't C++ a compiled programming language? How this can be even possible?

Thanks to Cling (CERN's C++ interpreter) and xeus-cling jupyter kernel is possible to use C++ as an intepreted language inside a jupyter notebook!

We prepared a simple notebook showing few examples of RDKit functionalities and a docker image in case you want to run it.

With the single requirement of docker being installed in your computer you'll be able to easily run the examples following the three steps below:
docker pull eloyfelix/rdkit_jupyter_clingdocker run -d -p 9999:9999 eloyfelix/rdkit_jupyter_clingopen http://localhost:9999/notebooks/rdkit_cling.ipynb in a browser

FPSim2, a simple Python3 molecular similarity tool

FPSim2 is a new tool for fast similarity search on big compound datasets (>100 million) being developed at ChEMBL. We started developing it as we needed a Python3 library able to run either in memory or out-of-core fast similarity searches on such dataset sizes.

It's written in Python/C++ and features:
A fast population count algorithm (builtin-popcnt-unrolled) from using SIMD instructions.Bounds for sub-linear speed-ups from 10.1021/ci600358fA compressed file format with optimised read speed based in PyTables and BLOSCUse of multiple cores in a single search In memory and on disk search modesSimple and easy to use
Source code is available on github and Conda packages are also available for either mac or linux. To install it type:

conda install rdkit -c rdkitconda install fpsim2 -c efelix
Try it with docker (much better performance than binder):

    docker pull eloyfelix/fpsim2    docker run -p 9999:9999 eloyfelix/fpsim2    open http:/…

2019 and ChEMBL – News, jobs and birthdays

Happy New Year from the ChEMBL Group to all our users and collaborators. 
Firstly, do you want a new challenge in 2019?  If so, we have a position for a bioinformatician in the ChEMBL Team to develop pipelines for identifying links between therapeutic targets, drugs and diseases.  You will be based in the ChEMBL team but also work in collaboration with the exciting Open Targets initiative.  More details can be found here(closing date 24thJanuary). 
In case you missed it, we published a paper at the end of last on the latest developments of the ChEMBL database “ChEMBL: towards direct deposition of bioassay data”. You can read it here.  Highlights include bioactivity data from patents, human pharmacokinetic data from prescribing information, deposited data from neglected disease screening and data from the IMI funded K4DD project.  We have also added a lot of new annotations on the therapeutic targets and indications for clinical candidates and marketed drugs to ChEMBL.  Importantly we ha…

Multi-task neural network on ChEMBL with PyTorch 1.0 and RDKit

The use and application of multi-task neural networks is growing rapidly in cheminformatics and drug discovery. Examples can be found in the following publications:

- Deep Learning as an Opportunity in VirtualScreening
- Massively Multitask Networks for Drug Discovery
- Beyond the hype: deep neural networks outperform established methods using a ChEMBL bioactivity benchmark set

But what is a multi-task neural network? In short, it's a kind of neural network architecture that can optimise multiple classification/regression problems at the same time while taking advantage of their shared description. This blogpost gives a great overview of their architecture. All networks in references above implement the hard parameter sharing approach.

So, having a set of activities relating targets and molecules we can train a single neural network as a binary multi-label classifier that will output the probability of activity/inactivity for each of the targets (tasks) for a given query molecule…