ChEMBL Resources

The SARfaris: GPCR, Kinase, ADME

Sunday, 19 March 2017

Finding Compounds in Databases using UniChem

Have you ever identified an interesting compound and wondered what else is known about it?  For example is there any bioactivity data on it in ChEMBL or PubChem?  Is there any toxicity data on it (CompTox)?  Then having found interesting data on a compound wondered if it can be purchased or whether it has been patented.  All this can be done using UniChem.  Interested?  

Come along to our webinar on 29th March at 2pm BST (3pm CEST, 9am EDT)
You will however need to register by emailing chembl-help. Places are limited so please let us know as soon as possible if you register but are then unable to attend.

If you want to know more about UniChem please read on.

UniChem (  is a simple system we have developed to cross-reference compounds across databases both internal to EMBL-EBI and externally. Currently we have cross-references to 140 million compounds in 30 different databases. Information about the sources indexed in UniChem can be found here. UniChem is updated weekly with new compounds from these source databases.

So, for example, you can input a database identifier or an InChIKey into UniChem and see links to all the other indexed databases that have information about that compound.

If we take the drug paroxetine and search for it in UniChem, it is found in 22 databases and the UniChem webpage gives links to the paroxetine entries in those databases.

You don’t have to do this compound by compound using the web interface though.  UniChem has a comprehensive set of  web services that you can use to retrieve data or alternatively all the database files and source to source mapping files are available for download.

UniChem relies on the InChIKey to do the mapping between databases and this works fine if two databases have exactly the same structure for a compound.  We all know however that this isn’t always the case.  Sometimes a different salt or isotope was tested or a mistake was made in the stereocentre assignment meaning the InChIKeys no longer match.

However don’t despair.  UniChem connectivity searching can help.  It turns out that because of the clever way that the InChI is built up with layers, this can be deconstructed and mapping can be done such that the relationship between compounds that differ by stereochemistry, isotopes, protonation state etc can all be identified and mapped. You can do this on single components or mixtures.

Taking our paroxetine example:

We have paroxetine and a number of related compounds in ChEMBL. For example:
Maybe someone wanted to genuinely test these related compounds or maybe they are errors (or a mixture of both).  Whatever the reason by using the UniChem connectivity searching feature we can identify any compounds that match paroxetine on the InChI connectivity layer.
The matches identified from a connectivity search starting with paroxetine can be found here:

At the webinar on 29th March we will describe how this is done in more detail and discuss some use cases.  If you are interested don’t forget to register.

If you want to read more here are links to two papers about UniChem:
Chambers, J., Davies, M., Gaulton, A., Hersey, A., Velankar, S., Petryszak, R., Hastings, J., Bellis, L., McGlinchey, S. and Overington, J.P. 
UniChem: A Unified Chemical Structure Cross-Referencing and Identifier Tracking System.
Journal of Cheminformatics2013, 5:3 (January 2013).

Chambers, J., Davies, M., Gaulton, A., Papadatos, G., Hersey and Overington, J.P.
UniChem: extension of InChI-based compound mapping to salt, connectivity and stereochemistry layers.
Journal of Cheminformatics2014, 6:43 (September 2014)

Tuesday, 14 March 2017

Chemogenomics Analyst Wanted

We are looking to recruit a scientist to support our work for the Horizon 2020 project “Coordinated Research Infrastructures Building Enduring Life-science services” (CORBEL). The role is to facilitate scientists in their use of chemogenomics resources by enabling database searching and evaluation of data.
  • To be responsible for liaising with scientists engaged in CORBEL and advising on the use of chemogenomics resources to progress their projects;
  • To help in the identification and analysis of bioactivity data from multiple database resources;
  • To construct and utilize appropriate workflows to facilitate the pharmacological profiling of molecules and chemotypes, the identification of potential off-target effects and the development of target prediction models;
  • To identify interoperability gaps between resources and help with developing solutions;
  • To organize and run appropriate training courses for scientists engaged in the CORBEL project;

 For full details of the position, or to apply see:

The closing date is 9th April 2017

Monday, 27 February 2017

Position to work on tractability in Open Targets

There is currently an opening for a Protein Computational Scientist to work on methods to assess and quantify the tractability (druggability) of potential new targets for drug discovery. This is a two year position funded by the Open Targets initiative.

The appointee will work with scientists from the Open Targets partners to assess, validate and develop methods for quantifying target tractability with the ultimate goal of incorporating such methodologies into the target validation platform ( The initial focus will be on “small molecule” tractability but we are also interested in other modalities in due course (e.g. antibody therapies). Many of the current methods to assess small molecule tractability are based on the use of 3D protein structures, but such information is only available for a subset of potential targets; a key component of the project is to determine robust methods and pipelines that can be applied to novel targets where there is much more limited information.

For more details or to apply, click here

Closing date is 9th March

(the image above is taken from the Fpocket publication:

Thursday, 9 February 2017

ChEMBL Webinars

We will be running a new series of webinars over the next few months. These will cover a range of topics including basic introductions to the Chemogenomics resources (ChEMBL, SureChEMBL, UniChem) as well as more detailed topics, a schema walkthrough and ChEMBL web services.

The first webinar will be a basic introduction to ChEMBL and will be on 22nd February at 2pm GMT (3pm CET, 9am EST).

If you would like to attend the webinar, please email to register.
Please note, spaces are limited so please let us know as soon as possible if you register but are then unable to attend.

We will post further details of upcoming webinars here, so watch this space!

The ChEMBL Team

Friday, 16 December 2016

Merry Christmas from ChEMBL

Wishing all of our many users and collaborators a very Merry Christmas and a Happy New Year!
The ChEMBL Team

Monday, 5 December 2016

A comprehensive map of molecular drug targets

Within the ChEMBL database we spend a lot of time manually curating links between FDA approved drugs and their efficacy targets. With collaborators from the University of New Mexico and the Institute of Cancer Research, we have now published an analysis of these drug efficacy targets:

Santos R, Ursu O, Gaulton A, Bento AP, Donadi RS, Bologa CG, Karlsson A, Al-Lazikani B, Hersey A, Oprea TI & Overington JP.
A comprehensive map of molecular drug targets
Nature Reviews Drug Discovery (2016) doi:10.1038/nrd.2016.230

In the article we address the complexities of assigning drug targets, describe the 667 human proteins and 189 pathogen proteins through which 1,578 FDA-approved drugs act and map each drug to its therapeutic indication via the WHO ATC classification system.

We show that 70% of small molecule drugs still act through privileged families (GPCRs, ion channels, kinases and nuclear receptors), highlight the differences in innovation between different therapeutic areas, look at conservation of targets across different model organisms and demonstrate that only 5% of identified cancer driver genes are targeted by current cancer therapies.

As an aside, the drug-target data within ChEMBL is used in a number of other platforms such as Pharos (the portal for the NIH Illuminating the Druggable Genome project), Open Targets (a resource for pre-competitive target validation) and DrugCentral (a drug compendium from the University of New Mexico), all of which have papers in the 2017 Database Issue of Nucleic Acids Research, alongside ChEMBL:

Pharos: Collating protein information to shed light on the druggable genome

Open Targets: a platform for therapeutic target identification and validation

DrugCentral: online drug compendium

Tuesday, 29 November 2016

New ChEMBL database paper out

The latest ChEMBL database paper is now available online:

This paper describes some of the additions to ChEMBL over the last few releases (ChEMBL_18 to ChEMBL_22) such as drug indications and clinical candidates, patent bioactivity data from BindingDB, drug metabolism information and richer assay annotation. A number of papers from our collaborators will also feature in the 2017 NAR database issue, so watch this space...

Thursday, 17 November 2016

ChEMBL_22 Data and Web Services Update

ChEMBL_22_1 data update:

We would like to inform users that an update to ChEMBL_22 has been released. 

The new version, ChEMBL_22_1, corrects an issue with the targets assigned to some BindingDB assays in ChEMBL (src_id = 37). If you are using the BindingDB data from ChEMBL, we recommend you download this update. This update also incorporates the mol file/canonical smiles correction announced previously.

Updates have been made to BindingDB data in the ASSAYS, ACTIVITIES, CHEMBL_ID_LOOKUP, LIGAND_EFF and PREDICTED_BINDING_DOMAINS tables. Corrections have also been made to molfiles and canonical_smiles in the COMPOUND_STRUCTURES table. No changes have been made to other data sets or to other drug/compound/target tables in ChEMBL_22.

The new release files can be downloaded from:

A new version of the ChEMBL RDF is also available from:

Improvements to Web Services:

1. Support for SDF format.

The "molecule" endpoint now supports the SDF format. For example, if you access this URL: you will get information about 20 first compounds in JSON format. This URL will return an SDF file of the same molecule page. Please note, that there will be only 18 compounds in SDF output because two compounds from (CHEMBL6961 and CHEMBL6963) have no structure defined. You can easily join the information about the compound provided via JSON, XML or YML format with the structure by inspecting the

> <chembl_id>
sdf property.

Obviously the same format works for a single compound so this URL: will provide an information about Aspirin while this URL (or will return its structure.

The same can be applied to filters, for example this URL returns information about compounds with molecular weight <= 300 AND pref_name ending with nib. The in turn will return corresponding structures.

We also released a new version of Python client (version 0.8.50 available from PyPI and GitHub) that is aware about molfile support. Example code:

from chembl_webresource_client.new_client import new_client
molecules = new_client.molecule
molstring =  molecules.all()[0]

Iterating through all molecules you can get an sdf files with all the structures from chembl, pagination is handled by the client.

2. Structural alerts.

This new API endpoint provides information about compound's structural alerts. For example, on order to get structural alerts for CHEMBL266429, you can use this URL:

Then you can render each of the alerts to image, for example

As you can see, the corresponding fragment is highlighted.You can add all parameters that are present in the standard "image" endpoint so format (png or svg), engine (rdkit or indigo), ignoreCoords to recompute coordinates from scratch and dimensions to change image size.

3. Document terms (keywords)

We used pytextrank package to extract most relevant terms from all document abstracts stored in ChEMBL, along with their significance score against each document (the code we used to perform the extraction is available).

For example, in order to get all the relevant terms for CHEMBL1124199 document, ordered by the significance score descending, you can use this URL:

By parsing the results you can extract (term, score) pairs and multiply the score to get this list:

590 Inverse agonist activity
548 Thien-2-yl analogues
493 Pentylenetetrazole-induced convulsions
490 5'-alkyl group
477 Agonist activity
472 Inverse agonist
449 5-methylthien-3-yl derivative
427 Potent compounds
417 Vivo activity
403 Magnitude higher affinity

you can now use the HTML5 based word cloud and feed the list into this tool providing the following configuration:

  gridSize: Math.round(16 * $('#canvas').width() / 1024),
  drawOutOfBound: true,
  weightFactor: function (size) {
    return Math.pow(size/100.0, 2.3) * $('#canvas').width() / 1024;
  fontFamily: 'Times, serif',
  hover: function(){},
  color: function (word, weight) {
    return (weight > 500) ? '#f02222' : '#c09292';
  rotateRatio: 0.0,
  backgroundColor: '#ffe0e0'

and you will get this wordcloud:

We are planning to add this component to the new document report card.

It may be also interesting to ask about all the documents for a given keyword, for example in order to get all the documents for the "inverse agonist activity" term ordered by score descending, the following URL can be used:

4. Document similarity

As the last endpoint we added "document_similarity". For example to get all documents similar to CHEMBL1122254 document this URL can be used:

The endpoint uses the same protocol we use to generate the "Related Documents" section in the Document Report Card (

The current protocol is fairly simple (measuring overlap in compounds and targets between the two documents) and not very granular (it can be difficult to choose N most relevant documents from the 50 documents that the protocol returns). However, we are currently investigating alternative methods such as topic modelling.

5. Other improvements

There are some minor improvements as well:
 - Molecule endpoint includes three more properties as described in GitHub issue #106.
 - Target endpoint can be filtered by synonym name, in other words you can get a list of targets for a given gene name, for example:
or using a shortcut:
 - Target relation endpoint can be accessed by primary ID as described in GitHub issue #114.
 - parent_chembl_id filter working correctly for the molecule_form endpoint (for example ) as described in GitHub issue #113

The ChEMBL Team

Thursday, 6 October 2016

ChEMBL 22 release - technical notes

The ChEMBL 22 release brings lots of new data. But we also released some new software so if you are interested in technical details please read on.

1. First of all, please note that ChEMBL 22 is the last release where we provide Oracle 9i dumps.
Oracle 9i has been out of support now for at nearly a decade and shouldn't be in use anymore but please let us know if this is a problem. On the other hand, we will do our best to provide Oracle 12c dumps for the next release.

2. If you are using the python API client please upgrade it by running:

[sudo] pip install -U chembl_webresource_client

This will upgrade the client to the latest version which solves some minor bugs and adds an ability to search in document abstracts. It will also create a new cache so you will see new chembl data immediately. Otherwise, you will need to clear your cache manually.

3. New version (2.4.9) of the ChEMBL API has been released as well. This version includes:
 - new endpoints: tissue and target_relation
 - mechanism endpoint contains references now
 - solr index has been added to documents so their abstracts can be searched for example searching  for 'cytocine': api/data/document/search.json?q=cytokine
 - the outdated chemical cartridge used by API (Biovia Direct) has been updated from 6.3 to 2016 Direct. The result is better handling of SMILES string, for example this API call:[O--].[Fe++].OCC1OC(OC2C(CO)OC(OC3C(O)C(CO)OC(OCC4OC(OCC5OC(O)C(O)C(OC6OC(CO)C(O)C(OC7OC(COC8OC(COC9OC(CO)C(O)C(O)C9O)C(O)C(O)C8O)C(O)C(OC8OC(CO)C(O)C(OC9OC(CO)C(O)C(OC%2510OC(COC%2511OC(COC%2512OC(COC%2513OC(COC%2514OC(COC%2515OC(CO)C(O)C(O)C%2515O)C(O)C(OC%2515OC(CO)C(O)C%2515O)C%2514O)C(O)C(O)C%2513O)C(O)C(O)C%2512O)C(O)C(O)C%2511O)C(O)C(OC%2511OC(CO)C(O)C(O)C%2511O)C%2510O)C9O)C8O)C7O)C6O)C5O)C(O)C(O)C4O)C3O)C2O)C(O)C1O/70
works fine now.
 - status endpoint provides API software version as well as ChEMBL release version.
 - there are many smaller bug fixes and improvements.

4. Since our API is maturing we started preparing collection of embedable widgets written in JS/CSS/HTML that you can use on your website/blog/webapplication. This will be a base for our new ChEMBL website. An example widget providing some besic information about a ChEMBL compound can be found below, the code used to embed it is:

<object data="" width="800px" height="350px"></object>

Another example is an assay co-occurance matrix for compounds extracted from a single document. Again the code to embed is:

<object data="" width="800px" height="800px"></object>

Thursday, 29 September 2016

ChEMBL 22 Released

We are pleased to announce the release of ChEMBL 22. This version of the database, prepared on 8th August 2016 contains:

  • 2,043,051 compound records
  • 1,686,695 compounds (of which 1,678,393 have mol files)
  • 14,371,219 activities
  • 1,246,132 assays
  • 11,224 targets
  • 65,213 documents

Data can be downloaded from the ChEMBL ftpsite or viewed via the ChEMBL interface. Please see ChEMBL_22 release notes for full details of all changes in this release.


In addition to the regular updates to the Scientific Literature, PubChem, FDA Orange Book and USP Dictionary of USAN and INN Investigational Drug Names this release of ChEMBL also includes the following new data:

Deposited Data Sets:

Two new deposited data sets have been included in ChEMBL_22: the MMV Pathogen Box compound set ( and GSK Tres Cantos Follow-up TB Screening Data (

Patent Data from BindingDB:

We have worked with the BindingDB team to integrate the bioactivity data that they have extracted from more than 1000 granted US patents published from 2013 onwards ( into ChEMBL. This data is incorporated into ChEMBL in the same way as literature-extracted bioactivity information, but with a new source (SRC_ID = 37, BindingDB Database) and a document type of 'PATENT'. In total this data set provides 99K bioactivity measurements for 68K compounds.

Withdrawn Drugs:

We have compiled a list of drugs that have been withdrawn in one or more countries due to safety or efficacy issues from multiple sources. Where available, the year of withdrawal, the applicable countries/areas and the reasons for the withdrawal are captured. Withdrawal information is shown on the Compound Report Card and a new icon has been added to the availability type section of the  Molecule Features image to denote drugs that have been withdrawn (e.g.,

Tissue Annotation:

We have identified tissues used in assays (e.g., tissues in which measurements were made after in-vivo dosing, isolated tissues on which assays were performed, or tissues from which sub-cellular fractions were prepared) using the Uberon ontology ( A TISSUE_DICTIONARY table has been created, which stores a list of the identified tissues, their corresponding ChEMBL_IDs, names and Uberon IDs. Mappings are also provided to the Experimental Factor Ontology (, Brenda Tissue Ontology ( and CALOHA Ontology ( Tissue Report Cards have been created (e.g.,, providing a mechanism to view all of the assay data associated with a particular tissue. The keyword search now also allows searching by tissue name, Uberon ID, EFO ID, Brenda Tissue ID or CALOHA tissue ID.

Indications for Clinical Candidates:

Indication information has now been extended to cover clinical candidates. This information has been extracted from and is included in the 'Browse Drug Indications' view and on Compound Report Cards.

Drug Metabolism Viewer:

An additional section has been added to Compound Report Cards to display drug metabolism schemes (e.g., These schemes can be opened in an expanded view by clicking the link above the image. Where known, enzyme information is shown on edges and clicking on an edge of interest will provide additional information about the reaction, including references. Clicking on the nodes allows linking to Compound Report Cards for the metabolites.

Variant Sequences:

For cases where assay data has been measured against a variant protein (e.g., site-directed mutagenesis or drug-resistance studies) we have created a VARIANT_SEQUENCES table to store the variant protein sequence used in the assay (the target for the assay will still be the wild-type protein). Since the exact protein sequence used in an assay is rarely reported in the medicinal chemistry literature, these sequences have been re-created by introducing the specified point mutation into the current UniProt sequence for the target. The resulting sequence is not therefore guaranteed to be the exact sequence used in the assay but provides a more robust way to document the relevant mutation(s) than the current use of residue name and position in most publications and ChEMBL assay descriptions (which quickly becomes obsolete when sequences change). In cases where the reported residue positions could not be reconciled with any UniProt sequence, variant sequence information has not been included in ChEMBL. Further sequences (requiring more curation) will be added in future releases. Assays with variant sequence information available are linked to the VARIANT_SEQUENCES table via the VARIANT_ID column. Please note, this information is not yet displayed on the ChEMBL interface.

We recommend you review the ChEMBL_22 release notes for a comprehensive overview of all updates and changes in ChEMBL 22, including schema changes, and as always, we greatly appreciate the reporting of any omissions or errors.

Keep an eye on the ChEMBL twitter and blog accounts for news and updates.

The ChEMBL Team