ChEMBL Resources

Resources:
ChEMBL
|
SureChEMBL
|
ChEMBL-NTD
|
ChEMBL-Malaria
|
The SARfaris: GPCR, Kinase, ADME
|
UniChem
|
DrugEBIlity
|
ECBD

Thursday, 31 May 2018

ChEMBL 24 Released!


We are pleased to announce the release of ChEMBL 24. This version of the database, prepared on 23/04/2018 contains:

    2,275,906 compound records
    1,828,820 compounds (of which 1,820,035 have mol files)
    15,207,914 activities
    1,060,283 assays
    12,091 targets
    69,861 documents

Data can be downloaded from the ChEMBL ftp site: ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_24_1

Please see ChEMBL_24 release notes for full details of all changes in this release: ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_24_1/chembl_24_1_release_notes.txt

Change in data model and addition of activity properties and supplementary data:

A new data submission format and database loader has been implemented. The new deposition system allows more advanced functionality, including the ability to update previously deposited data sets, and the ability to deposit activity data against existing ChEMBL compound or assay collections. This means that in future releases, it will be possible for the SRC_ID for data in the ACTIVITIES table to be different from the SRC_ID in the COMPOUND_RECORDS and/or ASSAYS tables to which the measurements relate.

We have now added an ACTIVITY_PROPERTIES table to the database, to allow parameters such as compound dose or time points to be captured for individual activity measurements. The table can also be used to record key experimental measurements that are important in interpreting the values reported in the ACTIVITIES table (e.g., HILL_SLOPE for a dose-response curve).

The ACTIVITY_SUPP table has also been introduced to allow supplementary data for an activity measurement to be captured. For example, for in vivo toxicology data, the ACTIVITIES table may capture summary level data across a group of animals, while the ACTIVITY_SUPP table contains individual animal-level data.

As a result of these improvements, this release contains some schema changes (including changes to the existing ASSAY_PARAMETERS table). A number of existing data sets have also been reformatted to take advantage of these new tables. Please see the release notes (ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_24/chembl_24_release_notes.txt) and recent blog post (http://chembl.blogspot.co.uk/2018/04/schema-changes-coming-in-chembl24.html) for more details.


Other features in the new release include:

Several new deposited data sets:
#  K4DD Project - K4DD drug target binding kinetics data (src_id = 30, DOI = 10.6019/CHEMBL3885741)
#  MMV Pathogen Box - The Australian National University Dept Of Immunology (src_id = 34, DOI = 10.6019/CHEMBL3987221)
#  Published Kinase Inhibitor Set 2 - Northwick Park Institute for Medical Research (src_id = 43)(10.6019/CHEMBL3988181)
#  University of Dundee, Gates Library - Leishmania donovani Methionine tRNA synthetase screening (src_id = 33, DOI = 10.6019/CHEMBL3988442)

Withdrawn Class information:
Withdrawn drugs in ChEMBL (src_id = 36) have been annotated with a controlled vocabulary to describe the reasons for their withdrawal.

Change of InChI version:
The version of Standard InChI used in ChEMBL has now been updated from 1.02 to 1.05.

Compound properties are now calculated with RDKit:
We are now using RDKit to calculate the following compound properties:
MW_FREEBASE, ALOGP, HBA, HBD, PSA, RTB, QED_WEIGHTED, FULL_MWT, AROMATIC_RINGS, HEAVY_ATOMS, MW_MONOISOTOPIC, FULL_MOLFORMULA, HBA_LIPINSKI, HBD_LIPINSKI. ACDLabs properties are unaffected.

Updated data sets:
A number of existing data sets have been updated including:
#  Scientific Literature (src_id = 1)
#  Clinical Candidates (src_id = 8)
#  FDA Orange Book (src_id = 9)
#  Open TG-GATEs (src_id = 11)
#  Manually Added Drugs (src_id = 12)
#  USP Dictionary of USAN and International Drug Names (src_id = 13)
#  DrugMatrix (src_id = 15)
#  BindingDB (src_id = 37)
#  Patent Bioactivity Data (src_id = 38)
#  Curated Drug Pharmacokinetic Data (src_id = 39)
#  WHO Anatomical Therapeutic Chemical Classification (src_id = 41)

Oracle exports:
Oracle 12c exports are now available for download (ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/chembl_24/chembl_24_oracle12c.tar.gz)


Retirement of old ChEMBL web services:
Please note, the legacy ChEMBL web services, hosted at: https://www.ebi.ac.uk/chembl//ws/home_old will not be updated to ChEMBL_23 and will be retired at the end of June. If you have not already done so, please switch to our current web services: https://www.ebi.ac.uk/chembl/ws


Funding acknowledgements:
Work contributing to ChEMBL_24 was funded by the Wellcome Trust, EMBL Member States, Open Targets, National Institutes of Health (NIH) Common Fund, EU Innovative Medicines Initiative (IMI) and EU Framework 7 programmes. Please see https://www.ebi.ac.uk/chembl/funding for more details.


# If you require further information about ChEMBL, please contact us: chembl-help@ebi.ac.uk

# To receive updates when new versions of ChEMBL are available, please sign up to our mailing list: http://listserver.ebi.ac.uk/mailman/listinfo/chembl-announce

# For current users of the ChEMBL web interface, chembl-announce and chembl-help mailing lists, please see our Privacy Policy/Terms of Use at:
https://www.ebi.ac.uk/data-protection/privacy-notice/chembl-announce-mailing-list
https://www.ebi.ac.uk/data-protection/privacy-notice/chembl-helpdesk
https://www.ebi.ac.uk/about/terms-of-use
https://www.ebi.ac.uk/data-protection/privacy-notice/embl-ebi-public-website


Monday, 14 May 2018

Striving for Perfect Representation of Chemical Structures – is this possible?


It probably goes without saying that at ChEMBL, we have a desire to make all our data as accurate and useful as possible. With this in mind we have spent many hours over the last few years trying to curate, in particular, the structures of marketed drugs and clinical candidates. We aren’t alone in this and more than 5 years ago people were coming across the same problems as highlighted in this blog post by ChemConnector on Fluvastatin

Our drug curation is an ongoing and probably a never-ending task but to be honest it has proved a lot more difficult than we expected. This is for several reasons:

Firstly, where to go to find the definitive structure of a molecule? One would have thought this would be easy but even the sources such as INN and USAN don’t always agree. For example for Telavancin the USAN_data_sheet shows a difference in the nitrogen and carbon counts in the structure images compared with the images in the INN document (although the molecular formula are the same in both documents).

Secondly, while molfiles are our definitive structures and we use standard InChIs to determine uniqueness we see many examples where, as we convert between formats (molfiles, smiles, InChIs) we introduce inconsistencies. This is of course a well-known problem. There are ongoing discussions and initiatives to develop open structure formats and extend InChIs to deal with some of these cases but my sense is this is a long way off.

Lastly, and what seems like an insurmountable problem, we are constrained by the method we use to represent chemical structures. We, at ChEMBL, like many people, use version 2000 molfiles (ref 1). There is no doubt that using v3000 molfiles would solve a number of these problems but it would be very time consuming and costly to do the conversion and therefore probably only feasible for a limited number of ChEMBL structures such as the drug molecules. We are considering this as a long term goal but it would need a wider community buy in to make it worthwhile.  However, we also suspect that many of our users also only use the older molfile version so providing the v3000 format wouldn’t help them. We would be interested in your feedback on which format you use though. Most of the resources we exchange data with (e.g. PubChem, BindingDB) also use v2000 molfiles. There is no doubt that different resources find their own way to cope with the limitations of the file formats and we do too. For example, it would be possible to use non-standard extensions of the datafields in the sd file to indicate this but it would lack real chemical awareness. Also, how one group chooses to use this won’t necessarily be consistent with another group so we are no further forward.

As a consequence of our curation efforts, we have come across an increasing number of challenging molecules for which it would be useful to get the views of our users as to the best way to deal with these. It should also be said here that we are only talking about apparently “simple” rule of 5 compliant organic molecules and several years ago we stopped trying to curate organometallic compounds. We don’t show the structures of these in ChEMBL. The drug cisplatin being a case in point. The v2000 molfile has no way of coding coordination bonds and the standard InChI (ref 2) that we use to define a unique chemical structure can’t distinguish between cis and trans-platin.

Back to the organic molecules though and a few of our dilemmas:

Milnacipran is my favourite and an apparently relatively simple example.  It is a mixture of the 1S,2R and 1R,2S enantiomers (USAN). However, v2000 molfiles don’t deal with relative stereochemistry so we have 3 options:

(1) Show one enantiomer:
(2) Show it as a racemic mixture i.e. no stereochemistry:



(3) Show it as a molfile comprised of two molecules:

Arguably option 3 is the only correct way to do this. However other data providers such as FDA and Drugbank use option 1. In the ChEMBL database we use option 2 so that we can distinguish milnacipran from levomilnacipran USAN (specifically the 1S, 2R isomer) or dextromilnacipran (1R, 2S). Option 1 wouldn’t enable us to distinguish these either in the molfile or the standard InChI.

My logic here for not using option 3 is in thinking about the use people are making of ChEMBL. ChEMBL is not a registration system where option 3 might indeed be needed but it is being used as a source of bioactivity data that can be used for identifying tool compounds, building QSAR models for specific targets etc. Hence wouldn’t users taking our 1.8 million compounds just discard any mixtures such as option 3 would give before starting their analysis given that calculating physicochemical properties etc on mixtures makes little sense?

OK so suppose you disagree and think option 3 is the right thing to do, what would you want us to do for itraconazole? This is described in DailyMed (ref 3) as a “1:1:1:1 racemic mixture of four diastereoisomers (two enantiomeric pairs)”.

Option 3 would give us a mixture of 4 molecules in our v2000 molfile. For example:

Again, we have chosen option 2 as the least bad option i.e just showing it as a racemic mixture.

It seemed as if we had identified a workable and at least internally consistent way of dealing with these structures – until we took a look at the following two examples alpha prodine and beta prodine:

Here we have alphaprodine being a mixture of the (RS,SR) enantiomers:
and betaprodine the (SS,RR) enantiomers:
Hence our use of option 2 fails to distinguish between them! This matters as the two enantiomeric pairs have different biological properties e.g. different analgesic activity (ref 4)

The other example is Met(h)iomeprazine and levomet(h)iomeprazine where the former is a mixture of two enantiomers and the latter one enantiomer or the other (but it isn’t apparently known which - according to INN).

For this example, we have chosen option 2 for metiomeprazine but for levometiomeprazine we show just one of the possible enantiomers.

In summary, no existing solutions are ideal and not everyone agrees on how to do this. In ChEMBL itself we are trying to be consistent within the constraints of the v2000 molfile format but it’s not all done yet. There is however a glimmer of light in this confusion in that our UniChem connectivity match (ref 5) enables matching of these cases across databases. For example using the non stereospecific representation of milnacipran enables matching to this as well as the specific levo- and dextro- milnacipran enantiomers (as well as their salts). Details here.

So, ChEMBL users out there, we’d be interested in what you think. Do you prefer option 1, 2 or 3 or for your use cases or does it make no difference? We can’t promise an instant change but we are interested in what you think. Before you ask we know we have some inconsistencies in ChEMBL for these molecules but we are undecided on what to do and of course time spent on this is less time on other things. If you want to vote on your preferred option you can do so here.

As always if you think we have something wrong in ChEMBL please email chembl-help@ebi.ac.uk and we will endeavour to correct it.

References
(1) A. Dalby, J.G. Nourse, W. D. Hounshell, A.K.I. Gushurst, D. L. Grier, B.A. Leland and J. Laufer, Description of Several Chemical Structure File Formats Used by Computer Programs Developed at Molecular Design Limited, Chem. Inf. Comput. Sci. 1992, 32, 244-255

(2) InChI - the worldwide chemical structure identifier standard, S Heller, A McNaught, D. Tchekhovskoi and S. Stein, J. Cheminf. 2013, 5

(3) Dailymed entry for Itraconazole https://dailymed.nlm.nih.gov/dailymed/drugInfo.cfm?setid=1e243ffb-31be-39a7-4946-83ce7b839e0a

(4) A.H Becket, A.F. Casy and G Kirk – Alpha and Beta Prodine Type Compounds, J. Med. and Pharmaceut. Chem., 1959,1,1-58

(5) J. Chambers, M. Davies, A. Gaulton, G. Papadatos, A. Hersey and J. P. Overington, UniChem: extension of InChI-based compound mapping to salt, connectivity and stereochemistry layers, J. Cheminformatics 2014, 6:43