Skip to main content

ChEMBL 26 Released



We are pleased to announce the release of ChEMBL_26

This version of the database, prepared on 10/01/2020 contains:

  • 2,425,876 compound records
  • 1,950,765 compounds (of which 1,940,733 have mol files)
  • 15,996,368 activities
  • 1,221,311 assays
  • 13,377 targets
  • 76,076 documents
You can query the ChEMBL 26 data online via the ChEMBL Interface and you can also download the data from the ChEMBL FTP site. Please see ChEMBL_26 release notes for full details of all changes in this release.

Changes since the last release:

* Deposited Data Sets:

CO-ADD antimicrobial screening data:
Two new data sets have been included from the Community for Open Access Drug Discovery (CO-ADD). These data sets are screening of the NIH NCI Natural Product Set III in the CO-ADD assays (src_id = 40, Document ChEMBL_ID = CHEMBL4296183, DOI = 10.6019/CHEMBL4296183) and screening of the NIH NCI Diversity Set V in the CO-ADD assays (src_id = 40, Document ChEMBL_ID = CHEMBL4296182, DOI = 10.6019/CHEMBL4296182).

HESI - Evaluation of the utility of stem-cell derived cardiomyocytes for drug proarrhythmic potential (src_id = 49, Document ChEMBL_ID = CHEMBL4295262 , DOI = 10.6019/CHEMBL4295262). Summary assay results for this data set have been included in ChEMBL_26 and further supplementary data will be added in ChEMBL_27.

* Changes to structure-processing and compound properties:
We are now using RDKit for almost all of our compound-related processing. For the first time in ChEMBL_26, this will include compound standardization, salt-stripping, generation of canonical smiles, structural alerts, image depiction, substructure searches and similarity searches (via FPSim2: https://github.com/chembl/FPSim2). Therefore, all molecules have been reprocessed and you may notice some differences in molfiles, smiles and structure search results compared with previous releases. The ChEMBL structure curation pipeline has been released as an open source package: https://github.com/chembl/ChEMBL_Structure_Pipeline, and incorporated into our Beaker web services (see below). More information can be found here: http://chembl.blogspot.com/2020/02/chembl-compound-curation-pipeline.html.

We are also now using ChemAxon tools to calculate most acidic and basic pKa, logP and logD (pH 7.4) predictions, rather than ACDLabs software. These properties have therefore been recalculated and renamed in the database.

* Target Predictions:
Target predictions in ChEMBL are now generated by a new method, using conformal prediction (https://jcheminf.biomedcentral.com/articles/10.1186/s13321-018-0325-4). A docker image is available for those wishing to use the models locally: https://hub.docker.com/repository/docker/chembl/mcp (see https://chembl.blogspot.com/2020/01/new-chembl-ligand-based-target.html for more information). We also plan to provide a new target prediction web service in the future. The current target prediction web service (https://www.ebi.ac.uk/chembl/api/data/target_prediction/) has now been deprecated.

* Updated Data Sets:
Scientific Literature
Patent Bioactivity Data
Orange Book
USP Dictionary of USAN and International Drug Names
Clinical Candidates
WHO Anatomical Therapeutic Chemical Classification
British National Formulary
Manually Added Drugs

Database changes:

# Columns Added:

CELL_DICTIONARY
CELL_ONTOLOGY_ID VARCHAR2(10) ID for the corresponding cell type in the Cell Ontology

VARIANT_SEQUENCES
TAX_ID   NUMBER(11,0) NCBI Tax ID for the organism from which the sequence was obtained

COMPOUND_PROPERTIES
CX_MOST_APKA NUMBER(9,2) The most acidic pKa calculated using ChemAxon v17.29.0
CX_MOST_BPKA NUMBER(9,2) The most basic pKa calculated using ChemAxon v17.29.0
CX_LOGP NUMBER(9,2) The calculated octanol/water partition coefficient using ChemAxon v17.29.0
CX_LOGD NUMBER(9,2) The calculated octanol/water distribution coefficient at pH7.4 using ChemAxon v17.29.0

# Columns Removed:

COMPOUND_PROPERTIES
ACD_MOST_APKA Replaced by CX_MOST_APKA
ACD_MOST Replaced by CX_MOST_BPKA
ACD_LOGP Replaced by CX_LOGP
ACD_LOGD Replaced by CX_LOGD


Funding acknowledgements:

Work contributing to ChEMBL26 was funded by the Wellcome Trust, EMBL Member States, Open Targets, National Institutes of Health (NIH), EU Innovative Medicines Initiative 2 (IMI2) and EU Horizon 2020 programmes. Please see https://chembl.gitbook.io/chembl-interface-documentation/acknowledgments for more details.


If you require further information about ChEMBL, please contact us: chembl-help@ebi.ac.uk

# To receive updates when new versions of ChEMBL are available, please sign up to our mailing list: http://listserver.ebi.ac.uk/mailman/listinfo/chembl-announce
# For general queries/feedback please email: chembl-help@ebi.ac.uk
# For details of upcoming webinars, please see: http://chembl.blogspot.com/search/label/Webinar

Comments

Popular posts from this blog

RDKit, C++ and Jupyter Notebook

Fancy playing with RDKit C++ API without needing to set up a C++ project and compile it? But wait... isn't C++ a compiled programming language? How this can be even possible?

Thanks to Cling (CERN's C++ interpreter) and xeus-cling jupyter kernel is possible to use C++ as an intepreted language inside a jupyter notebook!

We prepared a simple notebook showing few examples of RDKit functionalities and a docker image in case you want to run it.

With the single requirement of docker being installed in your computer you'll be able to easily run the examples following the three steps below:
docker pull eloyfelix/rdkit_jupyter_clingdocker run -d -p 9999:9999 eloyfelix/rdkit_jupyter_clingopen http://localhost:9999/notebooks/rdkit_cling.ipynb in a browser


ChEMBL 25 and new web interface released

We are pleased to announce the release of ChEMBL 25 and our new web interface. This version of the database, prepared on 10/12/2018 contains:

2,335,417 compound records1,879,206 compounds (of which 1,870,461 have mol files)15,504,603 activities1,125,387 assays12,482 targets72,271 documents

Data can be downloaded from the ChEMBL ftp site: ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_25

Please see ChEMBL_25 release notes for full details of all changes in this release: ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_25/chembl_25_release_notes.txt


DATA CHANGES SINCE THE LAST RELEASE

# Deposited Data Sets:

Kuster Lab Chemical Proteomics Drug Profiling (src_id = 48, Document ChEMBL_ID = CHEMBL3991601):
Data have been included from the publication: The target landscape of clinical kinase drugs. Klaeger S, Heinzlmeir S and Wilhelm M et al (2017), Science, 358-6367 (https://doi.org/10.1126/science.aan4368)

# In Vivo Assay Classification:

A classification…

FPSim2, a simple Python3 molecular similarity tool

FPSim2 is a new tool for fast similarity search on big compound datasets (>100 million) being developed at ChEMBL. We started developing it as we needed a Python3 library able to run either in memory or out-of-core fast similarity searches on such dataset sizes.

It's written in Python/C++ and features:
A fast population count algorithm (builtin-popcnt-unrolled) from https://github.com/WojciechMula/sse-popcount using SIMD instructions.Bounds for sub-linear speed-ups from 10.1021/ci600358fA compressed file format with optimised read speed based in PyTables and BLOSCUse of multiple cores in a single search In memory and on disk search modesSimple and easy to use
Source code is available on github and Conda packages are also available for either mac or linux. To install it type:

conda install rdkit -c rdkitconda install fpsim2 -c efelix
Try it with docker (much better performance than binder):

    docker pull eloyfelix/fpsim2    docker run -p 9999:9999 eloyfelix/fpsim2    open http:/…

2019 and ChEMBL – News, jobs and birthdays

Happy New Year from the ChEMBL Group to all our users and collaborators. 
Firstly, do you want a new challenge in 2019?  If so, we have a position for a bioinformatician in the ChEMBL Team to develop pipelines for identifying links between therapeutic targets, drugs and diseases.  You will be based in the ChEMBL team but also work in collaboration with the exciting Open Targets initiative.  More details can be found here(closing date 24thJanuary). 
In case you missed it, we published a paper at the end of last on the latest developments of the ChEMBL database “ChEMBL: towards direct deposition of bioassay data”. You can read it here.  Highlights include bioactivity data from patents, human pharmacokinetic data from prescribing information, deposited data from neglected disease screening and data from the IMI funded K4DD project.  We have also added a lot of new annotations on the therapeutic targets and indications for clinical candidates and marketed drugs to ChEMBL.  Importantly we ha…

Multi-task neural network on ChEMBL with PyTorch 1.0 and RDKit

The use and application of multi-task neural networks is growing rapidly in cheminformatics and drug discovery. Examples can be found in the following publications:

- Deep Learning as an Opportunity in VirtualScreening
- Massively Multitask Networks for Drug Discovery
- Beyond the hype: deep neural networks outperform established methods using a ChEMBL bioactivity benchmark set

But what is a multi-task neural network? In short, it's a kind of neural network architecture that can optimise multiple classification/regression problems at the same time while taking advantage of their shared description. This blogpost gives a great overview of their architecture. All networks in references above implement the hard parameter sharing approach.

So, having a set of activities relating targets and molecules we can train a single neural network as a binary multi-label classifier that will output the probability of activity/inactivity for each of the targets (tasks) for a given query molecule…