Skip to main content

Release of ChEMBL 33

We are pleased to announce the release of ChEMBL 33!

This fresh release comes with a few new data soures and also some new features: we added bioactivity data for understudied SLC targets from the RESOLUTE project and included a flag for Natural Products and for Chemical Probes. An annotation for the ACTION_TYPE of a measurement was included for approx. 270 K bioactivities. We also time-stamped every document in ChEMBL with their CREATION_DATE!
Have fun playing around with ChEMBL 33 over the summer and please send feedback via chembl-help@ebi.ac.uk.
 

ChEMBL database version ChEMBL 33 release notes

___________________________________________


# This version of the database, prepared on 31/05/2023 contains:


     2,399,743 compounds (of which 2,372,674 have mol files)

     3,051,613 compound records (non-unique compounds)

      20,334,684 activities

       1,610,596 assays

     15,398 targets

     88,630 documents


BioAssay Data Sources:    Number Assays:    Number Compound Records:    Number Activities:

 

Scientific Literature    1,556,406    1,707,714    8,422,975

Patent Bioactivity Data    16,573    59,839    179,516

 

 

 Donated Chemical Probes - SGC Frankfurt    10,247    207    70,833

EUbOPEN Chemogenomic Library    9,786    2,488    397,587

BindingDB Database    4,117    137,338    204,256

TP-search Transporter Database    3,592    4,383    6,765

PubChem BioAssays    2,999    531,694    7,434,992

Literature data from EUbOPEN Chemogenomic Library    2,842    709    2,842

FDA Approval Packages    1,386    80    1,387

Sanger Institute Genomics of Drug Sensitivity in Cancer    713    139    73,039

GSK Published Kinase Inhibitor Set    456    1,101    169,451

Kuster lab chemical proteomics drug profiling    325    243    70,505

Drugs for Neglected Diseases Initiative (DNDi)    233    7,070    14,452

MMV Malaria Box    138    8,438    45,158

Curated Drug Pharmacokinetic Data    136    98    1,163

DrugMatrix    134    1,529    494,046

MMV Pathogen Box    88    1,574    6,256

SARS-CoV-2 Screening Data 2020-21    57    26,367    37,209

K4DD Project    48    273    2,064

Gates Library compound collection    37    224,440    1,482,491

CO-ADD antimicrobial screening data    35    24,315    99,793

RESOLUTE - Research Empowerment

on Solute Carriers    34    93    96

Salvensis and LSHTM Schistosomiasis screening data    31    262    1,222

Open Source Malaria Screening    22    211    344

St Jude Malaria Screening    16    1,524    5,456

WHO-TDR Malaria Screening    16    740    5,853

AstraZeneca Deposited Data    15    5,799    11,687

GSK Tuberculosis Screening    15    826    1,814

Deposited Supplementary Bioactivity Data    13    1,786    4,817

GSK Kinetoplastid Screening    13    592    7,235

Curated Drug Metabolism Pathways    11    867    11

MMV Malaria HGL    10    141,662    295,295

HESi    9    31    986

Winzeler Lab Plasmodium Screening Data    7    78,603    399,067

St Jude Leishmania Screening    6    13,643    42,105

GSK Malaria Screening    6    13,533    81,198

Novartis Malaria Screening    6    10,119    27,888

Fraunhofer HDAC6    4    5,632    11,680

Cardiff Schistosomiasis Dataset 2023    4    80    194

Harvard Malaria Screening    4    37    111

IMI-CARE SARS-CoV-2 Data    3    4,404    9,646

Open TG-GATEs    2    160    210,708

Published Kinase Inhibitor Set 2    1    486    491

 

Compound-Only Data Sources:    Number Compound Records:


USP Dictionary of USAN and International Drug Names    12,394

Clinical Candidates    8,619

WHO Anatomical Therapeutic Chemical Classification    3,424

Orange Book    2,272

British National Formulary    1,958

Gene Expression Atlas Compounds    793

Prodrug active ingredients    238

Manually Added Drugs    228

International Nonproprietary Names    227

Withdrawn Drugs    225

HeCaToS Compounds    96

External Project Compounds    10

 

 



 

 

 

 

 

 

 

 

 

############################################

# Data changes since the last release:

############################################



# New Sources


"RESOLUTE - Research Empowerment on Solute Carriers" (src_id = 58): this dataset comprises 96 bioactivities measured in 34 assays on 32 SLC targets from the IMI-RESOLUTE project. RESOLUTE (https://re-solute.eu) is an EU-funded consortium working on the solute carrier (SLC) gene family in a public-private partnership. The consortium also develops new transport assays for selected SLCs.


Cardiff Schistosomiasis Dataset 2023 (src_id = 64): A library of 80 compounds were tested in vitro on different life cycle stages of the parasite Schistosoma mansoni. The dataset is also available from the ChEMBL - Neglected Tropical Disease archive (https://chembl.gitbook.io/chembl-ntd/#deposited-set-26-3rd-march-2023-dataset-using-chembl-to-complement-schistosome-drug-discovery).


Literature data from EUbOPEN Chemogenomic Library (src_id = 65): 2,842 bioactivity measurements have been extracted from primary literature by the SGC consortium to complement their Chemogenomic library (src_id = 55). References to primary literature are indicated in the ACTIVITY_PROPERTIES table (TEXT_VALUE AND STANDARD_TEXT_VALUE fields).


# Updated Sources


Scientific Literature

EUbOPEN Chemogenomic Library


# New Deposited Datasets


CHEMBL5096127 - Using ChEMBL to complement schistosome drug discovery

CHEMBL5209563 - FFN206 based assay for SLC18A1 using HEK-293 SLC18A1 OE cells

CHEMBL5209564 - Superclomeleon biosensor based assay for SLC12A3 using HEK-293 SLC12A3 OE cells

CHEMBL5209565 - pH biosensor based assay for SLC16A3 using HEK-293 SLC16A3 OE cells

CHEMBL5209566 - Superclomeleon biosensor-based assay for SLC26A9 using HEK293 SLC26A9 JumpIn OE cells

CHEMBL5209567 - Membrane potential based assay for SLC2A9 using HEK-293 SLC2A9 OE cells

CHEMBL5209568 - Membrane potential based assay for SLC5A11 using HEK-293 SLC5A11 OE cells

CHEMBL5209569 - Membrane potential based assay for SLC6A8 using HEK-293 JumpIN SLC6A8 OE cells

CHEMBL5209570 - Membrane potential based assay for SLC6A12 using HEK-293 SLC6A12 OE cells

CHEMBL5209571 - Membrane potential based assay for SLC13A3 using HEK-293 SLC13A3 OE cells

CHEMBL5209572 - Membrane potential based assay for SLC22A4 using HEK-293 SLC22A4 OE cells

CHEMBL5209573 - Fluo-8 based assay for SLC24A2 using HEK293 SLC24A2 JumpIn OE cells

CHEMBL5209574 - Fluo-8 based assay for SLC24A4 using HEK293 JumpIn SLC24A4 OE cells

CHEMBL5209575 - Membrane potential based assay for SLC1A1 using HEK-293 SLC1A1 OE cells

CHEMBL5209576 - Membrane potential based assay for SLC5A7 using HEK-293 SLC5A7 OE cells

CHEMBL5209577 - Flow cytometry transport assay for SLC2A1 using HEK293 JumpIN TRex SLC2A1 WT-OE cells

CHEMBL5209578 - Flow cytometry transport assay for SLC2A2 using HEK293 JumpIN TRex SLC2A2 WT-OE cells

CHEMBL5209579 - Flow cytometry transport assay for SLC2A4 using HEK293 JumpIN TRex SLC2A4 WT-OE cells

CHEMBL5209580 - Flow cytometry transport assay for SLC2A3 using HEK293 JumpIN TRex SLC2A3 WT-OE cells

CHEMBL5209581 - Membrane potential based assay for SLC6A5 using HEK-293 SLC6A5 OE cells

CHEMBL5209582 - Membrane potential based assay for SLC6A6 using HEK-293 SLC6A6 OE cells

CHEMBL5209583 - pH biosensor based assay for SLC9B2 using HEK-293 SLC9B2 OE cells

CHEMBL5209584 - Membrane potential based assay for SLC15A2 using HEK-293 SLC15A2 OE cells

CHEMBL5209585 - FFN206 based assay for SLC18A2 using HEK-293 SLC18A2 OE cells

CHEMBL5209586 - Membrane potential-based assay for SLC34A1 using HEK293 JumpIn SLC34A1 OE cells

CHEMBL5209587 - Membrane potential based assay for SLC23A1 using HEK-293 SLC23A1 OE cells

CHEMBL5209588 - Membrane potential based assay for SLC6A9 using HEK-293 SLC6A9 OE cells

CHEMBL5209589 - Membrane potential based transport assay for SLC1A3 using HEK293 JumpIn SLC1A3 OE cells

CHEMBL5209590 - Membrane potential based transport assay for SLC7A3 using HEK293 JumpIn SLC7A3 OE cells

CHEMBL5209667 - EUbOPEN Chemical Probe Library 2

CHEMBL5209669 - NanoBRET assay results for EUbOPEN Chemogenomics Library 3

CHEMBL5209684 - Tm Shift (DSF) assay results for EUbOPEN Chemogenomics Library 3

CHEMBL5209801 - GPCR results for EUbOPEN Chemogenomics Library 3

CHEMBL5209897 - Affinity Phenotypic Cellular Literature for EUbOPEN Chemogenomics Library wave 3

CHEMBL5210121 - Affinity On-target Cellular Literature for EUbOPEN Chemogenomics Library wave 3

CHEMBL5210307 - Affinity Biochemical Literature for EUbOPEN Chemogenomics Library wave 3

CHEMBL5212743 - Selectivity Literature for EUbOPEN Chemogenomics Library wave 3


############################################

# Database changes since the last release:

############################################


# New Database Tables:


CHEMBL_RELEASE table: this table links each ChEMBL release (aka version) to its CREATION_DATE.


# New Database Fields:


CHEMBL_RELEASE.CHEMBL_RELEASE_ID (Primary Key; links to DOCS.CHEMBL_RELEASE_ID)

CHEMBL_RELEASE.CHEMBL_RELEASE: ChEMBL release name

CHEMBL_RELEASE.CREATION_DATE: ChEMBL release creation date


DOCS.CHEMBL_RELEASE_ID (Foreign Key; links to CHEMBL_RELEASE.CHEMBL_RELEASE_ID): every document can now be linked via the CHEMBL_RELEASE_ID to the new CHEMBL_RELEASE table, which allows retrieving the CREATION_DATE for each document


ACTIVITIES.ACTION_TYPE (Foreign Key to ACTION_TYPE.ACTION_TYPE): 

The ACTION_TYPE field has been added to the ACTIVITIES table and provides additional detail on the mode of action of tested compounds in the specific assay setup. The recorded ACTION_TYPE must match one of the names in the ACTION_TYPE table. This field was populated with mode of action information that had previously been recorded as metadata in the ASSAY_PARAMETERS and ACTIVITY_PROPERTIES tables. In addition, approx. 250 K activities have been manually annotated with an ACTION_TYPE by the ChEMBL data extractors. The initial subset of curated activities are being released as a test set and we encourage feedback. As the rules are being more clearly defined and atypical cases identified, a small number of annotations may change over the coming releases.


MOLECULE_DICTIONARY.NATURAL_PRODUCT: Indicates whether the compound is a natural product as defined by COCONUT (https://coconut.naturalproducts.net/), the COlleCtion of Open Natural ProdUcTs (1 = yes, 0 = default value). Data set retrieved from COCONUT team on 05/05/2023. For the structure mapping, ChEMBL compounds were subjected to stripping off stereochemical information since compound structures in COCONUT did not include stereochemical information when the mapping was performed.


MOLECULE_DICTIONARY.CHEMICAL_PROBE: Indicates whether the compound is a chemical probe as defined by chemicalprobes.org. (1 = yes, 0 = default value). The data set of chemical probes was retrieved from the chemicalprobes.org website and filtered for probes that were assigned an In Vivo Rating or In Cell Rating of 3 stars or more. Data set retrieved on 30/05/2023.


# Data changes and amendments


Missing information in the field DOCS.YEAR was included for 385 patents (src_id = 38).


AUC records (in STANDARD_TYPE field) were all converted to ng.hr.mL-1 units from uM.hr using the parent compound MW. The STANDARD_VALUE and STANDARD_UNITS were updated accordingly. 10,219 records were affected.

Formula: STANDARD_VALUE * MW_FREEBASE

where STANDARD_VALUE is in uM.hr and MW_FREEBASE in g/mol


Cmax records (in STANDARD_TYPE field) were all converted to nM units from ug.mL-1 using the parent compound MW. The STANDARD_VALUE and STANDARD_UNITS were updated accordingly. 14,485 records were affected.

Formula: STANDARD_VALUE / MW_FREEBASE * 10^6

where STANDARD_VALUE is in ug.mL-1 and MW_FREEBASE in g/mol


The tissue annotation was removed from approx. 50,000 legacy assays. In these assays, tissues were not present in the cell-based experiments and the cell source tissue had been incorrectly used to populate the tissue field.


Please note that Oracle 19c dumps will be stopped after ChEMBL 34.


# Funding acknowledgements:


Work contributing to ChEMBL 33 was funded by the Wellcome Trust, EMBL Member States, Open Targets, National Institutes of Health (NIH), EU Innovative Medicines Initiative (IMI) and EU Framework 7 programmes. Please see https://www.ebi.ac.uk/chembl/funding for more details.



# To receive updates when new versions of ChEMBL are available, please sign up to our mailing list: http://listserver.ebi.ac.uk/mailman/listinfo/chembl-announce

# To receive updates about submitting your data to ChEMBL, please sign up to our deposition mailing list: https://listserver.ebi.ac.uk/mailman/listinfo/chembl-depositors 

# For general queries/feedback please email: chembl-help@ebi.ac.uk

# For details of upcoming webinars, please see: http://chembl.blogspot.com/search/label/Webinar


Comments

Popular posts from this blog

Here's a nice Christmas gift - ChEMBL 35 is out!

Use your well-deserved Christmas holidays to spend time with your loved ones and explore the new release of ChEMBL 35!            This fresh release comes with a wealth of new data sets and some new data sources as well. Examples include a total of 14 datasets deposited by by the ASAP ( AI-driven Structure-enabled Antiviral Platform) project, a new NTD data se t by Aberystwyth University on anti-schistosome activity, nine new chemical probe data sets, and seven new data sets for the Chemogenomic library of the EUbOPEN project. We also inlcuded a few new fields that do impr ove the provenance and FAIRness of the data we host in ChEMBL:  1) A CONTACT field has been added to the DOCs table which should contain a contact profile of someone willing to be contacted about details of the dataset (ideally an ORCID ID; up to 3 contacts can be provided). 2) In an effort to provide more detailed information about the source of a deposited dat...

Improvements in SureChEMBL's chemistry search and adoption of RDKit

    Dear SureChEMBL users, If you frequently rely on our "chemistry search" feature, today brings great news! We’ve recently implemented a major update that makes your search experience faster than ever. What's New? Last week, we upgraded our structure search engine by aligning it with the core code base used in ChEMBL . This update allows SureChEMBL to leverage our FPSim2 Python package , returning results in approximately one second. The similarity search relies on 256-bit RDKit -calculated ECFP4 fingerprints, and a single instance requires approximately 1 GB of RAM to run. SureChEMBL’s FPSim2 file is not currently available for download, but we are considering generating it periodicaly and have created it once for you to try in Google Colab ! For substructure searches, we now also use an RDKit -based solution via SubstructLibrary , which returns results several times faster than our previous implementation. Additionally, structure search results are now sorted by...

ChEMBL 34 is out!

We are delighted to announce the release of ChEMBL 34, which includes a full update to drug and clinical candidate drug data. This version of the database, prepared on 28/03/2024 contains:         2,431,025 compounds (of which 2,409,270 have mol files)         3,106,257 compound records (non-unique compounds)         20,772,701 activities         1,644,390 assays         15,598 targets         89,892 documents Data can be downloaded from the ChEMBL FTP site:  https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_34/ Please see ChEMBL_34 release notes for full details of all changes in this release:  https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_34/chembl_34_release_notes.txt New Data Sources European Medicines Agency (src_id = 66): European Medicines Agency's data correspond to EMA drugs prior to 20 January 2023 (excluding ...

Improved querying for SureChEMBL

    Dear SureChEMBL users, Earlier this year we ran a survey to identify what you, the users, would like to see next in SureChEMBL. Thank you for offering your feedback! This gave us the opportunity to have some interesting discussions both internally and externally. While we can't publicly reveal precisely our plans for the coming months (everything will be delivered at the right time), we can at least say that improving the compound structure extraction quality is a priority. Unfortunately, the change won't happen overnight as reprocessing 167 millions patents takes a while. However, the good news is that the new generation of optical chemical structure recognition shows good performance, even for patent images! We hope we can share our results with you soon. So in the meantime, what are we doing? You may have noticed a few changes on the SureChEMBL main page. No more "Beta" flag since we consider the system to be stable enough (it does not mean that you will never ...

ChEMBL brings drug bioactivity data to the Protein Data Bank in Europe

In the quest to develop new drugs, understanding the 3D structure of molecules is crucial. Resources like the Protein Data Bank in Europe (PDBe) and the Cambridge Structural Database (CSD) provide these 3D blueprints for many biological molecules. However, researchers also need to know how these molecules interact with their biological target – their bioactivity. ChEMBL is a treasure trove of bioactivity data for countless drug-like molecules. It tells us how strongly a molecule binds to a target, how it affects a biological process, and even how it might be metabolized. But here's the catch: while ChEMBL provides extensive information on a molecule's activity and cross references to other data sources, it doesn't always tell us if a 3D structure is available for a specific drug-target complex. This can be a roadblock for researchers who need that structural information to design effective drugs. Therefore, connecting ChEMBL data with resources like PDBe and CSD is essen...