ChEMBL 34 is out!

We are delighted to announce the release of ChEMBL 34, which includes a full update to drug and clinical candidate drug data. This version of the database, prepared on 28/03/2024 contains:

2,431,025 compounds (of which 2,409,270 have mol files)
3,106,257 compound records (non-unique compounds)
20,772,701 activities
1,644,390 assays
15,598 targets
89,892 documents

Data can be downloaded from the ChEMBL FTP site: https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_34/

Please see ChEMBL_34 release notes for full details of all changes in this release: https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_34/chembl_34_release_notes.txt

New Data Sources

European Medicines Agency (src_id = 66): European Medicines Agency's data correspond to EMA drugs prior to 20 January 2023 (excluding vaccines). 71 out of the 882 newly added EMA drugs are only authorised by EMA, rather than from other regulatory bodies e.g. FDA. A significant effort has been made to correctly map the drug form of the EMA data by manually inspecting different EMA sources of information, such as the Product Information (Annex I: Summary of Product Characteristics and Annex III: Labelling and Package Leaflet) and/or Assessment Report, where available.

University of Dundee: T. cruzi data (src_id = 67): 3328 compounds that harbour common protease inhibitor motifs were screened at 30 µM against LAPTc using RapidFire-MS method for inhibitory activity against TcLAP protein

EU-OPENSCREEN dataset (src_id = 68): 4 assays have been deposited by the EU-OPENSCREEN project; 1 cell-based assay on human HepG2 cells, 1 assay measuring inhibition of SARS-Cov2-induced cytopathy, and 2 assays measuring inhibition of SARS-CoV2 3Cl-Pro proteolytic cleavage. 1813 bioactivities in total have been added.

Zimmermann Lab Biotransformation data Dec 2023 (src_id = 69): 271 compounds have been tested for biotransformation in 68 bacterial species and 28 bacterial communities. The metabolism of these compounds was positive in 8844 of 22306 activity results; biotransformation is recorded in the STANDARD_TEXT_VALUE field of the ACTIVITIES table.

New Deposited Datasets

CHEMBL5291702 - European Medicines Agency

CHEMBL5303304 - EUbOPEN Chemogenomics Library - IncuCyte (assays link to document CHEMBL4689842)

CHEMBL5303761 - Data for DCP probe BI-3231

CHEMBL5303762 - Data for DCP probe BI-8668

CHEMBL5303763 - Data for DCP probe BI-3802

CHEMBL5303764 - Data for DCP probe BI-3812

CHEMBL5303765 - Data for DCP probe BAY-7081

CHEMBL5303766 - Data for DCP probe FHT-2344

CHEMBL5303767 - Data for DCP probe JNJ-4355

CHEMBL5303768 - Data for DCP probe TP-060

CHEMBL5303769 - Data for DCP probe JNJ-42226314

CHEMBL5303708 - ECBD screening data for assay EOS300033

CHEMBL5303709 - ECBD screening data for assay EOS300041

CHEMBL5303710 - ECBD screening data for assay EOS300044

CHEMBL5303711 - ECBD screening data for assay EOS300108

CHEMBL5303300 - EUbOPEN Chemogenomics Library - Multiplex (assays link to document CHEMBL4689842)

CHEMBL5305021 - RapidFire TcLAP Compounds Screening

CHEMBL5308504 - Tm Shift (DSF) assay results for EUbOPEN Chemogenomics Library (assays link to documents CHEMBL5060014 and CHEMBL4649998)

Data Highlights

A new source of drug data from the European Medicines Agency (EMA) has been included in ChEMBL for this release.

The MOLECULE_DICTIONARY.ORPHAN field has been added to indicate whether a drug has orphan designation, i.e. intended for use against a rare condition (1 = yes, 0 = no, -1 = preclinical compound i.e. not a drug). This data is currently available for European Medicines Agency drugs only.

The field MOLECULE_DICTIONARY.MOLECULE_TYPE has been updated to include ‘Antibody drug conjugate’ in addition to the existing categories.

The Prodrug data has been fully revised and updated. This includes the MOLECULE_DICTIONARY.PRODRUG field that indicates whether a drug is a prodrug (=1) or not (=0), as well as its pharmacologically active molecule which is given in the MOLECULE_HIERARCHY.ACTIVE_MOLREGNO field (and as source_id = 53).

The coverage of manually curated molecule sequence data has been revised and extended. The data now includes protein and nucleic acid sequences for INNs and USANs. See BIOTHERAPEUTIC_COMPONENTS and BIO_COMPONENT_SEQUENCES.

Drug indication coverage has been extended to include new EMA approved drug indications, as well as USAN and INN clinical candidate indications, and their mapping to MeSH and EFO ontologies. See DRUG_INDICATION and INDICATION_REFS.

The data in MOLECULE_DICTIONARY.MAX_PHASE now includes consideration of:

EMA approved drugs (max_phase=4 for human drugs),

USAN clinical candidate drugs (assigned as max_phase = 1 based on USAN guidance that states “Firms usually apply for a USAN when the investigational therapy is in Phase I or Phase II trials”. See https://www.ama-assn.org/about/united-states-adopted-names/apply-united-states-adopted-name), and

INN clinical candidate drugs (assigned as max_phase = 2 based on INN guidance that states “As a general guide, the development of a drug should progress up to the point of clinical trials (phase II) before an application is submitted to the INN Secretariat for name selection.” See https://www.who.int/publications/m/item/guidance-on-the-use-of-inns).

Pref_name curation. Progress has been made towards standardising drug and clinical candidate pref_names (in MOLECULE_DICTIONARY.PREF_NAME) whereby an approved drug name (FDA /EMA) is assigned in the first instance, if available. If not available, the USAN is assigned, followed by the INN name, respectively. If the USAN/INN name assignment is ambiguous, the FDA GSRS preferred name is used. A company research code, or Clinical Trial intervention name, is assigned if no standardised name is available. For virtual parent compounds, progress has been made towards assigning a distinct pref_name (typically based on the FDA GSRS preferred name) that differs from the child compound name.

Synonym curation. The data in MOLECULE_SYNONYMS now includes manually curated Spanish and French INN synonyms, as well as existing English INN synonyms. Manual curation has reduced the instances of the same synonym assigned to two (or more) different drug or clinical candidate drugs.

The descriptions of the drug and clinical candidate sources have been reviewed and updated to improve clarity (ie SOURCE.SRC_DESCRIPTION and SOURCE.SRC_SHORT_NAME for SRC_ID = {8, 9, 12, 13, 36, 41, 42, 53, 63, 66} ):

FDA_ORANGE_BOOK (src_id = 9) is now described as “FDA Approved Drug Products with Therapeutic Equivalence Evaluations (Orange Book)”,

FDA_NEW_DRUGS (src_id=12) is described as “FDA New Molecular Entity and New Therapeutic Biological Product Approvals (New FDA Drugs)”.

PRODRUG_ACTIVE (src_id=53) is now described as “Active Ingredient of a Prodrug”

The black_box_warning pipeline has been updated to capture any new FDA labels up to 31st December 2023 with black box warnings for severe or life-threatening side effects.

The clinical trials pipeline has been updated up to 29th June 2023 to capture data for clinical trial interventions, conditions and phase in ClinicalTrials.gov that can be mapped to ChEMBL data.

New data have been included in the VERSION table to show the version applied for MeSH and EFO ontologies, the ChEMBL_Structure_Pipeline, RDKit packages, InChI, UniProtKB, Bioassay Ontology and Gene Ontology as well as the version of the ChEMBL database.

The definition of a chemical probe has been amended to update the field MOLECULE_DICTIONARY.CHEMICAL_PROBE. The data set of chemical probes was retrieved from 1) the chemicalprobes.org website and filtered for probes that were assigned an In Vivo Rating or In Cell Rating of 3 stars or more, and from 2) probes-drugs.org by filtering for the subsets “SGC Probes” and “Open Science Probes”. Data set retrieved on 09/02/2024.

When possible, AUC record units were converted to ng.hr.mL-1 and Cmax record units were converted to nM by creating new conversion rules.

For legacy AUC and Cmax records (< ChEMBL 34), pharmacokinetic parameters have been extracted from the assay descriptions using regular expression matching (RegEx). This affects only records that did not have their PK parameters already manually extracted. Dose, dose unit, route of administration and time (for AUC only), have been loaded in ACTIVITY_PROPERTIES.

Organism name/taxonomy updates: a review of organism-related data was undertaken to align organism names with the NCBI taxonomy and to update obsolete taxonomy IDs. In addition, where multiple strain level targets existed in ChEMBL, these were merged at the species level and the strain details migrated to the assay_strain field in the ASSAYS table. The initial round of organism updates for ChEMBL v34 impacted the ASSAYS, TARGET_DICTIONARY and ORGANISM_CLASS tables. In the ASSAYS table, the assay_organism/assay_tax_ID fields were updated for ~ 37,144 legacy rows. In the TARGET_DICTIONARY table, these updates affected 428 legacy rows. A total of 61 updates to legacy data in the ORGANISM_CLASS table were made, with 14 entries downgraded.

Funding acknowledgements:

Work contributing to ChEMBL34 was funded by the Wellcome Trust, EMBL Member States, Open Targets, National Institutes of Health (NIH), EU Innovative Medicines Initiative (IMI) and EU Framework 7 programmes. Please see https://chembl.gitbook.io/chembl-interface-documentation/acknowledgments for more details.

If you require further information about ChEMBL, please contact us: chembl-help@ebi.ac.uk

# To receive updates when new versions of ChEMBL are available, please sign up to our mailing list: http://listserver.ebi.ac.uk/mailman/listinfo/chembl-announce

# For general queries/feedback or to report any problems with data, please email: chembl-help@ebi.ac.uk

The ChEMBL-og

Search This Blog

ChEMBL 34 is out!

New Data Sources

New Deposited Datasets

Data Highlights

Funding acknowledgements:

Comments