Skip to main content

ChEMBL 34 is out!

We are delighted to announce the release of ChEMBL 34, which includes a full update to drug and clinical candidate drug data. This version of the database, prepared on 28/03/2024 contains:

  •         2,431,025 compounds (of which 2,409,270 have mol files)
  •         3,106,257 compound records (non-unique compounds)
  •         20,772,701 activities
  •         1,644,390 assays
  •         15,598 targets
  •         89,892 documents
Data can be downloaded from the ChEMBL FTP site: https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_34/

Please see ChEMBL_34 release notes for full details of all changes in this release: https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_34/chembl_34_release_notes.txt

New Data Sources



European Medicines Agency (src_id = 66): European Medicines Agency's data correspond to EMA drugs prior to 20 January 2023 (excluding vaccines). 71 out of the 882 newly added EMA drugs are only authorised by EMA, rather than from other regulatory bodies e.g. FDA. A significant effort has been made to correctly map the drug form of the EMA data by manually inspecting different EMA sources of information, such as the Product Information (Annex I: Summary of Product Characteristics and Annex III: Labelling and Package Leaflet) and/or Assessment Report, where available.

University of Dundee: T. cruzi data (src_id = 67): 3328 compounds that harbour common protease inhibitor motifs were screened at 30 ┬ÁM against LAPTc using  RapidFire-MS method  for inhibitory activity against TcLAP protein

EU-OPENSCREEN dataset (src_id = 68): 4 assays have been deposited by the EU-OPENSCREEN project; 1 cell-based assay on human HepG2 cells, 1 assay measuring inhibition of SARS-Cov2-induced cytopathy, and 2 assays measuring inhibition of SARS-CoV2 3Cl-Pro proteolytic cleavage. 1813 bioactivities in total have been added.

Zimmermann Lab Biotransformation data Dec 2023 (src_id = 69): 271 compounds have been tested for biotransformation in 68 bacterial species and 28 bacterial communities.  The metabolism of these compounds was positive in 8844 of 22306 activity results; biotransformation is recorded in the STANDARD_TEXT_VALUE field of the ACTIVITIES table.

New Deposited Datasets

CHEMBL5291702 - European Medicines Agency
CHEMBL5303304 - EUbOPEN Chemogenomics Library - IncuCyte (assays link to document CHEMBL4689842)
CHEMBL5303761 - Data for DCP probe BI-3231
CHEMBL5303762 - Data for DCP probe BI-8668
CHEMBL5303763 - Data for DCP probe BI-3802
CHEMBL5303764 - Data for DCP probe BI-3812
CHEMBL5303765 - Data for DCP probe BAY-7081
CHEMBL5303766 - Data for DCP probe FHT-2344
CHEMBL5303767 - Data for DCP probe JNJ-4355
CHEMBL5303768 - Data for DCP probe TP-060
CHEMBL5303769 - Data for DCP probe JNJ-42226314
CHEMBL5303708 - ECBD screening data for assay EOS300033
CHEMBL5303709 - ECBD screening data for assay EOS300041
CHEMBL5303710 - ECBD screening data for assay EOS300044
CHEMBL5303711 - ECBD screening data for assay EOS300108
CHEMBL5303300 - EUbOPEN Chemogenomics Library - Multiplex (assays link to document CHEMBL4689842)
CHEMBL5305021 - RapidFire TcLAP Compounds Screening
CHEMBL5308504 - Tm Shift (DSF) assay results for EUbOPEN Chemogenomics Library (assays link to documents CHEMBL5060014 and CHEMBL4649998)

Data Highlights

A new source of drug data from the European Medicines Agency (EMA) has been included in ChEMBL for this release. 

The MOLECULE_DICTIONARY.ORPHAN field has been added to indicate whether a drug has orphan designation, i.e. intended for use against a rare condition (1 = yes, 0 = no, -1 = preclinical compound i.e. not a drug). This data is currently available for European Medicines Agency drugs only.

The field MOLECULE_DICTIONARY.MOLECULE_TYPE has been updated to include ‘Antibody drug conjugate’ in addition to the existing categories.

The Prodrug data has been fully revised and updated. This includes the MOLECULE_DICTIONARY.PRODRUG field that indicates whether a drug is a prodrug (=1) or not (=0), as well as its pharmacologically active molecule which is given in the MOLECULE_HIERARCHY.ACTIVE_MOLREGNO field (and as source_id = 53).

The coverage of manually curated molecule sequence data has been revised and extended. The data now includes protein and nucleic acid sequences for INNs and USANs. See BIOTHERAPEUTIC_COMPONENTS and BIO_COMPONENT_SEQUENCES. 

Drug indication coverage has been extended to include new EMA approved drug indications, as well as USAN and INN clinical candidate indications, and their mapping to MeSH and EFO ontologies. See DRUG_INDICATION and INDICATION_REFS.

The data in MOLECULE_DICTIONARY.MAX_PHASE now includes consideration of: 
EMA approved drugs (max_phase=4 for human drugs),
USAN clinical candidate drugs (assigned as max_phase = 1 based on USAN guidance that states “Firms usually apply for a USAN when the investigational therapy is in Phase I or Phase II trials”. See https://www.ama-assn.org/about/united-states-adopted-names/apply-united-states-adopted-name), and
INN clinical candidate drugs (assigned as max_phase = 2 based on INN guidance that states “As a general guide, the development of a drug should progress up to the point of clinical trials (phase II) before an application is submitted to the INN Secretariat for name selection.” See  https://www.who.int/publications/m/item/guidance-on-the-use-of-inns).

Pref_name curation. Progress has been made towards standardising drug and clinical candidate pref_names (in MOLECULE_DICTIONARY.PREF_NAME) whereby an approved drug name (FDA /EMA) is assigned in the first instance, if available. If not available, the USAN is assigned, followed by the INN name, respectively. If the USAN/INN name assignment is ambiguous, the FDA GSRS preferred name is used. A company research code, or Clinical Trial intervention name, is assigned if no standardised name is available. For virtual parent compounds, progress has been made towards assigning a distinct pref_name (typically based on the FDA GSRS preferred name) that differs from the child compound name. 

Synonym curation. The data in MOLECULE_SYNONYMS now includes manually curated Spanish and French INN synonyms, as well as existing English INN synonyms. Manual curation has reduced the instances of the same synonym assigned to two (or more) different drug or clinical candidate drugs.

The descriptions of the drug and clinical candidate sources have been reviewed and updated to improve clarity (ie SOURCE.SRC_DESCRIPTION and SOURCE.SRC_SHORT_NAME for SRC_ID = {8, 9, 12, 13, 36, 41, 42, 53, 63, 66} ):
FDA_ORANGE_BOOK (src_id = 9) is now described as “FDA Approved Drug Products with Therapeutic Equivalence Evaluations (Orange Book)”, 
FDA_NEW_DRUGS (src_id=12) is described as “FDA New Molecular Entity and New Therapeutic Biological Product Approvals (New FDA Drugs)”. 
PRODRUG_ACTIVE (src_id=53) is now described as “Active Ingredient of a Prodrug”

The black_box_warning pipeline has been updated to capture any new FDA labels up to 31st December 2023 with black box warnings for severe or life-threatening side effects.

The clinical trials pipeline has been updated up to 29th June 2023 to capture data for clinical trial interventions, conditions and phase in ClinicalTrials.gov that can be mapped to ChEMBL data.

New data have been included in the VERSION table to show the version applied for MeSH and EFO ontologies, the ChEMBL_Structure_Pipeline, RDKit packages, InChI, UniProtKB, Bioassay Ontology and Gene Ontology as well as the version of the ChEMBL database.  

The definition of a chemical probe has been amended to update the field MOLECULE_DICTIONARY.CHEMICAL_PROBE. The data set of chemical probes was retrieved from 1) the chemicalprobes.org website and filtered for probes that were assigned an In Vivo Rating or In Cell Rating of 3 stars or more, and from 2) probes-drugs.org by filtering for the subsets “SGC Probes” and “Open Science Probes”.  Data set retrieved on 09/02/2024.

When possible, AUC record units were converted to ng.hr.mL-1 and Cmax record units were converted to nM by creating new conversion rules.
For legacy AUC and Cmax records (< ChEMBL 34), pharmacokinetic parameters have been extracted from the assay descriptions using regular expression matching (RegEx). This affects only records that did not have their PK parameters already manually extracted. Dose, dose unit, route of administration and time (for AUC only), have been loaded in ACTIVITY_PROPERTIES. 

Organism name/taxonomy updates: a review of organism-related data was undertaken to align organism names with the NCBI taxonomy and to update obsolete taxonomy IDs. In addition, where multiple strain level targets existed in ChEMBL, these were merged at the species level and the strain details migrated to the assay_strain field in the ASSAYS table. The initial round of organism updates for ChEMBL v34 impacted the ASSAYS, TARGET_DICTIONARY and ORGANISM_CLASS tables. In the ASSAYS table, the assay_organism/assay_tax_ID fields were  updated for ~ 37,144 legacy rows. In the TARGET_DICTIONARY table, these updates affected 428 legacy rows. A total of 61 updates to legacy data in the ORGANISM_CLASS table were made, with 14 entries downgraded.

Funding acknowledgements:

Work contributing to ChEMBL34 was funded by the Wellcome Trust, EMBL Member States, Open Targets, National Institutes of Health (NIH), EU Innovative Medicines Initiative (IMI) and EU Framework 7 programmes. Please see https://chembl.gitbook.io/chembl-interface-documentation/acknowledgments for more details.

If you require further information about ChEMBL, please contact us: chembl-help@ebi.ac.uk

# To receive updates when new versions of ChEMBL are available, please sign up to our mailing list: http://listserver.ebi.ac.uk/mailman/listinfo/chembl-announce
# For general queries/feedback or to report any problems with data, please email: chembl-help@ebi.ac.uk


Comments

Popular posts from this blog

ChEMBL & SureChEMBL anniversary symposium

  In 2024 we celebrate the 15th anniversary of the first public release of the ChEMBL database as well as the 10th anniversary of SureChEMBL. To recognise this important landmark we are organising a two-day symposium to celebrate the work achieved by ChEMBL and SureChEMBL, and look forward to its future.   Save the date for the ChEMBL 15 Year Symposium October 1-2, 2024     Day one will consist of four workshops, a basic ChEMBL drug design workshop; an advanced ChEMBL workshop (EUbOPEN community workshop); a ChEMBL data deposition workshop; and a SureChEMBL workshop. Day two will consist of a series of talks from invited speakers, a few poster flash talks, a local nature walk, as well as celebratory cake. During the breaks, the poster session will be a great opportunity to catch up with other users and collaborators of the ChEMBL resources and chat to colleagues, co-workers and others to find out more about how the database is being used. Lunch and refreshments will be pro

A python client for accessing ChEMBL web services

Motivation The CheMBL Web Services provide simple reliable programmatic access to the data stored in ChEMBL database. RESTful API approaches are quite easy to master in most languages but still require writing a few lines of code. Additionally, it can be a challenging task to write a nontrivial application using REST without any examples. These factors were the motivation for us to write a small client library for accessing web services from Python. Why Python? We choose this language because Python has become extremely popular (and still growing in use) in scientific applications; there are several Open Source chemical toolkits available in this language, and so the wealth of ChEMBL resources and functionality of those toolkits can be easily combined. Moreover, Python is a very web-friendly language and we wanted to show how easy complex resource acquisition can be expressed in Python. Reinventing the wheel? There are already some libraries providing access to ChEMBL d

Accessing SureChEMBL data in bulk

It is the peak of the summer (at least in this hemisphere) and many of our readers/users will be on holiday, perhaps on an island enjoying the sea. Luckily, for the rest of us there is still the 'sea' of SureChEMBL data that awaits to be enjoyed and explored for hidden 'treasures' (let me know if I pushed this analogy too far). See here and  here for a reminder of SureChEMBL is and what it does.  This wealth of (big) data can be accessed via the SureChEMBL interface , where users can submit quite sophisticated and granular queries by combining: i) Lucene fields against full-text and bibliographic metadata and ii) advanced structure query features against the annotated compound corpus. Examples of such queries will be the topic of a future post. Once the search results are back, users can browse through and export the chemistry from the patent(s) of interest. In addition to this functionality, we've been receiving user requests for  local (behind the