Skip to main content

ChEMBL 34 is out!

We are delighted to announce the release of ChEMBL 34, which includes a full update to drug and clinical candidate drug data. This version of the database, prepared on 28/03/2024 contains:

  •         2,431,025 compounds (of which 2,409,270 have mol files)
  •         3,106,257 compound records (non-unique compounds)
  •         20,772,701 activities
  •         1,644,390 assays
  •         15,598 targets
  •         89,892 documents
Data can be downloaded from the ChEMBL FTP site: https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_34/

Please see ChEMBL_34 release notes for full details of all changes in this release: https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_34/chembl_34_release_notes.txt

New Data Sources



European Medicines Agency (src_id = 66): European Medicines Agency's data correspond to EMA drugs prior to 20 January 2023 (excluding vaccines). 71 out of the 882 newly added EMA drugs are only authorised by EMA, rather than from other regulatory bodies e.g. FDA. A significant effort has been made to correctly map the drug form of the EMA data by manually inspecting different EMA sources of information, such as the Product Information (Annex I: Summary of Product Characteristics and Annex III: Labelling and Package Leaflet) and/or Assessment Report, where available.

University of Dundee: T. cruzi data (src_id = 67): 3328 compounds that harbour common protease inhibitor motifs were screened at 30 µM against LAPTc using  RapidFire-MS method  for inhibitory activity against TcLAP protein

EU-OPENSCREEN dataset (src_id = 68): 4 assays have been deposited by the EU-OPENSCREEN project; 1 cell-based assay on human HepG2 cells, 1 assay measuring inhibition of SARS-Cov2-induced cytopathy, and 2 assays measuring inhibition of SARS-CoV2 3Cl-Pro proteolytic cleavage. 1813 bioactivities in total have been added.

Zimmermann Lab Biotransformation data Dec 2023 (src_id = 69): 271 compounds have been tested for biotransformation in 68 bacterial species and 28 bacterial communities.  The metabolism of these compounds was positive in 8844 of 22306 activity results; biotransformation is recorded in the STANDARD_TEXT_VALUE field of the ACTIVITIES table.

New Deposited Datasets

CHEMBL5291702 - European Medicines Agency
CHEMBL5303304 - EUbOPEN Chemogenomics Library - IncuCyte (assays link to document CHEMBL4689842)
CHEMBL5303761 - Data for DCP probe BI-3231
CHEMBL5303762 - Data for DCP probe BI-8668
CHEMBL5303763 - Data for DCP probe BI-3802
CHEMBL5303764 - Data for DCP probe BI-3812
CHEMBL5303765 - Data for DCP probe BAY-7081
CHEMBL5303766 - Data for DCP probe FHT-2344
CHEMBL5303767 - Data for DCP probe JNJ-4355
CHEMBL5303768 - Data for DCP probe TP-060
CHEMBL5303769 - Data for DCP probe JNJ-42226314
CHEMBL5303708 - ECBD screening data for assay EOS300033
CHEMBL5303709 - ECBD screening data for assay EOS300041
CHEMBL5303710 - ECBD screening data for assay EOS300044
CHEMBL5303711 - ECBD screening data for assay EOS300108
CHEMBL5303300 - EUbOPEN Chemogenomics Library - Multiplex (assays link to document CHEMBL4689842)
CHEMBL5305021 - RapidFire TcLAP Compounds Screening
CHEMBL5308504 - Tm Shift (DSF) assay results for EUbOPEN Chemogenomics Library (assays link to documents CHEMBL5060014 and CHEMBL4649998)

Data Highlights

A new source of drug data from the European Medicines Agency (EMA) has been included in ChEMBL for this release. 

The MOLECULE_DICTIONARY.ORPHAN field has been added to indicate whether a drug has orphan designation, i.e. intended for use against a rare condition (1 = yes, 0 = no, -1 = preclinical compound i.e. not a drug). This data is currently available for European Medicines Agency drugs only.

The field MOLECULE_DICTIONARY.MOLECULE_TYPE has been updated to include ‘Antibody drug conjugate’ in addition to the existing categories.

The Prodrug data has been fully revised and updated. This includes the MOLECULE_DICTIONARY.PRODRUG field that indicates whether a drug is a prodrug (=1) or not (=0), as well as its pharmacologically active molecule which is given in the MOLECULE_HIERARCHY.ACTIVE_MOLREGNO field (and as source_id = 53).

The coverage of manually curated molecule sequence data has been revised and extended. The data now includes protein and nucleic acid sequences for INNs and USANs. See BIOTHERAPEUTIC_COMPONENTS and BIO_COMPONENT_SEQUENCES. 

Drug indication coverage has been extended to include new EMA approved drug indications, as well as USAN and INN clinical candidate indications, and their mapping to MeSH and EFO ontologies. See DRUG_INDICATION and INDICATION_REFS.

The data in MOLECULE_DICTIONARY.MAX_PHASE now includes consideration of: 
EMA approved drugs (max_phase=4 for human drugs),
USAN clinical candidate drugs (assigned as max_phase = 1 based on USAN guidance that states “Firms usually apply for a USAN when the investigational therapy is in Phase I or Phase II trials”. See https://www.ama-assn.org/about/united-states-adopted-names/apply-united-states-adopted-name), and
INN clinical candidate drugs (assigned as max_phase = 2 based on INN guidance that states “As a general guide, the development of a drug should progress up to the point of clinical trials (phase II) before an application is submitted to the INN Secretariat for name selection.” See  https://www.who.int/publications/m/item/guidance-on-the-use-of-inns).

Pref_name curation. Progress has been made towards standardising drug and clinical candidate pref_names (in MOLECULE_DICTIONARY.PREF_NAME) whereby an approved drug name (FDA /EMA) is assigned in the first instance, if available. If not available, the USAN is assigned, followed by the INN name, respectively. If the USAN/INN name assignment is ambiguous, the FDA GSRS preferred name is used. A company research code, or Clinical Trial intervention name, is assigned if no standardised name is available. For virtual parent compounds, progress has been made towards assigning a distinct pref_name (typically based on the FDA GSRS preferred name) that differs from the child compound name. 

Synonym curation. The data in MOLECULE_SYNONYMS now includes manually curated Spanish and French INN synonyms, as well as existing English INN synonyms. Manual curation has reduced the instances of the same synonym assigned to two (or more) different drug or clinical candidate drugs.

The descriptions of the drug and clinical candidate sources have been reviewed and updated to improve clarity (ie SOURCE.SRC_DESCRIPTION and SOURCE.SRC_SHORT_NAME for SRC_ID = {8, 9, 12, 13, 36, 41, 42, 53, 63, 66} ):
FDA_ORANGE_BOOK (src_id = 9) is now described as “FDA Approved Drug Products with Therapeutic Equivalence Evaluations (Orange Book)”, 
FDA_NEW_DRUGS (src_id=12) is described as “FDA New Molecular Entity and New Therapeutic Biological Product Approvals (New FDA Drugs)”. 
PRODRUG_ACTIVE (src_id=53) is now described as “Active Ingredient of a Prodrug”

The black_box_warning pipeline has been updated to capture any new FDA labels up to 31st December 2023 with black box warnings for severe or life-threatening side effects.

The clinical trials pipeline has been updated up to 29th June 2023 to capture data for clinical trial interventions, conditions and phase in ClinicalTrials.gov that can be mapped to ChEMBL data.

New data have been included in the VERSION table to show the version applied for MeSH and EFO ontologies, the ChEMBL_Structure_Pipeline, RDKit packages, InChI, UniProtKB, Bioassay Ontology and Gene Ontology as well as the version of the ChEMBL database.  

The definition of a chemical probe has been amended to update the field MOLECULE_DICTIONARY.CHEMICAL_PROBE. The data set of chemical probes was retrieved from 1) the chemicalprobes.org website and filtered for probes that were assigned an In Vivo Rating or In Cell Rating of 3 stars or more, and from 2) probes-drugs.org by filtering for the subsets “SGC Probes” and “Open Science Probes”.  Data set retrieved on 09/02/2024.

When possible, AUC record units were converted to ng.hr.mL-1 and Cmax record units were converted to nM by creating new conversion rules.
For legacy AUC and Cmax records (< ChEMBL 34), pharmacokinetic parameters have been extracted from the assay descriptions using regular expression matching (RegEx). This affects only records that did not have their PK parameters already manually extracted. Dose, dose unit, route of administration and time (for AUC only), have been loaded in ACTIVITY_PROPERTIES. 

Organism name/taxonomy updates: a review of organism-related data was undertaken to align organism names with the NCBI taxonomy and to update obsolete taxonomy IDs. In addition, where multiple strain level targets existed in ChEMBL, these were merged at the species level and the strain details migrated to the assay_strain field in the ASSAYS table. The initial round of organism updates for ChEMBL v34 impacted the ASSAYS, TARGET_DICTIONARY and ORGANISM_CLASS tables. In the ASSAYS table, the assay_organism/assay_tax_ID fields were  updated for ~ 37,144 legacy rows. In the TARGET_DICTIONARY table, these updates affected 428 legacy rows. A total of 61 updates to legacy data in the ORGANISM_CLASS table were made, with 14 entries downgraded.

Funding acknowledgements:

Work contributing to ChEMBL34 was funded by the Wellcome Trust, EMBL Member States, Open Targets, National Institutes of Health (NIH), EU Innovative Medicines Initiative (IMI) and EU Framework 7 programmes. Please see https://chembl.gitbook.io/chembl-interface-documentation/acknowledgments for more details.

If you require further information about ChEMBL, please contact us: chembl-help@ebi.ac.uk

# To receive updates when new versions of ChEMBL are available, please sign up to our mailing list: http://listserver.ebi.ac.uk/mailman/listinfo/chembl-announce
# For general queries/feedback or to report any problems with data, please email: chembl-help@ebi.ac.uk


Comments

Popular posts from this blog

SureChEMBL gets a facelift

    Dear SureChEMBL users, Over the past year, we’ve introduced several updates to the SureChEMBL platform, focusing on improving functionality while maintaining a clean and intuitive design. Even small changes can have a big impact on your experience, and our goal remains the same: to provide high-quality patent annotation with a simple, effective way to find the data you need. What’s Changed? After careful consideration, we’ve redesigned the landing page to make your navigation smoother and more intuitive. From top to bottom: - Announcements Section: Stay up to date with the latest news and updates directly from this blog. Never miss any update! - Enhanced Search Bar: The main search bar is still your go-to for text searches, still with three pre-filter radio buttons to quickly narrow your results without hassle. - Improved Query Assistant: Our query assistant has been redesigned and upgraded to help you craft more precise queries. It now includes five operator options: E...

Here's a nice Christmas gift - ChEMBL 35 is out!

Use your well-deserved Christmas holidays to spend time with your loved ones and explore the new release of ChEMBL 35!            This fresh release comes with a wealth of new data sets and some new data sources as well. Examples include a total of 14 datasets deposited by by the ASAP ( AI-driven Structure-enabled Antiviral Platform) project, a new NTD data se t by Aberystwyth University on anti-schistosome activity, nine new chemical probe data sets, and seven new data sets for the Chemogenomic library of the EUbOPEN project. We also inlcuded a few new fields that do impr ove the provenance and FAIRness of the data we host in ChEMBL:  1) A CONTACT field has been added to the DOCs table which should contain a contact profile of someone willing to be contacted about details of the dataset (ideally an ORCID ID; up to 3 contacts can be provided). 2) In an effort to provide more detailed information about the source of a deposited dat...

Improvements in SureChEMBL's chemistry search and adoption of RDKit

    Dear SureChEMBL users, If you frequently rely on our "chemistry search" feature, today brings great news! We’ve recently implemented a major update that makes your search experience faster than ever. What's New? Last week, we upgraded our structure search engine by aligning it with the core code base used in ChEMBL . This update allows SureChEMBL to leverage our FPSim2 Python package , returning results in approximately one second. The similarity search relies on 256-bit RDKit -calculated ECFP4 fingerprints, and a single instance requires approximately 1 GB of RAM to run. SureChEMBL’s FPSim2 file is not currently available for download, but we are considering generating it periodicaly and have created it once for you to try in Google Colab ! For substructure searches, we now also use an RDKit -based solution via SubstructLibrary , which returns results several times faster than our previous implementation. Additionally, structure search results are now sorted by...

Multi-task neural network on ChEMBL with PyTorch 1.0 and RDKit

  Update: KNIME protocol with the model available thanks to Greg Landrum. Update: New code to train the model and ONNX exported trained models available in github . The use and application of multi-task neural networks is growing rapidly in cheminformatics and drug discovery. Examples can be found in the following publications: - Deep Learning as an Opportunity in VirtualScreening - Massively Multitask Networks for Drug Discovery - Beyond the hype: deep neural networks outperform established methods using a ChEMBL bioactivity benchmark set But what is a multi-task neural network? In short, it's a kind of neural network architecture that can optimise multiple classification/regression problems at the same time while taking advantage of their shared description. This blogpost gives a great overview of their architecture. All networks in references above implement the hard parameter sharing approach. So, having a set of activities relating targets and molecules we can tra...