Skip to main content

ChEMBL 34 is out!

We are delighted to announce the release of ChEMBL 34, which includes a full update to drug and clinical candidate drug data. This version of the database, prepared on 28/03/2024 contains:

  •         2,431,025 compounds (of which 2,409,270 have mol files)
  •         3,106,257 compound records (non-unique compounds)
  •         20,772,701 activities
  •         1,644,390 assays
  •         15,598 targets
  •         89,892 documents
Data can be downloaded from the ChEMBL FTP site: https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_34/

Please see ChEMBL_34 release notes for full details of all changes in this release: https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_34/chembl_34_release_notes.txt

New Data Sources



European Medicines Agency (src_id = 66): European Medicines Agency's data correspond to EMA drugs prior to 20 January 2023 (excluding vaccines). 71 out of the 882 newly added EMA drugs are only authorised by EMA, rather than from other regulatory bodies e.g. FDA. A significant effort has been made to correctly map the drug form of the EMA data by manually inspecting different EMA sources of information, such as the Product Information (Annex I: Summary of Product Characteristics and Annex III: Labelling and Package Leaflet) and/or Assessment Report, where available.

University of Dundee: T. cruzi data (src_id = 67): 3328 compounds that harbour common protease inhibitor motifs were screened at 30 µM against LAPTc using  RapidFire-MS method  for inhibitory activity against TcLAP protein

EU-OPENSCREEN dataset (src_id = 68): 4 assays have been deposited by the EU-OPENSCREEN project; 1 cell-based assay on human HepG2 cells, 1 assay measuring inhibition of SARS-Cov2-induced cytopathy, and 2 assays measuring inhibition of SARS-CoV2 3Cl-Pro proteolytic cleavage. 1813 bioactivities in total have been added.

Zimmermann Lab Biotransformation data Dec 2023 (src_id = 69): 271 compounds have been tested for biotransformation in 68 bacterial species and 28 bacterial communities.  The metabolism of these compounds was positive in 8844 of 22306 activity results; biotransformation is recorded in the STANDARD_TEXT_VALUE field of the ACTIVITIES table.

New Deposited Datasets

CHEMBL5291702 - European Medicines Agency
CHEMBL5303304 - EUbOPEN Chemogenomics Library - IncuCyte (assays link to document CHEMBL4689842)
CHEMBL5303761 - Data for DCP probe BI-3231
CHEMBL5303762 - Data for DCP probe BI-8668
CHEMBL5303763 - Data for DCP probe BI-3802
CHEMBL5303764 - Data for DCP probe BI-3812
CHEMBL5303765 - Data for DCP probe BAY-7081
CHEMBL5303766 - Data for DCP probe FHT-2344
CHEMBL5303767 - Data for DCP probe JNJ-4355
CHEMBL5303768 - Data for DCP probe TP-060
CHEMBL5303769 - Data for DCP probe JNJ-42226314
CHEMBL5303708 - ECBD screening data for assay EOS300033
CHEMBL5303709 - ECBD screening data for assay EOS300041
CHEMBL5303710 - ECBD screening data for assay EOS300044
CHEMBL5303711 - ECBD screening data for assay EOS300108
CHEMBL5303300 - EUbOPEN Chemogenomics Library - Multiplex (assays link to document CHEMBL4689842)
CHEMBL5305021 - RapidFire TcLAP Compounds Screening
CHEMBL5308504 - Tm Shift (DSF) assay results for EUbOPEN Chemogenomics Library (assays link to documents CHEMBL5060014 and CHEMBL4649998)

Data Highlights

A new source of drug data from the European Medicines Agency (EMA) has been included in ChEMBL for this release. 

The MOLECULE_DICTIONARY.ORPHAN field has been added to indicate whether a drug has orphan designation, i.e. intended for use against a rare condition (1 = yes, 0 = no, -1 = preclinical compound i.e. not a drug). This data is currently available for European Medicines Agency drugs only.

The field MOLECULE_DICTIONARY.MOLECULE_TYPE has been updated to include ‘Antibody drug conjugate’ in addition to the existing categories.

The Prodrug data has been fully revised and updated. This includes the MOLECULE_DICTIONARY.PRODRUG field that indicates whether a drug is a prodrug (=1) or not (=0), as well as its pharmacologically active molecule which is given in the MOLECULE_HIERARCHY.ACTIVE_MOLREGNO field (and as source_id = 53).

The coverage of manually curated molecule sequence data has been revised and extended. The data now includes protein and nucleic acid sequences for INNs and USANs. See BIOTHERAPEUTIC_COMPONENTS and BIO_COMPONENT_SEQUENCES. 

Drug indication coverage has been extended to include new EMA approved drug indications, as well as USAN and INN clinical candidate indications, and their mapping to MeSH and EFO ontologies. See DRUG_INDICATION and INDICATION_REFS.

The data in MOLECULE_DICTIONARY.MAX_PHASE now includes consideration of: 
EMA approved drugs (max_phase=4 for human drugs),
USAN clinical candidate drugs (assigned as max_phase = 1 based on USAN guidance that states “Firms usually apply for a USAN when the investigational therapy is in Phase I or Phase II trials”. See https://www.ama-assn.org/about/united-states-adopted-names/apply-united-states-adopted-name), and
INN clinical candidate drugs (assigned as max_phase = 2 based on INN guidance that states “As a general guide, the development of a drug should progress up to the point of clinical trials (phase II) before an application is submitted to the INN Secretariat for name selection.” See  https://www.who.int/publications/m/item/guidance-on-the-use-of-inns).

Pref_name curation. Progress has been made towards standardising drug and clinical candidate pref_names (in MOLECULE_DICTIONARY.PREF_NAME) whereby an approved drug name (FDA /EMA) is assigned in the first instance, if available. If not available, the USAN is assigned, followed by the INN name, respectively. If the USAN/INN name assignment is ambiguous, the FDA GSRS preferred name is used. A company research code, or Clinical Trial intervention name, is assigned if no standardised name is available. For virtual parent compounds, progress has been made towards assigning a distinct pref_name (typically based on the FDA GSRS preferred name) that differs from the child compound name. 

Synonym curation. The data in MOLECULE_SYNONYMS now includes manually curated Spanish and French INN synonyms, as well as existing English INN synonyms. Manual curation has reduced the instances of the same synonym assigned to two (or more) different drug or clinical candidate drugs.

The descriptions of the drug and clinical candidate sources have been reviewed and updated to improve clarity (ie SOURCE.SRC_DESCRIPTION and SOURCE.SRC_SHORT_NAME for SRC_ID = {8, 9, 12, 13, 36, 41, 42, 53, 63, 66} ):
FDA_ORANGE_BOOK (src_id = 9) is now described as “FDA Approved Drug Products with Therapeutic Equivalence Evaluations (Orange Book)”, 
FDA_NEW_DRUGS (src_id=12) is described as “FDA New Molecular Entity and New Therapeutic Biological Product Approvals (New FDA Drugs)”. 
PRODRUG_ACTIVE (src_id=53) is now described as “Active Ingredient of a Prodrug”

The black_box_warning pipeline has been updated to capture any new FDA labels up to 31st December 2023 with black box warnings for severe or life-threatening side effects.

The clinical trials pipeline has been updated up to 29th June 2023 to capture data for clinical trial interventions, conditions and phase in ClinicalTrials.gov that can be mapped to ChEMBL data.

New data have been included in the VERSION table to show the version applied for MeSH and EFO ontologies, the ChEMBL_Structure_Pipeline, RDKit packages, InChI, UniProtKB, Bioassay Ontology and Gene Ontology as well as the version of the ChEMBL database.  

The definition of a chemical probe has been amended to update the field MOLECULE_DICTIONARY.CHEMICAL_PROBE. The data set of chemical probes was retrieved from 1) the chemicalprobes.org website and filtered for probes that were assigned an In Vivo Rating or In Cell Rating of 3 stars or more, and from 2) probes-drugs.org by filtering for the subsets “SGC Probes” and “Open Science Probes”.  Data set retrieved on 09/02/2024.

When possible, AUC record units were converted to ng.hr.mL-1 and Cmax record units were converted to nM by creating new conversion rules.
For legacy AUC and Cmax records (< ChEMBL 34), pharmacokinetic parameters have been extracted from the assay descriptions using regular expression matching (RegEx). This affects only records that did not have their PK parameters already manually extracted. Dose, dose unit, route of administration and time (for AUC only), have been loaded in ACTIVITY_PROPERTIES. 

Organism name/taxonomy updates: a review of organism-related data was undertaken to align organism names with the NCBI taxonomy and to update obsolete taxonomy IDs. In addition, where multiple strain level targets existed in ChEMBL, these were merged at the species level and the strain details migrated to the assay_strain field in the ASSAYS table. The initial round of organism updates for ChEMBL v34 impacted the ASSAYS, TARGET_DICTIONARY and ORGANISM_CLASS tables. In the ASSAYS table, the assay_organism/assay_tax_ID fields were  updated for ~ 37,144 legacy rows. In the TARGET_DICTIONARY table, these updates affected 428 legacy rows. A total of 61 updates to legacy data in the ORGANISM_CLASS table were made, with 14 entries downgraded.

Funding acknowledgements:

Work contributing to ChEMBL34 was funded by the Wellcome Trust, EMBL Member States, Open Targets, National Institutes of Health (NIH), EU Innovative Medicines Initiative (IMI) and EU Framework 7 programmes. Please see https://chembl.gitbook.io/chembl-interface-documentation/acknowledgments for more details.

If you require further information about ChEMBL, please contact us: chembl-help@ebi.ac.uk

# To receive updates when new versions of ChEMBL are available, please sign up to our mailing list: http://listserver.ebi.ac.uk/mailman/listinfo/chembl-announce
# For general queries/feedback or to report any problems with data, please email: chembl-help@ebi.ac.uk


Comments

Popular posts from this blog

SureChEMBL Available Now

Followers of the ChEMBL group's activities and this blog will be aware of our involvement in the migration of the previously commercially available SureChem chemistry patent system, to a new, free-for-all system, known as SureChEMBL. Today we are very pleased to announce that the migration process is complete and the SureChEMBL website is now online. SureChEMBL provides the research community with the ability to search the patent literature using Lucene-based keyword queries and, much more importantly, chemistry-based queries. If you are not familiar with SureChEMBL, we recommend you review the content of these earlier blogposts here and here . SureChEMBL is a live system, which is continuously extracting chemical entities from the patent literature. The time it takes for a new chemical in the patent literature to become searchable in the SureChEMBL system is 1-2 days (WO patents can sometimes take a bit longer due to an additional reprocessing step). At time of writi

New SureChEMBL announcement

(Generated with DALL-E 3 ∙ 30 October 2023 at 1:48 pm) We have some very exciting news to report: the new SureChEMBL is now available! Hooray! What is SureChEMBL, you may ask. Good question! In our portfolio of chemical biology services, alongside our established database of bioactivity data for drug-like molecules ChEMBL , our dictionary of annotated small molecule entities ChEBI , and our compound cross-referencing system UniChem , we also deliver a database of annotated patents! Almost 10 years ago , EMBL-EBI acquired the SureChem system of chemically annotated patents and made this freely accessible in the public domain as SureChEMBL. Since then, our team has continued to maintain and deliver SureChEMBL. However, this has become increasingly challenging due to the complexities of the underlying codebase. We were awarded a Wellcome Trust grant in 2021 to completely overhaul SureChEMBL, with a new UI, backend infrastructure, and new f

ChEMBL & SureChEMBL anniversary symposium

  In 2024 we celebrate the 15th anniversary of the first public release of the ChEMBL database as well as the 10th anniversary of SureChEMBL. To recognise this important landmark we are organising a two-day symposium to celebrate the work achieved by ChEMBL and SureChEMBL, and look forward to its future.   Save the date for the ChEMBL 15 Year Symposium October 1-2, 2024     Day one will consist of four workshops, a basic ChEMBL drug design workshop; an advanced ChEMBL workshop (EUbOPEN community workshop); a ChEMBL data deposition workshop; and a SureChEMBL workshop. Day two will consist of a series of talks from invited speakers, a few poster flash talks, a local nature walk, as well as celebratory cake. During the breaks, the poster session will be a great opportunity to catch up with other users and collaborators of the ChEMBL resources and chat to colleagues, co-workers and others to find out more about how the database is being used. Lunch and refreshments will be pro

RDKit, C++ and Jupyter Notebook

Fancy playing with RDKit C++ API without needing to set up a C++ project and compile it? But wait... isn't C++ a compiled programming language? How this can be even possible? Thanks to Cling (CERN's C++ interpreter) and xeus-cling jupyter kernel is possible to use C++ as an intepreted language inside a jupyter notebook! We prepared a simple notebook showing few examples of RDKit functionalities and a docker image in case you want to run it. With the single requirement of docker being installed in your computer you'll be able to easily run the examples following the three steps below: docker pull eloyfelix/rdkit_jupyter_cling docker run -d -p 9999:9999 eloyfelix/rdkit_jupyter_cling open  http://localhost:9999/notebooks/rdkit_cling.ipynb  in a browser