Skip to main content

Pathogen data in ChEMBL



Infectious disease is a leading cause of death globally and bioactivity data against pathogens (fungi, bacteria, viruses, and parasites) is an important category in ChEMBL, especially in light of the ongoing pandemic. In ChEMBL version 29, there are over 2 M bioactivity data points against fungal, bacterial or viral targets (for 460 K compounds) available for pathogen-related research.


How can I find pathogen data?


On the ChEMBL interface, the organism taxonomy is available as a filter that can be applied to bioactivity data. A sunburst visualisation of the organism taxonomy is also provided as an easy starting point to explore targets according to their taxonomy.



In the full database, the organism_classification table holds the underlying data and can be used in bespoke SQL queries. For example, queries may be performed to extract high level pathogen data such as all bioactivity data for small molecules screened against bacterial targets (example below) or more specific subsets focused on gram-positive pathogens or on a single bacterial species. The target type includes whole organisms as well as molecular targets (proteins, nucleic acids etc.) and additional filters can be applied to filter the target type as necessary.


What are the sources of pathogen data in ChEMBL?

We routinely extract bioactivity data from core medicinal chemistry journals and also accept deposited data (a full list can be found in the source table). In recent releases, data deposited by the Community for Open Antimicrobial Drug Discovery (CO-ADD, University of Queensland & Wellcome Trust) has enhanced our pathogen coverage. CO-ADD is an open-access, not-for-profit initiative whereby compounds provided by researchers and industry scientists are screened against a clinically relevant panel of bacteria and fungi. So far, 100 K activities (against ~ 24 K compounds) have been provided through CO-ADD. Since CO-ADD may re-screen hits against resistant bacterial strains or in cytotoxicity assays, more comprehensive data is available for some compounds. There are now 31 CO-ADD datasets in ChEMBL 29 (data source: src_ID 40) with more expected in upcoming releases.


ChEMBL also has a dedicated Neglected Tropical Disease repository (ChEMBL-NTD) for open-access primary screening and medicinal chemistry data directed at key parasites causing endemic tropical diseases. In addition, 22 datasets from screens of the ‘Malaria Box' (MMV) compound set are also provided through ChEMBL ensuring good coverage of key parasites. Currently, there are ~ 950 K activities for Plasmodium species alone.


Finally, ChEMBL version 27 was a special SARS-CoV-2 release focused on large-scale drug screening studies for anti-viral activity, in particular cell-based assays with well-characterised compounds. Rapid integration of SARS-CoV-2 activity data into ChEMBL provided a contribution towards the COVID-19 effort and several follow-up datasets have since been captured in subsequent releases.


Questions? Please get in touch on the Helpdesk or have a look through our training materials and FAQs.

Popular posts from this blog

Release of ChEMBL 33

We are pleased to announce the release of ChEMBL 33! This fresh release comes with a few new data soures and also some new features: we added bioactivity data for understudied SLC targets from the RESOLUTE project and included a flag for Natural Products and for Chemical Probes. An annotation for the ACTION_TYPE of a measurement was included for approx. 270 K bioactivities. We also time-stamped every document in ChEMBL with their CREATION_DATE! Have fun playing around with ChEMBL 33 over the summer and please send feedback via chembl-help@ebi.ac.uk .   ChEMBL database version ChEMBL 33 release notes ___________________________________________ # This version of the database, prepared on 31/05/2023 contains:      2,399,743 compounds (of which 2,372,674 have mol files)      3,051,613 compound records (non-unique compounds)        20,334,684 activities         1,610,596 assays      15,398 targets      88,630 documents BioAssay Data Sources:    Number Assays:    Number

This Python InChI Key resolver will blow your mind

This scientific clickbait title introduces our promised blog post about the integration of UniChem into our ChEMBL python client. UniChem is a very important resource, as it contains information about 134 million (and counting) unique compound structures and cross references between various chemistry resources. Since UniChem is developed in-house and provides its own web services , we thought it would make sense to integrate it with our python client library . Before we present a systematic translation between raw HTTP calls described in the UniChem API documentation and client calls, let us provide some preliminary information: In order to install the client, you should use pip : pip install -U chembl_webresource_client Once you have it installed, you can import the unichem module: from chembl_webresource_client.unichem import unichem_client as unichem OK, so how to resolve an InChI Key to InChI string? It's very simple: Of course in order to reso

A python client for accessing ChEMBL web services

Motivation The CheMBL Web Services provide simple reliable programmatic access to the data stored in ChEMBL database. RESTful API approaches are quite easy to master in most languages but still require writing a few lines of code. Additionally, it can be a challenging task to write a nontrivial application using REST without any examples. These factors were the motivation for us to write a small client library for accessing web services from Python. Why Python? We choose this language because Python has become extremely popular (and still growing in use) in scientific applications; there are several Open Source chemical toolkits available in this language, and so the wealth of ChEMBL resources and functionality of those toolkits can be easily combined. Moreover, Python is a very web-friendly language and we wanted to show how easy complex resource acquisition can be expressed in Python. Reinventing the wheel? There are already some libraries providing access to ChEMBL d

Chemistry and Nature

  As the Great Big Green Week (UK) draws to a close, so does EMBL-EBI’s own Sustainability week. The Wellcome Genome Campus held events in the areas of recycling, energy use, and biodiversity. The ChEMBL team was keen to get involved and we developed our own Nature Trail event highlighting some of the bioactive compounds from the flora and fauna found on-site, and elsewhere. Our favourite examples include the sensation of mint and chilli and the glorious smell of rain! The full Nature Trail can be made available for external Public Engagement events upon request . Databases, such as ChEMBL , are large stores of structured data, including genetic, biological, and chemistry data for life sciences research. Data on the natural world is often held by wildlife organisations; this can be used to research biodiversity and species decline. Various Citizen Science initiatives mean that everyone can get involved in submitting nature records. So why not join in with the Butterfly Conservation’s B

Drug warning update: withdrawn drugs and drugs that carry a black box warning

The drug warning information in ChEMBL has been updated for version 32. In particular, the withdrawn drug data has been fully reviewed and, to assist the manual curation process, our rules have been updated, clarified and formally written.  In ChEMBL, a withdrawn drug is an approved drug (ie Phase 4) that has subsequently been withdrawn for toxicity reasons. For example, a drug is assigned as 'withdrawn' if: All doses are withdrawn (and not just the highest dose). The drug is withdrawn for all populations (and not just infants). The drugs is withdrawn for all indications.  Any drug withdrawn for a lack of evidence of efficacy is not included.  Any drug withdrawn for drug-drug interactions is included if it is a safety-related withdrawal. A regulatory body (e.g. EMA, FDA) is the preferred source of information for the withdrawn status. The withdrawn status is mapped to an individual drug form (e.g. a parent (salt-stripped) or salt drug form within a family of compound structures