Adding Biomedical Annotation to SureChEMBL: Beyond the Chemical Space

Dear users,

Since its introduction in 2015, SureChEMBL has been a database focused on chemical annotations. We extract compound structures from patent texts, images, and Molfiles when available, and register them in our database. This chemistry-first approach is even reflected in our name.

However, we know that intellectual property documents capture far more than chemistry. This was illustrated by Stefan Senger in 2017 (10.1186/s13321-017-0214-2), who showed that compound–target interactions can appear years before being mentioned in the scientific literature.

Our first step into biomedical annotation

A few years ago, we took a first step beyond chemistry by adding annotations for genes/proteins, diseases, and mechanisms of action in the SureChEMBL UI. These were generated by an in-house Natural Language Processing (NLP) model that performed reasonably well for an initial version.

Example of biomedical annotation in a patent text using the NLP model

Annotation is only the first step, and normalization (or standardization) follows. This process maps each entity to a controlled vocabulary or ontology to remove duplicates and enable cross-referencing with other resources.

Managing uniqueness is challenging, even for chemical compounds. Fortunately, chemistry benefits from identifiers like InChI, or registration hashes, that simplify this process. In contrast, biomedical entities such as proteins or diseases often have many synonyms. For example, “Type 2 diabetes mellitus” in the Ontology Lookup Service (OLS) has 25 exact synonyms (e.g. adult onset diabetes; diabetes mellitus, type 2; diabetes mellitus, type II; diabetes, type 2; non-insulin dependent diabetes mellitus; T2D; T2DM etc.). All of these need to be recognized and mapped to a single preferred term.

Although text annotation models do not require pre-defined dictionaries, normalization inevitably does — and these vocabularies are frequently updated. Unmatched terms must often be reviewed and added manually, which is time-consuming. For SureChEMBL, where we aim for full automation, that’s a significant limitation.

Our in-house model was lightweight enough to generate annotations on the fly but still slowed page loading due to processing the full text each time (model prompt ~1,000 tokens).

Our new approach

To overcome these limitations, we adopted a new approach for non-chemical annotations: moving from our in-house NLP model to a commercial grammar and dictionary-based system LeadMine, developed by NextMove Software.

LeadMine uses curated and public dictionaries with a custom grammar for fast and accurate text annotation. It has some options to automatically fix spelling mistakes that are frequently found in patent text due to the OCR. Using the provided dictionaries, it can also resolve an annotation to a unique identifier.

Using these functionalities, we annotated all patents in SureChEMBL for three key biomedical entity types and match them to the relevant data source when possible:

Gene/Protein: HGNC, Uniprot
Disease: MeSH, Human Disease Ontology
Mechanism (terms such as inhibitor, antagonist, modulator, etc.)

LeadMine is fast, robust, and easily scalable. Exactly what we need for SureChEMBL production-level annotation!

Patent annotation with Leadmine. Colour code: orange: generic chemical name, pink: generic molecule, grey: anatomy, violet: molecule dictionary,

turquoise: mechanism, green: PubChem dictionary, dark red: gene, yellow: polymer, light red: journal, khaki: organism, dark orange: disease

For now, the new biomedical annotations are limited to these three types while we assess database load, integration in the UI, and delivery through bulk data. More entity types, or custom dictionaries, may follow in later phases.

Accessing the biomedical annotations

Biomedical annotations are included in the bulk data downloads. On our FTP site, you’ll find three new files:

Biomedical_entities
Biomedical_locations
Biomedical_types

https://ftp.ebi.ac.uk/pub/databases/chembl/SureChEMBL/bulk_data/latest/

The documentation has been updated accordingly, and the example notebook now includes queries that combine chemical and biomedical entities.

The UI does not yet display these new annotations. To avoid confusion with the older NLP-based system, we’ll temporarily disabled biomedical annotations in the interface. They’ll return soon in a new, optimized form.

For the first time: cross-domain queries

With this bulk data release, we reach another major milestone!
For the first time, it is now possible to query both compounds and biomedical entities together using the Parquet files. This capability marks the beginning of a new era for SureChEMBL enabling richer, cross-domain exploration of patent data.
Think of it as a proof of concept for what will soon be possible directly in the UI!

Query example using the bulk data to find all patents containing the protein name ‘abl%’ and the compound imatinib (KTUFNOKKBVMGRW-UHFFFAOYSA-N).

More examples in our public notebook.

Limitations

Biomedical annotations currently cover patents up to 31 December 2024, and only for those already containing chemical annotations. Patents from 2025 onward will be added shortly.

What’s next

Delivering biomedical annotations is an important milestone. The next step is to display them in the patent UI to provide context for users. While we can’t share full patent texts, showing the annotations themselves will already be a major improvement.

In parallel, we’re integrating this new system into our production pipeline to keep the annotations up to date and reduce backlog.

The SureChEMBL team

The ChEMBL-og

Search This Blog