Skip to main content

SureChEMBL2.0 announcement


 

Dear SureChEMBL users,

 
Last year, we asked for your feedback on what you wanted to see next from SureChEMBL. Your input didn’t disappear into the void—quite the opposite! While it may have seemed quiet on the surface, we've been working hard behind the scenes: fixing bugs, implementing new features, and laying the groundwork for major improvements.
 
Today, we’re excited to share that we’ve reached a significant milestone—and we’re marking the occasion with a new name. Without further delay...


🎉 SureChEMBL 2.0 is here! 🎉

 

This is the first in a series of blog posts describing SureChEMBL 2.0:
  1. Introduction to SureChEMBL 2.0 and new data source (see below)
  2. Major changes to the download data
  3. Integration of RDKit into the processing pipeline

Further posts will be added over the next few weeks and linked to from here.

 

What’s New in SureChEMBL 2.0?

This release marks a major evolution of the system we inherited over a decade ago. We've replaced several core components of the pipeline—from annotation, to compound registration, to data provision. When we introduced the new SureChEMBL architecture in December 2023, we promised a system that would be “easier to support, and it should make it much easier to develop and deliver new functionalities”. Now, we're delivering on that promise.
One of the biggest challenges with legacy infrastructure (SureChEMBL originated from SureChem, remember?) was understanding every layer of its complex, ageing architecture. With SureChEMBL 2.0, we've rebuilt enough of the pipeline to give us full ownership and clarity—particularly over annotation, the heart of what we do.

New Data Source

Let’s start with one of the most exciting updates: we’ve added the China National Intellectual Property Administration (CNIPA) to our list of patent authorities! We now integrate data from five authorities:
    •    USPTO


    •    WIPO


    •    EPO


    •    JPO


    •    CNIPA


However, what we get from each varies. While we receive rich content—including attachments and full text—for USPTO, EPO, and WIPO, we only get titles and abstracts from JPO, and translated full text (no attachments) from CNIPA.

Authorities

Kind

Language

From

Full text

Attachments

CNIPA

Applications

EN

1985

Yes (English translation)

No

Granted

EPO

Applications

DE, EN, FR

1978

Yes

Yes

Granted

1980

JPO

Applications

EN

1976

Yes

(abstract)

No

USPTO

Applications

EN

2001

Yes

Yes

Granted

1920-1949

Yes (abstract)

Yes

1950-1975

Yes (abstract & claims)

Yes

1976

Yes

Yes

WIPO

Applications

EN, FR

1978

Yes

Yes

 

Let’s Talk Numbers

Here’s a snapshot of how many documents and chemical structures we’ve identified from each authority. While many patents are unrelated to chemistry (and won’t contain chemical structures), all documents remain accessible via the SureChEMBL web interface—keeping it a universal patent search portal.
Notably:
    •    USPTO (21M) has fewer documents than CNIPA or JPO, but contributes the most unique compounds (20M), thanks to the availability of the MOL and CDX attachments included in the Complex Work Unit (CWU).


    •    CNIPA (51M) and JPO (30M) lead in patent volume, but limited data access results in fewer compounds (4M and 193K respectively).


    •    EPO (9M) and WIPO (6M) have fewer patents but more chemical structures (13M and 11M, respectively), thanks to our image-to-structure extraction pipeline.    

Authorities

#Patents

#Annotated Patents

#Unique Compounds

CNIPA

51,134,665

19,843,662

3,882,257

EPO

8,685,902

5,893,457

12,637,484

JPO

29,661,823

3,430,164

193,340

USPTO

21,243,929

11,261,705

19,903,544

WIPO

5,873,483

3,870,512

11,144,795


What’s Next?

This is just the beginning. As we roll out more updates to our pipeline, you’ll see improvements in annotation quality and chemical extraction across all authorities. Stay tuned—we’ll keep you informed through our blog post .

The SureChEMBL team

Comments