Dear SureChEMBL users,
🎉 SureChEMBL 2.0 is here! 🎉
- Introduction to SureChEMBL 2.0 and new data source (see below)
- Major changes to the download data
- Integration of RDKit into the processing pipeline
Further posts will be added over the next few weeks and linked to from here.
What’s New in SureChEMBL 2.0?
This release marks a major evolution of the system we inherited over a decade ago. We've replaced several core components of the pipeline—from annotation, to compound registration, to data provision. When we introduced the new SureChEMBL architecture in December 2023, we promised a system that would be “easier to support, and it should make it much easier to develop and deliver new functionalities”. Now, we're delivering on that promise.
One of the biggest challenges with legacy infrastructure (SureChEMBL originated from SureChem, remember?) was understanding every layer of its complex, ageing architecture. With SureChEMBL 2.0, we've rebuilt enough of the pipeline to give us full ownership and clarity—particularly over annotation, the heart of what we do.
New Data Source
• USPTO
• WIPO
• EPO
• JPO
• CNIPA
However, what we get from each varies. While we receive rich content—including attachments and full text—for USPTO, EPO, and WIPO, we only get titles and abstracts from JPO, and translated full text (no attachments) from CNIPA.
Let’s Talk Numbers
Notably:
• USPTO (21M) has fewer documents than CNIPA or JPO, but contributes the most unique compounds (20M), thanks to the availability of the MOL and CDX attachments included in the Complex Work Unit (CWU).
• CNIPA (51M) and JPO (30M) lead in patent volume, but limited data access results in fewer compounds (4M and 193K respectively).
• EPO (9M) and WIPO (6M) have fewer patents but more chemical structures (13M and 11M, respectively), thanks to our image-to-structure extraction pipeline.
What’s Next?
This is just the beginning. As we roll out more updates to our pipeline, you’ll see improvements in annotation quality and chemical extraction across all authorities. Stay tuned—we’ll keep you informed through our blog post .The SureChEMBL team
Comments