Dear SureChEMBL users,
Welcome to the third blog post in our SureChEMBL2.0 series! If you missed the previous ones [1, 2], be sure to check them out when you have a moment.
Today, we're diving into what has changed under the hood of SureChEMBL!
At the end of last year, we announced the integration of RDKit for compound depiction and various types of chemistry searches. But of course, we didn’t stop there. In line with our goal to simplify the service architecture, but also to improve the data quality, we’ve decided to follow the lead of our sister resource ChEMBL in how we manage compound structures.
What’s Changing?
SureChEMBL handles compound structures at several stages in the pipeline. Regardless of how a compound is identified in a patent, the first step is to standardise its structure — and we’re now using the ChEMBL standardiser for this.Next, we register compounds using an RDKit-based hash, which helps us avoid introducing duplicates. (This hash-based system was introduced by Schrödinger and is now available in RDKit.) Finally, we calculate physicochemical properties, also with RDKit.
All legacy and future data are affected by this new pipeline.
What Does This Mean for You?
Changing our core chemistry framework means that some compound representations may differ, even though the underlying molecules (MOL file or SMILES) remain the same.
Some key points:
- The ChEMBL standardiser follows different rules from our previous system (which used ChemAxon). As a result, some existing compounds may now be rejected.
- Physicochemical properties might differ slightly — often at the second or third significant figure.
- We’ve deprecated 172,504 compounds in total. Of those:
- 106,210 were rejected due to structure standardiser.
- 66,294 were removed as duplicates based on the registration hash.
And don’t worry — after this update, the SureChEMBL database still contains 28,546,173 compounds.
What’s Next?
We hope this upgrade improves your experience in the SureChEMBL data. But if you’re interested in patent compound structures, stay tuned — there’s more to come in our next blog post!The SureChEMBL Team
Comments