Skip to main content

ChEMBL Compound Curation Pipeline




At the end of last year we mentioned that we are now using RDKit for our compound structure processing (see here). Most excitingly, as a part of this we have been working with Greg Landrum the developer of RDKit over the last year to reimplement our curation pipeline using RDKit. 

The pipeline includes three functions:

1. Check
Identifies and validates problem structures before they are added to the database

2. Standardize
Standardises chemical structures according to a set of predefined ChEMBL business rules 

3. GetParent
Generates parent structures of multi-component compounds based on a set of rules and defined list of salts and solvents

We are now pleased to announce that we are making all the code from this project freely available in GitHubThe functions can also now be used through our ChEMBL Beaker API. 

Live notebook with examples available here.

For ChEMBL26 (shortly to be released) we have created new molfiles for all the ChEMBL compounds using this pipeline and we will continue to refine and develop this over the coming months. We are sure it’s not yet perfect so as always we welcome your feedback to ChEMBL-help

Comments