At the end of last year we mentioned that we are now using RDKit for our compound structure processing (see here). Most excitingly, as a part of this we have been working with Greg Landrum the developer of RDKit over the last year to reimplement our curation pipeline using RDKit.
The pipeline includes three functions:
1. Check
Identifies and validates problem structures before they are added to the database
2. Standardize
Standardises chemical structures according to a set of predefined ChEMBL business rules
Standardises chemical structures according to a set of predefined ChEMBL business rules
3. GetParent
Generates parent structures of multi-component compounds based on a set of rules and defined list of salts and solvents
Generates parent structures of multi-component compounds based on a set of rules and defined list of salts and solvents
We are now pleased to announce that we are making all the code from this project freely available in GitHub. The functions can also now be used through our ChEMBL Beaker API.
Live notebook with examples available here.
Live notebook with examples available here.
For ChEMBL26 (shortly to be released) we have created new molfiles for all the ChEMBL compounds using this pipeline and we will continue to refine and develop this over the coming months. We are sure it’s not yet perfect so as always we welcome your feedback to ChEMBL-help.
Comments