ChEMBL Resources

The SARfaris: GPCR, Kinase, ADME

Wednesday, 27 February 2013

ChEMBL Compound Clean Up

For the last three months, I've been busy working my way through a 9000 long (sometimes headache-inducing) set of ChEMBL compound ids. These had been highlighted for curation for the reason that for each ChEMBL_id in the list, there were two or more compound keys from the same paper. This implied that either there were two indistinguishable using InChI representation compounds described in the paper or they were different compounds that had been somehow merged together in the database.

Each ChEMBL_id was individually checked against the data in the original paper to see if there were indeed two compound keys for the same structure.

The outcome of this check gave rise to one of four cases:
  • The structure(s) was found to be incorrect and was redrawn.
  • The structure was correct for some records but not others, so a new compound was created for those selected records.
  • The structure required the definition of stereochemistry or a salt.
  • The structure was left alone - either the stereochemistry could not be shown or it was indeed a currently indistinguishable compound with separate compound keys. An example of this case is where chemists have separated enantiomers, and know that a pair of compounds only differ by stereochemistry, but they don't know the absolute configuration, just that they are 'opposite'.
It was a laborious but satisfying job to complete, allowing me to make use of my pedantic and geek-like tendencies. This has shown that there are a fairly significant number of papers where the authors have given identical structures two different compound keys. In some cases these are duplicates and probably should have been merged in the original publication; it also highlights some of the problems of representation of relative stereoisomers and sometimes atropisomers. These are difficult things.

It has definitely been an interesting project to get through with over 3,800 compounds being redrawn, altered or had records moved/merged. These changes will be available with the release of ChEMBL_16, further enhancing the data you have and need! 

Any questions or queries, please feel free to contact ChEMBL Help at the usual address.



Gabriel Irwin said...

Three cheers for Louisa! Really, great stuff. Can't wait to get it.

Christopher Southan said...

Its not much I know but I'll buy you a beer at the next opportunity, Cheers