Sunday, 24 April 2011
Molecular databases and molecule complexity - part 1
At one level a database of small molecules seems a really simple thing - a set of identifiers and then a 2D structure. You can then do a bunch of really cool things with this, as the large literature in the area shows. For example, one thing which is pretty common is to take a library of molecules, then 'dock' them into a protein structure, hopefully to find a novel lead; or maybe even a new use for a drug (or prediction of a side effect of a known drug). The wide availability of pipeline tools, web services connecting directly to remote databases, and so forth, makes this sort of thing really simple, and arguably too simple.
However, there are many challenges with handling normalised 2-D chemical data. One thing we have started to think about recently, is just how ambiguous a 2D representation of a structure is for typical users interested in the analysis of compound properties, docking, etc.
The problem arises from the fact that molecules are 'complex', in that a single valid 2D representation can have multiple, readily interconvertable distinct physical manefestations. These factors involving ambiguity include ionization, tautomerisation, hydration (for example, the formation of geminal hydroxy forms from aldehydes), stereoisomerism, and of course there is conformational flexibility. When a real physical experiment is done, the lowest free energy result emerges from this ensemble of possibilities.
During the registration of a molecule into a database there is a typically a series of normalisation steps that happen, in order to reduce this level of real physical world multiple structures to a simpler 'canonical' form. When one wants to use the data, a user may then need to 'enumerate' a set of possible structures in order to do anything useful with them. (of course, stereoisomers are not usually physically interconvertable. However often molecules have undefined stereochemistry when registered in a database, and for some tasks (e.g. docking) the results depend enormously on the actual stereo form, since the two enantiomers will bind to the (usually chiral) receptor with different energies, whereas other properties are identical in this case (e.g. logP).
So in summary, there is a processing step when one registers a molecule to reduce complexity in the representation, and then a processing step when one takes a molecule from a database to do something useful with it (caveat - alternatives to this general model exist).
Some molecules will have a limited (maybe even a single) number of physical forms, others will be incredibly complex and have a very large number of physical forms.
How widely appreciated is this fact - well, based on some of the questions and requests we get for ChEMBL support, I'd say not very widely, and we're thinking of ways to incorporate this into the database somehow.....
It's Easter Egg Hunt time in Cheam now.... so more tomorrow.