ChEMBL Resources


Monday, 14 May 2018

Striving for Perfect Representation of Chemical Structures – is this possible?

It probably goes without saying that at ChEMBL, we have a desire to make all our data as accurate and useful as possible. With this in mind we have spent many hours over the last few years trying to curate, in particular, the structures of marketed drugs and clinical candidates. We aren’t alone in this and more than 5 years ago people were coming across the same problems as highlighted in this blog post by ChemConnector on Fluvastatin

Our drug curation is an ongoing and probably a never-ending task but to be honest it has proved a lot more difficult than we expected. This is for several reasons:

Firstly, where to go to find the definitive structure of a molecule? One would have thought this would be easy but even the sources such as INN and USAN don’t always agree. For example for Telavancin the USAN_data_sheet shows a difference in the nitrogen and carbon counts in the structure images compared with the images in the INN document (although the molecular formula are the same in both documents).

Secondly, while molfiles are our definitive structures and we use standard InChIs to determine uniqueness we see many examples where, as we convert between formats (molfiles, smiles, InChIs) we introduce inconsistencies. This is of course a well-known problem. There are ongoing discussions and initiatives to develop open structure formats and extend InChIs to deal with some of these cases but my sense is this is a long way off.

Lastly, and what seems like an insurmountable problem, we are constrained by the method we use to represent chemical structures. We, at ChEMBL, like many people, use version 2000 molfiles (ref 1). There is no doubt that using v3000 molfiles would solve a number of these problems but it would be very time consuming and costly to do the conversion and therefore probably only feasible for a limited number of ChEMBL structures such as the drug molecules. We are considering this as a long term goal but it would need a wider community buy in to make it worthwhile.  However, we also suspect that many of our users also only use the older molfile version so providing the v3000 format wouldn’t help them. We would be interested in your feedback on which format you use though. Most of the resources we exchange data with (e.g. PubChem, BindingDB) also use v2000 molfiles. There is no doubt that different resources find their own way to cope with the limitations of the file formats and we do too. For example, it would be possible to use non-standard extensions of the datafields in the sd file to indicate this but it would lack real chemical awareness. Also, how one group chooses to use this won’t necessarily be consistent with another group so we are no further forward.

As a consequence of our curation efforts, we have come across an increasing number of challenging molecules for which it would be useful to get the views of our users as to the best way to deal with these. It should also be said here that we are only talking about apparently “simple” rule of 5 compliant organic molecules and several years ago we stopped trying to curate organometallic compounds. We don’t show the structures of these in ChEMBL. The drug cisplatin being a case in point. The v2000 molfile has no way of coding coordination bonds and the standard InChI (ref 2) that we use to define a unique chemical structure can’t distinguish between cis and trans-platin.

Back to the organic molecules though and a few of our dilemmas:

Milnacipran is my favourite and an apparently relatively simple example.  It is a mixture of the 1S,2R and 1R,2S enantiomers (USAN). However, v2000 molfiles don’t deal with relative stereochemistry so we have 3 options:

(1) Show one enantiomer:
(2) Show it as a racemic mixture i.e. no stereochemistry:

(3) Show it as a molfile comprised of two molecules:

Arguably option 3 is the only correct way to do this. However other data providers such as FDA and Drugbank use option 1. In the ChEMBL database we use option 2 so that we can distinguish milnacipran from levomilnacipran USAN (specifically the 1S, 2R isomer) or dextromilnacipran (1R, 2S). Option 1 wouldn’t enable us to distinguish these either in the molfile or the standard InChI.

My logic here for not using option 3 is in thinking about the use people are making of ChEMBL. ChEMBL is not a registration system where option 3 might indeed be needed but it is being used as a source of bioactivity data that can be used for identifying tool compounds, building QSAR models for specific targets etc. Hence wouldn’t users taking our 1.8 million compounds just discard any mixtures such as option 3 would give before starting their analysis given that calculating physicochemical properties etc on mixtures makes little sense?

OK so suppose you disagree and think option 3 is the right thing to do, what would you want us to do for itraconazole? This is described in DailyMed (ref 3) as a “1:1:1:1 racemic mixture of four diastereoisomers (two enantiomeric pairs)”.

Option 3 would give us a mixture of 4 molecules in our v2000 molfile. For example:

Again, we have chosen option 2 as the least bad option i.e just showing it as a racemic mixture.

It seemed as if we had identified a workable and at least internally consistent way of dealing with these structures – until we took a look at the following two examples alpha prodine and beta prodine:

Here we have alphaprodine being a mixture of the (RS,SR) enantiomers:
and betaprodine the (SS,RR) enantiomers:
Hence our use of option 2 fails to distinguish between them! This matters as the two enantiomeric pairs have different biological properties e.g. different analgesic activity (ref 4)

The other example is Met(h)iomeprazine and levomet(h)iomeprazine where the former is a mixture of two enantiomers and the latter one enantiomer or the other (but it isn’t apparently known which - according to INN).

For this example, we have chosen option 2 for metiomeprazine but for levometiomeprazine we show just one of the possible enantiomers.

In summary, no existing solutions are ideal and not everyone agrees on how to do this. In ChEMBL itself we are trying to be consistent within the constraints of the v2000 molfile format but it’s not all done yet. There is however a glimmer of light in this confusion in that our UniChem connectivity match (ref 5) enables matching of these cases across databases. For example using the non stereospecific representation of milnacipran enables matching to this as well as the specific levo- and dextro- milnacipran enantiomers (as well as their salts). Details here.

So, ChEMBL users out there, we’d be interested in what you think. Do you prefer option 1, 2 or 3 or for your use cases or does it make no difference? We can’t promise an instant change but we are interested in what you think. Before you ask we know we have some inconsistencies in ChEMBL for these molecules but we are undecided on what to do and of course time spent on this is less time on other things. If you want to vote on your preferred option you can do so here.

As always if you think we have something wrong in ChEMBL please email and we will endeavour to correct it.

(1) A. Dalby, J.G. Nourse, W. D. Hounshell, A.K.I. Gushurst, D. L. Grier, B.A. Leland and J. Laufer, Description of Several Chemical Structure File Formats Used by Computer Programs Developed at Molecular Design Limited, Chem. Inf. Comput. Sci. 1992, 32, 244-255

(2) InChI - the worldwide chemical structure identifier standard, S Heller, A McNaught, D. Tchekhovskoi and S. Stein, J. Cheminf. 2013, 5

(3) Dailymed entry for Itraconazole

(4) A.H Becket, A.F. Casy and G Kirk – Alpha and Beta Prodine Type Compounds, J. Med. and Pharmaceut. Chem., 1959,1,1-58

(5) J. Chambers, M. Davies, A. Gaulton, G. Papadatos, A. Hersey and J. P. Overington, UniChem: extension of InChI-based compound mapping to salt, connectivity and stereochemistry layers, J. Cheminformatics 2014, 6:43

No comments: