Skip to main content

Striving for Perfect Representation of Chemical Structures – is this possible?


It probably goes without saying that at ChEMBL, we have a desire to make all our data as accurate and useful as possible. With this in mind we have spent many hours over the last few years trying to curate, in particular, the structures of marketed drugs and clinical candidates. We aren’t alone in this and more than 5 years ago people were coming across the same problems as highlighted in this blog post by ChemConnector on Fluvastatin

Our drug curation is an ongoing and probably a never-ending task but to be honest it has proved a lot more difficult than we expected. This is for several reasons:

Firstly, where to go to find the definitive structure of a molecule? One would have thought this would be easy but even the sources such as INN and USAN don’t always agree. For example for Telavancin the USAN_data_sheet shows a difference in the nitrogen and carbon counts in the structure images compared with the images in the INN document (although the molecular formula are the same in both documents).

Secondly, while molfiles are our definitive structures and we use standard InChIs to determine uniqueness we see many examples where, as we convert between formats (molfiles, smiles, InChIs) we introduce inconsistencies. This is of course a well-known problem. There are ongoing discussions and initiatives to develop open structure formats and extend InChIs to deal with some of these cases but my sense is this is a long way off.

Lastly, and what seems like an insurmountable problem, we are constrained by the method we use to represent chemical structures. We, at ChEMBL, like many people, use version 2000 molfiles (ref 1). There is no doubt that using v3000 molfiles would solve a number of these problems but it would be very time consuming and costly to do the conversion and therefore probably only feasible for a limited number of ChEMBL structures such as the drug molecules. We are considering this as a long term goal but it would need a wider community buy in to make it worthwhile.  However, we also suspect that many of our users also only use the older molfile version so providing the v3000 format wouldn’t help them. We would be interested in your feedback on which format you use though. Most of the resources we exchange data with (e.g. PubChem, BindingDB) also use v2000 molfiles. There is no doubt that different resources find their own way to cope with the limitations of the file formats and we do too. For example, it would be possible to use non-standard extensions of the datafields in the sd file to indicate this but it would lack real chemical awareness. Also, how one group chooses to use this won’t necessarily be consistent with another group so we are no further forward.

As a consequence of our curation efforts, we have come across an increasing number of challenging molecules for which it would be useful to get the views of our users as to the best way to deal with these. It should also be said here that we are only talking about apparently “simple” rule of 5 compliant organic molecules and several years ago we stopped trying to curate organometallic compounds. We don’t show the structures of these in ChEMBL. The drug cisplatin being a case in point. The v2000 molfile has no way of coding coordination bonds and the standard InChI (ref 2) that we use to define a unique chemical structure can’t distinguish between cis and trans-platin.

Back to the organic molecules though and a few of our dilemmas:

Milnacipran is my favourite and an apparently relatively simple example.  It is a mixture of the 1S,2R and 1R,2S enantiomers (USAN). However, v2000 molfiles don’t deal with relative stereochemistry so we have 3 options:

(1) Show one enantiomer:
(2) Show it as a racemic mixture i.e. no stereochemistry:



(3) Show it as a molfile comprised of two molecules:

Arguably option 3 is the only correct way to do this. However other data providers such as FDA and Drugbank use option 1. In the ChEMBL database we use option 2 so that we can distinguish milnacipran from levomilnacipran USAN (specifically the 1S, 2R isomer) or dextromilnacipran (1R, 2S). Option 1 wouldn’t enable us to distinguish these either in the molfile or the standard InChI.

My logic here for not using option 3 is in thinking about the use people are making of ChEMBL. ChEMBL is not a registration system where option 3 might indeed be needed but it is being used as a source of bioactivity data that can be used for identifying tool compounds, building QSAR models for specific targets etc. Hence wouldn’t users taking our 1.8 million compounds just discard any mixtures such as option 3 would give before starting their analysis given that calculating physicochemical properties etc on mixtures makes little sense?

OK so suppose you disagree and think option 3 is the right thing to do, what would you want us to do for itraconazole? This is described in DailyMed (ref 3) as a “1:1:1:1 racemic mixture of four diastereoisomers (two enantiomeric pairs)”.

Option 3 would give us a mixture of 4 molecules in our v2000 molfile. For example:

Again, we have chosen option 2 as the least bad option i.e just showing it as a racemic mixture.

It seemed as if we had identified a workable and at least internally consistent way of dealing with these structures – until we took a look at the following two examples alpha prodine and beta prodine:

Here we have alphaprodine being a mixture of the (RS,SR) enantiomers:
and betaprodine the (SS,RR) enantiomers:
Hence our use of option 2 fails to distinguish between them! This matters as the two enantiomeric pairs have different biological properties e.g. different analgesic activity (ref 4)

The other example is Met(h)iomeprazine and levomet(h)iomeprazine where the former is a mixture of two enantiomers and the latter one enantiomer or the other (but it isn’t apparently known which - according to INN).

For this example, we have chosen option 2 for metiomeprazine but for levometiomeprazine we show just one of the possible enantiomers.

In summary, no existing solutions are ideal and not everyone agrees on how to do this. In ChEMBL itself we are trying to be consistent within the constraints of the v2000 molfile format but it’s not all done yet. There is however a glimmer of light in this confusion in that our UniChem connectivity match (ref 5) enables matching of these cases across databases. For example using the non stereospecific representation of milnacipran enables matching to this as well as the specific levo- and dextro- milnacipran enantiomers (as well as their salts). Details here.

So, ChEMBL users out there, we’d be interested in what you think. Do you prefer option 1, 2 or 3 or for your use cases or does it make no difference? We can’t promise an instant change but we are interested in what you think. Before you ask we know we have some inconsistencies in ChEMBL for these molecules but we are undecided on what to do and of course time spent on this is less time on other things. If you want to vote on your preferred option you can do so here.

As always if you think we have something wrong in ChEMBL please email chembl-help@ebi.ac.uk and we will endeavour to correct it.

References
(1) A. Dalby, J.G. Nourse, W. D. Hounshell, A.K.I. Gushurst, D. L. Grier, B.A. Leland and J. Laufer, Description of Several Chemical Structure File Formats Used by Computer Programs Developed at Molecular Design Limited, Chem. Inf. Comput. Sci. 1992, 32, 244-255

(2) InChI - the worldwide chemical structure identifier standard, S Heller, A McNaught, D. Tchekhovskoi and S. Stein, J. Cheminf. 2013, 5

(3) Dailymed entry for Itraconazole https://dailymed.nlm.nih.gov/dailymed/drugInfo.cfm?setid=1e243ffb-31be-39a7-4946-83ce7b839e0a

(4) A.H Becket, A.F. Casy and G Kirk – Alpha and Beta Prodine Type Compounds, J. Med. and Pharmaceut. Chem., 1959,1,1-58

(5) J. Chambers, M. Davies, A. Gaulton, G. Papadatos, A. Hersey and J. P. Overington, UniChem: extension of InChI-based compound mapping to salt, connectivity and stereochemistry layers, J. Cheminformatics 2014, 6:43



Comments

Popular posts from this blog

New SureChEMBL announcement

(Generated with DALL-E 3 ∙ 30 October 2023 at 1:48 pm) We have some very exciting news to report: the new SureChEMBL is now available! Hooray! What is SureChEMBL, you may ask. Good question! In our portfolio of chemical biology services, alongside our established database of bioactivity data for drug-like molecules ChEMBL , our dictionary of annotated small molecule entities ChEBI , and our compound cross-referencing system UniChem , we also deliver a database of annotated patents! Almost 10 years ago , EMBL-EBI acquired the SureChem system of chemically annotated patents and made this freely accessible in the public domain as SureChEMBL. Since then, our team has continued to maintain and deliver SureChEMBL. However, this has become increasingly challenging due to the complexities of the underlying codebase. We were awarded a Wellcome Trust grant in 2021 to completely overhaul SureChEMBL, with a new UI, backend infrastructure, and new f

A python client for accessing ChEMBL web services

Motivation The CheMBL Web Services provide simple reliable programmatic access to the data stored in ChEMBL database. RESTful API approaches are quite easy to master in most languages but still require writing a few lines of code. Additionally, it can be a challenging task to write a nontrivial application using REST without any examples. These factors were the motivation for us to write a small client library for accessing web services from Python. Why Python? We choose this language because Python has become extremely popular (and still growing in use) in scientific applications; there are several Open Source chemical toolkits available in this language, and so the wealth of ChEMBL resources and functionality of those toolkits can be easily combined. Moreover, Python is a very web-friendly language and we wanted to show how easy complex resource acquisition can be expressed in Python. Reinventing the wheel? There are already some libraries providing access to ChEMBL d

Multi-task neural network on ChEMBL with PyTorch 1.0 and RDKit

  Update: KNIME protocol with the model available thanks to Greg Landrum. Update: New code to train the model and ONNX exported trained models available in github . The use and application of multi-task neural networks is growing rapidly in cheminformatics and drug discovery. Examples can be found in the following publications: - Deep Learning as an Opportunity in VirtualScreening - Massively Multitask Networks for Drug Discovery - Beyond the hype: deep neural networks outperform established methods using a ChEMBL bioactivity benchmark set But what is a multi-task neural network? In short, it's a kind of neural network architecture that can optimise multiple classification/regression problems at the same time while taking advantage of their shared description. This blogpost gives a great overview of their architecture. All networks in references above implement the hard parameter sharing approach. So, having a set of activities relating targets and molecules we can tra

LSH-based similarity search in MongoDB is faster than postgres cartridge.

TL;DR: In his excellent blog post , Matt Swain described the implementation of compound similarity searches in MongoDB . Unfortunately, Matt's approach had suboptimal ( polynomial ) time complexity with respect to decreasing similarity thresholds, which renders unsuitable for production environments. In this article, we improve on the method by enhancing it with Locality Sensitive Hashing algorithm, which significantly reduces query time and outperforms RDKit PostgreSQL cartridge . myChEMBL 21 - NoSQL edition    Given that NoSQL technologies applied to computational chemistry and cheminformatics are gaining traction and popularity, we decided to include a taster in future myChEMBL releases. Two especially appealing technologies are Neo4j and MongoDB . The former is a graph database and the latter is a BSON document storage. We would like to provide IPython notebook -based tutorials explaining how to use this software to deal with common cheminformatics p

ChEMBL 26 Released

We are pleased to announce the release of ChEMBL_26 This version of the database, prepared on 10/01/2020 contains: 2,425,876 compound records 1,950,765 compounds (of which 1,940,733 have mol files) 15,996,368 activities 1,221,311 assays 13,377 targets 76,076 documents You can query the ChEMBL 26 data online via the ChEMBL Interface and you can also download the data from the ChEMBL FTP site . Please see ChEMBL_26 release notes for full details of all changes in this release. Changes since the last release: * Deposited Data Sets: CO-ADD antimicrobial screening data: Two new data sets have been included from the Community for Open Access Drug Discovery (CO-ADD). These data sets are screening of the NIH NCI Natural Product Set III in the CO-ADD assays (src_id = 40, Document ChEMBL_ID = CHEMBL4296183, DOI = 10.6019/CHEMBL4296183) and screening of the NIH NCI Diversity Set V in the CO-ADD assays (src_id = 40, Document ChEMBL_ID = CHEMBL4296182, DOI = 10.601