ChEMBL Resources


Friday, 24 April 2009

New Staff In The Group

We are coming to the end of our current round of recruitment, and now have new starters regularly starting, laying claim to their part of the caravan, etc.. Here is a link to the group members.

Wednesday, 15 April 2009

Bioactive Peptides

Peptides (short polymers built from amide-linked alpha-amino acids) are one of the largest classes of bioactive compounds. Many drugs are peptides, or peptide derivatives; furthermore the ready accessibility of amino-acid monomers and chemistry for reliable assembly have led to very extensive characterisation of peptides as bioactives. An additional very useful property of peptide derivatives is that due to their modular nature, and the conformational constraints enforced by stereochemistry and the peptide backbone geometry, it is often possible to get some pretty good QSARs derived from amino-acid sidechain properties. It is also possible to automatically classify peptides into various subclasses (natural, non-natural, N-capped, C-capped, cyclic, etc,) in some sort of ontology-based classification.

The following is a brief overview of the peptide content of StARlite (release 31). In total, 41,128 compounds contain the simplest possible dipeptide substructure (di-glycine), this corresponds to about 9% of StARlite; so as a first approximation it is possible to say that 9% of StARlite is peptidic in nature (this also happens to be the largest single non-trivial structural class in StARlite). A table was then built of all distinct peptide units of a given length (up to 10 amino-acids in this case). The data is as follows....

peptide length# length or longer# exact length

Considering all possible natural amino-acid dipeptides gives 400 distinct dipeptides (20^2), this compares to the 16,512 dipeptides found in StARlite, implying a very diverse and expanded set of amino-acids. It would be pretty interesting to find out what fraction of the 400 possible natural dipeptides are actually sampled. Of course, much of the variation of the dipeptides will come from groups attached N- and C-terminal to the dipeptide, but even so, the sampled variation of sidechains is pretty good. There are 8,000 distinct tri-peptides (20^3) constructed from the 20 natural amino-acids, it is clear that, even assuming the tripeptides are all simple and unelaborated) that there is poor coverage of tripeptides (6,079 vs 8,000) - chemical diversity scales very poorly! It is also pretty clear that there is a pseudo-power law distribution to the observed peptide length distribution (see below).

Here is a graph, I know it is bad practice not having labelled axes, but the x-axis is the peptide length, and the y-axis is the frequency of that class. Green is the class of that length or more, and yellow is of that exact length class.

I have also pulled back some ligand efficiency data for this peptide set, at first glance, it looks very interesting..... More later.

Monday, 13 April 2009

Books and Papers - 10 - A Travel Guide To Scientific Sites Of The British Isles

Oh my goodness! My head is splitting - I acquired a headcold on Friday, and today am still laid low. Other than interviews tomorrow, I am sorely tempted to spend the following day with Mr. Duvet and Mrs. Pillow.....

Anyway, a book, if the preface is not a call to arms, I don't know what is, this really is an essential book for every scientist (who lives in or visits the British Isles). It is also a book that when I reach for it from the shelves, the kids run to tidy up, or state they have homework. regardless, I will quote from the first couple of sentences from the preface....

The Population of the British Isles is less than 0.2% that of the entire earth (sic); yet this tiny fraction of human society is responsible for an enormous number of cultural advances in both the arts and sciences. Public appreciation for the men and women of Britain and Ireland who wrote, painted, composed music, etc. is evident wherever one looks, but the recognition of explorers of nature are harder to find.'

For example, did you know that the Occam of Occam's Razor, is derived from William of Ockham in Surrey! Cool!

%T A Travel Guide To Scientific Sites Of The British Isles
%A Charles Tanford
%A Jacqueline Reynolds
%D 1995
%I John Wiley & Sons
%O ISBN 0-471-95070-2

Monday, 6 April 2009

StARlite Schema Walkthrough

So, here is another StARlite schema walkthrough (barring any unforeseen circumstances). Wednesday 8th April, 2009 at 2pm UK local time, which is now BST. It will take an hour. Please mail me, if you are interested in getting the weblink. You will need to call a UK telephone, so please bear this in mind.

The image is of the Starlite rooms cocktail bar in Tujunga Village, Los Angeles. I have not visited there (yet), but given Tujunga's Utopian Socialist roots, it seems a mighty fine bar to have a drink in.

Thursday, 2 April 2009

Paper of the year?

Two simple words Robot Scientist.

%T The Automation of Science
%A Ross D. King
%A Jem Rowland
%A Stephen G. Oliver
%A Michael Young
%A Wayne Aubrey
%A Emma Byrne
%A Maria Liakata
%A Magdalena Markham
%A Pinar Pir
%A Larisa N. Soldatova
%A Andrew Sparkes
%A Kenneth E. Whelan
%A Amanda Clare
%J Science
%D 2009
%V 324
%P 85-89

Wednesday, 1 April 2009

Bioisostere Discovery

Here is an old, old use case we developed for StARlite, this one looks at using data contained within StARlite to discover bioisosteres - a functional group replacement that preserves activity while improving other properties, such as metabolism, patentability, solubility etc. The algorithm exploits the useful 'data structure' of StARlite, in that compounds are typically entered in the literature/database as clusters of synthetically related compounds (i.e. they typically share late stage intermediates in their production), and therefore there are often reasonably straightforward ways to synthetically access these related compounds. Secondly, again because of the structure of the data, there are often equivalent assays to compare (same assays, done under the same conditions, by the same people), and so this removes one important variable from any further analysis (this is performed using the simple heuristic of only comparing quantitative data from the same StARlite doc_id).

Here is some (truly appalling, almost prose it has been noted) pseudocode, in which one wants to find possible replacements for a particular Functional Group (for example, a nitro, a vinyl halide, a sulphonamide, etc.)

1. Search StARlite for the all examples of the Functional Group
2. Identify all fragments that these Functional Groups are attached to (call these 'Contexts')
3. Search StARlite for all Contexts, then identify the corresponding Replacement Functional Groups
4. Build a table of Replacement Functional Groups and the count the frequency of each type of interchange (this frequency list is pretty useful in its own right)
5. Retrieve quantitative values of binding energy difference (using endpoints such as IC50, Ki, Kd, etc., constraining the comparison to the same assay_ids from the same doc_ids
6. Use these binding energy differences to compute an expectation value for the binding energy difference between the Functional Group and the Replacement Functional Group

So a good bioisostere would preserve (or improve) binding energy, these are then pretty easy to identify from the tables generated above. Of course, with the multiple end points stored in StARlite, and the generality of the approach, the same basic workflow can be used to identify functional group replacements that can improve half-life, solubility, logD, etc., etc.

Here is an old slide of a real case, the replacement of a carboxylic acid with other functional groups. Hopefully, with the background above, the figure is self explanatory....

The picture used in the header of the post is from the excellent and very amusing B'eau Bo D'Or blog, and I think perfectly illustrates bioisosterism - albeit in a context that is completely opaque to anyone not steeped in the 70's and 80's popular culture of the United Kingdom.