The following is a brief overview of the peptide content of StARlite (release 31). In total, 41,128 compounds contain the simplest possible dipeptide substructure (di-glycine), this corresponds to about 9% of StARlite; so as a first approximation it is possible to say that 9% of StARlite is peptidic in nature (this also happens to be the largest single non-trivial structural class in StARlite). A table was then built of all distinct peptide units of a given length (up to 10 amino-acids in this case). The data is as follows....
|peptide length||# length or longer||# exact length|
Considering all possible natural amino-acid dipeptides gives 400 distinct dipeptides (20^2), this compares to the 16,512 dipeptides found in StARlite, implying a very diverse and expanded set of amino-acids. It would be pretty interesting to find out what fraction of the 400 possible natural dipeptides are actually sampled. Of course, much of the variation of the dipeptides will come from groups attached N- and C-terminal to the dipeptide, but even so, the sampled variation of sidechains is pretty good. There are 8,000 distinct tri-peptides (20^3) constructed from the 20 natural amino-acids, it is clear that, even assuming the tripeptides are all simple and unelaborated) that there is poor coverage of tripeptides (6,079 vs 8,000) - chemical diversity scales very poorly! It is also pretty clear that there is a pseudo-power law distribution to the observed peptide length distribution (see below).
Here is a graph, I know it is bad practice not having labelled axes, but the x-axis is the peptide length, and the y-axis is the frequency of that class. Green is the class of that length or more, and yellow is of that exact length class.
I have also pulled back some ligand efficiency data for this peptide set, at first glance, it looks very interesting..... More later.