Have you ever wondered which compound is the most popular in ChEMBL? And by popular I don't mean the one which cracks the best jokes at dinner parties; I mean the compound with the largest number of structural analogues or nearest neighbours (NNs). This number also gives an indication of the sparsity or density of the chemical space around a compound and is a useful concept during hit expansion and lead optimisation. This number of course depends on the fingerprint, the hashing and folding parameters, the similarity coefficient and the threshold. So let's say 2048-bit RDKit Morgan fingerprints with a radius of 2 or 3 (equivalent to ECFP_4 or ECFP_6) and Tanimoto threshold of 0.5. Why so low threshold? For an explanation, see here and here . To calculate this compound 'popularity', one would need to calculate the full similarity matrix of the 1.4M compounds in ChEMBL. This used to be prohibitively computationally expensive just a few years ago; nowadays,...
The Organization of Drug Discovery Data
| | | | | | | |