TL;DR:
In his excellent blog post, Matt Swain described the implementation of compound similarity searches in MongoDB. Unfortunately, Matt's approach had suboptimal (polynomial) time complexity with respect to decreasing similarity thresholds, which renders unsuitable for production environments. In this article, we improve on the method by enhancing it with Locality Sensitive Hashing algorithm, which significantly reduces query time and outperforms RDKit PostgreSQL cartridge.
myChEMBL 21 - NoSQL edition
Given that NoSQL technologies applied to computational chemistry and cheminformatics are gaining traction and popularity, we decided to include a taster in future myChEMBL releases. Two especially appealing technologies are Neo4j and MongoDB. The former is a graph database and the latter is a BSON document storage. We would like to provide IPython notebook-based tutorials explaining how to use this software to deal with common cheminformatics problems.
Matt Swain's article "Chemical similarity search in MongoDB" seemed to be an ideal candidate for converting it to IPython notebook and shipping with the next myChEMBL release, so we've started working on it. Matt mentions other researchers, who have worked on this topic already: Davy Suvee, whose series (1, 2, 3) of blog posts defined the basic algorithm, without providing any statistics about performance and accuracy and Rajarshi Guha, who attempted to verify time complexity using one million compound sample from ChEMBL 17 and was disappointed with the performance of a median query time of 6332 ms with the similarity threshold set to 0.9.
General algorithm:
The general algorithm developed by Suvee, verified by Guha and improved by Swain looks as follows:
1. For each compound, compute a fingerprint(s) and store the compound identifier along with the fingerprint in MongoDB collection (precomputing step).
2. For each fingerprint generated in step 1, store it along with the number of occurrences and its length in separate MongoDB collection (precomputing step)
3. For the query compound, compute the fingerprint and its length Lq.
4. Apply pruning rules, originally described by Swamidass and Baldi in their paper:
a) Consider only fingerprints with length L in range:
T * Lq <= L <= Lq / T
where T is a threshold
b) Consider:
Lq − T * Lq + 1
most differentiating fingerprint bits to check if at least one is present in searched compounds.
5. For all compounds which successfully passed pruning in step 4, compute the Tanimoto coefficient and return only those with a similarity score higher than the given threshold.
Time complexity:
After benchmarking his solution with 1000 random compounds from ChEMBL 18, Matt got following results:
While not identical with the results we obtained (we got higher absolute values because we had SMILES on input, as opposed to RDKit objects in Matt's code, we measured the mean query time by recording the time needed to complete 1k random queries against ChEMBL 20 using a docker container running on tiny Gigabyte BRIX machine with a Intel Core i7-3537U 2GHz CPU, 16GB RAM and 256GB SSD with Ubuntu 14.04 as host OS), the shape of the curve is quite similar and indicates polynomial time complexity with respect to decreasing threshold.
This is easy to understand as the function f(T) = [(Lq/T) - TLq](Lq-TLq + 1) can be reduced to a polynomial of the fourth degree. The lower is threshold the more compounds pass pruning and it takes more time to compute Tanimoto coefficient. Assuming optimistically that Matt's results can be easily achieved on any machine/computing environment, then searching 1k compounds with the threshold of 0.7 would take more than 16 minutes (and usually the query is provided as SMILES string, not a binary RDKit object, which requires expensive time consuming conversion), which can be too long for production environments. Moreover, there is a faster alternative, namely RDKit cartridge, which is (according to Matt's results) faster for thresholds lower than 0.85 and a bit slower for higher thresholds.
Introducing LSH
After we realised, that the time complexity is not optimal (mostly by sitting and staring at monitor for ages waiting for results), we decided to look for ways to improve the process. Some more time spent on researching this topic led us to the conclusion that using a technique called 'Locality Sensitive Hashing' (LSH) can help us solving the problem.
The idea behind LSH is pretty simple - while most hashes we use everyday (cryptographic hashes like SHA) generate completely different values for documents, even when they differ only slightly, LSH tries to generate similar hash values for similar documents in order to maximize the probability of hash collision.
We've managed to find only two research papers, directly describing the application of LSH to chemical similarity. The first one, by Dutta, Guha et al. is quite old (2006) and was published 3 whole years before the initial release of MongoDB. It looks at a bit different problems, such as outlier detection and clustering and uses LSH implementation written in C++ by Andoni and Indyk - authors of the authoritative paper in this field: Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions. The second paper by Nasr, Baldi et al. exactly describes our problem but is highly theoretical and doesn't provide any code or hints for practical implementation.
Comparison with audio fingerprinting
Knowing that our problem is quite analogous to finding songs similar to the one played currently in the radio, we searched for applications of LSH to audio fingerprinting and stumbled upon this article by Ciumac Sergiu. It describes an algorithm for detecting duplicate songs using MinHash (min-wise independent permutations LSH scheme). Based on this article, we've created an improved compound similarity search algorithm, which as the following steps:
1. For each compound, compute a fingerprint(s) and store the compound identifier along with the fingerprint in MongoDB collection (precomputing step).
2. Generate 100 random permutations of length equal to the length of a single fingerprint and save them as a separate MongoDB collection.
3. For each fingerprint:
For each permutation generated in step 2, reorder the fingerprint bits according to the permutation and store the position of the first '1' bit.
After this is completed each fingerprint will have a corresponding list of 100 positions of the first '1' according to 100 permutations, store this information along with the fingerprint.
4. For each list generated in step 3, divide this list into 25 chunks, 4 numbers in each chunk. Using left shift (<<) binary operation convert four numbers in each chunk into single 64-bit number. After this is done each fingerprint will be reduced to the vector of 25 integers, store this vector along with the fingerprint.
5. Create 25 different MongoDB collections. For each compound, get the corresponding 64-bit integer vector. For each number in the vector check the respective collection for the document identified by this number. If the document exists, extend it with compound identifier. If it doesn't exist create a new document identified by the number and containing single compound identifier.
6. For the query compound use the method described in steps 3 and 4 to generate a vector of 25 64-bit integer numbers. For each number check the respective collection for the document identified by this number. It it exists take compounds assigned to this document. Get the distinct set of all compounds found by this process for all 25 numbers. On this set, apply pruning procedure from the original algorithm and compute Tanimoto coefficient to all compound that successfully passed pruning. Return only those with a similarity score higher than the given threshold.
Comparing results
After implementing this method in Python and calling it from IPython notebook we've benchmarked it in the same way as we did with original algorithm (mean time of 5 repetitions of searching 1k random sample against 1455713 compounds from ChEMBL 20 for each discrete threshold value). In the same way we measured RDKit postgreSQL cartridge performance and got these results:
As you can see, LSH-based method is always faster than PostgreSQL cartridge and apart from 0.95 point, faster than Swain's implementation. Again, you may wonder why all the numbers are higher than Matt's results. Well, the most important difference between our and Matt's implementation is that our software expects SMILES as an input whereas Matt's stuff expects RDKit mol. This means that our methods have to perform one more expensive step during each iteration. We've decided to implement our functions like this because this is what they should do in practice. Another (but less significant) factor is that we were running our calculations in completely different environment (hardware, OS, presence/absence of virtualisation, ChEMBL version, RDKit version).
What about accuracy?
EDIT: we've updated error plots due to the bug in code (explained in comments), the actual error values are lower then we said previously.
Shorter timings is one thing but are we getting exactly the same results? Since our hashing function is only an approximation, we should expect some discrepancies. To measure the number of discrepancies, we selected random sample of 1k compounds and compared results from Swain's and our implementation. For each pair of results, we recorded the size of symmetric set difference between results int the pair and divided them by the size of sum (union) of both result sets. We repeated this for different threshold values, results (in percent) are presented below:
As an alternative error measure, we counted number of compounds, for which we observed any discrepancy in results (regardless of its size) and divided by the number of compounds in sample (1k). Again, results in percent are shown below:
As can be noticed, for thresholds greater or equal to 0.8 the results are almost the same. For threshold 0.7, less than 6% of query compounds has different results. As we checked, the result of LSH similarity is always a subset of original method. This number alone is probably not very high but we decided to further investigate this. At first we thought that since without any pruning and filtering by Tanimoto coefficient, pure LSH procedure returns on average 3.5k 'similar' compounds for given query compound, then for low thresholds standard algorithm will yield more results and this is why we observe differences. This wouldn't be very bad because usually when asking for similar compounds we don't expect to deal with thousands of them. But it turned out that this situation happens very rarely, especially for Morgan fingerprints. To get better idea of what really happens let's take some example query compound that generates different results.
As can be noticed, for thresholds greater or equal to 0.8 the results are almost the same. For threshold 0.7, less than 6% of query compounds has different results. As we checked, the result of LSH similarity is always a subset of original method. This number alone is probably not very high but we decided to further investigate this. At first we thought that since without any pruning and filtering by Tanimoto coefficient, pure LSH procedure returns on average 3.5k 'similar' compounds for given query compound, then for low thresholds standard algorithm will yield more results and this is why we observe differences. This wouldn't be very bad because usually when asking for similar compounds we don't expect to deal with thousands of them. But it turned out that this situation happens very rarely, especially for Morgan fingerprints. To get better idea of what really happens let's take some example query compound that generates different results.
For input SMILES:
Clc1ccc(Cn2c(nc3ccccc23)C(=O)N4CCC(CC4)C(=O)NCCc5ccncc5)cc1
We've got two two results using standard algorithm and one using LSH. Those where
CHEMBL3085935
and CHEMBL3085944
, their SMILES are:O=C(NCCc1ccncc1)C1CCN(C(=O)c2nc3ccccc3n2Cc2ccc(Cl)cc2)CC1
O=C(NCCc1ccncc1)C1CCN(C(=O)c2cc3ccccc3n2Cc2ccc(Cl)cc2)CC1
The difference is marked in red. Let's look at their vector of 25 hashes:
[644282988592, 77607305247, 610086746128, 412446916610, 266355085320, 137447358470,...]
[25807697968, 77607305263, 610086778896, 412514025474, 403794038792, 137447358509,...]
Notice, that some of them have very similar values (thanks to LSH). This means that the algorithm can be further improved by increasing number of chunks (instead of dividing 100 integers into 25 groups of 4, we could split them into 50 groups of 2 or even skip diving them at all) or use a different data structure, that instead of exactly matching hash string would first give a neighbors within some radius and return compounds from this neighborhood. Of course there would be some performance penalty but we believe it's possible to make the discrepancy chart completely flat in the range of thresholds greater than 0.7 and still have better performance than original method.
Further improvements/research
There are number of things mentioned in this article that could be better explored. First of all it should be noted, that we are not really comparing MongoDB and PostgreSQL as competing database engines. This is because both engines are using different techniques to implement compound similarity search. RDKit Postgres cartridge is not using LSH but it should be relatively easy to add code implementing it to RDKit library or as a separate piece of software called from the cartridge. On the other hand MongoDB support any extensions, executed on the MongoDB server, so all computations have to be done on the client side. That said, there is a very interesting project called ToroDB - MongoDB drop-in replacement, that runs on top of PostgreSQL. So a possible immediate next step would be to run the same compound searching benchmarks using ToroDB, which should give us some idea of how PostgreSQL and MongoDB perform executing the same NoSQL tasks.
Generating permutations for LSH is another interesting topic. In our code, we used the standard numpy random function generating uniformly distributed numbers. It would be interesting to check if using different probability distributions can give better results. The audio fingerprinting article, on which we based our code mentions a paper by Baluja et al. titled "Permutation grouping: intelligent hash function design for audio and image retrieval".
Measuring accuracy and time over LSH vector/groups size would be important as well. That would allow to find the optimal number of hashes that are required to get desired accuracy/time ratio.
Last but not least, in our benchmarks we've only used Morgan 2048-bit fingerprints. Fixed length fingerprints are already hashed and folded to achieve constant length and the price for this is worse fidelity. Using LSH on unfolded fingerprints may lead to better results; in addition it would also be interesting to compare different types of fingerprints as well as different lengths.
Code
EDIT: you can now download the code from here.
To better illustrate our method, we provide this IPython notebook. As already explained, it's based on Matt Swain blog post but enhanced with methods described by Ciumac Sergiu in his article about audio fingerprinting. This notebook will be included (along with all required dependencies) on myChEMBL 21 but if you have ChEMBL 20, you can load it as well but then, it's important to remember about installing MongoDB first. This can be done easily by ssh-ing to the machine and running:
sudo [apt-get|yum] install mongodb
and then start the mongod daemon:
sudo mongod --fork --logpath /var/log/mongod.log
Using the 'fork' option is very important and not including it may lead to worse performance. When running ChEMBL 20 as a virtual machine, it may happen that the size of collections created by MongoDB will exceed remaining disk space. This is why we suggest using ChEMBL 20 docker container on Linux box when possible because then, the only limiting factor is the size of physical disk space of the host machine.
Michał
Comments
I'm quite surprised by the results with the PostgreSQL cartridge. I would expect the search time to decrease with increasing similarity threshold. I was able to reproduce the behavior though, so I will try to figure out what's going on (or why my expectations are wrong).
One concern I have with this: The results look great for high cutoffs (i.e. finding close neighbors), but one often wants to find compounds that are "somewhat similar" to the query. This requires searching with a fairly loose similarity cutoff. It looks like the LSH approach leads to an accuracy problem here.
- selecting optimal permutations within a group
- using other fingerprint types/lengths
- using different data structures (range trees?) or increasing the number of buckets
and how would that affect time?
One other thing that I tried since I wrote my original blog post is "sharding" the MongoDB collection. With the default setup, MongoDB only uses one CPU core to perform the query, but if you shard the collection, it will use a core for each shard. This can be a huge benefit for CPU intensive queries like this. You can run multiple shards on the same server, or split shards across multiple servers, or both! Sharding across multiple servers also means that you can easily scale up your RAM beyond what would be possible to have in a single machine. This is useful for huge databases (i.e. ~100m+ molecules) where you start to have trouble fitting the indexes in RAM on a single machine. Instead, each server only has to hold its own shards in memory, and you can scale by just adding more servers as your database grows.
Also, it's worth noting: for my original postgres benchmarks, I just compiled the RDKit Postgres cartridge with the default settings. However, since then I've noticed that some of the options in the makefile should probably be changed from their default to improve performance - in particular turning on 'USE_POPCOUNT' and possibly also 'USE_THREADS'. That would probably give a more fair comparison.
How about searching based on HAMMING SPACE . I see that one disadvantage of the search is requirement for a huge
amount of memory in order to perform random permutations.
Check this article .
http://onlinelibrary.wiley.com/doi/10.1002/ecj.11561/abstract
Another thing is that I just found that the actual error rate is much LOWER than published. This is because for each compound, I measured a set difference of results and stored cumulative size of differences over all compounds. But then I divided it by the number of compounds, which doesn't make much sense. I should divide it by cumulative set sum over all compounds, which can only be larger, which would make error rate lower.
As alternative measure, I could count number of compounds, which had any difference in results, regardless of how big this difference is and then divide it by the number of all compounds. And this obviously would also make error rate lower than published, assuming that I will get similar results.
I will recompute error rates and update article soon.