Skip to main content

LSH-based similarity search in MongoDB is faster than postgres cartridge.


In his excellent blog post, Matt Swain described the implementation of compound similarity searches in MongoDB. Unfortunately, Matt's approach had suboptimal (polynomial) time complexity with respect to decreasing similarity thresholds, which renders unsuitable for production environments. In this article, we improve on the method by enhancing it with Locality Sensitive Hashing algorithm, which significantly reduces query time and outperforms RDKit PostgreSQL cartridge.

myChEMBL 21 - NoSQL edition 


Given that NoSQL technologies applied to computational chemistry and cheminformatics are gaining traction and popularity, we decided to include a taster in future myChEMBL releases. Two especially appealing technologies are Neo4j and MongoDB. The former is a graph database and the latter is a BSON document storage. We would like to provide IPython notebook-based tutorials explaining how to use this software to deal with common cheminformatics problems.

Matt Swain's article "Chemical similarity search in MongoDB" seemed to be an ideal candidate for converting it to IPython notebook and shipping with the next myChEMBL release, so we've started working on it. Matt mentions other researchers, who have worked on this topic already: Davy Suvee, whose series (1, 2, 3) of blog posts defined the basic algorithm, without providing any statistics about performance and accuracy and Rajarshi Guha, who attempted to verify time complexity using one million compound sample from ChEMBL 17 and was disappointed with the performance of a median query time of 6332 ms with the similarity threshold set to 0.9.

General algorithm:

The general algorithm developed by Suvee, verified by Guha and improved by Swain looks as follows:

1. For each compound, compute a fingerprint(s) and store the compound identifier along with the fingerprint in MongoDB collection (precomputing step).

2. For each fingerprint generated in step 1, store it along with the number of occurrences and its length in separate MongoDB collection (precomputing step)

3. For the query compound, compute the fingerprint and its length Lq.

4. Apply pruning rules, originally described by Swamidass and Baldi in their paper:

    a) Consider only fingerprints with length L in range:
    T * Lq <= L <= Lq / T 

       where T is a threshold

    b) Consider:

        Lq − T * Lq + 1

        most differentiating fingerprint bits to check if at least one is present in searched compounds.

5. For all compounds which successfully passed pruning in step 4, compute the Tanimoto coefficient and return only those with a similarity score higher than the given threshold. 

Time complexity:

After benchmarking his solution with 1000 random compounds from ChEMBL 18,  Matt got following results:

While not identical with the results we obtained (we got higher absolute values because we had SMILES on input, as opposed to RDKit objects in Matt's code, we measured the mean query time by recording the time needed to complete 1k random queries against ChEMBL 20 using a docker container running on tiny Gigabyte BRIX machine with a Intel Core i7-3537U 2GHz CPU, 16GB RAM and 256GB SSD with Ubuntu 14.04 as host OS), the shape of the curve is quite similar and indicates polynomial time complexity with respect to decreasing threshold.

This is easy to understand as the function f(T) = [(Lq/T) - TLq](Lq-TLq + 1) can be reduced to a polynomial of the fourth degree. The lower is threshold the more compounds pass pruning and it takes more time to compute Tanimoto coefficient. Assuming optimistically that Matt's results can be easily achieved on any machine/computing environment, then searching 1k compounds with the threshold of 0.7 would take more than 16 minutes (and usually the query is provided as SMILES string, not a binary RDKit object, which requires expensive time consuming conversion), which can be too long for production environments. Moreover, there is a faster alternative, namely  RDKit cartridge, which is (according to Matt's results) faster for thresholds lower than 0.85 and a bit slower for higher thresholds.

Introducing LSH

After we realised, that the time complexity is not optimal (mostly by sitting and staring at monitor for ages waiting for results), we decided to look for ways to improve the process. Some more time spent on researching this topic led us to the conclusion that using a technique called 'Locality Sensitive Hashing' (LSH) can help us solving the problem.

The idea behind LSH is pretty simple - while most hashes we use everyday (cryptographic hashes like SHA) generate completely different values for documents, even when they differ only slightly, LSH tries to generate similar hash values for similar documents in order to maximize the probability of hash collision.

We've managed to find only two research papers, directly describing the application of LSH to chemical similarity. The first one, by Dutta, Guha et al. is quite old (2006) and was published 3 whole years before the initial release of MongoDB. It looks at a bit different problems, such as outlier detection and clustering and uses LSH implementation written in C++ by Andoni and Indyk - authors of the authoritative paper in this field: Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions. The second paper by Nasr, Baldi et al. exactly describes our problem but is highly theoretical and doesn't provide any code or hints for practical implementation.

Comparison with audio fingerprinting

Knowing that our problem is quite analogous to finding songs similar to the one played currently in the radio, we searched for applications of LSH to audio fingerprinting and stumbled upon this article by Ciumac Sergiu. It describes an algorithm for detecting duplicate songs using MinHash (min-wise independent permutations LSH scheme). Based on this article, we've created an improved compound similarity search algorithm, which as the following steps:

1. For each compound, compute a fingerprint(s) and store the compound identifier along with the fingerprint in MongoDB collection (precomputing step).

2. Generate 100 random permutations of length equal to the length of a single fingerprint and save them as a separate MongoDB collection.

3. For each fingerprint:
     For each permutation generated in step 2, reorder the fingerprint bits according to the permutation and store the position of the first '1' bit.
After this is completed each fingerprint will have a corresponding list of 100 positions of the first '1' according to 100 permutations, store this information along with the fingerprint.

4. For each list generated in step 3, divide this list into 25 chunks, 4 numbers in each chunk. Using left shift (<<) binary operation convert four numbers in each chunk into single 64-bit number. After this is done each fingerprint will be reduced to the vector of 25 integers, store this vector along with the fingerprint.

5. Create 25 different MongoDB collections. For each compound, get the corresponding 64-bit integer vector. For each number in the vector check the respective collection for the document identified by this number. If the document exists, extend it with compound identifier. If it doesn't exist create a new document identified by the number and containing single compound identifier.

6. For the query compound use the method described in steps 3 and 4 to generate a vector of 25 64-bit integer numbers. For each number check the respective collection for the document identified by this number. It it exists take compounds assigned to this document. Get the distinct set of all compounds found by this process for all 25 numbers. On this set, apply pruning procedure from the original algorithm and compute Tanimoto coefficient to all compound that successfully passed pruning. Return only those with a similarity score higher than the given threshold. 

Comparing results

After implementing this method in Python and calling it from IPython notebook we've benchmarked it in the same way as we did with original algorithm (mean time of 5 repetitions of searching 1k random sample against 1455713 compounds from ChEMBL 20 for each discrete threshold value). In the same way we measured RDKit postgreSQL cartridge performance and got these results:

As you can see, LSH-based method is always faster than PostgreSQL cartridge and apart from 0.95 point, faster than Swain's implementation. Again, you may wonder why all the numbers are higher than Matt's results. Well, the most important difference between our and Matt's implementation is that our software expects SMILES as an input whereas Matt's stuff expects RDKit mol. This means that our methods have to perform one more expensive step during each iteration. We've decided to implement our functions like this because this is what they should do in practice. Another (but less significant) factor is that we were running our calculations in completely different environment (hardware, OS, presence/absence of virtualisation, ChEMBL version, RDKit version).

What about accuracy?

EDIT: we've updated error plots due to the bug in code (explained in comments), the actual error values are lower then we said previously.

Shorter timings is one thing but are we getting exactly the same results? Since our hashing function is only an approximation, we should expect some discrepancies. To measure the number of discrepancies, we selected random sample of 1k compounds and compared results from Swain's and our implementation. For each pair of results, we recorded the size of symmetric set difference between results int the pair and divided them by the size of sum (union) of both result sets. We repeated this for different threshold values, results (in percent) are presented below:

As an alternative error measure, we counted number of compounds, for which we observed any discrepancy in results (regardless of its size) and divided by the number of compounds in sample (1k). Again, results in percent are shown below:

As can be noticed, for thresholds greater or equal to 0.8 the results are almost the same. For threshold 0.7, less than 6% of query compounds has different results. As we checked, the result of LSH similarity is always a subset of original method. This number alone is probably not very high but we decided to further investigate this. At first we thought that since without any pruning and filtering by Tanimoto coefficient, pure LSH procedure returns on average 3.5k 'similar' compounds for given query compound, then for low thresholds standard algorithm will yield more results and this is why we observe differences. This wouldn't be very bad because usually when asking for similar compounds we don't expect to deal with thousands of them. But it turned out that this situation happens very rarely, especially for Morgan fingerprints. To get better idea of what really happens let's take some example query compound that generates different results.

For input SMILES:


We've got two two results using standard algorithm and one using LSH. Those where CHEMBL3085935 and CHEMBL3085944, their SMILES are:


The difference is marked in red. Let's look at their vector of 25 hashes:

[644282988592, 77607305247, 610086746128, 412446916610, 266355085320, 137447358470,...][25807697968, 77607305263, 610086778896, 412514025474, 403794038792, 137447358509,...]

Notice, that some of them have very similar values (thanks to LSH). This means that the algorithm can be further improved by increasing number of chunks (instead of dividing 100 integers into 25 groups of 4, we could split them into 50 groups of 2 or even skip diving them at all) or use a different data structure, that instead of exactly matching hash string would first give a neighbors within some radius and return compounds from this neighborhood. Of course there would be some performance penalty but we believe it's possible to make the discrepancy chart completely flat in the range of thresholds greater than 0.7 and still have better performance than original method.

Further improvements/research

There are number of things mentioned in this article that could be better explored. First of all it should be noted, that we are not really comparing MongoDB and PostgreSQL as competing database engines. This is because both engines are using different techniques to implement compound similarity search. RDKit Postgres cartridge is not using LSH but it should be relatively easy to add code implementing it to RDKit library or as a separate piece of software called from the cartridge. On the other hand MongoDB support any extensions, executed on the MongoDB server, so all computations have to be done on the client side. That said, there is a very interesting project called ToroDB - MongoDB drop-in replacement, that runs on top of PostgreSQL. So a possible immediate next step would be to run the same compound searching benchmarks using ToroDB, which should give us some idea of how PostgreSQL and MongoDB perform executing the same NoSQL tasks.

Generating permutations for LSH is another interesting topic. In our code, we used the standard numpy random function generating uniformly distributed numbers. It would be interesting to check if using different probability distributions can give better results. The audio fingerprinting article, on which we based our code mentions a paper by Baluja et al. titled "Permutation grouping: intelligent hash function design for audio and image retrieval".

Measuring accuracy and time over LSH vector/groups size would be important as well. That would allow to find the optimal number of hashes that are required to get desired accuracy/time ratio.

Last but not least, in our benchmarks we've only used Morgan 2048-bit fingerprints. Fixed length fingerprints are already hashed and folded to achieve constant length and the price for this is worse fidelity. Using LSH on unfolded fingerprints may lead to better results; in addition it would also be interesting to compare different types of fingerprints as well as different lengths.



EDIT: you can now download the code from here.

To better illustrate our method, we provide this IPython notebook. As already explained, it's based on Matt Swain blog post but enhanced with methods described by Ciumac Sergiu in his article about audio fingerprinting.  This notebook will be included (along with all required dependencies) on myChEMBL 21 but if you have ChEMBL 20, you can load it as well but then, it's important to remember about installing MongoDB first. This can be done easily by ssh-ing to the machine and running:

sudo [apt-get|yum] install mongodb

and then start the mongod daemon:

sudo mongod --fork --logpath /var/log/mongod.log

Using the 'fork' option is very important and not including it may lead to worse performance. When running ChEMBL 20 as a virtual machine, it may happen that the size of collections created by MongoDB will exceed remaining disk space. This is why we suggest using ChEMBL 20 docker container on Linux box when possible because then, the only limiting factor is the size of physical disk space of the host machine.



greg landrum said…
Interesting stuff, and something I'd love to help explore further.

I'm quite surprised by the results with the PostgreSQL cartridge. I would expect the search time to decrease with increasing similarity threshold. I was able to reproduce the behavior though, so I will try to figure out what's going on (or why my expectations are wrong).

One concern I have with this: The results look great for high cutoffs (i.e. finding close neighbors), but one often wants to find compounds that are "somewhat similar" to the query. This requires searching with a fairly loose similarity cutoff. It looks like the LSH approach leads to an accuracy problem here.
kott said…
I agree this is kind of a trade between time and accuracy. The question is how much the accuracy can be improved by:

- selecting optimal permutations within a group
- using other fingerprint types/lengths
- using different data structures (range trees?) or increasing the number of buckets

and how would that affect time?
Matt said…
This is great - some impressive results.

One other thing that I tried since I wrote my original blog post is "sharding" the MongoDB collection. With the default setup, MongoDB only uses one CPU core to perform the query, but if you shard the collection, it will use a core for each shard. This can be a huge benefit for CPU intensive queries like this. You can run multiple shards on the same server, or split shards across multiple servers, or both! Sharding across multiple servers also means that you can easily scale up your RAM beyond what would be possible to have in a single machine. This is useful for huge databases (i.e. ~100m+ molecules) where you start to have trouble fitting the indexes in RAM on a single machine. Instead, each server only has to hold its own shards in memory, and you can scale by just adding more servers as your database grows.

Also, it's worth noting: for my original postgres benchmarks, I just compiled the RDKit Postgres cartridge with the default settings. However, since then I've noticed that some of the options in the makefile should probably be changed from their default to improve performance - in particular turning on 'USE_POPCOUNT' and possibly also 'USE_THREADS'. That would probably give a more fair comparison.
Abhik Seal said…
Hi ,

How about searching based on HAMMING SPACE . I see that one disadvantage of the search is requirement for a huge
amount of memory in order to perform random permutations.

Check this article .
kott said…
@Abik - I don't really understand why you say, that the method requires "huge amount of memory in order to perform random permutations". This is simply not true. Please run attached code and see memory usage.
kott said…
@Matt - thanks, I will try that. Another thing is you can define some map-reduce tasks using MongoDB. If we define a map step as "compound_docuemnt -> (tanimoto, SMILES)" and reduce as "discard tuple if tanimoto < T, otherwise add to result", then this scheme can be easily performed on multiple servers. This is currently out of scope of myChEMBL and this article but should speed up similarity search. Of course using pruning is still important and using LSH can make it even faster.

Another thing is that I just found that the actual error rate is much LOWER than published. This is because for each compound, I measured a set difference of results and stored cumulative size of differences over all compounds. But then I divided it by the number of compounds, which doesn't make much sense. I should divide it by cumulative set sum over all compounds, which can only be larger, which would make error rate lower.

As alternative measure, I could count number of compounds, which had any difference in results, regardless of how big this difference is and then divide it by the number of all compounds. And this obviously would also make error rate lower than published, assuming that I will get similar results.

I will recompute error rates and update article soon.
kott said…
Error charts are now updated.

Popular posts from this blog

PKIS data in ChEMBL

The Protein Kinase Inhibitor Set (PKIS) made available by GSK was recently mentioned on  In the Pipeline .  In collaboration with GSK, we are making the data being generated on these compounds available via  the ChEMBL database.  We are also creating a portal for the compound set, where the structures can be browsed and downloaded, direct links to the data are provided and useful information can be posted. A preliminary version is available  here : feedback would be appreciated. The data generated on the PKIS set and deposited in ChEMBL may be downloaded in CSV format here  (note that the Luciferase dataset described in the recent PLoS paper will be in the next release of ChEMBL). Alternatively, to view the data in the ChEMBL web interface, follow these steps: On the home page, enter 'GSK_PKIS' in the search box and click on the 'Assays' button... On the 'Please select...' menu on the right, choose 'Display Bioactivities'...

A python client for accessing ChEMBL web services

Motivation The CheMBL Web Services provide simple reliable programmatic access to the data stored in ChEMBL database. RESTful API approaches are quite easy to master in most languages but still require writing a few lines of code. Additionally, it can be a challenging task to write a nontrivial application using REST without any examples. These factors were the motivation for us to write a small client library for accessing web services from Python. Why Python? We choose this language because Python has become extremely popular (and still growing in use) in scientific applications; there are several Open Source chemical toolkits available in this language, and so the wealth of ChEMBL resources and functionality of those toolkits can be easily combined. Moreover, Python is a very web-friendly language and we wanted to show how easy complex resource acquisition can be expressed in Python. Reinventing the wheel? There are already some libraries providing access to ChEMBL d

Release of ChEMBL 33

We are pleased to announce the release of ChEMBL 33! This fresh release comes with a few new data soures and also some new features: we added bioactivity data for understudied SLC targets from the RESOLUTE project and included a flag for Natural Products and for Chemical Probes. An annotation for the ACTION_TYPE of a measurement was included for approx. 270 K bioactivities. We also time-stamped every document in ChEMBL with their CREATION_DATE! Have fun playing around with ChEMBL 33 over the summer and please send feedback via .   ChEMBL database version ChEMBL 33 release notes ___________________________________________ # This version of the database, prepared on 31/05/2023 contains:      2,399,743 compounds (of which 2,372,674 have mol files)      3,051,613 compound records (non-unique compounds)        20,334,684 activities         1,610,596 assays      15,398 targets      88,630 documents BioAssay Data Sources:    Number Assays:    Number

Chemistry and Nature

  As the Great Big Green Week (UK) draws to a close, so does EMBL-EBI’s own Sustainability week. The Wellcome Genome Campus held events in the areas of recycling, energy use, and biodiversity. The ChEMBL team was keen to get involved and we developed our own Nature Trail event highlighting some of the bioactive compounds from the flora and fauna found on-site, and elsewhere. Our favourite examples include the sensation of mint and chilli and the glorious smell of rain! The full Nature Trail can be made available for external Public Engagement events upon request . Databases, such as ChEMBL , are large stores of structured data, including genetic, biological, and chemistry data for life sciences research. Data on the natural world is often held by wildlife organisations; this can be used to research biodiversity and species decline. Various Citizen Science initiatives mean that everyone can get involved in submitting nature records. So why not join in with the Butterfly Conservation’s B

Accessing SureChEMBL data in bulk

It is the peak of the summer (at least in this hemisphere) and many of our readers/users will be on holiday, perhaps on an island enjoying the sea. Luckily, for the rest of us there is still the 'sea' of SureChEMBL data that awaits to be enjoyed and explored for hidden 'treasures' (let me know if I pushed this analogy too far). See here and  here for a reminder of SureChEMBL is and what it does.  This wealth of (big) data can be accessed via the SureChEMBL interface , where users can submit quite sophisticated and granular queries by combining: i) Lucene fields against full-text and bibliographic metadata and ii) advanced structure query features against the annotated compound corpus. Examples of such queries will be the topic of a future post. Once the search results are back, users can browse through and export the chemistry from the patent(s) of interest. In addition to this functionality, we've been receiving user requests for  local (behind the