Skip to main content

LSH-based similarity search in MongoDB is faster than postgres cartridge.


TL;DR:

In his excellent blog post, Matt Swain described the implementation of compound similarity searches in MongoDB. Unfortunately, Matt's approach had suboptimal (polynomial) time complexity with respect to decreasing similarity thresholds, which renders unsuitable for production environments. In this article, we improve on the method by enhancing it with Locality Sensitive Hashing algorithm, which significantly reduces query time and outperforms RDKit PostgreSQL cartridge.


myChEMBL 21 - NoSQL edition 

 

Given that NoSQL technologies applied to computational chemistry and cheminformatics are gaining traction and popularity, we decided to include a taster in future myChEMBL releases. Two especially appealing technologies are Neo4j and MongoDB. The former is a graph database and the latter is a BSON document storage. We would like to provide IPython notebook-based tutorials explaining how to use this software to deal with common cheminformatics problems.

Matt Swain's article "Chemical similarity search in MongoDB" seemed to be an ideal candidate for converting it to IPython notebook and shipping with the next myChEMBL release, so we've started working on it. Matt mentions other researchers, who have worked on this topic already: Davy Suvee, whose series (1, 2, 3) of blog posts defined the basic algorithm, without providing any statistics about performance and accuracy and Rajarshi Guha, who attempted to verify time complexity using one million compound sample from ChEMBL 17 and was disappointed with the performance of a median query time of 6332 ms with the similarity threshold set to 0.9.

General algorithm:


The general algorithm developed by Suvee, verified by Guha and improved by Swain looks as follows:

1. For each compound, compute a fingerprint(s) and store the compound identifier along with the fingerprint in MongoDB collection (precomputing step).

2. For each fingerprint generated in step 1, store it along with the number of occurrences and its length in separate MongoDB collection (precomputing step)

3. For the query compound, compute the fingerprint and its length Lq.

4. Apply pruning rules, originally described by Swamidass and Baldi in their paper:

    a) Consider only fingerprints with length L in range:
      
    T * Lq <= L <= Lq / T 

       where T is a threshold

    b) Consider:

        Lq − T * Lq + 1

        most differentiating fingerprint bits to check if at least one is present in searched compounds.

5. For all compounds which successfully passed pruning in step 4, compute the Tanimoto coefficient and return only those with a similarity score higher than the given threshold. 

Time complexity:


After benchmarking his solution with 1000 random compounds from ChEMBL 18,  Matt got following results:





While not identical with the results we obtained (we got higher absolute values because we had SMILES on input, as opposed to RDKit objects in Matt's code, we measured the mean query time by recording the time needed to complete 1k random queries against ChEMBL 20 using a docker container running on tiny Gigabyte BRIX machine with a Intel Core i7-3537U 2GHz CPU, 16GB RAM and 256GB SSD with Ubuntu 14.04 as host OS), the shape of the curve is quite similar and indicates polynomial time complexity with respect to decreasing threshold.

This is easy to understand as the function f(T) = [(Lq/T) - TLq](Lq-TLq + 1) can be reduced to a polynomial of the fourth degree. The lower is threshold the more compounds pass pruning and it takes more time to compute Tanimoto coefficient. Assuming optimistically that Matt's results can be easily achieved on any machine/computing environment, then searching 1k compounds with the threshold of 0.7 would take more than 16 minutes (and usually the query is provided as SMILES string, not a binary RDKit object, which requires expensive time consuming conversion), which can be too long for production environments. Moreover, there is a faster alternative, namely  RDKit cartridge, which is (according to Matt's results) faster for thresholds lower than 0.85 and a bit slower for higher thresholds.


Introducing LSH


After we realised, that the time complexity is not optimal (mostly by sitting and staring at monitor for ages waiting for results), we decided to look for ways to improve the process. Some more time spent on researching this topic led us to the conclusion that using a technique called 'Locality Sensitive Hashing' (LSH) can help us solving the problem.

The idea behind LSH is pretty simple - while most hashes we use everyday (cryptographic hashes like SHA) generate completely different values for documents, even when they differ only slightly, LSH tries to generate similar hash values for similar documents in order to maximize the probability of hash collision.

We've managed to find only two research papers, directly describing the application of LSH to chemical similarity. The first one, by Dutta, Guha et al. is quite old (2006) and was published 3 whole years before the initial release of MongoDB. It looks at a bit different problems, such as outlier detection and clustering and uses LSH implementation written in C++ by Andoni and Indyk - authors of the authoritative paper in this field: Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions. The second paper by Nasr, Baldi et al. exactly describes our problem but is highly theoretical and doesn't provide any code or hints for practical implementation.

Comparison with audio fingerprinting


Knowing that our problem is quite analogous to finding songs similar to the one played currently in the radio, we searched for applications of LSH to audio fingerprinting and stumbled upon this article by Ciumac Sergiu. It describes an algorithm for detecting duplicate songs using MinHash (min-wise independent permutations LSH scheme). Based on this article, we've created an improved compound similarity search algorithm, which as the following steps:

1. For each compound, compute a fingerprint(s) and store the compound identifier along with the fingerprint in MongoDB collection (precomputing step).

2. Generate 100 random permutations of length equal to the length of a single fingerprint and save them as a separate MongoDB collection.

3. For each fingerprint:
     For each permutation generated in step 2, reorder the fingerprint bits according to the permutation and store the position of the first '1' bit.
After this is completed each fingerprint will have a corresponding list of 100 positions of the first '1' according to 100 permutations, store this information along with the fingerprint.

4. For each list generated in step 3, divide this list into 25 chunks, 4 numbers in each chunk. Using left shift (<<) binary operation convert four numbers in each chunk into single 64-bit number. After this is done each fingerprint will be reduced to the vector of 25 integers, store this vector along with the fingerprint.

5. Create 25 different MongoDB collections. For each compound, get the corresponding 64-bit integer vector. For each number in the vector check the respective collection for the document identified by this number. If the document exists, extend it with compound identifier. If it doesn't exist create a new document identified by the number and containing single compound identifier.

6. For the query compound use the method described in steps 3 and 4 to generate a vector of 25 64-bit integer numbers. For each number check the respective collection for the document identified by this number. It it exists take compounds assigned to this document. Get the distinct set of all compounds found by this process for all 25 numbers. On this set, apply pruning procedure from the original algorithm and compute Tanimoto coefficient to all compound that successfully passed pruning. Return only those with a similarity score higher than the given threshold. 
 

Comparing results


After implementing this method in Python and calling it from IPython notebook we've benchmarked it in the same way as we did with original algorithm (mean time of 5 repetitions of searching 1k random sample against 1455713 compounds from ChEMBL 20 for each discrete threshold value). In the same way we measured RDKit postgreSQL cartridge performance and got these results:



As you can see, LSH-based method is always faster than PostgreSQL cartridge and apart from 0.95 point, faster than Swain's implementation. Again, you may wonder why all the numbers are higher than Matt's results. Well, the most important difference between our and Matt's implementation is that our software expects SMILES as an input whereas Matt's stuff expects RDKit mol. This means that our methods have to perform one more expensive step during each iteration. We've decided to implement our functions like this because this is what they should do in practice. Another (but less significant) factor is that we were running our calculations in completely different environment (hardware, OS, presence/absence of virtualisation, ChEMBL version, RDKit version).

What about accuracy?


EDIT: we've updated error plots due to the bug in code (explained in comments), the actual error values are lower then we said previously.

Shorter timings is one thing but are we getting exactly the same results? Since our hashing function is only an approximation, we should expect some discrepancies. To measure the number of discrepancies, we selected random sample of 1k compounds and compared results from Swain's and our implementation. For each pair of results, we recorded the size of symmetric set difference between results int the pair and divided them by the size of sum (union) of both result sets. We repeated this for different threshold values, results (in percent) are presented below:




As an alternative error measure, we counted number of compounds, for which we observed any discrepancy in results (regardless of its size) and divided by the number of compounds in sample (1k). Again, results in percent are shown below:



As can be noticed, for thresholds greater or equal to 0.8 the results are almost the same. For threshold 0.7, less than 6% of query compounds has different results. As we checked, the result of LSH similarity is always a subset of original method. This number alone is probably not very high but we decided to further investigate this. At first we thought that since without any pruning and filtering by Tanimoto coefficient, pure LSH procedure returns on average 3.5k 'similar' compounds for given query compound, then for low thresholds standard algorithm will yield more results and this is why we observe differences. This wouldn't be very bad because usually when asking for similar compounds we don't expect to deal with thousands of them. But it turned out that this situation happens very rarely, especially for Morgan fingerprints. To get better idea of what really happens let's take some example query compound that generates different results.

For input SMILES:

Clc1ccc(Cn2c(nc3ccccc23)C(=O)N4CCC(CC4)C(=O)NCCc5ccncc5)cc1

We've got two two results using standard algorithm and one using LSH. Those where CHEMBL3085935 and CHEMBL3085944, their SMILES are:

O=C(NCCc1ccncc1)C1CCN(C(=O)c2nc3ccccc3n2Cc2ccc(Cl)cc2)CC1
O=C(NCCc1ccncc1)C1CCN(C(=O)c2cc3ccccc3n2Cc2ccc(Cl)cc2)CC1

The difference is marked in red. Let's look at their vector of 25 hashes:

[644282988592, 77607305247, 610086746128, 412446916610, 266355085320, 137447358470,...][25807697968, 77607305263, 610086778896, 412514025474, 403794038792, 137447358509,...]

Notice, that some of them have very similar values (thanks to LSH). This means that the algorithm can be further improved by increasing number of chunks (instead of dividing 100 integers into 25 groups of 4, we could split them into 50 groups of 2 or even skip diving them at all) or use a different data structure, that instead of exactly matching hash string would first give a neighbors within some radius and return compounds from this neighborhood. Of course there would be some performance penalty but we believe it's possible to make the discrepancy chart completely flat in the range of thresholds greater than 0.7 and still have better performance than original method.


Further improvements/research


There are number of things mentioned in this article that could be better explored. First of all it should be noted, that we are not really comparing MongoDB and PostgreSQL as competing database engines. This is because both engines are using different techniques to implement compound similarity search. RDKit Postgres cartridge is not using LSH but it should be relatively easy to add code implementing it to RDKit library or as a separate piece of software called from the cartridge. On the other hand MongoDB support any extensions, executed on the MongoDB server, so all computations have to be done on the client side. That said, there is a very interesting project called ToroDB - MongoDB drop-in replacement, that runs on top of PostgreSQL. So a possible immediate next step would be to run the same compound searching benchmarks using ToroDB, which should give us some idea of how PostgreSQL and MongoDB perform executing the same NoSQL tasks.

Generating permutations for LSH is another interesting topic. In our code, we used the standard numpy random function generating uniformly distributed numbers. It would be interesting to check if using different probability distributions can give better results. The audio fingerprinting article, on which we based our code mentions a paper by Baluja et al. titled "Permutation grouping: intelligent hash function design for audio and image retrieval".

Measuring accuracy and time over LSH vector/groups size would be important as well. That would allow to find the optimal number of hashes that are required to get desired accuracy/time ratio.

Last but not least, in our benchmarks we've only used Morgan 2048-bit fingerprints. Fixed length fingerprints are already hashed and folded to achieve constant length and the price for this is worse fidelity. Using LSH on unfolded fingerprints may lead to better results; in addition it would also be interesting to compare different types of fingerprints as well as different lengths.

Code

 

EDIT: you can now download the code from here.

To better illustrate our method, we provide this IPython notebook. As already explained, it's based on Matt Swain blog post but enhanced with methods described by Ciumac Sergiu in his article about audio fingerprinting.  This notebook will be included (along with all required dependencies) on myChEMBL 21 but if you have ChEMBL 20, you can load it as well but then, it's important to remember about installing MongoDB first. This can be done easily by ssh-ing to the machine and running:

sudo [apt-get|yum] install mongodb

and then start the mongod daemon:

sudo mongod --fork --logpath /var/log/mongod.log

Using the 'fork' option is very important and not including it may lead to worse performance. When running ChEMBL 20 as a virtual machine, it may happen that the size of collections created by MongoDB will exceed remaining disk space. This is why we suggest using ChEMBL 20 docker container on Linux box when possible because then, the only limiting factor is the size of physical disk space of the host machine.

Michał

Comments

greg landrum said…
Interesting stuff, and something I'd love to help explore further.

I'm quite surprised by the results with the PostgreSQL cartridge. I would expect the search time to decrease with increasing similarity threshold. I was able to reproduce the behavior though, so I will try to figure out what's going on (or why my expectations are wrong).

One concern I have with this: The results look great for high cutoffs (i.e. finding close neighbors), but one often wants to find compounds that are "somewhat similar" to the query. This requires searching with a fairly loose similarity cutoff. It looks like the LSH approach leads to an accuracy problem here.
kott said…
I agree this is kind of a trade between time and accuracy. The question is how much the accuracy can be improved by:

- selecting optimal permutations within a group
- using other fingerprint types/lengths
- using different data structures (range trees?) or increasing the number of buckets

and how would that affect time?
Matt said…
This is great - some impressive results.

One other thing that I tried since I wrote my original blog post is "sharding" the MongoDB collection. With the default setup, MongoDB only uses one CPU core to perform the query, but if you shard the collection, it will use a core for each shard. This can be a huge benefit for CPU intensive queries like this. You can run multiple shards on the same server, or split shards across multiple servers, or both! Sharding across multiple servers also means that you can easily scale up your RAM beyond what would be possible to have in a single machine. This is useful for huge databases (i.e. ~100m+ molecules) where you start to have trouble fitting the indexes in RAM on a single machine. Instead, each server only has to hold its own shards in memory, and you can scale by just adding more servers as your database grows.

Also, it's worth noting: for my original postgres benchmarks, I just compiled the RDKit Postgres cartridge with the default settings. However, since then I've noticed that some of the options in the makefile should probably be changed from their default to improve performance - in particular turning on 'USE_POPCOUNT' and possibly also 'USE_THREADS'. That would probably give a more fair comparison.
Abhik Seal said…
Hi ,

How about searching based on HAMMING SPACE . I see that one disadvantage of the search is requirement for a huge
amount of memory in order to perform random permutations.

Check this article .
http://onlinelibrary.wiley.com/doi/10.1002/ecj.11561/abstract
kott said…
@Abik - I don't really understand why you say, that the method requires "huge amount of memory in order to perform random permutations". This is simply not true. Please run attached code and see memory usage.
kott said…
@Matt - thanks, I will try that. Another thing is you can define some map-reduce tasks using MongoDB. If we define a map step as "compound_docuemnt -> (tanimoto, SMILES)" and reduce as "discard tuple if tanimoto < T, otherwise add to result", then this scheme can be easily performed on multiple servers. This is currently out of scope of myChEMBL and this article but should speed up similarity search. Of course using pruning is still important and using LSH can make it even faster.

Another thing is that I just found that the actual error rate is much LOWER than published. This is because for each compound, I measured a set difference of results and stored cumulative size of differences over all compounds. But then I divided it by the number of compounds, which doesn't make much sense. I should divide it by cumulative set sum over all compounds, which can only be larger, which would make error rate lower.

As alternative measure, I could count number of compounds, which had any difference in results, regardless of how big this difference is and then divide it by the number of all compounds. And this obviously would also make error rate lower than published, assuming that I will get similar results.

I will recompute error rates and update article soon.
kott said…
Error charts are now updated.

Popular posts from this blog

Here's a nice Christmas gift - ChEMBL 35 is out!

Use your well-deserved Christmas holidays to spend time with your loved ones and explore the new release of ChEMBL 35!            This fresh release comes with a wealth of new data sets and some new data sources as well. Examples include a total of 14 datasets deposited by by the ASAP ( AI-driven Structure-enabled Antiviral Platform) project, a new NTD data se t by Aberystwyth University on anti-schistosome activity, nine new chemical probe data sets, and seven new data sets for the Chemogenomic library of the EUbOPEN project. We also inlcuded a few new fields that do impr ove the provenance and FAIRness of the data we host in ChEMBL:  1) A CONTACT field has been added to the DOCs table which should contain a contact profile of someone willing to be contacted about details of the dataset (ideally an ORCID ID; up to 3 contacts can be provided). 2) In an effort to provide more detailed information about the source of a deposited dat...

Improvements in SureChEMBL's chemistry search and adoption of RDKit

    Dear SureChEMBL users, If you frequently rely on our "chemistry search" feature, today brings great news! We’ve recently implemented a major update that makes your search experience faster than ever. What's New? Last week, we upgraded our structure search engine by aligning it with the core code base used in ChEMBL . This update allows SureChEMBL to leverage our FPSim2 Python package , returning results in approximately one second. The similarity search relies on 256-bit RDKit -calculated ECFP4 fingerprints, and a single instance requires approximately 1 GB of RAM to run. SureChEMBL’s FPSim2 file is not currently available for download, but we are considering generating it periodicaly and have created it once for you to try in Google Colab ! For substructure searches, we now also use an RDKit -based solution via SubstructLibrary , which returns results several times faster than our previous implementation. Additionally, structure search results are now sorted by...

ChEMBL 34 is out!

We are delighted to announce the release of ChEMBL 34, which includes a full update to drug and clinical candidate drug data. This version of the database, prepared on 28/03/2024 contains:         2,431,025 compounds (of which 2,409,270 have mol files)         3,106,257 compound records (non-unique compounds)         20,772,701 activities         1,644,390 assays         15,598 targets         89,892 documents Data can be downloaded from the ChEMBL FTP site:  https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_34/ Please see ChEMBL_34 release notes for full details of all changes in this release:  https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_34/chembl_34_release_notes.txt New Data Sources European Medicines Agency (src_id = 66): European Medicines Agency's data correspond to EMA drugs prior to 20 January 2023 (excluding ...

Improved querying for SureChEMBL

    Dear SureChEMBL users, Earlier this year we ran a survey to identify what you, the users, would like to see next in SureChEMBL. Thank you for offering your feedback! This gave us the opportunity to have some interesting discussions both internally and externally. While we can't publicly reveal precisely our plans for the coming months (everything will be delivered at the right time), we can at least say that improving the compound structure extraction quality is a priority. Unfortunately, the change won't happen overnight as reprocessing 167 millions patents takes a while. However, the good news is that the new generation of optical chemical structure recognition shows good performance, even for patent images! We hope we can share our results with you soon. So in the meantime, what are we doing? You may have noticed a few changes on the SureChEMBL main page. No more "Beta" flag since we consider the system to be stable enough (it does not mean that you will never ...

ChEMBL brings drug bioactivity data to the Protein Data Bank in Europe

In the quest to develop new drugs, understanding the 3D structure of molecules is crucial. Resources like the Protein Data Bank in Europe (PDBe) and the Cambridge Structural Database (CSD) provide these 3D blueprints for many biological molecules. However, researchers also need to know how these molecules interact with their biological target – their bioactivity. ChEMBL is a treasure trove of bioactivity data for countless drug-like molecules. It tells us how strongly a molecule binds to a target, how it affects a biological process, and even how it might be metabolized. But here's the catch: while ChEMBL provides extensive information on a molecule's activity and cross references to other data sources, it doesn't always tell us if a 3D structure is available for a specific drug-target complex. This can be a roadblock for researchers who need that structural information to design effective drugs. Therefore, connecting ChEMBL data with resources like PDBe and CSD is essen...