ChEMBL Resources


Monday, 24 August 2015

Online Resources

We would like to announce the recent release of two online training resources for ChEMBL and UniChem.

For ChEMBL, we  have developed ‘ChEMBL: Exploring bioactive drug-like molecules’, which will walk you through how to use the interface, step-by-step. It tackles topics such as target searching, compound searching, web services and data downloads. The course also gives you a chance to test your knowledge throughout.

Additionally, Jon Chambers has created the 'UniChem: Quick Tour' course. This course will give users a basic understanding of UniChem and the benefits it can bring to navigating small molecule resources. It will also walk you through how to conduct simple searches using UniChem and the UniChem Connectivity Search feature.

I'd also like to remind you that we store recordings of our past webinars, in case you missed them. You can access these anytime and they can be found here:

If you have any questions or queries about anything mentioned here, please do not hesitate to contact 

Monday, 10 August 2015

Accessing SureChEMBL data in bulk

It is the peak of the summer (at least in this hemisphere) and many of our readers/users will be on holiday, perhaps on an island enjoying the sea. Luckily, for the rest of us there is still the 'sea' of SureChEMBL data that awaits to be enjoyed and explored for hidden 'treasures' (let me know if I pushed this analogy too far). See here and here for a reminder of SureChEMBL is and what it does. 

This wealth of (big) data can be accessed via the SureChEMBL interface, where users can submit quite sophisticated and granular queries by combining: i) Lucene fields against full-text and bibliographic metadata and ii) advanced structure query features against the annotated compound corpus. Examples of such queries will be the topic of a future post. Once the search results are back, users can browse through and export the chemistry from the patent(s) of interest. In addition to this functionality, we've been receiving user requests for local (behind the firewall), batch access of the SureChEMBL data almost since day one of releasing the interface last year. By such data, we mean not only the compound repository (which is available anyway) but also the comprehensive map of associations between compound and patents, i.e. which compound was extracted from which section of which patent, for all available compounds and patents. To address these requests, we currently provide two additional ways to access the SureChEMBL data:

1) The map file. This is a flat file including compound, patent and association information. The back file contains all associations for eligible patent documents published between 1960 and 2014. Since then, two more incremental update files have been available covering the first 2 quarters of 2015. More info on the structure of the file is found here and all files are available here. We will continue to add incremental updates quarterly. 

2) The data client. This is a script that firstly creates a relational database in the user's environment and then populates the database with SureChEMBL data it retrieves from our private ftp server on a nightly basis. In addition to the front file (the patent and compound info that arrives daily), the script can also fetch the back file for each past year and back-fill the database accordingly. In the end, the data client script effectively creates a local and live snapshot of the SureChEMBL database, which allows for further integration, querying, filtering, deployment, etc., on the users' side. More info on the details of the client can be found here and documentation on how to set the client up is available here. Note that the script will need credentials to access our private ftp server, which we are happy to provide if you ask. It is also worth mentioning that the data client provides more information in a more timely fashion than the map file (daily vs. quarterly) but it requires considerably more effort and resources to set it up, compared to reading a flat file. 

As usual, this is not all - there is soon going to be an additional way to batch-access the SureChEMBL data. As announced recently, an enhanced version of SureChEMBL data (chemical AND biological annotations from patents), will be semantically integrated in the OpenPHACTS system and will be part of the freely available OpenPHACTS API. This will be available later this year.


Tuesday, 4 August 2015

LSH-based similarity search in MongoDB is faster than postgres cartridge.


In his excellent blog post, Matt Swain described the implementation of compound similarity searches in MongoDB. Unfortunately, Matt's approach had suboptimal (polynomial) time complexity with respect to decreasing similarity thresholds, which renders unsuitable for production environments. In this article, we improve on the method by enhancing it with Locality Sensitive Hashing algorithm, which significantly reduces query time and outperforms RDKit PostgreSQL cartridge.

myChEMBL 21 - NoSQL edition 


Given that NoSQL technologies applied to computational chemistry and cheminformatics are gaining traction and popularity, we decided to include a taster in future myChEMBL releases. Two especially appealing technologies are Neo4j and MongoDB. The former is a graph database and the latter is a BSON document storage. We would like to provide IPython notebook-based tutorials explaining how to use this software to deal with common cheminformatics problems.

Matt Swain's article "Chemical similarity search in MongoDB" seemed to be an ideal candidate for converting it to IPython notebook and shipping with the next myChEMBL release, so we've started working on it. Matt mentions other researchers, who have worked on this topic already: Davy Suvee, whose series (1, 2, 3) of blog posts defined the basic algorithm, without providing any statistics about performance and accuracy and Rajarshi Guha, who attempted to verify time complexity using one million compound sample from ChEMBL 17 and was disappointed with the performance of a median query time of 6332 ms with the similarity threshold set to 0.9.

General algorithm:

The general algorithm developed by Suvee, verified by Guha and improved by Swain looks as follows:

1. For each compound, compute a fingerprint(s) and store the compound identifier along with the fingerprint in MongoDB collection (precomputing step).

2. For each fingerprint generated in step 1, store it along with the number of occurrences and its length in separate MongoDB collection (precomputing step)

3. For the query compound, compute the fingerprint and its length Lq.

4. Apply pruning rules, originally described by Swamidass and Baldi in their paper:

    a) Consider only fingerprints with length L in range:
    T * Lq <= L <= Lq / T 

       where T is a threshold

    b) Consider:

        Lq − T * Lq + 1

        most differentiating fingerprint bits to check if at least one is present in searched compounds.

5. For all compounds which successfully passed pruning in step 4, compute the Tanimoto coefficient and return only those with a similarity score higher than the given threshold. 

Time complexity:

After benchmarking his solution with 1000 random compounds from ChEMBL 18,  Matt got following results:

While not identical with the results we obtained (we got higher absolute values because we had SMILES on input, as opposed to RDKit objects in Matt's code, we measured the mean query time by recording the time needed to complete 1k random queries against ChEMBL 20 using a docker container running on tiny Gigabyte BRIX machine with a Intel Core i7-3537U 2GHz CPU, 16GB RAM and 256GB SSD with Ubuntu 14.04 as host OS), the shape of the curve is quite similar and indicates polynomial time complexity with respect to decreasing threshold.

This is easy to understand as the function f(T) = [(Lq/T) - TLq](Lq-TLq + 1) can be reduced to a polynomial of the fourth degree. The lower is threshold the more compounds pass pruning and it takes more time to compute Tanimoto coefficient. Assuming optimistically that Matt's results can be easily achieved on any machine/computing environment, then searching 1k compounds with the threshold of 0.7 would take more than 16 minutes (and usually the query is provided as SMILES string, not a binary RDKit object, which requires expensive time consuming conversion), which can be too long for production environments. Moreover, there is a faster alternative, namely  RDKit cartridge, which is (according to Matt's results) faster for thresholds lower than 0.85 and a bit slower for higher thresholds.

Introducing LSH

After we realised, that the time complexity is not optimal (mostly by sitting and staring at monitor for ages waiting for results), we decided to look for ways to improve the process. Some more time spent on researching this topic led us to the conclusion that using a technique called 'Locality Sensitive Hashing' (LSH) can help us solving the problem.

The idea behind LSH is pretty simple - while most hashes we use everyday (cryptographic hashes like SHA) generate completely different values for documents, even when they differ only slightly, LSH tries to generate similar hash values for similar documents in order to maximize the probability of hash collision.

We've managed to find only two research papers, directly describing the application of LSH to chemical similarity. The first one, by Dutta, Guha et al. is quite old (2006) and was published 3 whole years before the initial release of MongoDB. It looks at a bit different problems, such as outlier detection and clustering and uses LSH implementation written in C++ by Andoni and Indyk - authors of the authoritative paper in this field: Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions. The second paper by Nasr, Baldi et al. exactly describes our problem but is highly theoretical and doesn't provide any code or hints for practical implementation.

Comparison with audio fingerprinting

Knowing that our problem is quite analogous to finding songs similar to the one played currently in the radio, we searched for applications of LSH to audio fingerprinting and stumbled upon this article by Ciumac Sergiu. It describes an algorithm for detecting duplicate songs using MinHash (min-wise independent permutations LSH scheme). Based on this article, we've created an improved compound similarity search algorithm, which as the following steps:

1. For each compound, compute a fingerprint(s) and store the compound identifier along with the fingerprint in MongoDB collection (precomputing step).

2. Generate 100 random permutations of length equal to the length of a single fingerprint and save them as a separate MongoDB collection.

3. For each fingerprint:
     For each permutation generated in step 2, reorder the fingerprint bits according to the permutation and store the position of the first '1' bit.
After this is completed each fingerprint will have a corresponding list of 100 positions of the first '1' according to 100 permutations, store this information along with the fingerprint.

4. For each list generated in step 3, divide this list into 25 chunks, 4 numbers in each chunk. Using left shift (<<) binary operation convert four numbers in each chunk into single 64-bit number. After this is done each fingerprint will be reduced to the vector of 25 integers, store this vector along with the fingerprint.

5. Create 25 different MongoDB collections. For each compound, get the corresponding 64-bit integer vector. For each number in the vector check the respective collection for the document identified by this number. If the document exists, extend it with compound identifier. If it doesn't exist create a new document identified by the number and containing single compound identifier.

6. For the query compound use the method described in steps 3 and 4 to generate a vector of 25 64-bit integer numbers. For each number check the respective collection for the document identified by this number. It it exists take compounds assigned to this document. Get the distinct set of all compounds found by this process for all 25 numbers. On this set, apply pruning procedure from the original algorithm and compute Tanimoto coefficient to all compound that successfully passed pruning. Return only those with a similarity score higher than the given threshold. 

Comparing results

After implementing this method in Python and calling it from IPython notebook we've benchmarked it in the same way as we did with original algorithm (mean time of 5 repetitions of searching 1k random sample against 1455713 compounds from ChEMBL 20 for each discrete threshold value). In the same way we measured RDKit postgreSQL cartridge performance and got these results:

As you can see, LSH-based method is always faster than PostgreSQL cartridge and apart from 0.95 point, faster than Swain's implementation. Again, you may wonder why all the numbers are higher than Matt's results. Well, the most important difference between our and Matt's implementation is that our software expects SMILES as an input whereas Matt's stuff expects RDKit mol. This means that our methods have to perform one more expensive step during each iteration. We've decided to implement our functions like this because this is what they should do in practice. Another (but less significant) factor is that we were running our calculations in completely different environment (hardware, OS, presence/absence of virtualisation, ChEMBL version, RDKit version).

What about accuracy?

EDIT: we've updated error plots due to the bug in code (explained in comments), the actual error values are lower then we said previously.

Shorter timings is one thing but are we getting exactly the same results? Since our hashing function is only an approximation, we should expect some discrepancies. To measure the number of discrepancies, we selected random sample of 1k compounds and compared results from Swain's and our implementation. For each pair of results, we recorded the size of symmetric set difference between results int the pair and divided them by the size of sum (union) of both result sets. We repeated this for different threshold values, results (in percent) are presented below:

As an alternative error measure, we counted number of compounds, for which we observed any discrepancy in results (regardless of its size) and divided by the number of compounds in sample (1k). Again, results in percent are shown below:

As can be noticed, for thresholds greater or equal to 0.8 the results are almost the same. For threshold 0.7, less than 6% of query compounds has different results. As we checked, the result of LSH similarity is always a subset of original method. This number alone is probably not very high but we decided to further investigate this. At first we thought that since without any pruning and filtering by Tanimoto coefficient, pure LSH procedure returns on average 3.5k 'similar' compounds for given query compound, then for low thresholds standard algorithm will yield more results and this is why we observe differences. This wouldn't be very bad because usually when asking for similar compounds we don't expect to deal with thousands of them. But it turned out that this situation happens very rarely, especially for Morgan fingerprints. To get better idea of what really happens let's take some example query compound that generates different results.

For input SMILES:


We've got two two results using standard algorithm and one using LSH. Those where CHEMBL3085935 and CHEMBL3085944, their SMILES are:


The difference is marked in red. Let's look at their vector of 25 hashes:

[644282988592, 77607305247, 610086746128, 412446916610, 266355085320, 137447358470,...][25807697968, 77607305263, 610086778896, 412514025474, 403794038792, 137447358509,...]

Notice, that some of them have very similar values (thanks to LSH). This means that the algorithm can be further improved by increasing number of chunks (instead of dividing 100 integers into 25 groups of 4, we could split them into 50 groups of 2 or even skip diving them at all) or use a different data structure, that instead of exactly matching hash string would first give a neighbors within some radius and return compounds from this neighborhood. Of course there would be some performance penalty but we believe it's possible to make the discrepancy chart completely flat in the range of thresholds greater than 0.7 and still have better performance than original method.

Further improvements/research

There are number of things mentioned in this article that could be better explored. First of all it should be noted, that we are not really comparing MongoDB and PostgreSQL as competing database engines. This is because both engines are using different techniques to implement compound similarity search. RDKit Postgres cartridge is not using LSH but it should be relatively easy to add code implementing it to RDKit library or as a separate piece of software called from the cartridge. On the other hand MongoDB support any extensions, executed on the MongoDB server, so all computations have to be done on the client side. That said, there is a very interesting project called ToroDB - MongoDB drop-in replacement, that runs on top of PostgreSQL. So a possible immediate next step would be to run the same compound searching benchmarks using ToroDB, which should give us some idea of how PostgreSQL and MongoDB perform executing the same NoSQL tasks.

Generating permutations for LSH is another interesting topic. In our code, we used the standard numpy random function generating uniformly distributed numbers. It would be interesting to check if using different probability distributions can give better results. The audio fingerprinting article, on which we based our code mentions a paper by Baluja et al. titled "Permutation grouping: intelligent hash function design for audio and image retrieval".

Measuring accuracy and time over LSH vector/groups size would be important as well. That would allow to find the optimal number of hashes that are required to get desired accuracy/time ratio.

Last but not least, in our benchmarks we've only used Morgan 2048-bit fingerprints. Fixed length fingerprints are already hashed and folded to achieve constant length and the price for this is worse fidelity. Using LSH on unfolded fingerprints may lead to better results; in addition it would also be interesting to compare different types of fingerprints as well as different lengths.



EDIT: you can now download the code from here.

To better illustrate our method, we provide this IPython notebook. As already explained, it's based on Matt Swain blog post but enhanced with methods described by Ciumac Sergiu in his article about audio fingerprinting.  This notebook will be included (along with all required dependencies) on myChEMBL 21 but if you have ChEMBL 20, you can load it as well but then, it's important to remember about installing MongoDB first. This can be done easily by ssh-ing to the machine and running:

sudo [apt-get|yum] install mongodb

and then start the mongod daemon:

sudo mongod --fork --logpath /var/log/mongod.log

Using the 'fork' option is very important and not including it may lead to worse performance. When running ChEMBL 20 as a virtual machine, it may happen that the size of collections created by MongoDB will exceed remaining disk space. This is why we suggest using ChEMBL 20 docker container on Linux box when possible because then, the only limiting factor is the size of physical disk space of the host machine.