Thursday, 24 January 2019

FPSim2, a simple Python3 molecular similarity tool

FPSim2 is a new tool for fast similarity search on big compound datasets (>100 million) being developed at ChEMBL. We started developing it as we needed a Python3 library able to run either in memory or out-of-core fast similarity searches on such dataset sizes.

It's fully written in Python/Cython and features:

Source code is available on github and Conda packages are also available for either mac or linux. To install it type:

conda install rdkit -c rdkit 
conda install fpsim2 -c efelix

Try it with docker (much better performance than binder):

  •     docker pull eloyfelix/fpsim2
  •     docker run -p 9999:9999 eloyfelix/fpsim2
  •     open http://localhost:9999/notebooks/demo.ipynb in a browser

Or if you prefer to try it without installing anything (yet)... Click on the binder image!

Data files used in the demos are also available to download.

I would also like to thank Andrew Dalke and Greg Landrum for their blogs, they have been very useful resources!



George Papadatos said...

very cool! how does it compare to chemfp?

Eloy said...


I can only compare it to chemfp 1.5 version, which is the opensource one.

FPSim2 is Python3 compatible, can use multiple threads in a single query and has a fast loading compressed file format.
SMILES, InChI and molfiles can be used as an input for a search, but this also comes with a cost.
FPSim2 can also run searches without loading all FPs in memory at once. This enables Raspberry Pi to run Unichem (>150 million) similarity searches :)

chemfp, as a more mature software has many more extra features like calculating full similarity matrices for example.

FPSim2 still needs some optimisations, features and a benchmark after some of this work is done.