FPSim2 is a new tool for fast similarity search on big compound datasets (>100 million) being developed at ChEMBL. We started developing it as we needed a Python3 library able to run either in memory or out-of-core fast similarity searches on such dataset sizes.
It's written in Python/Cython and features:
- A fast population count algorithm (builtin-popcnt-unrolled) from https://github.com/WojciechMula/sse-popcount using SIMD instructions.
- Bounds for sub-linear speed-ups from 10.1021/ci600358f
- A compressed file format with optimised read speed based in PyTables and BLOSC
- Use of multiple cores in a single search
- In memory and on disk search modes
- Simple and easy to use
Source code is available on github and Conda packages are also available for either mac or linux. To install it type:
conda install rdkit -c rdkit
conda install fpsim2 -c efelix
Try it with docker (much better performance than binder):
- docker pull eloyfelix/fpsim2
- docker run -p 9999:9999 eloyfelix/fpsim2
- open http://localhost:9999/notebooks/demo.ipynb in a browser
Or if you prefer to try it without installing anything (yet)... Click on the binder image!
I would also like to thank Andrew Dalke and Greg Landrum for their blogs, they have been very useful resources!