FPSim2 is a new tool for fast similarity search on big compound datasets (>100 million) being developed at ChEMBL. We started developing it as we needed a Python3 library able to run either in memory or out-of-core fast similarity searches on such dataset sizes.
It's written in Python/Cython and features:
- A fast population count algorithm (builtin-popcnt-unrolled) from https://github.com/WojciechMula/sse-popcount using SIMD instructions.
- Bounds for sub-linear speed-ups from 10.1021/ci600358f
- A compressed file format with optimised read speed based in PyTables and BLOSC
- Use of multiple cores in a single search
- In memory and on disk search modes
- Simple and easy to use
Source code is available on github and Conda packages are also available for either mac or linux. To install it type:
conda install rdkit -c rdkit
conda install fpsim2 -c efelix
Try it with docker (much better performance than binder):
- docker pull eloyfelix/fpsim2
- docker run -p 9999:9999 eloyfelix/fpsim2
- open http://localhost:9999/notebooks/demo.ipynb in a browser
Or if you prefer to try it without installing anything (yet)... Click on the binder image!
Data files used in the demos are also available to download.
I would also like to thank Andrew Dalke and Greg Landrum for their blogs, they have been very useful resources!
Comments
I can only compare it to chemfp 1.5 version, which is the opensource one.
FPSim2 is Python3 compatible, can use multiple threads in a single query and has a fast loading compressed file format.
SMILES, InChI and molfiles can be used as an input for a search, but this also comes with a cost.
FPSim2 can also run searches without loading all FPs in memory at once. This enables Raspberry Pi to run Unichem (>150 million) similarity searches :)
chemfp, as a more mature software has many more extra features like calculating full similarity matrices for example.
FPSim2 still needs some optimisations, features and a benchmark after some of this work is done.