The ChEMBL-og

Hi Chris, Are you using Docker on Windows or Mac?...

2020-02-06T12:42:51.739+00:00

Hi Chris,

Are you using Docker on Windows or Mac?
It's default config (Docker on Windows and Mac actually runs inside a tiny VM) allows it only to use 2GB of RAM and it looks like it's killing the container process because Docker runs out of memory when loading the models.
You'll need to change Docker config to allow it to use 8GB of system memory.

Kind regards,
Eloy

Hi, Tried to use Docker docker run -p 8080:8080 c...

2020-02-06T10:58:30.580+00:00

Hi, Tried to use Docker

docker run -p 8080:8080 chembl/mcp
Forking - python [index.py]
2020/02/06 10:54:02 Started logging stderr from function.
2020/02/06 10:54:02 Started logging stdout from function.
2020/02/06 10:54:02 OperationalMode: http
2020/02/06 10:54:02 Timeouts: read: 10s, write: 10s hard: 10s.
2020/02/06 10:54:02 Listening on port: 8080
2020/02/06 10:54:02 Writing lock-file to: /tmp/.lock
2020/02/06 10:54:02 Metrics listening on port: 8081
2020/02/06 10:54:31 Upstream HTTP request error: Post http://127.0.0.1:5000/: dial tcp 127.0.0.1:5000: connect: connection refused
2020/02/06 10:54:46 Forked function has terminated: signal: killed

when I try this in another Terminal window

curl -X POST -H 'Accept: */*' -H 'Content-Type: application/json' -d '{"smiles": "CC(=O)Oc1ccccc1C(=O)O"}' http://127.0.0.1:8080

Glad you find it useful! This is a feature that we...

2019-08-27T18:39:00.289+01:00

Glad you find it useful! This is a feature that we considered to implement but it won't probably happen before next ChEMBL (26) release.

Very nice tool. I am trying to move from chemfp to...

2019-08-27T08:31:17.530+01:00

Very nice tool. I am trying to move from chemfp to using FPSim2... Are you planning to include the calculation of similarity matrices? Thanks!

thanks! I can only compare it to chemfp 1.5 vers...

2019-01-25T15:43:26.905+00:00

thanks!

I can only compare it to chemfp 1.5 version, which is the opensource one.

FPSim2 is Python3 compatible, can use multiple threads in a single query and has a fast loading compressed file format.
SMILES, InChI and molfiles can be used as an input for a search, but this also comes with a cost.
FPSim2 can also run searches without loading all FPs in memory at once. This enables Raspberry Pi to run Unichem (>150 million) similarity searches :)

chemfp, as a more mature software has many more extra features like calculating full similarity matrices for example.

FPSim2 still needs some optimisations, features and a benchmark after some of this work is done.

very cool! how does it compare to chemfp?

2019-01-24T20:30:40.916+00:00

very cool! how does it compare to chemfp?

Very Christmas to all at ChEMBL - as a user of ChE...

2016-12-19T16:43:33.275+00:00

Very Christmas to all at ChEMBL - as a user of ChEMBL now, just want to say, great job!!!! :)

2016-03-04T16:54:58.038+00:00

This comment has been removed by the author.

I noted in your "future work" you're...

2015-08-08T02:36:09.922+01:00

I noted in your "future work" you're considering using supervisord as a process manager. We were using it for a long time, and you might want to take a look at a process manager designed specifically for Docker: Chaperone Documentation.

Disclaimer: We built this ourselves and open-sourced it, but it really has solved boaloads of problems for us and a few of our clients. Always very very interested in feedback.

Error charts are now updated.

2015-08-05T13:43:02.712+01:00

Error charts are now updated.

@Matt - thanks, I will try that. Another thing is ...

2015-08-05T08:20:19.273+01:00

@Matt - thanks, I will try that. Another thing is you can define some map-reduce tasks using MongoDB. If we define a map step as "compound_docuemnt -> (tanimoto, SMILES)" and reduce as "discard tuple if tanimoto < T, otherwise add to result", then this scheme can be easily performed on multiple servers. This is currently out of scope of myChEMBL and this article but should speed up similarity search. Of course using pruning is still important and using LSH can make it even faster.

Another thing is that I just found that the actual error rate is much LOWER than published. This is because for each compound, I measured a set difference of results and stored cumulative size of differences over all compounds. But then I divided it by the number of compounds, which doesn't make much sense. I should divide it by cumulative set sum over all compounds, which can only be larger, which would make error rate lower.

As alternative measure, I could count number of compounds, which had any difference in results, regardless of how big this difference is and then divide it by the number of all compounds. And this obviously would also make error rate lower than published, assuming that I will get similar results.

I will recompute error rates and update article soon.

@Abik - I don't really understand why you say,...

2015-08-05T08:00:13.041+01:00

@Abik - I don't really understand why you say, that the method requires "huge amount of memory in order to perform random permutations". This is simply not true. Please run attached code and see memory usage.

Hi , How about searching based on HAMMING SPACE ....

2015-08-05T06:52:19.482+01:00

Hi ,

How about searching based on HAMMING SPACE . I see that one disadvantage of the search is requirement for a huge
amount of memory in order to perform random permutations.

Check this article .
http://onlinelibrary.wiley.com/doi/10.1002/ecj.11561/abstract

This is great - some impressive results. One othe...

2015-08-04T20:13:14.023+01:00

This is great - some impressive results.

One other thing that I tried since I wrote my original blog post is "sharding" the MongoDB collection. With the default setup, MongoDB only uses one CPU core to perform the query, but if you shard the collection, it will use a core for each shard. This can be a huge benefit for CPU intensive queries like this. You can run multiple shards on the same server, or split shards across multiple servers, or both! Sharding across multiple servers also means that you can easily scale up your RAM beyond what would be possible to have in a single machine. This is useful for huge databases (i.e. ~100m+ molecules) where you start to have trouble fitting the indexes in RAM on a single machine. Instead, each server only has to hold its own shards in memory, and you can scale by just adding more servers as your database grows.

Also, it's worth noting: for my original postgres benchmarks, I just compiled the RDKit Postgres cartridge with the default settings. However, since then I've noticed that some of the options in the makefile should probably be changed from their default to improve performance - in particular turning on 'USE_POPCOUNT' and possibly also 'USE_THREADS'. That would probably give a more fair comparison.

I agree this is kind of a trade between time and a...

2015-08-04T14:02:00.856+01:00

I agree this is kind of a trade between time and accuracy. The question is how much the accuracy can be improved by:

- selecting optimal permutations within a group
- using other fingerprint types/lengths
- using different data structures (range trees?) or increasing the number of buckets

and how would that affect time?

Interesting stuff, and something I'd love to h...

2015-08-04T13:07:15.991+01:00

Interesting stuff, and something I'd love to help explore further.

I'm quite surprised by the results with the PostgreSQL cartridge. I would expect the search time to decrease with increasing similarity threshold. I was able to reproduce the behavior though, so I will try to figure out what's going on (or why my expectations are wrong).

One concern I have with this: The results look great for high cutoffs (i.e. finding close neighbors), but one often wants to find compounds that are "somewhat similar" to the query. This requires searching with a fairly loose similarity cutoff. It looks like the LSH approach leads to an accuracy problem here.

Thanks. The example on the blog post is working (w...

2015-07-28T14:20:36.430+01:00

Thanks. The example on the blog post is working (with some minor modifications)

This is now completed and chembl_webresource_clien...

2015-07-27T14:19:38.489+01:00

This is now completed and chembl_webresource_client ver. 0.8.31+ supports Python 3.

Thank you Maciek, we will use qcow2 compression fo...

2015-07-23T16:41:59.863+01:00

Thank you Maciek, we will use qcow2 compression for myChEMBL 21.

Yes, this is one of the most important issues and ...

2015-07-23T16:40:42.492+01:00

Yes, this is one of the most important issues and will be resolved soon, you can check the progress by subscribing notifications from this ticket: https://github.com/chembl/chembl_webresource_client/issues/9

Any plan to make compatible with python 3.X ?

2015-07-23T14:09:15.854+01:00

Any plan to make compatible with python 3.X ?

Hey, why won't you use qcow2's compression...

2015-07-22T09:21:50.125+01:00

Hey, why won't you use qcow2's compression instead of tar.gz? You wouldn't need to decompress it and save some more disk space. I had a blog post about using myCHEMBL in KVM about a year or so, and the compression worked like a charm. The initial image was roughly the same size as tarball, but functional. It will grow over time. I don't know how big is the performance hit, although for light usage there was no difference at all. For reference see http://maciek.wojcikowski.pl/2014/06/mychembl-running-on-kvm/

CentOS based myChembl is brilliant.. ! To bad i ha...

2015-07-14T14:42:56.526+01:00

CentOS based myChembl is brilliant.. ! To bad i have just finished installing remus... :(

These models require scikit-learn==0.14.1 I now h...

2015-06-30T17:10:14.700+01:00

These models require scikit-learn==0.14.1

I now have them running in a virtual environment, any plans to update them?

you already the legend, how far you can get )

2015-05-17T18:37:38.397+01:00

you already the legend, how far you can get )