Skip to main content

Privacy and the ChEMBL Database

Privacy is pretty important - for example, in the picture above I have protected to privacy of two colleagues, as I think I should ;) In fact I've even made sure that the black box securing their identities is not a layer on the image that can be trivially removed.....

Chemistry is a little different to some other areas of life-science research, and there is a little more caution applied typically in the use of 'public' database systems by people working on chemical structures - primarily because of patenting and novelty. There are probably similar privacy/security concerns over sequence data too - and in ChEMBL we've covered that too. I'm not going to drift on to what constitutes a 'publication', and all that sort of stuff since 1) I'm not qualified, 2) I don't have the time (and 1) anyway), and 3) it attracts trolls (and 1) and 2) anyway).

I have been asked for a talk through on the usage and query privacy of ChEMBL as part of the great OpenPHACTS project for some time; so here it is - to make it clear - I'm not an expert, but I do worry about these things, and I read a lot. Any feedback or suggestions would be great in the comment section.

ChEMBL is hosted on production machines at a pair of physically separated load-balanced Class 3 data centers in London. These are pretty close to one of the main Internet backbones in the UK, so reliability, latency and throughput is pretty good. The ChEMBL database and application is automatically loaded from a staging system at Hinxton. Once it leaves our staging area, we can't access the production data/server at all; in fact only a small number of named staff, using all sorts of access control and logging mechanisms can get into the machine rooms.

You may have noted that we use https: on the ChEMBL url above - even if you try and force use of http: to access the server, it will switch you over to https: (go on try it, I told you so). This ensures 1) that the server you access really is the genuine ChEMBL server (you should see a little lock in the corner of your browser), and 2) that the traffic between your client and our server is encrypted, and so no one can simply sit on the same network as you, listening to all your queries. So this is pretty secure, the tls standard used by https: is relied on by essentially everyone who implement secure and private web sites. It takes a little care to actually get https: to work properly - with a common reason for non-validation (so the little padlock doesn't appear) being the use of http: links on the nominally https: source page, or http: links to third party sites such as for advertisers, etc.

We don't (currently) have a green bar in the browser for this https: service - the green bar (or something similar depending on your browser) comes from the use of a Extended Validity Certificate (EVC). For these, you and your Certification Authority need to do a little more paperwork, and then spend a little more money. There is no difference in the technical security - the little padlock is the mark of security, not the green bar, just that the certificate authority has done some more work to validate that you really physically are who you say you are and so on. At the moment, sites like PayPal and so forth have EVCs, but they will no doubt spread, as the public starts to associate only sites with a green url bar with 'enhanced' security, and assume that the green thing is The Mark of website safety.

We do not use accounts to access the ChEMBL website - there is no need for the things we do - any personalisation is done via cookies saved on your machine in the cookies folder (we have an Institutional cookie policy too, that describes what cookies we will write on your computer). It is not straightforward to implement good password systems, as many large professional internet companies have amply shown (LinkedIn - I'm thinking of you!), and for us we don't need them for ChEMBL, so we haven't bothered.

There is also an Institutional Privacy Policy which covers a broad range of personal type data across all our activities (including recruitment, etc).

There is an Institutional Terms of Use for all institute resources. There is usage logging performed on the servers for internally reviewing the use of our services, or for spotting of problems (like DOS attacks, innocent scripting that can look like a DOS (Ben ;) )) and to collect statistics (like total usage, distribution of users, etc), to track enhanced usage following interface/data addition (this makes us feel good sometimes, it's nice to know our things are used). This data is all private, and is forbidden from being shared other than at aggregate level with third parties/collaborators.

The ChEMBL web application is written to not store any user queries (chemical structures or sequences, or text queries), other than storage required for application and database performance - so for example some automatically flushed, short-lifetime caches that are part of Oracle, and as I've said above, we don't have access to these anyway on the production servers.

We do not run google analytics on our ChEMBL application (but some of the Institutes services do, and we do on the ChEMBL-og) - it is tempting to do use GA for the fancy plots and maps, but what it means is that a third party (Google for GA) will be seeing all the query IP source addresses and url strings. Google already know enough about me, they don't also need to know I have a late night penchant for 4-amino-anilines as well.

So, if I was to extract some general principles from the above:
  • Use https: for everything - there's no real cost over http:, and make sure it validates!
  • Have a clear and easy to find Terms of Use.
  • Have a clear and easy to find license for any data.
  • Have a cookie policy and explain to users what the cookies you use are.
  • Have a privacy policy.
  • Keep your security certificates up to date.
  • Do not store any user queries for later analysis.
  • Think carefully before placing a user account system on your software - Does it really need it. If you do need implement one; for example your application has user uploaded data, has complex long running queries, or stores intermediate results, etc.? Read widely and plan defensively before you do. 
  • If you use third party analytics tools, make sure that your users know this, and if privacy is a concern to you, make sure you're also familiar with their ToU.
  • If you deploy things 'on the cloud' - read the agreement and T&Cs that you have with the company for your use of their services. Usually they do a very good job of dodging any responsibility, and sometimes grant themselves rights you would not expect. (We don't use third party cloud provision for any of our services - but we do use the cloud for some data entry portals. For these we're not doing anything that really requires great privacy, since once we've entered the data, we give it away anyway). And once you've read the T&Cs, read them again.
  • ChEMBL is typically "tighter" than the our Institute policies, but I think it's too confusing to make this specifically clear.....
Update - two things, 1) we do have a privacy policy specific to ChEMBL on our page and 2) The readers of the ChEMBL-og are very smart people, really you are. My attempts at protecting the privacy of one of the fellas above was woeful - I left his name badge in plain view! Doh! Sorry.


Popular posts from this blog

UniChem 2.0

UniChem new beta interface and web services We are excited to announce that our UniChem beta site will become the default one on the 11th of May. The new system will allow us to better maintain UniChem and to bring new functionality in a more sustainable way. The current interface and web services will still be reachable for a period of time at . In addition to it, the most popular legacy REST endpoints will also remain implemented in the new web services: Some downtime is expected during the swap.  What's new? UniChem’s current API and web application is implemented with a framework version that’s not maintained and the cost of updating it surpasses the cost of rebuilding it. In order to improve stability, security, and support the implementation and fast delivery of new features, we have decided to revamp our user-facing systems using the latest version of widely used and maintained frameworks, i

A python client for accessing ChEMBL web services

Motivation The CheMBL Web Services provide simple reliable programmatic access to the data stored in ChEMBL database. RESTful API approaches are quite easy to master in most languages but still require writing a few lines of code. Additionally, it can be a challenging task to write a nontrivial application using REST without any examples. These factors were the motivation for us to write a small client library for accessing web services from Python. Why Python? We choose this language because Python has become extremely popular (and still growing in use) in scientific applications; there are several Open Source chemical toolkits available in this language, and so the wealth of ChEMBL resources and functionality of those toolkits can be easily combined. Moreover, Python is a very web-friendly language and we wanted to show how easy complex resource acquisition can be expressed in Python. Reinventing the wheel? There are already some libraries providing access to ChEMBL d

LSH-based similarity search in MongoDB is faster than postgres cartridge.

TL;DR: In his excellent blog post , Matt Swain described the implementation of compound similarity searches in MongoDB . Unfortunately, Matt's approach had suboptimal ( polynomial ) time complexity with respect to decreasing similarity thresholds, which renders unsuitable for production environments. In this article, we improve on the method by enhancing it with Locality Sensitive Hashing algorithm, which significantly reduces query time and outperforms RDKit PostgreSQL cartridge . myChEMBL 21 - NoSQL edition    Given that NoSQL technologies applied to computational chemistry and cheminformatics are gaining traction and popularity, we decided to include a taster in future myChEMBL releases. Two especially appealing technologies are Neo4j and MongoDB . The former is a graph database and the latter is a BSON document storage. We would like to provide IPython notebook -based tutorials explaining how to use this software to deal with common cheminformatics p

ChEMBL 30 released

  We are pleased to announce the release of ChEMBL 30. This version of the database, prepared on 22/02/2022 contains: 2,786,911 compound records 2,157,379 compounds (of which 2,136,187 have mol files) 19,286,751 activities 1,458,215 assays 14,855 targets 84,092 documents Data can be downloaded from the ChEMBL FTP site: Please see ChEMBL_30 release notes for full details of all changes in this release: New Deposited Datasets EUbOPEN Chemogenomic Library (src_id = 55, ChEMBL Document ID CHEMBL4689842):   The EUbOPEN consortium is an Innovative Medicines Initiative (IMI) funded project to enable and unlock biology in the open. The aims of the project are to assemble an open access chemogenomic library comprising about 5,000 well annotated compounds covering roughly 1,000 different proteins, to synthesize at least

Multi-task neural network on ChEMBL with PyTorch 1.0 and RDKit

  Update: KNIME protocol with the model available thanks to Greg Landrum. Update: New code to train the model and ONNX exported trained models available in github . The use and application of multi-task neural networks is growing rapidly in cheminformatics and drug discovery. Examples can be found in the following publications: - Deep Learning as an Opportunity in VirtualScreening - Massively Multitask Networks for Drug Discovery - Beyond the hype: deep neural networks outperform established methods using a ChEMBL bioactivity benchmark set But what is a multi-task neural network? In short, it's a kind of neural network architecture that can optimise multiple classification/regression problems at the same time while taking advantage of their shared description. This blogpost gives a great overview of their architecture. All networks in references above implement the hard parameter sharing approach. So, having a set of activities relating targets and molecules we can tra