Skip to main content

Privacy and the ChEMBL Database

Privacy is pretty important - for example, in the picture above I have protected to privacy of two colleagues, as I think I should ;) In fact I've even made sure that the black box securing their identities is not a layer on the image that can be trivially removed.....

Chemistry is a little different to some other areas of life-science research, and there is a little more caution applied typically in the use of 'public' database systems by people working on chemical structures - primarily because of patenting and novelty. There are probably similar privacy/security concerns over sequence data too - and in ChEMBL we've covered that too. I'm not going to drift on to what constitutes a 'publication', and all that sort of stuff since 1) I'm not qualified, 2) I don't have the time (and 1) anyway), and 3) it attracts trolls (and 1) and 2) anyway).

I have been asked for a talk through on the usage and query privacy of ChEMBL as part of the great OpenPHACTS project for some time; so here it is - to make it clear - I'm not an expert, but I do worry about these things, and I read a lot. Any feedback or suggestions would be great in the comment section.

ChEMBL is hosted on production machines at a pair of physically separated load-balanced Class 3 data centers in London. These are pretty close to one of the main Internet backbones in the UK, so reliability, latency and throughput is pretty good. The ChEMBL database and application is automatically loaded from a staging system at Hinxton. Once it leaves our staging area, we can't access the production data/server at all; in fact only a small number of named staff, using all sorts of access control and logging mechanisms can get into the machine rooms.

You may have noted that we use https: on the ChEMBL url above - even if you try and force use of http: to access the server, it will switch you over to https: (go on try it, I told you so). This ensures 1) that the server you access really is the genuine ChEMBL server (you should see a little lock in the corner of your browser), and 2) that the traffic between your client and our server is encrypted, and so no one can simply sit on the same network as you, listening to all your queries. So this is pretty secure, the tls standard used by https: is relied on by essentially everyone who implement secure and private web sites. It takes a little care to actually get https: to work properly - with a common reason for non-validation (so the little padlock doesn't appear) being the use of http: links on the nominally https: source page, or http: links to third party sites such as for advertisers, etc.

We don't (currently) have a green bar in the browser for this https: service - the green bar (or something similar depending on your browser) comes from the use of a Extended Validity Certificate (EVC). For these, you and your Certification Authority need to do a little more paperwork, and then spend a little more money. There is no difference in the technical security - the little padlock is the mark of security, not the green bar, just that the certificate authority has done some more work to validate that you really physically are who you say you are and so on. At the moment, sites like PayPal and so forth have EVCs, but they will no doubt spread, as the public starts to associate only sites with a green url bar with 'enhanced' security, and assume that the green thing is The Mark of website safety.

We do not use accounts to access the ChEMBL website - there is no need for the things we do - any personalisation is done via cookies saved on your machine in the cookies folder (we have an Institutional cookie policy too, that describes what cookies we will write on your computer). It is not straightforward to implement good password systems, as many large professional internet companies have amply shown (LinkedIn - I'm thinking of you!), and for us we don't need them for ChEMBL, so we haven't bothered.

There is also an Institutional Privacy Policy which covers a broad range of personal type data across all our activities (including recruitment, etc).

There is an Institutional Terms of Use for all institute resources. There is usage logging performed on the servers for internally reviewing the use of our services, or for spotting of problems (like DOS attacks, innocent scripting that can look like a DOS (Ben ;) )) and to collect statistics (like total usage, distribution of users, etc), to track enhanced usage following interface/data addition (this makes us feel good sometimes, it's nice to know our things are used). This data is all private, and is forbidden from being shared other than at aggregate level with third parties/collaborators.

The ChEMBL web application is written to not store any user queries (chemical structures or sequences, or text queries), other than storage required for application and database performance - so for example some automatically flushed, short-lifetime caches that are part of Oracle, and as I've said above, we don't have access to these anyway on the production servers.

We do not run google analytics on our ChEMBL application (but some of the Institutes services do, and we do on the ChEMBL-og) - it is tempting to do use GA for the fancy plots and maps, but what it means is that a third party (Google for GA) will be seeing all the query IP source addresses and url strings. Google already know enough about me, they don't also need to know I have a late night penchant for 4-amino-anilines as well.

So, if I was to extract some general principles from the above:
  • Use https: for everything - there's no real cost over http:, and make sure it validates!
  • Have a clear and easy to find Terms of Use.
  • Have a clear and easy to find license for any data.
  • Have a cookie policy and explain to users what the cookies you use are.
  • Have a privacy policy.
  • Keep your security certificates up to date.
  • Do not store any user queries for later analysis.
  • Think carefully before placing a user account system on your software - Does it really need it. If you do need implement one; for example your application has user uploaded data, has complex long running queries, or stores intermediate results, etc.? Read widely and plan defensively before you do. 
  • If you use third party analytics tools, make sure that your users know this, and if privacy is a concern to you, make sure you're also familiar with their ToU.
  • If you deploy things 'on the cloud' - read the agreement and T&Cs that you have with the company for your use of their services. Usually they do a very good job of dodging any responsibility, and sometimes grant themselves rights you would not expect. (We don't use third party cloud provision for any of our services - but we do use the cloud for some data entry portals. For these we're not doing anything that really requires great privacy, since once we've entered the data, we give it away anyway). And once you've read the T&Cs, read them again.
  • ChEMBL is typically "tighter" than the our Institute policies, but I think it's too confusing to make this specifically clear.....
Update - two things, 1) we do have a privacy policy specific to ChEMBL on our page and 2) The readers of the ChEMBL-og are very smart people, really you are. My attempts at protecting the privacy of one of the fellas above was woeful - I left his name badge in plain view! Doh! Sorry.


Popular posts from this blog

A python client for accessing ChEMBL web services

Motivation The CheMBL Web Services provide simple reliable programmatic access to the data stored in ChEMBL database. RESTful API approaches are quite easy to master in most languages but still require writing a few lines of code. Additionally, it can be a challenging task to write a nontrivial application using REST without any examples. These factors were the motivation for us to write a small client library for accessing web services from Python. Why Python? We choose this language because Python has become extremely popular (and still growing in use) in scientific applications; there are several Open Source chemical toolkits available in this language, and so the wealth of ChEMBL resources and functionality of those toolkits can be easily combined. Moreover, Python is a very web-friendly language and we wanted to show how easy complex resource acquisition can be expressed in Python. Reinventing the wheel? There are already some libraries providing access to ChEMBL d

ChEMBL 29 Released

  We are pleased to announce the release of ChEMBL 29. This version of the database, prepared on 01/07/2021 contains: 2,703,543 compound records 2,105,464 compounds (of which 2,084,724 have mol files) 18,635,916 activities 1,383,553 assays 14,554 targets 81,544 documents Data can be downloaded from the ChEMBL FTP site: .  Please see ChEMBL_29 release notes for full details of all changes in this release: New Deposited Datasets EUbOPEN Chemogenomic Library (src_id = 55, ChEMBL Document IDs CHEMBL4649982-CHEMBL4649998): The EUbOPEN consortium is an Innovative Medicines Initiative (IMI) funded project to enable and unlock biology in the open. The aims of the project are to assemble an open access chemogenomic library comprising about 5,000 well annotated compounds covering roughly 1,000 different proteins, to synthesiz

Julia meets RDKit

Julia is a young programming language that is getting some traction in the scientific community. It is a dynamically typed, memory safe and high performance JIT compiled language that was designed to replace languages such as Matlab, R and Python. We've been keeping an an eye on it for a while but we were missing something... yes, RDKit! Fortunately, Greg very recently added the MinimalLib CFFI interface to the RDKit repertoire. This is nothing else than a C API that makes it very easy to call RDKit from almost any programming language. More information about the MinimalLib is available directly from the source . The existence of this MinimalLib CFFI interface meant that we no longer had an excuse to not give it a go! First, we added a BinaryBuilder recipe for building RDKit's MinimalLib into Julia's Yggdrasil repository (thanks Mosè for reviewing!). The recipe builds and automatically uploads the library to Julia's general package registry. The build currently targe

Identifying relevant compounds in patents

  As you may know, patents can be inherently noisy documents which can make it challenging to extract drug discovery information from them, such as the key targets or compounds being claimed. There are many reasons for this, ranging from deliberate obfuscation through to the long and detailed nature of the documents. For example, a typical small molecule patent may contain extensive background information relating to the target biology and disease area, chemical synthesis information, biological assay protocols and pharmacological measurements (which may refer to endogenous substances, existing therapies, reaction intermediates, reagents and reference compounds), in addition to description of the claimed compounds themselves.  The SureChEMBL system extracts this chemical information from patent documents through recognition of chemical names, conversion of images and extraction of attached files, and allows patents to be searched for chemical structures of interest. However, the curren

New Drug Warnings Browser

As mentioned in the announcement post of  ChEMBL 29 , a new Drug Warnings Browser has been created. This is an updated version of the entity browsers in ChEMBL ( Compounds , Targets , Activities , etc). It contains new features that will be tried out with the Drug Warnings and will be applied to the other entities gradually. The new features of the Drug Warnings Browser are described below. More visible buttons to link to other entities This functionality is already available in the old entity browsers, but the button to use it is not easily recognised. In the new version, the buttons are more visible. By using those buttons, users can see the related activities, compounds, drugs, mechanisms of action and drug indications to the drug warnings selected. The page will take users to the corresponding entity browser with the items related to the ones selected, or to all the items in the dataset if the user didn’t select any. Additionally, the process of creating the join query is no