ChEMBL Resources


Friday, 20 March 2015

The SureChEMBL map file is out

As many of you know, SureChEMBL taps into the wealth of knowledge hidden in the patent documents. More specifically, SureChEMBL extracts and indexes chemistry from the full-text patent corpus (EPO, WIPO and USPTO; JPO titles and abstracts only) by means of automated text- and image-mining, on a daily basis. We have recently hosted a webinar about it which turned out to be very popular - for those who missed it, the video and slides are here.

Besides the interface, SureChEMBL compound data can be accessed in various ways, such as UniChem and PubChem. The full compound dump is also available as a flat file download from our ftp server.

Since the release of the SureChEMBL interface last September, we have received numerous requests for a way to access compound and patent data in a batch way. Typical use-cases would include retrieving all compounds for a list of patent IDs, or vice versa, retrieving all patents where one or more compounds have been extracted from. As a result, we have now produced this so-called map file which connects SureChEMBL compounds and patents.

It is available here.
More information can be found in the README file.

What is this file?

There is a total of 216,892,266 rows in the map, indicating a compound extracted from a specific section of a specific patent document. The format of the file is quite simple: it contains compound information (SCHEMBL ID, SMILES, InChI Key, corpus frequency), patent information (patent ID and publication data), and finally location information, such as the field ID and frequency. The field ID indicates the specific section in the patent where the compound was extracted from (1:Description, 2:Claims, 3:Abstract, 4:Title, 5:Image, 6:MOL attachment). The frequency is the number of times the compound was found in a given section of a given patent. More information on the format of the file in the README file.

How many compounds and patents are there?

There are 187,958,584 unique patent-compound pairs, involving 14,076,090 unique compound IDs extracted from 3,585,233 EP, JP, WO and US patent documents - an average of ~52 compounds per patent. The patent coverage is from 1960 to 31-12-2014 inclusive.

Here's a breakdown of the patents in the map per year and patent authority:

Are these all the compounds and patents in SureChEMBL?

Technically, no - in practice, yes. We excluded chemically annotated patents that are not immediately relevant to life sciences, such as this one. For the filtering, we used a list of relevant IPCR and related patent classification codes. At the same time, we excluded too small, too large, too trivial compounds, along with non-organic and radical/fragment compounds.

Are these compounds genuinely claimed as novel in their respective patents?

Automated methods to assess which are the important and relevant compounds in a pharmaceutical patent is a field of research and one of our future plans. For now, the map file include all extracted chemistry mentioned in all sections of a patent, subject to the filters listed in the previous section. A quick and effective trick to filter out trivial and/or uninformative compounds is to use the corpus frequency column and exclude everything with a value more than, say, 1000. Note that, in this way, you will also exclude drug compounds such as sildenafil, which are casually mentioned in a lot of patents. You could also look for compounds mentioned only in claims, description or images sections by filtering by the corresponding field ID.

What can I do with this?

Well, you can start by 'grepping' for one or more patent IDs or SCHEMBL IDs or InChI keys, followed by further filtering. Many of you will choose to normalise the flat file into 3 database tables (say compounds, documents and doc_to_compound) for centralised access and easy querying.

For example, to find the patents the drug palbociclib has been extracted from:

Any plans to update this map file?  

New patents and chemistry arrive and are stored to SureChEMBL every day. We are planning to release new versions and incremental updates of the map file every quarter, in sync with the update of the compound dump files.

I couldn’t find my compound / patent - this compound should not be there

Don’t forget this an automated, live, high-throughput text-mining effort against an inherently noisy corpus such as patents. We are constantly working on improving data quality. If you find anything strange, let us know.

Can I join more metadata, such as patent assignee and title?

Obviously your first port of call would be the SureChEMBL website for patent metadata, but other services you may wish to use include the EPO web services for programmatic access.

Is there anything else?

Errr, yes. Watch this space for another post on storing and accessing live SureChEMBL data, behind your firewall. 

The SureChEMBL Team

Thursday, 12 March 2015

Beaker now officially part of ChEMBL web services

We have mentioned Beaker (a.k.a the ChEMBL cheminformatics utility web service), several times on the blog (here, here and here), but have not devoted an entire post to Beaker. Well, here it is.

Beaker - what's this?

It's a small utility, that makes chemistry software available securely over https. You no longer need to install a chemical toolkit in order to convert your molfile to SMILES or calculate descriptors. If you have an internet connection (if you can read this, chances are you do), you can use Beaker. We recommend you head over to the interactive online documentation (, to see the full list of functionality it offers and try it with your own data.

Which toolkits are used by Beaker?

Under-the-hood Beaker is exposing the functionality of the RDKit cheminformatics library. Beaker's optical structure recognition methods use the OSRA library.


Do I need an API Key?

As long as you are making no more than 1 request per second, you do not need an API key. Beaker provides standard set of response headers to inform about rate limiting:

There is also one custom header:

This lets you know how you have been authenticated. The default authentication is IP-based, which means that if any other person uses Beaker from the same IP, it will affect your rate limit. This is why having your own API key can be useful - no one can 'steal' your rate limit and it will be slightly higher than default as well. If you need a key, just write to us.

I tried it and it doesn't work...

Before contacting us and submitting bug report, please try a few things first:

1. Submit data via It it works there, see point 2.
2. Check data encoding. Unlike the ChEMBL data web services (where you should use percent encoding as described in the previous blog post), if you are accessing Beaker via GET, then all data provided should be base64 encoded. This is why, if you want to use GET to convert 'CCC' SMILES to molfile this link won't work:
'CCC' has to be base64 encoded first and  
base64('CCC') == 'Q0NDQw==',
so the valid link is Our online documentation will do encoding for you, and present what URL was really executed:

3. Use POST where possible. GET requests are nice, because everything gets included into URL, so you can embed such a URL in a blogpost, like we just did. One issue with GET is that there is often a maximum number of characters you can send, although this does depend on server setup. If you would like to use Beaker from ChEMBL servers for example, your link can't exceed 4000 characters. Base64 encoding will make any parameter about 1/3 longer. So for example, if you would like to send an image, in order to perform Optical Structure Recognition (OSRA), it's very hard to find a valid, good quality image, that is less than 1.2 Kb in size, so in that case using GET is not a good idea. Also, do not forget you can use curl to submit your POST requests. Below we provide some examples of how to access Beaker via POST with curl:

4. If using GET, check what type of base64 are you using. Standard implementation of base64 use the following characters:
 Those two last signs ('+' and '/'), are not url-safe as they have special meaning in URLs. This is why Beaker uses url-safe version which substitutes '-' instead of '+' and '_' instead of '/' in the standard base64 alphabet. For reference, please click here.

Does ChEMBL python client library work with Beaker?

Yes, and even more it adds enough syntactic sugar to make it feel like your are using locally installed chemical toolkit. For example, look how easy it is, to compute maximum common substructure from three compounds, given as SMILES strings:

You can install the python client library by using 'pip install chembl_webresource_client' or download it here and expect more examples in a future blog post.

Does it work without the client?

Of course it does. You have already seen an example of how to use Beaker from JavaScript (using the online documentation) and python (using the client library). But because curl is very common tool, available on many platforms, you can execute calls to Beaker from your command line in bash. Bash has a very cool feature called pipes, so you can chain the output of one command to the input of another. This way you can mix calls to our data web services with Beaker calls. As an example let's assume that you have a photo of a compound. This could be a scan of the paper document, such as patent or a photo of a conference poster taken using your mobile phone, but it has to have decent resolution and quality:

Original image available here

If we would like to find the compound in ChEMBL that is most similar to the one recognized from the image above, we could use this line of bash script:

The script may look a bit hackish, but this is because we wanted to only use standard command line tools, that can be found on OSX and Linux systems. In production, we would never use grep, sed and awk to parse JSON because this is bad (instead we encourage you to try jq), but we wanted to show a nice example of using pipes to combine different tools. Anyway, the end result of running this command will be open the following page in your browser:

Is Beaker open source software, can I see the code?

Yes, it's hosted on GitHub ( under a Apache 2.0 license and the latest stable version is always registered in PyPI.

This also means that you can deploy your own local Beaker version. Reasons why you might like to do this include:

1. You don't want to rely on availability of ChEMBL web services or care about rate limiting.
2. You don't want to send proprietary compounds to a public service.
3. You would prefer to install your own chemical toolkit (on only 1 machine), and access its services over http(s).

How do I cite Beaker?

Please use the following publication:

If you have any questions about Beaker or any other ChEMBL services, please let us know.

Friday, 6 March 2015

GET and SMILES interference

When reviewing our old web services documentation (, you will notice some methods can be accessed by both: GET and POST. One thing these methods have in common is that they all accept SMILES as search parameter. Why? Well, it turns out there are some SMILES, that can not be handled via GET, when using old web services. Take this SMILES as an example: [Na+].CO[C@@H](CCC#C\C=C/CCCC(C)CCCCC=C)C(=O)[O-] and you will see, that you can't construct a valid URL using '' as a base. To 'GET' around this issue, you will need to use POST. This is a bit sad, as it means you can not put a link to such a compound on your blog or ask about it on chemistry forum or send it to your friend via Skype :( What's more, if you would like to use POST for a non-SMILES method (e.g. get all assays by ChEMBL ID), you would also be out of luck.

When reviewing the documentation of the new web services (as we asked you in the previous blog post), you will probably notice, that none of the methods mention POST support. What does it mean? First of all, it means that you can achieve everything using GET. You can easily retrieve data for the SMILES string above using GET, you just have to be careful. Placing a SMILES strings in URLs can be tricky. This is because they often contain characters that are not allowed in URLs and should be encoded, according to the URI standard. Encoded how? The standard mechanism to encode URLs is called percent-encoding and it is widely accepted by modern web browsers and other web tools, such as curl or wget.

In fact, some browsers hide the percent-encoding from users, as we will see soon. So the percent-encoded URL of our molecule looks like this:

Not very readable... But try to open this link in Firefox and in the address bar you will see this:[Na+].CO[C@@H](CCC%23C\C=C/CCCC(C)CCCCC=C)C(=O)[O-]

Much better, only one sign, #, is encoded! What if you try to open the second link in browser? In Firefox this should work. All other popular browsers will return 404 - not found and curl will complain:

So what to do?

1. These are characters, that always have to be encoded: 
  • % change to %25, because percent is a special 'escape' character in percent encoding.
  • # change to %23, because not encoded hash sign is an indicator of the Fragment identifier part of URL, which is only used in browser and is not send to server.
  • \ change to %5C, because all browsers apart from Firefox are changing not encoded backslash to forward slash as explained here.
2. This is a character, that can not be encoded:
  • Forward slash / can not be encoded as %2F because our Apache server configuration will return a 404 for URLs containing %2F. The security reasons are described in the Apache documentation. This is particularly annoying, as it means that you can not use online percent-encoders if your SMILES contain forward slashes.


How to make your life with SMILES simpler?


1. Use Firefox - this browser does a great job when it comes to encoding. Although FF is unable to guess in which context you are using % and #, so you will still encode those two characters before pasting it into the address bar, but the rest of special characters will be encoded correctly, so then you can copy the URL from Firefox to other browsers and it will just work.

2. Use cURL - just as with Firefox, you have to encode % and #, but the rest will be handled properly, you just have to use '-g' flag, which switches off the URL globbing parser:


How long?

You might ask yourself, will I get a valid response from the server, when I search with a really long SMILES? We can find out by first using our new web service filters to obtain the SMILES of the biggest compound in ChEMBL, the URL will be:

And SMILES of the first molecule is:


This is huge! But, more importantly, the SMILES looks like it's already url-encoded, as it contains %23 and %29, which will be interpreted as reserved characters and translated to # and ), when  pasted into a browser address bar. To prevent this from happening, we must first percent-encode the percent sign, to ensure they are interpreted literally. Percent is encoded as %25 and after conversion our URL will look like this (this URL contain many backslashes so it will only work in Firefox):[NH2+]OC(CO)C(O)C(OC1OC(CO)C(O)C(O)C1O)C(O)CO.CCCCCCCCCCCCCCCC[NH2+]OC(CO)C(O)C(OC2OC(CO)C(O)C(O)C2O)C(O)CO.CCCCCCCCCCCCCCCC[NH2+]OC(CO)C(O)C(OC3OC(CO)C(O)C(O)C3O)C(O)CO.CCCCCCCCCCCCCCCC[NH2+]OC(CO)C(O)C(OC4OC(CO)C(O)C(O)C4O)C(O)CO.CCCCCCCCCCCCCCCC[NH2+]OC(CO)C(O)C(OC5OC(CO)C(O)C(O)C5O)C(O)CO.CCCCCCCCCCCCCCCC[NH2+]OC(CO)C(O)C(OC6OC(CO)C(O)C(O)C6O)C(O)CO.CCCCCCCCCCCCCCCC[NH2+]OC(CO)C(O)C(OC7OC(CO)C(O)C(O)C7O)C(O)CO.CCCCCCCCCCCCCCCC[NH2+]OC(CO)C(O)C(OC8OC(CO)C(O)C(O)C8O)C(O)CO.CCCCCCCCCCCCCCCC[NH2+]OC(CO)C(O)C(OC9OC(CO)C(O)C(O)C9O)C(O)CO.CCCCCCCCCCCCCCCC[NH2+]OC(CO)C(O)C(OC%2510OC(CO)C(O)C(O)C%2510O)C(O)CO.CCCCCCCCCCCCCCCC[NH2+]OC(CO)C(O)C(OC%2511OC(CO)C(O)C(O)C%2511O)C(O)CO.CCCCCCCCCCCCCCCC[NH2+]OC(CO)C(O)C(OC%2512OC(CO)C(O)C(O)C%2512O)C(O)CO.CCCCCCCCCC(C(=O)NCCc%2513ccc(OP(=S)(Oc%2514ccc(CCNC(=O)C(CCCCCCCCC)P(=O)(O)[O-])cc%2514)N(C)\N=C\c%2515ccc(Op%2516(Oc%2517ccc(\C=N\N(C)P(=S)(Oc%2518ccc(CCNC(=O)C(CCCCCCCCC)P(=O)(O)[O-])cc%2518)Oc%2519ccc(CCNC(=O)C(CCCCCCCCC)P(=O)(O)[O-])cc%2519)cc%2517)np(Oc%2520ccc(\C=N\N(C)P(=S)(Oc%2521ccc(CCNC(=O)C(CCCCCCCCC)P(=O)(O)[O-])cc%2521)Oc%2522ccc(CCNC(=O)C(CCCCCCCCC)P(=O)(O)[O-])cc%2522)cc%2520)(Oc%2523ccc(\C=N\N(C)P(=S)(Oc%2524ccc(CCNC(=O)C(CCCCCCCCC)P(=O)(O)[O-])cc%2524)Oc%2525ccc(CCNC(=O)C(CCCCCCCCC)P(=O)(O)[O-])cc%2525)cc%2523)np(Oc%2526ccc(\C=N\N(C)P(=S)(Oc%2527ccc(CCNC(=O)C(CCCCCCCCC)P(=O)(O)[O-])cc%2527)Oc%2528ccc(CCNC(=O)C(CCCCCCCCC)P(=O)(O)[O-])cc%2528)cc%2526)(Oc%2529ccc(\C=N\N(C)P(=S)(Oc%2530ccc(CCNC(=O)C(CCCCCCCCC)P(=O)(O)[O-])cc%2530)Oc%2531ccc(CCNC(=O)C(CCCCCCCCC)P(=O)(O)[O-])cc%2531)cc%2529)n%2516)cc%2515)cc%2513)P(=O)(O)[O-]

It should be noted, that the URL is 1666 characters long, which is below the maximum allowed URL length in our Apache setup (4000 if your were wondering). Even if we encode all the special characters, we will end up with the following URL (this time valid in all modern browsers):

which is 2478 characters long and still lower than 4000 limit. This means we can use GET to retrieve every ChEMBL compound based on its SMILES string.


What about POST?

OK, so SMILES can be passed via GET, can POST still be used? Yes! For every single method you can use POST as well. This is important, as it is possible to create very long URLs, by chaining together multiple filters and other parameters, such us pagination and ordering. This is why all new web services methods (even those, which don't expect any SMILES), can be executed using POST. In fact, our updated Python client (blog post coming soon!), is using POST for almost everything.  Just keep in mind, that in order to use POST to retrieve data, you have to add X-HTTP-Method-Override:GET header.

If the earlier discussion around SMILES encoding all seems a bit too complicated (hence the xkcd comic strip) or just too much hassle, you can always use POST. But remember you won't be able to share SMILES links in your blog posts or Skype conversations :(


Is using POST to retrieve data 'breaking the web'?

There was a recent discussion about this topic after Dropbox team announced (just like us), that they see some limitations in GET compared to POST and will start to allow POST methods to retrieve data (original article). There were some critical opinions about allowing POST access to data, stating that this is a poor API design. This is turn has triggered a long discussion on hacker news page, from which one comment is particularity important:
"I really like how the Google Translate API handles this issue.  The actual HTTP method can be POST, but the intended HTTP method must always be GET (using the "X-HTTP-Method-Override" header)." 

And this is exactly what we are doing, and we also believe this is the right way to use POST in order to allow retrieval of data from RESTful web interface. 

The ChEMBL Team

Tuesday, 3 March 2015

SureChEMBL Webinar

As many of you know, SureChEMBL is one of our most recent resources, which taps into the wealth of knowledge hidden in the patent documents. More specifically, SureChEMBL extracts and indexes chemistry from the full-text patent corpus (EPO, WIPO and USPTO) by means of automated text- and image-mining, tirelessly, on a daily basis.

If you would like learn more about SureChEMBL, its applications and exciting recent developments and future plans, we'll be giving a free webinar on Wednesday 11 March at 4pm GMT.

Remember this is one of the times of year where daylight savings times may not be in sync, so check what time 4pm GMT is for your local timezone, for example, the time difference to Boston, MA at the moment is only 4 hours compared to the regular 5 hours.

Please send us an email here to register your interest.