ChEMBL Resources


Friday, 20 November 2015

SureChEMBL: A New Hope


SureChEMBL has disrupted the field of patent chemistry by liberating chemical structures and knowledge locked in text and images, and by making the compound-patent associations freely and fully searchable and accessible on a daily basis to everyone: academics, IP professionals, content providers, software vendors, biotechs, small and big pharma, and related chemical industries. The speed, scale and scope of the data is unprecedented for a public resource. 

SureChEMBL has been around for less than two years; during this time, it has evolved into a full-blown chemistry resource provided by the EMBL-EBI: the SureChEMBL interface was revamped and released last year, including combined keyword and structure-based queries against the annotated patent corpus. All chemistry is integrated with UniChem and there are several ways to access the data in bulk, including flat files and a data client. Very soon, the data will be fully integrated and available via the Open PHACTS web service API, including, for the first time, gene and disease annotations from patents, in addition to the chemistry ones.

So we're very happy now that another milestone has been reached: the official NAR publication for SureChEMBL is available in the usual Open Access format.

Here's the abstract:

SureChEMBL is a publicly available large-scale resource containing compounds extracted from the full text, images and attachments of patent documents. The data are extracted from the patent literature according to an automated text and image-mining pipeline on a daily basis. SureChEMBL provides access to a previously unavailable, open and timely set of annotated compound-patent associations, complemented with sophisticated combined structure and keyword-based search capabilities against the compound repository and patent document corpus; given the wealth of knowledge hidden in patent documents, analysis of SureChEMBL data has immediate applications in drug discovery, medicinal chemistry and other commercial areas of chemical science. Currently, the database contains 17 million compounds extracted from 14 million patent documents.

%A Papadatos, George
%A Davies, Mark
%A Dedman, Nathan
%A Chambers, Jon
%A Gaulton, Anna
%A Siddle, James
%A Koks, Richard
%A Irvine, Sean A.
%A Pettersson, Joe
%A Goncharoff, Nicko
%A Hersey, Anne
%A Overington, John P.
%T SureChEMBL: a large-scale, chemically annotated patent document database
%0 Journal Article
%D 2015 
%J Nucleic Acids Research 
%R 10.1093/nar/gkv1253 


Tuesday, 27 October 2015

Advanced keyword and structure searches with SureChEMBL

Previously in the SureChEMBL series, we described how to access SureChEMBL data in bulk, offline and locally. So, you may ask, what is the point in using the SureChEMBL web interface? Well, how about the unprecedented functionality that allows you to submit very granular queries by combining: i) Lucene fields against full-text and bibliographic metadata and ii) advanced structure query features against the annotated compound corpus - at the same time?

Let’s see each one separately first:

Lucene-powered keyword searching

You may use the main text box for simple keyword-based patent searches, such as ‘Apple’, ‘diabetes’ or even 'chocolate cake' (the patent corpus as a recipe book is a new use-case here). You will get a lot of results and probably a lot of noise. With Lucene fields, you can slice and dice a query by indicating specific patent sections and bibliographic metadata, such as date/year of filing or publication, assignee, patent classification code, patent authority, etc. For example, to search for the term ‘diabetes’ only in the abstract of patents, you can search with:

where ab is the Lucene query field for abstract. For a full list of Lucene queries, see here. Furthermore, you can combine these fields with boolean operators (AND, OR, NOT - always in UPPER case) and brackets. For example to find US patents published in 2014 which also mention the word ‘diabetes’ in the title or abstract, you could search with:

(ttl:diabetes OR ab:diabetes) AND pdyear:2014 AND pnctry:US

or even limit it to more med-chem relevant patent hits by using the appropriate IPC hierarchical classification codes:

(ttl:diabetes OR ab:diabetes) AND ic:(C07D AND (A61K OR A61P)) AND pdyear:2014 AND pnctry:US

Is that all? No, you could also use wildcards, such as * and ?, as well as proximity searches:

(ttl:diabet* OR ab:diabet*) AND pdyear:2014 AND pnctry:US

A couple of thing worth pointing out here:
1) in the way described above, you may search not only the chemically-annotated (EP, US, WO, JP patents) or chemically-relevant corpus but any patent within SureChEMBL’s broad coverage, such as French, German, British, Chinese, Australian, Canadian, etc., patents about any topic:

pa:"Apple Inc" AND ab:vehicle AND pnctry:CN

for such cases, just remember to check the 'All authorities' box on the right hand side panel.
2) If the Lucene query syntax seems too complicated, almost the same functionality is available via a more user-friendly field-based widget called Fielded Search:

ChemAxon-powered structure searching

To begin with, SureChEMBL provides basic substructure and similarity searches against the currently 17 million chemical structures, powered by ChemAxon’s JChem technology. Some of you may have noticed that we have recently done some refurbishment around the sketchers and we now provide the latest MarvinJS sketcher as the sole source of structure input. We also removed the manual entry box, as it is superseded by functionality described below. Behind the scenes, we use the native ChemAxon inter-conversion functionality to ensure maximum compatibility and minimum information loss during structure conversions. The good news is that you can input a structure in several ways (besides sketching it from scratch), e.g. SMILES, SMARTS, CML, InChI, Molfile and IUPAC/trivial name. Just click and paste your string on the MarvinJS sketcher or open the import dialogue to paste it right there - or even upload a file. More importantly, you may now take advantage of more advanced query features, such as (NOT) atom and bond lists, explicit hydrogens, as well as the Markush-friendly position variation and repetition ranges.

For example, this is a query that combines atom, not atom, and bond lists, as well as explicit hydrogens to control substitution:

Or this one, which combines position variation and linker repetition range:

Again, don't forget that you have additional control over the MW range of the search hits, as well as their exact location in the patent document (title, abstract, claims, description, images/molfiles).

Combined keyword and structure searching

Finally, as mentioned in the beginning, you can easily submit combined keyword and structure queries, such as this one: our knowledge, there's no other freely available patent searching resource or interface out there providing this type of functionality but we're happy to stand corrected...

As usual, for any questions or feedback, drop us a line.

George and Nathan

Monday, 19 October 2015

Is ChEMBL down or is it just me?

Have you ever wondered whether your favorite resource of bioactive molecules data is down or there is some temporary network issue, that makes it unavailable from your end? There are many online tools, that can help in such cases (for example or similar websites). We, however, provide now a much better solution: ChEMBL status page:

As you may notice, the status page is hosted on GitHub, so it is outside of the EBI infrastructure. This means that even when ChEMBL core websites are down, you should still be able to see the status page (assuming that GitHub is online, which is a quite reasonable assumption, despite occasional incidents). We've placed a link to the status page at the bottom of the left-side navigation menu on the main ChEMBL web page, as it provides some useful information even when everything is fine.

The status page presents information about the health of ChEMBL's most critical resources (main web interface, REST API, ADME Sarfari, SureChEMBL, UniChem and more) along with cumulative availability data grouped by time (from last day, week, month, year and all time). As you can see from the data presented on the status page, we have some pretty impressive availability rate: more than 99% for every monitored resource!

For those of you interested in the technical details - we use a service called Uptime Robot in order to collect availability data. Uptime Robot allows to define up to 50 monitors (each monitoring a single URL) - for free. It also provides an API to retrieve collected data and present/share it online without having to visit the Uptime Robot webpage.

There is a nice open source JavaScript widget called Upscuits, which provides a nice overview of data collected by Uptime Robot. Since the widget is written in JavaScript, it can be hosted on any static page friendly environment. The ChEMBL team uses GitHub for hosting our open source repositories anyway, so GitHub pages were an obvious choice.

We have been using the ChEMBL GitHub Organisation page for quite some time for mirror posts from this blog (we use Jekyll to do this) so creating another simple website with status dashboard provided by Upscuits/Uptime Robot was a breeze. We hope the new page will help diagnosing any availability issues that may occur.

Tuesday, 13 October 2015

Paper: Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents

Our collaborators in GSK have just published an Open Access paper in the Journal of Cheminformatics. It is a comparative study of the quality of chemistry extraction from patent documents and includes patent chemistry sources derived by automated text-mining, such as SureChEMBL and the IBM/NIH data set. Among other things, the paper provides a useful detailed overview of SureChEMBL's chemistry annotation specifications.

While conducting this study, we realised that this task is far from trivial for several reasons: 
  • The patent corpus is inherently noisy, ambiguous and error-rich.
  • There are diverse use cases and accuracy expectations when it comes to chemistry extracted from patents.
  • Not all the chemistry found in a patent document is of equal importance.
  • Compound standardisation variants such as stereoisomers, tautomers, salts and mixtures is always an issue.
  • There is a distinct lack of an open Gold Standard when it comes to standardised chemistry extracted from relevant full text patent documents. Recently, there have been several attempts towards text-mining standards provided by BioCreative and publications such as this one, which offer position and type of chemical named entities but not converted structures.  
  • The commercial patent chemistry vendors do not disclose their extraction specifications, which makes any comparisons even harder.

Here is the background:

First public disclosure of new chemical entities often takes place in patents, which makes them an important source of information. However, with an ever increasing number of patent applications, manual processing and curation on such a large scale becomes even more challenging. An alternative approach better suited for this large corpus of documents is the automated extraction of chemical structures. A number of patent chemistry databases generated by using the latter approach are now available but little is known that can help to manage expectations when using them. This study aims to address this by comparing two such freely available sources, SureChEMBL and IBM SIIP (IBM Strategic Intellectual Property Insight Platform), with manually curated commercial databases.

%T Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents
%A S. Senger
%A L. Bartek
%A G. Papadatos
%A A. Gaulton

%J Journal of Cheminformatics
%D 2015
%O doi:10.1186/s13321-015-0097-z

George and Anna

Wednesday, 30 September 2015

Wanted - Web Developer!

Just a reminder that we are currently looking for a Web Application Developer to join the ChEMBL team at EMBL-EBI. The closing date for this vacancy is 4th Oct, so hurry and apply!

The role is primarily to develop a series of web-based applications and interfaces for the ChEMBL chemogenomic resources. The role also involves the development, maintenance and documentation of these tools, and supporting their usage within the EBI and externally. It will also involve some requirement gathering and use-case development.

Experience of Python and JavaScript is required as is experience of working with web frameworks such as Django. A sound knowledge of relational databases (primarily Oracle), SQL PL/SQL, REST and HTTP protocol is also a requirement. Experience of contributing to open software projects and documenting them on GitHub for example is desirable. Applicants should have a good understanding of best practice in software engineering, rapid development cycle work, have developed user-friendly web interfaces and experience in good code documentation.

Full job description is available here: 

Monday, 28 September 2015

Blast from the past - 1000th blog post!

To celebrate the 1000th post, we've decided to take a journey back in time. So, what you see above is a timeline* showing the most important blog posts published on the ChEMBL blog. The posts delineate major events and milestones in the group’s 7-year history and highlight the contributions and impact to the community. Posts on ChEMBL updates, publications, innovative software applications and popular resources are all included there.

We hope you will enjoy skimming through it as much as we did. If you have any favourite blog post published here, let us know in the comments.

Just remember that this journey continues; here’s to the next chiliad of exciting blog posts!

The ChEMBL team.

* The timeline was prepared using the excellent timelineJS library by

Tuesday, 22 September 2015

KNIME chemoinformatics meetup at the EBI

We’re co-organising a KNIME chemoinformatics workshop at the EBI on Monday 5th October. This is a regular meeting that takes place the day before the biannual UK-QSAR meeting.

There will be informal discussions on the current and future state of the KNIME chemoinformatics nodes, along with updates by the community and the KNIME guys. There will also be talks on the integration of KNIME with the ChEMBL resources and the Open PHACTS platform.  

More details and agenda here; to register, fill in your details here.



Monday, 24 August 2015

Online Resources

We would like to announce the recent release of two online training resources for ChEMBL and UniChem.

For ChEMBL, we  have developed ‘ChEMBL: Exploring bioactive drug-like molecules’, which will walk you through how to use the interface, step-by-step. It tackles topics such as target searching, compound searching, web services and data downloads. The course also gives you a chance to test your knowledge throughout.

Additionally, Jon Chambers has created the 'UniChem: Quick Tour' course. This course will give users a basic understanding of UniChem and the benefits it can bring to navigating small molecule resources. It will also walk you through how to conduct simple searches using UniChem and the UniChem Connectivity Search feature.

I'd also like to remind you that we store recordings of our past webinars, in case you missed them. You can access these anytime and they can be found here:

If you have any questions or queries about anything mentioned here, please do not hesitate to contact