Skip to main content


Showing posts from October, 2015

Advanced keyword and structure searches with SureChEMBL

Previously in the SureChEMBL series, we described how to access SureChEMBL data in bulk , offline and locally. So, you may ask, what is the point in using the SureChEMBL web interface ? Well, how about the unprecedented functionality that allows you to submit very granular queries by combining: i) Lucene fields against full-text and bibliographic metadata and ii) advanced structure query features against the annotated compound corpus - at the same time? Let’s see each one separately first: Lucene-powered keyword searching You may use the main text box for simple keyword-based patent searches, such as ‘Apple’, ‘diabetes’ or even ' chocolate cake ' (the patent corpus as a recipe book is a new use-case here). You will get a lot of results and probably a lot of noise. With Lucene fields, you can slice and dice a query by indicating specific patent sections and bibliographic metadata, such as date/year of filing or publication, assignee, patent classification code,

Is ChEMBL down or is it just me?

Have you ever wondered whether your favorite resource of bioactive molecules data is down or there is some temporary network issue, that makes it unavailable from your end? There are many online tools, that can help in such cases (for example or similar websites). We, however, provide now a much better solution: ChEMBL status page : As you may notice, the status page is hosted on GitHub , so it is outside of the EBI infrastructure. This means that even when ChEMBL core websites are down, you should still be able to see the status page (assuming that GitHub is online, which is a quite reasonable assumption , despite occasional incidents ). We've placed a link to the status page at the bottom of the left-side navigation menu on the main ChEMBL web page , as it provides some useful information even when everything is fine. The status page presents information about the health of ChEMBL's most critical

Paper: Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents

Our collaborators in GSK have just published an Open Access paper in the Journal of Cheminformatics . It is a comparative study of the quality of chemistry extraction from patent documents and includes patent chemistry sources derived by automated text-mining, such as SureChEMBL and the IBM/NIH data set . Among other things, the paper provides a useful detailed overview of SureChEMBL's chemistry annotation specifications. While conducting this study, we realised that this task is far from trivial for several reasons:  The patent corpus is inherently noisy, ambiguous and error-rich. There are diverse use cases and accuracy expectations when it comes to chemistry extracted from patents. Not all the chemistry found in a patent document is of equal importance. Compound standardisation variants such as stereoisomers, tautomers, salts and mixtures is always an issue. There is a distinct lack of an open Gold Standard when it comes to standardised chemistry extracted fro