ChEMBL Resources


Friday, 20 November 2015

SureChEMBL: A New Hope


SureChEMBL has disrupted the field of patent chemistry by liberating chemical structures and knowledge locked in text and images, and by making the compound-patent associations freely and fully searchable and accessible on a daily basis to everyone: academics, IP professionals, content providers, software vendors, biotechs, small and big pharma, and related chemical industries. The speed, scale and scope of the data is unprecedented for a public resource. 

SureChEMBL has been around for less than two years; during this time, it has evolved into a full-blown chemistry resource provided by the EMBL-EBI: the SureChEMBL interface was revamped and released last year, including combined keyword and structure-based queries against the annotated patent corpus. All chemistry is integrated with UniChem and there are several ways to access the data in bulk, including flat files and a data client. Very soon, the data will be fully integrated and available via the Open PHACTS web service API, including, for the first time, gene and disease annotations from patents, in addition to the chemistry ones.

So we're very happy now that another milestone has been reached: the official NAR publication for SureChEMBL is available in the usual Open Access format.

Here's the abstract:

SureChEMBL is a publicly available large-scale resource containing compounds extracted from the full text, images and attachments of patent documents. The data are extracted from the patent literature according to an automated text and image-mining pipeline on a daily basis. SureChEMBL provides access to a previously unavailable, open and timely set of annotated compound-patent associations, complemented with sophisticated combined structure and keyword-based search capabilities against the compound repository and patent document corpus; given the wealth of knowledge hidden in patent documents, analysis of SureChEMBL data has immediate applications in drug discovery, medicinal chemistry and other commercial areas of chemical science. Currently, the database contains 17 million compounds extracted from 14 million patent documents.

%A Papadatos, George
%A Davies, Mark
%A Dedman, Nathan
%A Chambers, Jon
%A Gaulton, Anna
%A Siddle, James
%A Koks, Richard
%A Irvine, Sean A.
%A Pettersson, Joe
%A Goncharoff, Nicko
%A Hersey, Anne
%A Overington, John P.
%T SureChEMBL: a large-scale, chemically annotated patent document database
%0 Journal Article
%D 2015 
%J Nucleic Acids Research 
%R 10.1093/nar/gkv1253