ChEMBL Resources

Resources:
ChEMBL
|
SureChEMBL
|
ChEMBL-NTD
|
ChEMBL-Malaria
|
The SARfaris: GPCR, Kinase, ADME
|
UniChem
|
DrugEBIlity
|
ECBD

Monday, 29 September 2014

The great US patent spike on SureChEMBL


Apparently, there was a huge spike of new granted US patents released by the USPTO a few days ago. The reason?

In March 2013, US patent law changed. The ‘first to invent’ became ‘first inventor to file’ for patent protection purposes (see more on this here). As a result, a lot of people rushed to submit applications just before the change. Fast forward 18 months later (last week), a huge spike in USPTO granted patents is observed. 

Did SureChEMBL pick that up? See below the cumulative count plot of new patent documents:

And the corresponding compound count extracted from these patents:

For more information on SureChEMBL, see our previous posts.

George

Friday, 19 September 2014

SureChEMBL Available Now





Followers of the ChEMBL group's activities and this blog will be aware of our involvement in the migration of the previously commercially available SureChem chemistry patent system, to a new, free-for-all system, known as SureChEMBL. Today we are very pleased to announce that the migration process is complete and the SureChEMBL website is now online.

SureChEMBL provides the research community with the ability to search the patent literature using Lucene-based keyword queries and, much more importantly, chemistry-based queries. If you are not familiar with SureChEMBL, we recommend you review the content of these earlier blogposts here and here. SureChEMBL is a live system, which is continuously extracting chemical entities from the patent literature. The time it takes for a new chemical in the patent literature to become searchable in the SureChEMBL system is 1-2 days (WO patents can sometimes take a bit longer due to an additional reprocessing step). At time of writing this blogpost the number of unique compounds in SureChEMBL is 15,760,514, which have been extracted from 12,949,021 patents.

To get started using SureChEMBL, head over to the homepage, where you will be presented with a range of search methods and filters. The image below provides a brief overview of the search functionality offered by the system:




To provide an example of how to use the SureChEMBL website, let's assume you are interested in patents which contained structures similar (or identical) to Sildenafil in the claims section of the document and also mention the term PDE5 anywhere in the document. To run this search, go to the SureChEMBL homepage and carry out the following actions:
  1. Enter the term 'PDE5' in the search text box 
  2. Sketch in the structure of Sildenafil (or use the name look-up function)
  3. Change the search type to similarity (>85%) 
  4. Click the 'Claims' checkbox in the document filter section and 
  5. Hit 'Search' button


After clicking 'Search', you will be presented with a page which contains all compounds that match your search criteria:





From the compound results page above you then have the choice of either exporting the chemistry (all the compounds returned by the search) or viewing the patents associated with 1 more of the selected compounds. For the selected compounds in this search, the associated patents (sorted by descending publication date) are :


 

From the patent document results page, you are able to export chemistry from all documents on display, view patent family information and view the chemistry-annotated, full text document. The claims section of the first patent (US-20140255433-A1) includes references to both sildenafil and PDE5:


 

The aim of this blogpost is to introduce the SureChEMBL system and not to provide a comprehensive review of all the functionality the system offers. This will be covered in future training sessions and webinars, which will be announced on this blog in the near future.

We would like to thank the people over at Digital Science, who were responsible for building the original SureChem system and supported its migration over to EMBL-EBI. In particular, we would like to thank Nicko Goncharoff, James Siddle and Richard Koks.

The system runs on the cloud - specifically on Amazon Web Services, a stable, secure and highly scalable way to deploy web applications. We need to keep a close eye on performance and patterns of usage over the coming weeks, to get an idea of how many servers, etc, we need for full deployment. In particular, we will throttle scripted access,  so please get in touch if you want to try anything like this, so you are not frustrated by slow performance, and we will try and accommodate your use case. There is also a download link on the homepage, so please explore this if you are interested.

We have an exciting roadmap for the future development of SureChEMBL, bt if you have any priority requests, mail them to surechembl-help (at) ebi.ac.uk.

If you experience any issues with the system, or have any questions please get in touch.

Tuesday, 9 September 2014

Papers: Literature text mining and extensions to UniChem


Two new papers from the group have just been published, both in Journal of Chemoinformatics - and of course both Open Access.

The first deals with some extensions to UniChem to allow far more flexible searches. The abstract is:

UniChem is a low-maintenance, fast and freely available compound identifier mapping service, recently made available on the Internet. Until now, the criterion of molecular equivalence within UniChem has been on the basis of complete identity between Standard InChIs. However, a limitation of this approach is that stereoisomers, isotopes and salts of otherwise identical molecules are not considered as related. Here, we describe how we have exploited the layered structural representation of the Standard InChI to create new functionality within UniChem that integrates these related molecular forms. The service, called ‘Connectivity Search’ allows molecules to be first matched on the basis of complete identity between the connectivity layer of their corresponding Standard InChIs, and the remaining layers then compared to highlight stereochemical and isotopic differences. Parsing of Standard InChI sub-layers permits mixtures and salts to also be included in this integration process. Implementation of these enhancements required simple modifications to the schema, loader and web application, but none of which have changed the original UniChem functionality or services. The scope of queries may be varied using a variety of easily configurable options, and the output is annotated to assist the user to filter, sort and understand the difference between query and retrieved structures. A RESTful web service output may be easily processed programmatically to allow developers to present the data in whatever form they believe their users will require, or to define their own level of molecular equivalence for their resource, albeit within the constraint of identical connectivity.

The second deals with using text mining approaches to find papers that look like they could be abstracted into ChEMBL - that is they contain keywords enriched in medicinal chemistry and compound structure concepts. The abstract for this paper is:


The large increase in the number of scientific publications has fuelled a need for semi- and fully automated text mining approaches in order to assist in the triage process, both for individual scientists and also for larger-scale data extraction and curation into public databases. Here, we introduce a document classifier, which is able to successfully distinguish between publications that are ‘ChEMBL-like’ (i.e. related to small molecule drug discovery and likely to contain quantitative bioactivity data) and those that are not. The unprecedented size of the medicinal chemistry literature collection, coupled with the advantage of manual curation and mapping to chemistry and biology make the ChEMBL corpus a unique resource for text mining.
The method has been implemented as a data protocol/workflow for both Pipeline Pilot (version 8.5) and KNIME (version 2.9) respectively. Both workflows and models are freely available at: ftp://ftp.ebi.ac.uk/pub/databases/chembl/text-mining. These can be readily modified to include additional keyword constraints to further focus searches.
Large-scale machine learning document classification was shown to be very robust and flexible for this particular application, as illustrated in four distinct text-mining-based use cases. The models are readily available on two data workflow platforms, which we believe will allow the majority of the scientific community to apply them to their own data.

%T UniChem: extension of InChI-based compound mapping to salt, connectivity and stereochemistry layers
%A J Chambers
%A M Davies
%A A Gaulton
%A G Papadatos
%A A Hersey
%A JP Overington
%J Journal of Cheminformatics 
%D 2014
%V 6:43  
%O doi:10.1186/s13321-014-0043-5
%O http://www.jcheminf.com/content/6/1/43

%T A document classifier for medicinal chemistry publications trained on the ChEMBL corpus
%A G Papadatos
%A GJP van Westen
%A S Croset
%A R Santos
%A S Trubian
%A JP Overington
%J Journal of Cheminformatics 
%D 2014
%V 6:40  
%O doi:10.1186/s13321-014-0040-8
%O http://www.jcheminf.com/content/6/1/40

Tuesday, 2 September 2014

We're hiring! Web developer for NIH Illuminating the Druggable Genome (IDG) project


We got a prize today, so we are happy. What better way to celebrate, than to recruit someone new for the group. We have a position available for a developer to support web service development and integration for the Knowledge Management Centre part of the recently announced NIH Illuminating the Druggable Genome project, see this link for details of the job.

Closing deadline for applications is 12th October 2014.