Skip to main content

SureChEMBL - Chemical Structure Information in Patents


Today we have announced that we are taking over the running of the SureChem system from Digital Science. We have renamed this SureChEMBL to reflect the history and provenance of the technology and engineering, but also to align it with it's new home and future, we like the name, and hope you do. We are delighted that this has happened - Nicko and the team at Digital Science have been great, and the more we have dug in to how it works, the more we have appreciated the design and vision that they had.

If there is one consistent piece of feedback we get about ChEMBL it is in encouraging us to add patent data to what we do. So now we have, but because the data from patents is different in detail from that reported in the published literature, we will keep the databases separate, but closely integrated.

For those of you that are already SureChem users you will be familiar with the functionality and how it works; but for those that weren't SureChEMBL takes feeds of full text patents, identifies chemical objects from either the in-line text or from images and adds 2-D chemical structures. This is then loaded into a database and is searchable by chemical structure, so you can do substructure, similarity searching and so forth - all the good things you'd expect from a chemical database. This chemical search functionality is unavailable from the public, published patent documents, and is really essential for anyone seriously using the patent literature. Oh, and the system does this live, so as patents are published, they are processed and added to the system - the delay between publication and structures being available in SureChEMBL is about a day when converted from text, and a few days when converted from image sources.



SureChEMBL is hosted on the cloud - it's quite a complicated AWS solution, and it will take a few months for us to assume complete control of all the various parts, and, importantly keep things running smoothly behind the scenes, so the continuous access to fresh patent data is maintained.

SureChEMBL uses a number of third part software products in its operation, and arranging the licenses and permissions has been complex, and is still ongoing. The 3rd party software and data feeds used in SureChEMBL include:

Name to structureChemAxon, ACD/Labs, Perkin Elmer, OpenEye, OPSIN, NextMove
Chemical cartridge: ChemAxon
Image to structureKey Module
Patent data: FairView (IFI Claims) – processed patents, TwinDolphin – patent PDFs

These guys have all been a pleasure to work with so far, and SureChEMBL is a great showcase of their respective technologies and data:

We will host the system at the primary urls http://www.ebi.ac.uk/surechembl and also at http://www.surechembl.org - at the moment , these redirect to www.surechem.org, but as we switch things over they will point to servers provisioned by our team, so please start using these new urls for future access, although the original urls will continue to work into the future.

One of the more complicated things to transfer is the user accounts system - we can't simply transfer them over - and so have a plan to mail batches of users once a new sign-on system is in place in order to invite them to sign up to the new user account system. If you are not currently a registered user, please sign up with the current system, and we'll invite you to transfer over to our sign-on system once things are ready.

The EMBL-EBI has a broad range of life-science chemistry resources, and we integrate across chemistry related content using a chemical structure integration system call UniChem. In overview the EMBL-EBI chemistry resources include the following.



The future? - well the future is exciting, and we have lots of ideas to actively develop the SureChEMBL system. To be clear though, doing this will rely on us getting funding, and we're working hard on this. Some of the ideas we have for SureChEMBL include:
  • Put SureChEMBL chemical content into UniChem
  • Add sequence searching
  • Add disease term, animal model, etc. indexing
  • Development of community KNIME nodes
  • Add links to/from Europe PMC
  • Ligand Ensemble-based mapping of ChEMBL literature to patents
  • Refactor interface for EMBL look and feel
  • Extend image extraction retrospectively from 2006 using spot priced compute from AWS
  • Provide weekly/monthly feed of patent structures to PubChem
  • Add chemical structure tagging & search to full text content of Europe PMC
But one of the first things we plan to do is index genes and targets (in collaboration with local SME SciBite) and provide an RDF form of the data and REST web services as part of the IMI OpenPHACTS project.

In the new year, we will run a webinar on SureChEMBL (which we will announce here), but in the mean-time we're very happy to take questions on the SureChEMBL support email address surechembl-help (at) ebi.ac.uk.

jpo

Comments

Popular posts from this blog

Target predictions in the browser with RDKit MinimalLib (JS) and ONNX.js

Some time ago we showed an example of how a model trained in Python's PyTorch could be run in a C++ backend by exporting it to the ONNX format.  Greg also showed us in his blogpost how our multitask neural network model could be used in a very nice KNIME workflow by exporting it to ONNX. That was possible thanks to RDKit's Java bindings and the ONNX Java runtime. As a refresher, most of the most popular machine learning frameworks can export their models to this format and many programming languages can load them to run the predictions. This certainly is a beautiful example of interoperability! In November 2019 RDKit introduced a reduced functionality Javascript library which is able to do all we need in order to use our multitask model in the browser. So, the only thing that was left to do was to combine these two awesome tools... and we did it! Here is our demo with its available source code . Start typing a smiles into the box and enjoy! Updated code to generate the m

Identifying relevant compounds in patents

  As you may know, patents can be inherently noisy documents which can make it challenging to extract drug discovery information from them, such as the key targets or compounds being claimed. There are many reasons for this, ranging from deliberate obfuscation through to the long and detailed nature of the documents. For example, a typical small molecule patent may contain extensive background information relating to the target biology and disease area, chemical synthesis information, biological assay protocols and pharmacological measurements (which may refer to endogenous substances, existing therapies, reaction intermediates, reagents and reference compounds), in addition to description of the claimed compounds themselves.  The SureChEMBL system extracts this chemical information from patent documents through recognition of chemical names, conversion of images and extraction of attached files, and allows patents to be searched for chemical structures of interest. However, the curren

This Python InChI Key resolver will blow your mind

This scientific clickbait title introduces our promised blog post about the integration of UniChem into our ChEMBL python client. UniChem is a very important resource, as it contains information about 134 million (and counting) unique compound structures and cross references between various chemistry resources. Since UniChem is developed in-house and provides its own web services , we thought it would make sense to integrate it with our python client library . Before we present a systematic translation between raw HTTP calls described in the UniChem API documentation and client calls, let us provide some preliminary information: In order to install the client, you should use pip : pip install -U chembl_webresource_client Once you have it installed, you can import the unichem module: from chembl_webresource_client.unichem import unichem_client as unichem OK, so how to resolve an InChI Key to InChI string? It's very simple: Of course in order to reso

ChEMBL 28 Released!

  We are pleased to announce the release of ChEMBL_28. This version of the database, prepared on 15/01/2021 contains: * 2,680,904 compound records * 2,086,898 compounds (of which 2,066,376 have mol files) * 17,276,334 activities * 1,358,549 assays * 14,347 targets * 80,480 documents Data can be downloaded from the ChEMBL FTP site:   https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_28 . Please see ChEMBL_26 release notes for full details of all changes in this release:  https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_28/chembl_28_release_notes.txt DATA CHANGES SINCE THE LAST RELEASE This release includes several new deposited data sets: Donated Chemical Probes data from SGC Frankfurt (src_id = 54) SARS-CoV-2 screening data from the Fraunhofer Institute (src_id = 52) Antimicrobial screening data sets from CO-ADD (src_id = 40) Plasmodium screening data from the UCSD Winzeler lab (src_id = 51) MMV pathogen box screening data (src_id = 34) Curated data

Accessing SureChEMBL data in bulk

It is the peak of the summer (at least in this hemisphere) and many of our readers/users will be on holiday, perhaps on an island enjoying the sea. Luckily, for the rest of us there is still the 'sea' of SureChEMBL data that awaits to be enjoyed and explored for hidden 'treasures' (let me know if I pushed this analogy too far). See here and  here for a reminder of SureChEMBL is and what it does.  This wealth of (big) data can be accessed via the SureChEMBL interface , where users can submit quite sophisticated and granular queries by combining: i) Lucene fields against full-text and bibliographic metadata and ii) advanced structure query features against the annotated compound corpus. Examples of such queries will be the topic of a future post. Once the search results are back, users can browse through and export the chemistry from the patent(s) of interest. In addition to this functionality, we've been receiving user requests for  local (behind the