Skip to main content

The SureChEMBL map file is out


As many of you know, SureChEMBL taps into the wealth of knowledge hidden in the patent documents. More specifically, SureChEMBL extracts and indexes chemistry from the full-text patent corpus (EPO, WIPO and USPTO; JPO titles and abstracts only) by means of automated text- and image-mining, on a daily basis. We have recently hosted a webinar about it which turned out to be very popular - for those who missed it, the video and slides are here.

Besides the interface, SureChEMBL compound data can be accessed in various ways, such as UniChem and PubChem. The full compound dump is also available as a flat file download from our ftp server.

Since the release of the SureChEMBL interface last September, we have received numerous requests for a way to access compound and patent data in a batch way. Typical use-cases would include retrieving all compounds for a list of patent IDs, or vice versa, retrieving all patents where one or more compounds have been extracted from. As a result, we have now produced this so-called map file which connects SureChEMBL compounds and patents.

It is available here.
More information can be found in the README file.

What is this file?

There is a total of 216,892,266 rows in the map, indicating a compound extracted from a specific section of a specific patent document. The format of the file is quite simple: it contains compound information (SCHEMBL ID, SMILES, InChI Key, corpus frequency), patent information (patent ID and publication data), and finally location information, such as the field ID and frequency. The field ID indicates the specific section in the patent where the compound was extracted from (1:Description, 2:Claims, 3:Abstract, 4:Title, 5:Image, 6:MOL attachment). The frequency is the number of times the compound was found in a given section of a given patent. More information on the format of the file in the README file.

How many compounds and patents are there?

There are 187,958,584 unique patent-compound pairs, involving 14,076,090 unique compound IDs extracted from 3,585,233 EP, JP, WO and US patent documents - an average of ~52 compounds per patent. The patent coverage is from 1960 to 31-12-2014 inclusive.

Here's a breakdown of the patents in the map per year and patent authority:




Are these all the compounds and patents in SureChEMBL?

Technically, no - in practice, yes. We excluded chemically annotated patents that are not immediately relevant to life sciences, such as this one. For the filtering, we used a list of relevant IPCR and related patent classification codes. At the same time, we excluded too small, too large, too trivial compounds, along with non-organic and radical/fragment compounds.

Are these compounds genuinely claimed as novel in their respective patents?

Automated methods to assess which are the important and relevant compounds in a pharmaceutical patent is a field of research and one of our future plans. For now, the map file include all extracted chemistry mentioned in all sections of a patent, subject to the filters listed in the previous section. A quick and effective trick to filter out trivial and/or uninformative compounds is to use the corpus frequency column and exclude everything with a value more than, say, 1000. Note that, in this way, you will also exclude drug compounds such as sildenafil, which are casually mentioned in a lot of patents. You could also look for compounds mentioned only in claims, description or images sections by filtering by the corresponding field ID.

What can I do with this?

Well, you can start by 'grepping' for one or more patent IDs or SCHEMBL IDs or InChI keys, followed by further filtering. Many of you will choose to normalise the flat file into 3 database tables (say compounds, documents and doc_to_compound) for centralised access and easy querying.

For example, to find the patents the drug palbociclib has been extracted from:

Any plans to update this map file?  

New patents and chemistry arrive and are stored to SureChEMBL every day. We are planning to release new versions and incremental updates of the map file every quarter, in sync with the update of the compound dump files.

I couldn’t find my compound / patent - this compound should not be there

Don’t forget this an automated, live, high-throughput text-mining effort against an inherently noisy corpus such as patents. We are constantly working on improving data quality. If you find anything strange, let us know.

Can I join more metadata, such as patent assignee and title?

Obviously your first port of call would be the SureChEMBL website for patent metadata, but other services you may wish to use include the EPO web services for programmatic access.

Is there anything else?

Errr, yes. Watch this space for another post on storing and accessing live SureChEMBL data, behind your firewall. 


The SureChEMBL Team

Comments

Popular posts from this blog

Here's a nice Christmas gift - ChEMBL 35 is out!

Use your well-deserved Christmas holidays to spend time with your loved ones and explore the new release of ChEMBL 35!            This fresh release comes with a wealth of new data sets and some new data sources as well. Examples include a total of 14 datasets deposited by by the ASAP ( AI-driven Structure-enabled Antiviral Platform) project, a new NTD data se t by Aberystwyth University on anti-schistosome activity, nine new chemical probe data sets, and seven new data sets for the Chemogenomic library of the EUbOPEN project. We also inlcuded a few new fields that do impr ove the provenance and FAIRness of the data we host in ChEMBL:  1) A CONTACT field has been added to the DOCs table which should contain a contact profile of someone willing to be contacted about details of the dataset (ideally an ORCID ID; up to 3 contacts can be provided). 2) In an effort to provide more detailed information about the source of a deposited dat...

Improvements in SureChEMBL's chemistry search and adoption of RDKit

    Dear SureChEMBL users, If you frequently rely on our "chemistry search" feature, today brings great news! We’ve recently implemented a major update that makes your search experience faster than ever. What's New? Last week, we upgraded our structure search engine by aligning it with the core code base used in ChEMBL . This update allows SureChEMBL to leverage our FPSim2 Python package , returning results in approximately one second. The similarity search relies on 256-bit RDKit -calculated ECFP4 fingerprints, and a single instance requires approximately 1 GB of RAM to run. SureChEMBL’s FPSim2 file is not currently available for download, but we are considering generating it periodicaly and have created it once for you to try in Google Colab ! For substructure searches, we now also use an RDKit -based solution via SubstructLibrary , which returns results several times faster than our previous implementation. Additionally, structure search results are now sorted by...

ChEMBL 34 is out!

We are delighted to announce the release of ChEMBL 34, which includes a full update to drug and clinical candidate drug data. This version of the database, prepared on 28/03/2024 contains:         2,431,025 compounds (of which 2,409,270 have mol files)         3,106,257 compound records (non-unique compounds)         20,772,701 activities         1,644,390 assays         15,598 targets         89,892 documents Data can be downloaded from the ChEMBL FTP site:  https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_34/ Please see ChEMBL_34 release notes for full details of all changes in this release:  https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_34/chembl_34_release_notes.txt New Data Sources European Medicines Agency (src_id = 66): European Medicines Agency's data correspond to EMA drugs prior to 20 January 2023 (excluding ...

Improved querying for SureChEMBL

    Dear SureChEMBL users, Earlier this year we ran a survey to identify what you, the users, would like to see next in SureChEMBL. Thank you for offering your feedback! This gave us the opportunity to have some interesting discussions both internally and externally. While we can't publicly reveal precisely our plans for the coming months (everything will be delivered at the right time), we can at least say that improving the compound structure extraction quality is a priority. Unfortunately, the change won't happen overnight as reprocessing 167 millions patents takes a while. However, the good news is that the new generation of optical chemical structure recognition shows good performance, even for patent images! We hope we can share our results with you soon. So in the meantime, what are we doing? You may have noticed a few changes on the SureChEMBL main page. No more "Beta" flag since we consider the system to be stable enough (it does not mean that you will never ...

ChEMBL brings drug bioactivity data to the Protein Data Bank in Europe

In the quest to develop new drugs, understanding the 3D structure of molecules is crucial. Resources like the Protein Data Bank in Europe (PDBe) and the Cambridge Structural Database (CSD) provide these 3D blueprints for many biological molecules. However, researchers also need to know how these molecules interact with their biological target – their bioactivity. ChEMBL is a treasure trove of bioactivity data for countless drug-like molecules. It tells us how strongly a molecule binds to a target, how it affects a biological process, and even how it might be metabolized. But here's the catch: while ChEMBL provides extensive information on a molecule's activity and cross references to other data sources, it doesn't always tell us if a 3D structure is available for a specific drug-target complex. This can be a roadblock for researchers who need that structural information to design effective drugs. Therefore, connecting ChEMBL data with resources like PDBe and CSD is essen...