Skip to main content

Unleashing 4 million IUPAC names into the wild


In support of Egon Willighagen's 'One Million IUPAC Names' project, we have just released more than 4 million IUPAC names text-mined from patents. Here are the details as listed on Zenodo: 

 

What: This file contains IUPAC names text-mined from patents (US, WIPO, EPO, Chinese, Japanese). 

Who: This file is provided by the SureChEMBL project under a CC0 license. We are part of the Chemical Biology Services team at EMBL-EBI. Please cite us appropriately if you use this dataset (thanks!).

Format: 

This is a gzipped TSV file with two columns, IUPAC Names and SMILES. The IUPAC Names column may itself contain multiple IUPAC names separated by an exclamation mark (!). Each of these names resolves to the same SMILES and they differ only in casing. They are sorted such that the name with fewer uppercase characters comes first.

Details:

As part of the SureChEMBL text-mining pipeline, we recognise and extract IUPAC names in patents. These are stored in a database at SureChEMBL HQ, and converted to chemical structures which are made available in our downloads. Here we are making available those text-mined IUPAC names, or to be exact, the names after minor corrections (that may involve removing spaces, or fixing parentheses) that enable the name to be interpreted.

How:

We use LeadMine from NextMove Software to textmine systematic IUPAC names. This incorporates OPSIN by Daniel Lowe to resolve IUPAC names to SMILES.

Something I realised when checking the data was that the case of the characters can change the meaning. Consider O-methylphenol vs o-methylphenol, for example. Needless to say, the same SMILES string can be associated with more than one IUPAC name. While in theory the rules describe how to generate a single Preferred IUPAC Name (PIN) for any structure; in practise a wide variety of names are used and are often semi-systematic in nature (consider o/m/p versus 2/3/4 in phenyl ring numbering).

What can this be used for? Well, you could use it as a lookup to provide an IUPAC name for a structure. Or to train or test an IUPAC name generator. Whatever you want really - the data is released under a CC0 license.

We will not be updating this file on a regular basis, but rather will wait until there is a substantial increase in size. Echoing Egon's comment on his blog, let us know if you find these data useful.

Image credit: ChatGPT-4o with "a representation of IUPAC names being released into the wind". You may notice a minor correction (originally 'nthyl'). Perhaps they need to train on this set for a bit more.

Comments