I recently prepared a few slides on OPSIN for an internal presentation, and was looking for a simple use case. The first thing I tried turned out to be more interesting that I expected.
If you visit the OPSIN website, there are three examples provided to illustrate its functionality. Daniel's original website, at the Uni of Cambridge, had 2,4,6-trinitrotoluene (TNT) as the example. With the move to EMBL-EBI and associated rewrite of the frontend, I thought about keeping this but decided that something more biologically-relevant would be appropriate. In the end, I comprised by keeping the 2,4,6- as a nod to the original, but used a saccharide instead: 2,4,6-tri-O-methyl-D-glucopyranose.
Now click on "Search Google", to do a search using the InChIKey. My attention was drawn to the AI summary results, which I captured at the time (maybe you can tell when?) in the screenshot below:
"The string UTLUVTKMAWSZKV-NEIVSKJXSA-N is an InChIKey (International Chemical Identifier Key) that corresponds to the chemical compound methyl 4-O-methyl-beta-D-xylopyranoside." Oh yeah? Well, it has put the name in bold so it must be pretty sure. But maybe just before I "Dive deeper in AI Mode", since I am the custodian of an actual rule-based deterministic IUPAC name to structure software, I should probably doublecheck that. One copy and paste later gives the following result:
Well, at least it's another saccharide.
Does this tell us something profound about AI tooling and difficulties reading IUPAC names? Not really - InChIKey to IUPAC name is a dictionary lookup problem. However, it is definitely true that reading an IUPAC name accurately is difficult; even just the interpretation of stereochemistry on its own is quite a complex problem and I can't see how an LLM will ever be able to handle this. Will it lead to errors in databases? Perhaps, where people trust it too much. But as soon as LLMs start delegating specialised tasks to tools such as OPSIN (e.g. by calling the web API, or via an MCP), I expect that it will work itself out.
And as a postscript to this, currently I cannot reproduce this error - right now Google's AI provides a correct IUPAC name. Who knows? Maybe I hallucinated it.



Comments