Skip to main content

UniChem - An EBI compound structure cross-referencing resource


We have faced for some time some issues with compound integration with ChEMBL - specifically the loading of compound sets into ChEMBL for cross referencing, between for example, ChEBI, PDBe compounds, etc. The ChEMBL update cycle is relatively slow with respect to some other resources, and there is inevitable thrash with compounds not being present, especially for exciting new data. Without doing something different for compound integration, we were starting to face a scenario where we had a compound table with many millions of compounds without any bioactivity data, and following this the inevitable slowdown in searching, etc.

We also had some issues facing us about curation of other people's primary data, changing compound structures, or their rendering, etc.

So, we decided to set up an external system to resolve cross-references between various databases. This is a very simple Standard InChI lookup, containing compounds from resources such as ChEMBL, ChEBI, PDBe, DrugBank, KEGG, BindingDB, PubChem, and so forth. UniChem can also handle versioning of the contained resources. We will be migrating various components of the current ChEMBL interface across to use web services on UniChem, this way, the cross links will always be fresh and correct, and we can focus on curation and optimisation of ChEMBL content. There are some other resources, like ZINC, STITCH, and ChemSpider, for example, that would be great to integrate, if we can get hold of the required data.

The easiest way for us to handle deposition into UniChem is for us to take an ftp: feed of a simple table of resource_id, standard_InChI, and standard_InChI_key.

At the moment, UniChem sits behind our firewall, but if people want to have a play, let us know.

We will write something more specific and detailed, but would welcome thoughts of whether this resolver should be externally facing, and what other resources would be good to integrate?

The image above may or may not be the UniChem logo.

Comments

Michael Kuhn said…
So the idea is that you give UniChem an InChIKey, and get back the identifiers of the source databases? To some extent, this already works in PubChem:
http://www.ncbi.nlm.nih.gov/pcsubstance?term=KEGG%5Bsource%5D%20%20BSYNRYMUTXBXSQ-UHFFFAOYSA-N" Although you don't control the versions, of course.

Regarding STITCH, we're downstream of PubChem and also don't create new interactions that you could integrate into ChEMBL. Therefore, it's probably easiest to leave STITCH out of UniChem and rather link to STITCH via InChIKey:
http://stitch.embl.de/interactions/BSYNRYMUTXBXSQ-UHFFFAOYSA-N

~Michael
jpo said…
Thanks for the comment. I guess the linking out to STITCH via InChI key is one-way only (or at least practically is one-way).

We should have simple cross reference to STITCH in a future version of the ChEMBL interface.

Longer term we have a plan with UniChem to provide some additional services off this, such as a trivial name/synonym service, etc. Keeping on top of names is a real pain for us, and others do it really well.
fredrik said…
I think it would be really great with the addition of some vendor databases like emolecules.com. To have the ability to filter ChEMBL based on commercial availability would give great benefits.

/Fredrik
Unknown said…
You can already get a purchasable subset of ChEMBL via ZINC. Both ChEMBL15 http://zinc.docking.org/pbcs/chembl15 and ChEMBL DrugStore http://zinc.docking.org/pbcs/drugstore. The molecules are in ready-to-dock formats, free to download. We also offer purchasable subsets of many other subsets like IUPHAR, FDA, DrugBank, bindingdb, and also supersets, like all purchasable natural products, metabolites, drugs, in man compounds, etc, etc.