Skip to main content

UniChem - An EBI compound structure cross-referencing resource


We have faced for some time some issues with compound integration with ChEMBL - specifically the loading of compound sets into ChEMBL for cross referencing, between for example, ChEBI, PDBe compounds, etc. The ChEMBL update cycle is relatively slow with respect to some other resources, and there is inevitable thrash with compounds not being present, especially for exciting new data. Without doing something different for compound integration, we were starting to face a scenario where we had a compound table with many millions of compounds without any bioactivity data, and following this the inevitable slowdown in searching, etc.

We also had some issues facing us about curation of other people's primary data, changing compound structures, or their rendering, etc.

So, we decided to set up an external system to resolve cross-references between various databases. This is a very simple Standard InChI lookup, containing compounds from resources such as ChEMBL, ChEBI, PDBe, DrugBank, KEGG, BindingDB, PubChem, and so forth. UniChem can also handle versioning of the contained resources. We will be migrating various components of the current ChEMBL interface across to use web services on UniChem, this way, the cross links will always be fresh and correct, and we can focus on curation and optimisation of ChEMBL content. There are some other resources, like ZINC, STITCH, and ChemSpider, for example, that would be great to integrate, if we can get hold of the required data.

The easiest way for us to handle deposition into UniChem is for us to take an ftp: feed of a simple table of resource_id, standard_InChI, and standard_InChI_key.

At the moment, UniChem sits behind our firewall, but if people want to have a play, let us know.

We will write something more specific and detailed, but would welcome thoughts of whether this resolver should be externally facing, and what other resources would be good to integrate?

The image above may or may not be the UniChem logo.

Comments

Michael Kuhn said…
So the idea is that you give UniChem an InChIKey, and get back the identifiers of the source databases? To some extent, this already works in PubChem:
http://www.ncbi.nlm.nih.gov/pcsubstance?term=KEGG%5Bsource%5D%20%20BSYNRYMUTXBXSQ-UHFFFAOYSA-N" Although you don't control the versions, of course.

Regarding STITCH, we're downstream of PubChem and also don't create new interactions that you could integrate into ChEMBL. Therefore, it's probably easiest to leave STITCH out of UniChem and rather link to STITCH via InChIKey:
http://stitch.embl.de/interactions/BSYNRYMUTXBXSQ-UHFFFAOYSA-N

~Michael
jpo said…
Thanks for the comment. I guess the linking out to STITCH via InChI key is one-way only (or at least practically is one-way).

We should have simple cross reference to STITCH in a future version of the ChEMBL interface.

Longer term we have a plan with UniChem to provide some additional services off this, such as a trivial name/synonym service, etc. Keeping on top of names is a real pain for us, and others do it really well.
fredrik said…
I think it would be really great with the addition of some vendor databases like emolecules.com. To have the ability to filter ChEMBL based on commercial availability would give great benefits.

/Fredrik
Unknown said…
You can already get a purchasable subset of ChEMBL via ZINC. Both ChEMBL15 http://zinc.docking.org/pbcs/chembl15 and ChEMBL DrugStore http://zinc.docking.org/pbcs/drugstore. The molecules are in ready-to-dock formats, free to download. We also offer purchasable subsets of many other subsets like IUPHAR, FDA, DrugBank, bindingdb, and also supersets, like all purchasable natural products, metabolites, drugs, in man compounds, etc, etc.

Popular posts from this blog

ChEMBL 34 is out!

We are delighted to announce the release of ChEMBL 34, which includes a full update to drug and clinical candidate drug data. This version of the database, prepared on 28/03/2024 contains:         2,431,025 compounds (of which 2,409,270 have mol files)         3,106,257 compound records (non-unique compounds)         20,772,701 activities         1,644,390 assays         15,598 targets         89,892 documents Data can be downloaded from the ChEMBL FTP site:  https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_34/ Please see ChEMBL_34 release notes for full details of all changes in this release:  https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_34/chembl_34_release_notes.txt New Data Sources European Medicines Agency (src_id = 66): European Medicines Agency's data correspond to EMA drugs prior to 20 January 2023 (excluding vaccines). 71 out of the 882 newly added EMA drugs are only authorised by EMA, rather than from other regulatory bodies e.g.

New SureChEMBL announcement

(Generated with DALL-E 3 ∙ 30 October 2023 at 1:48 pm) We have some very exciting news to report: the new SureChEMBL is now available! Hooray! What is SureChEMBL, you may ask. Good question! In our portfolio of chemical biology services, alongside our established database of bioactivity data for drug-like molecules ChEMBL , our dictionary of annotated small molecule entities ChEBI , and our compound cross-referencing system UniChem , we also deliver a database of annotated patents! Almost 10 years ago , EMBL-EBI acquired the SureChem system of chemically annotated patents and made this freely accessible in the public domain as SureChEMBL. Since then, our team has continued to maintain and deliver SureChEMBL. However, this has become increasingly challenging due to the complexities of the underlying codebase. We were awarded a Wellcome Trust grant in 2021 to completely overhaul SureChEMBL, with a new UI, backend infrastructure, and new f

Accessing SureChEMBL data in bulk

It is the peak of the summer (at least in this hemisphere) and many of our readers/users will be on holiday, perhaps on an island enjoying the sea. Luckily, for the rest of us there is still the 'sea' of SureChEMBL data that awaits to be enjoyed and explored for hidden 'treasures' (let me know if I pushed this analogy too far). See here and  here for a reminder of SureChEMBL is and what it does.  This wealth of (big) data can be accessed via the SureChEMBL interface , where users can submit quite sophisticated and granular queries by combining: i) Lucene fields against full-text and bibliographic metadata and ii) advanced structure query features against the annotated compound corpus. Examples of such queries will be the topic of a future post. Once the search results are back, users can browse through and export the chemistry from the patent(s) of interest. In addition to this functionality, we've been receiving user requests for  local (behind the