Skip to main content

ChEBI 2.0 Data Products

Image generated using DALLE-3

Introduction

Dear ChEBI users,
If you have visited the ChEBI website recently, you have been able to intuit that something really great is just about to come. Our team has been working for the last three years on the redevelopment of ChEBI, which includes: a new website, modern infrastructure, a new submission and curator tool and improvements in the ChEBI data products (ontology, TSV flat files, SDF files, database dump). Many things to cover in one single post, for that reason, this is one in a series of blog posts describing ChEBI 2.0. Today, we are going to take a look at the ChEBI data products enhancements.

What is new?

We are going to describe the changes according to the type of product, starting with one of our most widely used data products: the ChEBI Ontology.

ChEBI Ontology

These are the changes we have implemented:

Homogenized prefixes

We decided to homogenise the ChEBI prefixes. In the past, we used to have http://purl.obolibrary.org/obo/chebi# for the annotation properties and http://purl.obolibrary.org/obo/chebi/ for the object properties. This situation happened because of the way old ChEBI worked, it generated the OWL files based on OBO files. Now, we have only one prefix for both (annotation properties and object properties): http://purl.obolibrary.org/obo/chebi/

Using CHEMROF for annotation properties

Old ChEBI used a custom hashed prefix for annotation properties, for example, http://purl.obolibrary.org/obo/chebi#mass. Now, we are using CHEMROF (https://w3id.org/chemrof/). Following table shows the new annotation properties and their equivalences in the old ChEBI.

Old ChEBI

ChEBI 2.0

chebi:charge

chemrof:charge

chebi:formula

chemrof:generalized_empirical_formula

chebi:inchikey

chemrof:inchi_key_string

chebi:inchi

chemrof:inchi_string

chebi:monoisotopic_mass

chemrof:monoisotopic_mass

chebi:mass

chemrof:mass

chebi:smiles

chemrof:smiles_string

Wurcs not provided.

chemrof:wurcs_representation


Also, in the previous table, you can see we are offering WURCS representation for carbohydrate structures as a new annotation.

Using RO for object properties

Old ChEBI used a custom slashed prefix for annotation properties, for example, http://purl.obolibrary.org/obo/chebi/has_functional_parent. Now we are using classes from OBO Relation Ontology to model our relationship among compounds (object properties). The following table shows the new object properties and their equivalences in the old ChEBI.

Old ChEBI

ChEBI 2.0

chebi:has_functional_parent

RO:0018038

chebi:has_parent_hydride

RO:0018040

chebi:is_conjugate_acid_of

RO:0018034

chebi:is_conjugate_base_of

RO:0018033

chebi:is_enantiomer_of

RO:0018039

chebi:is_substituent_group_from

RO:0018037

chebi:is_tautomer_of

RO:0018036

chebi:has_functional_parent

RO:0018038

chebi:has_parent_hydride

RO:0018040

Other improvements

  • Old ChEBI had some important metadata missing, for example: license, homepage, title, description and version number. We have fixed that in the redevelopment.
  • Old ChEBI didn’t have proper XML data types for annotation properties, so all of them were processed as strings. The above is incorrect for mass, monoisotopic mass and charge information. We have set the correct XML data types for them.
  • We are using Bioregistry prefixes for cross-references. This is an improvement because old ChEBI didn’t have these prefixes homogenised, so you could find two or more prefixes for the same database along the ontology.
  • Leveraging the new way of the ChEBI Ontology generation, we have included the OBO Json Graph format for all the ontology subsets: LITE, CORE and FULL.
Now, let's take a look at the ChEBI Relational Database

ChEBI Database

As you probably know, we also offer ChEBI as a relational database, specifically we export the database as an ORACLE dump in addition to one DDL script for each table. For ChEBI 2.0, we have taken a big decision: to use PostgreSQL as our new relational database system. The reasons behind this big decision are many: It's open source, it's robust, and it has a big community, great performance and scalability. Apart from the above, the Python data ecosystem libraries (Pandas, Polars, DuckDB, SQLAlchemy, etc) integrate easily with PostgreSQL. 

Yes, we are aware that ORACLE has improved during this time, but from our point of view, it is one step behind PostgreSQL, for example, DuckDB does not have native support for ORACLE yet. Our new PostgreSQL database can be easily installed using the pg_restore command, however, we have decided to maintain separate DDL for each table in case you need them. If you want to read the documentation about the tables and columns in the database, you can visit our new ChEBI Database documentation website.

ChEBI Flat Files

Before describing this data product, it is important to remember that an entry in ChEBI can have multiple secondary ids and only one primary id, also known as the ChEBI ID (our stable identifiers offered over the years 👵). For example, the compound caffeine has the ChEBI ID (primary id) CHEBI:27732 and three secondary ids: CHEBI:22982, CHEBI:3295 and CHEBI:41472, which reference the same compound, but the official ChEBI ID for caffeine is CHEBI:27732. The above is crucial because a compound could get information (synonyms, cross-references, chemical data, structure, etc.) from both primary id (ChEBI ID) and the secondary ones. 

If you don't want to worry about secondary ChEBI IDs, then we highly recommend using the ChEBI flat files data product. They include all the tables provided in the PostgreSQL dump exported as TSV files (tab-separated values) with a big difference. They do NOT include secondary ids, so you can be sure every compound identifier in all the tables is ALWAYS the primary one. This data product could be more suitable for your analysis, depending on your goals.

Where can I find the new ChEBI data products?

We are using https://ftp.ebi.ac.uk/pub/databases/chebi-2 as the data product’s root folder, however, once old ChEBI has been totally turned off, you will be able to find the new data products in the normal ftp you have been using during these years: https://ftp.ebi.ac.uk/pub/databases/chebi. Our plan is to deprecate the old ChEBI data products at the end of this summer (August 2025), so we really encourage you to start integrating the new products into your future or current work. If you have any comments or concerns, please contact us at: chebi-developers@ebi.ac.uk

Special acknowledgments

A big thank you to the OBO Community, especially Chris Mungall and Charles Tapley Hoyt, as well as all the users who have been using ChEBI during these years.

What is next?

The ChEBI Data products are just one of the new features we are going to release soon. Please stay tuned to know more about ChEBI 2.0.


Comments