Dear SureChEMBL users,
This is our second blog post from our series related to SureChEMBL2.0. If you missed the first one, you should take 5 minutes to read it.
We know that for many of you, not being able to download new patent data from SureChEMBL for several months has had a significant impact on your work. Please rest assured that halting the data release was not a decision taken lightly. Continuing to develop on legacy software had reached a breaking point and could no longer be sustained within our current resource constraints.
In terms of SureChEMBL downloads, we were previously offering three types, all focused on patent compounds but differing in update frequency and content:
- MAP files: TSV files released quarterly, listing all compound–patent relationships identified in that period, along with the locations of the compounds within the patents.
- Compound data dump: SDF or TXT files, also released quarterly, containing only the compounds found in that period.
- Data client: A collection of scripts for setting up a local database and downloading daily compound and document data, only accessible on a private FTP (it was opened to all but you had to ask us first...
The data client was especially appreciated by some of you, as it offered near-real-time access—typically one to two weeks after patent publication. However, due to its complexity (for both users and maintainers), outdated code, and fragile logic, we had to make the tough call to stop delivering new data through this method.
On the other hand, the MAP files and compound dumps were widely used thanks to their simplicity. Still, they didn’t suit users who needed quicker access, and the data they provided was too limited for others. The incremental compound data dumps were addressed only to user interested in the patent chemical space.
Since all three approaches were incremental, we reached a point where we couldn't modify our data model or content without breaking compatibility—either with the data client or user workflows. For example, adding patent metadata to MAP files would have dramatically increased file sizes due to their denormalised structure.
Given these limitations, we saw no other viable option than to develop a new unified download system that incorporates the best features of the previous three. With this in place, we were no longer able to continue maintaining the legacy download systems while focusing on the new solution.
Introducing: SureChEMBL Bulk Data Download
The bulk data is a new collection of a few core files representing the entire annotated SureChEMBL database. It is conceptually similar to the MAP files, but with much richer content. Here's the schema we're using:
For more technical details, please see the documentation. If you're familiar with relational databases, the structure should be straightforward. One file contains the compounds, another the patent documents, and a third the compound–patent relationships. A fourth smaller file holds metadata about patent sections (e.g., title, abstract, description, claims, images, MOL attachments).
The big advantage: this format mirrors our internal relational database schema, meaning no duplication of compound or patent information. For instance, in MAP files, including the patent title meant repeating it for every compound–patent relationship, inflating file sizes. With the new schema, such repetition is avoided.
This structure also allows us to easily extend the data model in the future—new tables or metadata types can be added without breaking anything. And because each release is a snapshot of SureChEMBL at a given time, it’s self-contained: changes in future releases won’t affect previous ones.
Release Frequency & Format
Creating a full snapshot of SureChEMBL takes time, so for now, we plan to deliver updates every two weeks. This is more frequent than the quarterly MAP releases, though not as fast as the former data client.
We are also transitioning to a new file format: Parquet.
Parquet is a columnar storage format widely used in big data platforms like Apache Spark and Hadoop. It’s designed for efficient querying, particularly for analytics that access only a subset of columns. Benefits include faster read performance, better compression, and support for complex nested data.
While interacting with Parquet files requires using a library (e.g., Pandas, PyArrow, Polars, DuckDB in Python), it’s very similar to querying a SQL database.
Here is how to get all the compounds for a given patent:
Here is how to get all the compounds for a given patent:
Here is another example where we get the first patent, a compound (identified by its InChI key) is found:
Converting Parquet to CSV
If you’d prefer to use traditional formats like CSV, you can easily convert the files using the DuckDB CLI. First install DuckDB (available for Linux, macOS and Windows). Then, run the following commands:
duckdb -c "COPY (SELECT * FROM 'compounds.parquet') TO 'compounds.csv' (HEADER, DELIMITER ',');"
duckdb -c "COPY (SELECT * FROM 'patents.parquet') TO 'patents.csv' (HEADER, DELIMITER ',');"
duckdb -c "COPY (SELECT * FROM 'patent_compound_map.parquet') TO 'patent_compound_map.csv' (HEADER, DELIMITER ',');"
After that, take a look at the file sizes—you’ll see just how much more efficient Parquet really is.Deprecation of Old Downloads
The legacy downloads are now officially deprecated:- MAP files and compound data dumps: latest release from 2023
- Data client: last update on 20 June 2024
Summary
With the new SureChEMBL bulk data download, our goals are:• Easier maintenance on our side = more reliable releases for you
• Flexibility to modify past data and add new fields
• Clear documentation for all changes so you can adapt your workflows
We’re very excited to finally share this with you, and we hope it delivers the data you need in a modern, flexible, and efficient way. Please don’t hesitate to reach out and tell us what you think!
One Last Thing…
You may notice a difference in the total compound count compared to previous downloads. This is intentional—we’ve been upgrading our chemistry pipeline and have migrated to RDKit, to ensure better consistency with ChEMBL. You can read more about what’s changed in the next article.The SureChEMBL team
Comments