We are delighted to announce the release of ChEMBL 34, which includes a full update to drug and clinical candidate drug data. This version of the database, prepared on 28/03/2024 contains: 2,431,025 compounds (of which 2,409,270 have mol files) 3,106,257 compound records (non-unique compounds) 20,772,701 activities 1,644,390 assays 15,598 targets 89,892 documents Data can be downloaded from the ChEMBL FTP site: https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_34/ Please see ChEMBL_34 release notes for full details of all changes in this release: https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_34/chembl_34_release_notes.txt New Data Sources European Medicines Agency (src_id = 66): European Medicines Agency's data correspond to EMA drugs prior to 20 January 2023 (excluding vaccines). 71 out of the 882 newly added EMA drugs are only authorised by EMA, rather than from other regulatory bodies e.g.
Comments
As to the source of the patent structures. There are a number of initiatives underway at the moment to text-mine chemical structures from patents. We're currently not free to say what some of these sources are, but one source could be the feed from the EPO team.
These structures would be loaded into UniChem (qv) and all the lookups done there.
A big problem with other ways of chemical patent data are shown by your other comments - indirect access through semi-open resources, with significant onus on the user to ensure they don't violate any explicit or ambiguous usage constraints/licenses.
One of the ideas of patent filings is explicitly to make things easy to find so researchers don't waste time recreating other peoples IP, and also can build on top of this. Current systems do not really allow this.....