Monday, 22 August 2011

ChEMBL Identifiers

A few notes about the use and format of identifiers in ChEMBL:

Each of the major entity types within ChEMBL (documents, assays, compounds and targets) are assigned unique ChEMBL identifiers, which take the form of a ‘CHEMBL’ prefix followed immediately by an integer (e.g., CHEMBL25 is the compound aspirin, CHEMBL210 is the human beta-2 adrenergic receptor "target"). There is no distinction between the format of the identifier for different types of entities, but a given ChEMBL identifier will only ever be assigned to a single entity (i.e., CHEMBL25 will only ever be used for the compound aspirin and never for an assay, document or target). A lookup table is provided in the database, to resolve which identifiers correspond to which entity types. 

ChEMBL identifiers are stable with respect to the entities they represent. For compounds (with known/defined structures), ChEMBL identifiers represent distinct compound structures, as defined by the standard InChI, e.g., CHEMBL25 represents: InChI=1S/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12). Therefore, two compounds reported in different papers but having the same standard InChI will be assigned the same ChEMBL ID. 

These ChEMBL IDs will never be reassigned to a structure with a different standard InChI. However, since compounds may be reported or drawn incorrectly in the literature, it is sometimes necessary to alter the compound ChEMBL ID (structure) to which a particular bioactivity measurement links. In this case, the old (incorrect) ChEMBL identifier may be 'downgraded' in the database if no other data link to it. Downgraded compounds are not currently displayed on the live interface, but are retained in the database and the ChEMBL ID lookup table, and could be re-instated in future (with the same ChEMBL ID) if new data become available for them. 

External identifiers for ChEMBL entities are also recorded in the database, where possible. For example, in addition to ChEMBL IDs and InChI/InChIKeys, all small molecule compounds with defined structures are assigned ChEBI identifiers. Where data are taken from other resources, the original identifiers are also retained (e.g., SIDs and AIDs for PubChem substances and assays, HET codes for PDBe ligands). PubMed identifiers or Digital Object Identifiers (DOIs) are stored for documents, and protein targets are represented by primary accessions from the UniProt database.

