Skip to main content

ChEMBL Identifiers




A few notes about the use and format of identifiers in ChEMBL:

Each of the major entity types within ChEMBL (documents, assays, compounds and targets) are assigned unique ChEMBL identifiers, which take the form of a ‘CHEMBL’ prefix followed immediately by an integer (e.g., CHEMBL25 is the compound aspirin, CHEMBL210 is the human beta-2 adrenergic receptor "target"). There is no distinction between the format of the identifier for different types of entities, but a given ChEMBL identifier will only ever be assigned to a single entity (i.e., CHEMBL25 will only ever be used for the compound aspirin and never for an assay, document or target). A lookup table is provided in the database, to resolve which identifiers correspond to which entity types. 

ChEMBL identifiers are stable with respect to the entities they represent. For compounds (with known/defined structures), ChEMBL identifiers represent distinct compound structures, as defined by the standard InChI, e.g., CHEMBL25 represents: InChI=1S/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12). Therefore, two compounds reported in different papers but having the same standard InChI will be assigned the same ChEMBL ID. 

These ChEMBL IDs will never be reassigned to a structure with a different standard InChI. However, since compounds may be reported or drawn incorrectly in the literature, it is sometimes necessary to alter the compound ChEMBL ID (structure) to which a particular bioactivity measurement links. In this case, the old (incorrect) ChEMBL identifier may be 'downgraded' in the database if no other data link to it. Downgraded compounds are not currently displayed on the live interface, but are retained in the database and the ChEMBL ID lookup table, and could be re-instated in future (with the same ChEMBL ID) if new data become available for them. 

External identifiers for ChEMBL entities are also recorded in the database, where possible. For example, in addition to ChEMBL IDs and InChI/InChIKeys, all small molecule compounds with defined structures are assigned ChEBI identifiers. Where data are taken from other resources, the original identifiers are also retained (e.g., SIDs and AIDs for PubChem substances and assays, HET codes for PDBe ligands). PubMed identifiers or Digital Object Identifiers (DOIs) are stored for documents, and protein targets are represented by primary accessions from the UniProt database.

Comments

Popular posts from this blog

ChEMBL_27 SARS-CoV-2 release

The COVID-19 pandemic has resulted in an unprecedented effort across the global scientific community. Drug discovery groups are contributing in several ways, including the screening of compounds to identify those with potential anti-SARS-CoV-2 activity. When the compounds being assayed are marketed drugs or compounds in clinical development then this may identify potential repurposing opportunities (though there are many other factors to consider including safety and PK/PD considerations; see for example  https://www.medrxiv.org/content/10.1101/2020.04.16.20068379v1.full.pdf+html ). The results from such compound screening can also help inform and drive our understanding of the complex interplay between virus and host at different stages of infection. Several large-scale drug screening studies have now been described and made available as pre-prints or as peer-reviewed publications. The ChEMBL team has been following these developments with significant interest, and as a contr

RDKit, C++ and Jupyter Notebook

Fancy playing with RDKit C++ API without needing to set up a C++ project and compile it? But wait... isn't C++ a compiled programming language? How this can be even possible? Thanks to Cling (CERN's C++ interpreter) and xeus-cling jupyter kernel is possible to use C++ as an intepreted language inside a jupyter notebook! We prepared a simple notebook showing few examples of RDKit functionalities and a docker image in case you want to run it. With the single requirement of docker being installed in your computer you'll be able to easily run the examples following the three steps below: docker pull eloyfelix/rdkit_jupyter_cling docker run -d -p 9999:9999 eloyfelix/rdkit_jupyter_cling open  http://localhost:9999/notebooks/rdkit_cling.ipynb  in a browser

FPSim2, a simple Python3 molecular similarity tool

FPSim2 is a new tool for fast similarity search on big compound datasets (>100 million) being developed at ChEMBL. We started developing it as we needed a Python3 library able to run either in memory or out-of-core fast similarity searches on such dataset sizes. It's written in Python/Cython and features: A fast population count algorithm (builtin-popcnt-unrolled) from https://github.com/WojciechMula/sse-popcount using SIMD instructions. Bounds for sub-linear speed-ups from 10.1021/ci600358f A compressed file format with optimised read speed based in PyTables and BLOSC Use of multiple cores in a single search In memory and on disk search modes Simple and easy to use Source code is available on github and Conda packages are also available for either mac or linux. To install it type: conda install rdkit -c rdkit conda install fpsim2 -c efelix Try it with docker (much better performance than binder):     docker pull eloyfelix/fpsim2     docker run -p 9

2019 and ChEMBL – News, jobs and birthdays

  Happy New Year from the ChEMBL Group to all our users and collaborators.  Firstly, do you want a new challenge in 2019?  If so, we have a position for a bioinformatician in the ChEMBL Team  to  develop pipelines for identifying links between therapeutic targets, drugs and diseases.  You will be based in the ChEMBL team but also work in collaboration with the exciting Open Targets initiative.  More details can be found here   (closing date 24 th January).  In case you missed it, we published a paper at the end of last on the latest developments of the ChEMBL database “ ChEMBL: towards direct deposition of bioassay data”. You can read it here .  Highlights include bioactivity data from patents, human pharmacokinetic data from prescribing information, deposited data from neglected disease screening and data from the IMI funded K4DD project.  We have also added a lot of new annotations on the therapeutic targets and indications for clinical candidates and marketed

ChEMBL 25 and new web interface released

We are pleased to announce the release of ChEMBL 25 and our new web interface. This version of the database, prepared on 10/12/2018 contains: 2,335,417 compound records 1,879,206 compounds (of which 1,870,461 have mol files) 15,504,603 activities 1,125,387 assays 12,482 targets 72,271 documents Data can be downloaded from the ChEMBL ftp site: ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_25 Please see ChEMBL_25 release notes for full details of all changes in this release: ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_25/chembl_25_release_notes.txt DATA CHANGES SINCE THE LAST RELEASE # Deposited Data Sets: Kuster Lab Chemical Proteomics Drug Profiling (src_id = 48, Document ChEMBL_ID = CHEMBL3991601): Data have been included from the publication: The target landscape of clinical kinase drugs. Klaeger S, Heinzlmeir S and Wilhelm M et al (2017), Science, 358-6367 ( https://doi.org/10.1126/science.aan4368 ) # In Vivo Assay