Skip to main content

Posts

ChEMBL 24 Released!

We are pleased to announce the release of ChEMBL 24. This version of the database, prepared on 23/04/2018 contains:     2,275,906 compound records     1,828,820 compounds (of which 1,820,035 have mol files)     15,207,914 activities     1,060,283 assays     12,091 targets     69,861 documents Data can be downloaded from the ChEMBL ftp site: ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_24_1 Please see ChEMBL_24 release notes for full details of all changes in this release: ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_24_1/chembl_24_1_release_notes.txt Change in data model and addition of activity properties and supplementary data: A new data submission format and database loader has been implemented. The new deposition system allows more advanced functionality, including the ability to update previously deposited data sets, and the ability to deposit activity data again...

Striving for Perfect Representation of Chemical Structures – is this possible?

It probably goes without saying that at ChEMBL, we have a desire to make all our data as accurate and useful as possible. With this in mind we have spent many hours over the last few years trying to curate, in particular, the structures of marketed drugs and clinical candidates. We aren’t alone in this and more than 5 years ago people were coming across the same problems as highlighted in this blog post by ChemConnector on Fluvastatin Our drug curation is an ongoing and probably a never-ending task but to be honest it has proved a lot more difficult than we expected. This is for several reasons: Firstly, where to go to find the definitive structure of a molecule? One would have thought this would be easy but even the sources such as INN and USAN don’t always agree. For example for Telavancin the USAN_data_sheet  shows a difference in the nitrogen and carbon counts in the structure images compared with the images in the INN document (although the molecular formula are t...

Schema changes coming in ChEMBL_24

Since ChEMBL was first released in 2009, the diversity of data sources and data types in the database has increased significantly. Increasingly, we are dealing with more complex assays such as measurement of drug pharmacokinetic parameters or toxicology data sets such as clinical biochemistry and tissue histopathology data. There are a number of problems handling these kinds of assays with the current data model/database schema. For example, since parameters such as compound doses or time points could not be recorded against individual activity measurements (only the whole assay) such experiments were typically split so that a separate assay was created for each compound or time point measured. This is obviously far from ideal. Another issue is that such experiments frequently measure or derive multiple endpoints from a particular assay (e.g., AUC, Cmax, tmax, t1/2 for a pharmacokinetic study) or produce large amounts of raw data that may need to be associated with summary-level ...

Join the ChEMBL Team!

We are looking for talented individuals to help us maintain and develop the ChEMBL and SureChEMBL resources and currently have a number of open positions within the team. If you are looking for an exciting new role and would like to work with us on the beautiful Wellcome Genome Campus , here are details of the positions: Data Integration Scientist We are looking for a Scientist with a passion for data integration to manage the incorporation of drug discovery data into the ChEMBL database. Responsibilities will include: Responsibility for the handling, processing and integration of data into the ChEMBL database. Facilitating the deposition of datasets directly into ChEMBL through working with external collaborators. Applying text- & data-mining techniques for the development of effective large-scale curation strategies. Developing methods for the application and maintenance of ontologies in ChEMBL. Working with other teams to facilitate the integration of...

Have you heard of CORBEL?

Briefly, CORBEL is an initiative of thirteen biological and medical research infrastructures, which together create a platform for harmonised user access to biological and medical technologies, biological samples and data services required by cutting-edge biomedical research. Do you know that ChEMBL, through ELIXIR, participates to the project and provides its expertise in, among other things, identification of existing bioactivities for compounds of interest, profiling of chemotypes, target identification, data storage and distribution? But of course, CORBEL gives you access to different services working in many different biomedical areas. You want to screen the compounds you have identified and then use Electron Microscopy to observe their effect on a cell type of your interest, there are services for you! This is just an example of how CORBEL can contribute to boost your research projects(s), don’t forget we are   37 partners !   As part of the WP...

ChEMBL tissues: Increasing depth, breadth and accuracy of annotations

Our current tissue annotation efforts have been on increasing the breadth and depth of the tissue effort first started in ChEMBL 22. The figure above represents the increased depth and coverage from that initial point till now.  We continue to use a suite of tissue ontologies namely: Uberon, Experimental Factor Ontology ( http://www.ebi.ac.uk/ols/ontologies/efo ) , CALOHA (ftp://ftp.nextprot.org/pub/current_release/controlled_vocabularies/caloha.obo) and Brenda Tissue Ontology ( ( http://www.ebi.ac.uk/ols/ontologies/bto )   to identify assays where the tissue is the assay system. We have increased the detail of information we capture to reflect the more granular tissues mentioned in the assays such as 'Popliteal lymph node' and 'Substantia nigra' pars compacta where previously the higher level term ‘lymph node’ and ‘Substantia nigra’ might have been captured. Plasma based assays We have recently focused annotation efforts on plasma bas...

Targets in ChEMBL through the years

Evolution of targets over time While ChEMBL was first released in 2009, the data on which it is built originate from publications extending back to 1975. Despite relatively sparse coverage from the early years in comparison to now, it is interesting to see how the publically available data for targets has grown over time. This interactive plot aims to present key data for each of ChEMBL’s targets over the years, in a style inspired by the late Hans Rosling’s  TED talk   on global development (if you haven't already seen it, I recommend that you watch it now!) As shown above, dragging the slider at the bottom of the plot updates the year to reflect the data available up until that point.  The following values are shown: Y-axis: The cumulative sum of compounds with a pChEMBL value for the target X-axis: The maximum pChEMBL value or LLE (depending on radio button selection) achieved to date for target Point Size: The maximu...