Tuesday, 14 May 2019
The use and application of multi-task neural networks is growing rapidly in cheminformatics and drug discovery. Examples can be found in the following publications:
- Deep Learning as an Opportunity in VirtualScreening
- Massively Multitask Networks for Drug Discovery
- Beyond the hype: deep neural networks outperform established methods using a ChEMBL bioactivity benchmark set
But what is a multi-task neural network? In short, it's a kind of neural network architecture that can optimise multiple classification/regression problems at the same time while taking advantage of their shared description. This blogpost gives a great overview of their architecture. All networks in references above implement the hard parameter sharing approach.
So, having a set of activities relating targets and molecules we can train a single neural network as a binary multi-label classifier that will output the probability of activity/inactivity for each of the targets (tasks) for a given query molecule.
PyTorch is one of the most popular open source AI libraries at present. It's getting a lot of traction in research environments, it's deeply integrated with the NumPy ecosystem and it also implements a dynamic graph approach making it easier to debug.
We have some interesting references, we have data in ChEMBL, we have PyTorch and RDKit... what are we waiting for?
First of all we'll need to extract the data from ChEMBL and format it for our purpose. The following notebook explains step by step how to do it. The output will be a H5 file that you can also download from here in case you want go directly to the network training phase.
Notebook to extract the data
Nice! We have the data, let's go then through the main notebook and train a model!
Notebook to train the model
This was a simple example. We hope you enjoyed it and will be inspired to experiment with deeper architectures, skipping connections, different learning rate strategies, more epochs, early stopping... and so on!
Notebooks also available in GitHub
Posted by Eloy at 5/14/2019 10:18:00 am
Wednesday, 17 April 2019
We have two exciting opportunities for scientists to come and work with the ChEMBL team at the Wellcome Genome Campus in Hinxton near Cambridge.
If you've used ChEMBL in the past perhaps now is the chance to come and shape its future. Even if you haven't this is a great place to work and in both positions you will collaborate with people developing the ChEMBL resources but also our collaborators here at Hinxton and around Europe. These include the Open Targets project and EU funded toxicology projects such as EU-ToxRisk and eTRANSAFE.
We are looking for:
(1) A talented chemoinformatician to work on methods for the annotation, searching and visualization of toxicologically relevant data. You will develop pipelines and tools to enable the better prediction and assessment of the toxicity of pharmaceutical and environmental chemicals.
Closing Date 19th May 2019
More details here
(2) A protein computational scientist to develop, assess and validate methods for quantifying target tractability with the goal of incorporating such methodologies into the Open Targets informatics platform www.targetvalidation.org. This exciting work will focus on developing new methods based on protein structure and sequence.
Closing Date 26th May 2019
More details here
Don't delay, apply today.
Thursday, 28 March 2019
- 2,335,417 compound records
- 1,879,206 compounds (of which 1,870,461 have mol files)
- 15,504,603 activities
- 1,125,387 assays
- 12,482 targets
- 72,271 documents
Data can be downloaded from the ChEMBL ftp site: ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_25
Please see ChEMBL_25 release notes for full details of all changes in this release: ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_25/chembl_25_release_notes.txt
DATA CHANGES SINCE THE LAST RELEASE
# Deposited Data Sets:
Kuster Lab Chemical Proteomics Drug Profiling (src_id = 48, Document ChEMBL_ID = CHEMBL3991601):
Data have been included from the publication: The target landscape of clinical kinase drugs. Klaeger S, Heinzlmeir S and Wilhelm M et al (2017), Science, 358-6367 (https://doi.org/10.1126/science.aan4368)
# In Vivo Assay Classification:
A classification scheme has been created for in vivo assays. This is stored in the ASSAY_CLASSIFICATION table in the database schema and consists of a three-level classification. Level 1 corresponds to the top-levels of the ATC classification i.e., anatomical system/therapeutic area (e.g., CARDIOVASCULAR SYSTEM, MUSCULO-SKELETAL SYSTEM, NERVOUS SYSTEM). Level 2 provides a more fine-grained classification of the phenotype or biological process being studied (e.g., Learning and Memory, Anti-Obesity Activity, Gastric Function). Level three represents the specific in vivo assay being performed (e.g., Laser Induced Thrombosis, Hypoxia Tolerance Test in Rats, Paw Edema Test) and is assigned a specific ASSAY_CLASS_ID. Individual in vivo assays in the ChEMBL ASSAYS table are mapped to reference in-vivo assays in the ASSAY_CLASSIFICATION table via the ASSAY_CLASS_MAP table. More information about the classification scheme is available in the following publication: https://doi.org/10.1038/sdata.2018.230. The assay classification is available via web services and will be included in the ChEMBL web interface in the near future.
# Updated Data Sets:
Patent Bioactivity Data
BindingDB Database (corrections to compound structures)
WEB INTERFACE/WEB SERVICE CHANGES SINCE THE LAST RELEASE
# Web Interface:
The new ChEMBL web interface is now live at https://www.ebi.ac.uk/chembl (this replaces the previous beta version). The old ChEMBL web interface will be retired before the ChEMBL_26 release, but is available on the following URL until then: https://www.ebi.ac.uk/chembl/old. The new interface provides richer search and filtering capabilities. Documentation regarding this new functionality and frequently asked questions are available on our help pages: https://chembl.gitbook.io/chembl-interface-documentation/
# Changes to Web Services:
The Assay web service has been updated to include both assay_parameters and the in vivo assay classification for an assay (where applicable):
A separate endpoint has also been created for the in vivo assay classification:
The Activity web service has been updated to include activity_properties. The 'published_type', 'published_relation', 'published_value' and 'published_units' fields have also been renamed to 'type', 'relation', 'value' and 'units':
A new endpoint has been created to retrieve supplementary data associated with an activity measurement (or list of measurements):
SCHEMA CHANGES SINCE THE LAST RELEASE
# Tables Added:
Classification scheme for phenotypic assays e.g., by therapeutic area, phenotype/process and assay type. Can be used to find standard assays for a particular disease area or phenotype e.g., anti-obesity assays. Currently data are available only for in vivo efficacy assays
COLUMN_NAME DATA_TYPE COMMENT
ASSAY_CLASS_ID NUMBER(9,0) Primary key
L1 VARCHAR2(100) High level classification e.g., by anatomical/therapeutic area
L2 VARCHAR2(100) Mid-level classification e.g., by phenotype/biological process
L3 VARCHAR2(1000) Fine-grained classification e.g., by assay type
CLASS_TYPE VARCHAR2(50) The type of assay being classified e.g., in vivo efficacy
SOURCE VARCHAR2(50) Source from which the assay class was obtained
Mapping table linking assays to classes in the ASSAY_CLASSIFICATION table
COLUMN_NAME DATA_TYPE COMMENT
ASS_CLS_MAP_ID NUMBER Primary key.
ASSAY_ID NUMBER assay_id is the foreign key that maps to the 'assays' table
ASSAY_CLASS_ID NUMBER assay_class_id is the foreign key that maps to the 'assay_classification' table
# Columns Removed:
PUBLISHED_TYPE (DEPRECATED in ChEMBL_24, now removed, replaced by TYPE)
PUBLISHED_RELATION (DEPRECATED in ChEMBL_24, now removed, replaced by RELATION)
PUBLISHED_VALUE (DEPRECATED in ChEMBL_24, now removed, replaced by VALUE)
PUBLISHED_UNITS (DEPRECATED in ChEMBL_24, now removed, replaced by UNITS)
Work contributing to ChEMBL_25 was funded by the Wellcome Trust, EMBL Member States, Open Targets, National Institutes of Health (NIH) Common Fund, EU Innovative Medicines Initiative (IMI) and EU Framework 7 programmes. Please see https://www.ebi.ac.uk/chembl/funding for more details.
The ChEMBL Team
Tuesday, 19 February 2019
In 2019 we celebrate the 10th anniversary of the first public release of the ChEMBL database. To recognise this important landmark we are organising a one-day symposium to celebrate the work achieved by ChEMBL during its first ten years, and look forward to its future.
The symposium will be held on Tuesday 8th October in the Francis Crick Auditorium on the Wellcome Genome Campus, Hinxton, Cambridge, UK. A series of talks from invited speakers will be followed by a celebratory birthday cake and drinks reception. During the breaks, the poster session will be a great opportunity to catch up with other users of the ChEMBL database and chat to colleagues, co-workers and others to find how more about how the database is being used.
For the programme of invited talks, and more information on how to register, see https://www.ebi.ac.uk/about/events/10-years-of-chembl
Thursday, 7 February 2019
You know that in the ChEMBL group, we love to play with the data we collect!! Back in April 2014, we started to work on a target prediction tool. Wow! This was almost 5 years ago! Since then, we have continued to update the tool for each new ChEMBL release, providing you with the actual models and the result of the prediction on the ChEMBL website for the drug molecules. The good news is that these target predictions are not dead and a successor is on its way!
First, we would like to introduce you some closely related work. You may have heard about conformal prediction (CP). If not, it is a machine learning framework developed to associate confidence to predictions. I personally consider this as a requirement for decision making. Basically, you train a model as you would do in QSAR but then you first predict a so-called calibration set, for which you know the actual values. For each of these observations you obtain two probabilities: one for the active and one for the inactive class (in a typical classification scheme). Now that you have this information, each time you predict a new compound you compare its probabilities to those of the calibration set (the non-conformity scores as they are called) and you derived p-values for each class. Based on your predefined significance level, the compound can be assigned in different categories: only active or only inactive, but also both active and inactive or none of them. I am sure you can start seeing here the added value of CP!
Here I have briefly detailed how it works for classification models but CP can also be applied to regression models. If you want to know more about conformal prediction, I recommend you to read this book and also this very nice example of the application in drug discovery. Having learnt how to build conformal predictors, we were intrigued to know how well they perform against traditional QSAR models with our ChEMBL data!
With this in mind, we decided to build a panel of models using a substantial data set from ChEMBL. With our new protocol, we were able to build models for 788 targets (550 of them human targets). For the descriptors we used RDKit Morgan fingerprint (2048 bits and radius 2) and 6 physicochemical descriptors. For the machine learning part we used the good old Random Forests as implemented in Scikit-learn version 0.19. For the QSAR models, this is all that is needed, but for CP you need a framework and this was provided by the very nice library provided by Henrik Linusson.
The next part consisted of training the models and checking their internal performance, but we went a bit further and decided that with our models trained on ChEMBL_23 data, it would be interesting to see how they perform with new data in ChEMBL_24 in a so-called temporal validation. All the details, results and conclusion are presented in the recently accepted article!
The dataset for each target is already available here and you can find the models ready to use there.
Feel free to take a look and to share your opinion in the comments.
Now, you remember that I started this post mentioning our good old target predictors. So does it mean a new generation of ChEMBL models using conformal prediction is ready to be launched for our users? Well, unfortunately not yet, but stay tuned!