Skip to main content

Posts

ChEMBL tissues: Increasing depth, breadth and accuracy of annotations

Our current tissue annotation efforts have been on increasing the breadth and depth of the tissue effort first started in ChEMBL 22. The figure above represents the increased depth and coverage from that initial point till now.  We continue to use a suite of tissue ontologies namely: Uberon, Experimental Factor Ontology ( http://www.ebi.ac.uk/ols/ontologies/efo ) , CALOHA (ftp://ftp.nextprot.org/pub/current_release/controlled_vocabularies/caloha.obo) and Brenda Tissue Ontology ( ( http://www.ebi.ac.uk/ols/ontologies/bto )   to identify assays where the tissue is the assay system. We have increased the detail of information we capture to reflect the more granular tissues mentioned in the assays such as 'Popliteal lymph node' and 'Substantia nigra' pars compacta where previously the higher level term ‘lymph node’ and ‘Substantia nigra’ might have been captured. Plasma based assays We have recently focused annotation efforts on plasma bas...

Targets in ChEMBL through the years

Evolution of targets over time While ChEMBL was first released in 2009, the data on which it is built originate from publications extending back to 1975. Despite relatively sparse coverage from the early years in comparison to now, it is interesting to see how the publically available data for targets has grown over time. This interactive plot aims to present key data for each of ChEMBL’s targets over the years, in a style inspired by the late Hans Rosling’s  TED talk   on global development (if you haven't already seen it, I recommend that you watch it now!) As shown above, dragging the slider at the bottom of the plot updates the year to reflect the data available up until that point.  The following values are shown: Y-axis: The cumulative sum of compounds with a pChEMBL value for the target X-axis: The maximum pChEMBL value or LLE (depending on radio button selection) achieved to date for target Point Size: The maximu...

Software Engineer Wanted!

We are currently seeking a talented Software Engineer to work on developing our exciting SureChEMBL resource. SureChEMBL is a publicly available large-scale database containing compounds extracted from the full text, images and attachments of patent documents. The data are extracted from the patent literature according to an automated text and image-mining pipeline on a daily basis (see https://academic.oup.com/nar/article/44/D1/D1220/2503067 for more information), producing a database of more than 19 million chemical structures. The successful candidate will have a minimum of 3 years of professional development experience with strong core Java Enterprise Edition development skills (please see job description below for full requirements). For more details of the position, or to apply please visit:  https://www.embl.de/jobs/searchjobs/index.php?ref=EBI_01104 The closing date for applications is 21st January 2018

Using autoencoders for molecule generation

Some time ago we found the following paper https://arxiv.org/abs/1610.02415 so we decided to take a look at it and train the described model using ChEMBL. Lucky us, we also found two open source implementations of the model; the original authors one https://github.com/HIPS/molecule-autoencoder and https://github.com/maxhodak/keras-molecules . We decided to rely on the last one as the original author states that it might be easier to have greater success using it. What is the paper about? It describes how molecules can be generated and specifically designed using autoencoders. First of all we are going to give some simple and not very technical introduction for those that are not familiar with autoencoders and then go through a ipython notebook showing few examples of how to use it. Autoencoder introduction Autoencoders are one of the many different and popular unsupervised deep learning algorithms used nowadays for many different fields and purposes. These work wi...