Skip to main content


Showing posts from 2014

Accessing web services with cURL

ChEMBL web services are really friendly. We provide live online documentation , support for CORS and JSONP techniques to support web developers in creating their own web widgets. For Python developers, we provide dedicated client library as well as examples using the client and well known requests library in a form of ipython notebook . There are also examples for Java and Perl, you can find it here . But this is nothing for real UNIX/Linux hackers. Real hackers use cURL . And there is a good reason to do so. cURL comes preinstalled on many Linux distributions as well as OSX. It follows Unix philosophy and can be joined with other tools using pipes . Finally, it can be used inside bash scripts which is very useful for automating tasks. Unfortunately first experiences with cURL can be frustrating. For example, after studying cURL manual pages , one may think that following will return set of compounds in json format: But the result is quite dissapointing... The reason is

Finding key compounds in med. chemistry patents: The open way

A couple of us attended the 3rd RDKit UGM , hosted by Merck in Darmstadt this year. It was an excellent opportunity to catch up with RDKit developments and applications and meet up with other loyal "RDKitters". I presented a talk-torial there and went through an IPython Notebook, which some of you may find useful. It uses patent chemistry data extracted from SureChEMBL and after a series of filtering steps, it follows a few "traditional" chemoinformatics approaches with a set of claimed compounds. My ultimate aim was to identify "key compounds" in patents using compound information alone, inspired by papers such as this and this . The crucial difference is that these authors used commercial data and software, where in this implementation everything is free and open. At the same time, I wanted to show off what the combination of pandas, scikit-learn, mpld3, Beaker, RDKit, IPython Notebook and SureChEMBL can do nowadays (hint: a lot).  So,

Using ChEMBL web services via proxy.

It is common practice for organizations and companies to make use of proxy servers to connect to services outside their network. This can cause problems for users of the ChEMBL web services who sit behind a proxy server. So to help those users who have asked, we provide the following quick guide, which demonstrates how to access ChEMBL web services via a proxy. Most software libraries respect proxy settings from environmental variables. You can set the proxy variable once, normally HTTP_PROXY and then use that variable to set other related proxy environment variables: Or if you have different proxies responsible for different protocols: On Windows, this would be: If you are accessing the ChEMBL web services programmatically and you prefer not to clutter your environment, you can consider adding the proxy settings to your scripts. Here are some python based recipes: 1. Official ChEMBL client library If you are working in a python based environment, we recommend

An overview and invitation to contribute to ChEMBL curation with PPDMs

PPDMs has been in the making for more than a year and is a follow-up on a conference paper we published in 2012. As in 2012, our objective is to map small molecule binding sites to protein domains, the structural units that form recurring building blocks in the evolution of proteins. An application note describing PPDMs is just out in Bioinformatics . Mapping small molecule binding to protein domains The mapping facilitates the functional interpretation of small molecule-protein interactions - if you understand which domain in a protein is targeted, you are in a better position to anticipate the downstream effect.  Mapping small molecule binding to protein domains also provides a technical advantage to machine-learning approaches that incorporate protein sequence information as a descriptor to predict small molecule bioactivity. Reducing the sequence descriptor to the part that mediates small molecule binding increases the informative content of the descriptor. This is best exemp

Paper: PPDMs – A resource for mapping small molecule bioactivities from ChEMBL to Pfam-A protein domains

We've just published a Open Access paper in Bioinformatics on an approach to annotate the region of ligand binding within a target protein. This has a lot of applications in the use of ChEMBL , in particular providing greater accuracy in mapping functional effects, improving ligand-based target prediction approaches, and reducing false positives in sequence/target searching of ChEMBL. Where next for this work - well annotating to a site-specific level would be a good thing to implement (think about HIV-1 RT with the distinct nucleoside and non-nucleoside sites). Here's the abstract... Summary : PPDMs is a resource that maps small molecule bioactivities to protein domains from the Pfam-A collection of protein families. Small molecule bioactivities mapped to protein domains add important precision to approaches that use protein sequence searches alignments to assist applications in computational drug discovery and systems and chemical biology. We have previously propos

Django model describing ChEMBL database.

TL;DR: We have just open sourced our Django ORM Model, which describes the ChEMBL relational database schema. This means you no longer need to write another line of SQL code to interact with ChEMBL database. We think it is pretty cool and we are using it in the ChEMBL group to make our lives easier. Read on to find out more.... It is never a good idea to use SQL code directly in python. Let's see some basic examples explaining why: Can you see what is wrong with the code above? SQL keyword `JOIN` was misspelled as 'JION'. But it's hard to find it quickly because most of code highlighters will apply Python syntax rules and ignore contents of strings. In our case the string is very important as it contains SQL statement. The problem above can be easily solved using some simple Python SQL wrapper, such as edendb . This wrapper will provide set of functions to perform database operations for example 'select', 'insert', 'delete': No

myChEMBL 19 Released

                      We are very pleased to announce that the latest myChEMBL release, based on the ChEMBL 19 database ,  is now available to download . In addition to the extra data, you will also find a number a great new features. So what's new then? More core chemoinformatics tools We have included OSRA (Optical Structure Recognition), which is useful for extracting compound structures from images. OSRA can be accessed from the command line or by very convenient web interface, provided by Beaker (described below). We've also added OpenBabel - another great open source cheminformatics toolkit. This means you can now experiment with both RDKit and OpenBabel and use whichever you prefer. ChEMBL Beaker myChEMBL now ships with a local instance the ChEMBL Beaker service. For those not familiar with Beaker, the service provides users with an array of chemoinformatics utilities via a RESTful API. Under the hood, Beaker is using RDKit and OSRA to carry out it

New Drug Approvals 2014 - Pt. XII - Naloxegol (Movantik™)

ATC Code: A06AH03 Wikipedia:  Naloxegol ChEMBL:  CHEMBL2219418 On September 16th  FDA approved  Movantik (naloxegol, AZ-13337019 ), as an oral treatment for patients with opioid-induced constipation and chronic non-cancer pain. Naloxegol Naloxegol is an opioid receptor antagonist .  Due to its similarity to noroxymorphone, a main metabolite of oxycodone , naloxegol is classed as a controlled substance. However, the FDA analysed its abuse potential and concluded that there was no risk of dependency. Mode of Action Opioids are a class of drugs which are used to manage pain, but have a common side effect of reducing the motility of the gastrointestinal tract, making bowel movements difficult.  Opioids work by binding to the mu-receptors ( CHEMBL233 , UniProt:P35372 ) in the central nervous system, thereby reducing pain. However, they are also able to bind to the mu-receptors in the gastrointestinal tract, hence causing opioid-induced constipation. 

New Drug Approvals 2014 - Pt. XI - Idelalisib (Zydelig™)

ATC Code: L01XX47 Wikipedia: Idelalisib ChEMBL: CHEMBL2216870 On July 23rd the FDA approved Zydelig ( idelalisib , GS-1101), as an orally-delivered drug to treat patients with three types of blood cancers. • Relapsed chronic lymphocytic leukemia (CLL) • Relapsed follicular B-cell, non-Hodgkin lymphoma  (FL) • Relapsed small lymphocytic lymphoma (SLL) Blood cancer The three main categories of blood cancer are leukemia , lymphoma and myeloma . Lymphoma is also split into two types: Hodgkin lymphoma and non-Hodgkin lymphoma . Both leukemia and myeloma occur in the bone marrow , whilst lymphoma is a cancer that is isolated to the lymphatic system. Acute leukemia is where there is an abundance of underdeveloped white blood cells that can’t function properly and chronic leukemia is where there are just far too many white blood cells, which is just as bad as having too few. Myeloma is where the plasma cells form tumours in the bone marrow. Idelal

The great US patent spike on SureChEMBL

Apparently, there was a huge spike of new granted US patents released by the USPTO a few days ago. The reason? In March 2013, US patent law changed. The ‘first to invent’ became ‘first inventor to file’ for patent protection purposes (see more on this  here ). As a result, a lot of people rushed to submit applications just before the change.  Fast forward 18 months later (last week), a huge spike in USPTO granted patents is observed.  Did SureChEMBL pick that up? See below the cumulative count plot of new patent documents: And the corresponding compound count extracted from these patents: For more information on SureChEMBL, see our previous  posts . George

SureChEMBL Available Now

Followers of the ChEMBL group's activities and this blog will be aware of our involvement in the migration of the previously commercially available SureChem chemistry patent system, to a new, free-for-all system, known as SureChEMBL. Today we are very pleased to announce that the migration process is complete and the SureChEMBL website is now online. SureChEMBL provides the research community with the ability to search the patent literature using Lucene-based keyword queries and, much more importantly, chemistry-based queries. If you are not familiar with SureChEMBL, we recommend you review the content of these earlier blogposts here and here . SureChEMBL is a live system, which is continuously extracting chemical entities from the patent literature. The time it takes for a new chemical in the patent literature to become searchable in the SureChEMBL system is 1-2 days (WO patents can sometimes take a bit longer due to an additional reprocessing step). At time of writi

Papers: Literature text mining and extensions to UniChem

Two new papers from the group have just been published, both in Journal of Chemoinformatics - and of course both Open Access. The first deals with some extensions to UniChem to allow far more flexible searches. The abstract is: UniChem is a low-maintenance, fast and freely available compound identifier mapping service, recently made available on the Internet. Until now, the criterion of molecular equivalence within UniChem has been on the basis of complete identity between Standard InChIs. However, a limitation of this approach is that stereoisomers, isotopes and salts of otherwise identical molecules are not considered as related. Here, we describe how we have exploited the layered structural representation of the Standard InChI to create new functionality within UniChem that integrates these related molecular forms. The service, called ‘Connectivity Search’ allows molecules to be first matched on the basis of complete identity between the connectivity layer of their co

We're hiring! Web developer for NIH Illuminating the Druggable Genome (IDG) project

We got a prize today , so we are happy. What better way to celebrate, than to recruit someone new for the group. We have a position available for a developer to support web service development and integration for the Knowledge Management Centre part of the recently announced NIH Illuminating the Druggable Genome project, see this link for details of the job. Closing deadline for applications is 12th October 2014 .

SureChEMBL Update 1

As announced in the previous SureChEMBL blogpost , the temporary holding page is now in place. So when users visit (or ), you will be redirected to . For updates on the release of the new SureChEMBL site, please keep an eye on the ChEMBL-og .

SureChEMBL Coming Very Soon

In the coming weeks we will be very pleased to announce the release of the new SureChEMBL website. Since the beginning of the year, we have been working hard with the folks over at Digital Science, along with all the content and software providers to get the system setup and running on our own Amazon Web Service controlled environment. As we approach the final stages of the transition, we will need to temporarily halt access to the original SureChem site. The reason for this minor disruption is to allow us to complete the testing of the additional functionality we have added to the SureChEMBL user interface. We will use ChEMBL-og as the primary route of communicating with users, so if you want to be kept up to date, bookmark the site. We will also make ad hoc tweets about SureChEMBL on @johnpoverington, @georgeisyourman, @surechembl and @chembl. SureChEMBL User Interface Users familiar with the previous SureChem UI will find a lot in common with the new SureChEM

Citing ChEMBL, and Data DOIs

There are now multiple formats and ways to access the ChEMBL data, and we have recently assigned DOIs to all available versions of ChEMBL (and will archive these on the ftp server, permanently). So when you publish use of ChEMBL, could you reference the following papers: ChEMBL Database A. Gaulton, L. Bellis, J. Chambers, M. Davies, A. Hersey, Y. Light, S. McGlinchey, R. Akhtar, A.P. Bento, B. Al-Lazikani, D. Michalovich, & J.P. Overington (2012) ‘ChEMBL: A Large-scale Bioactivity Database For Chemical Biology and Drug Discovery’ Nucleic Acids Res. Database Issue , 40 D1100-1107. DOI:10.1093/nar/gkr777 PMID:21948594 A.P. Bento, A. Gaulton, A. Hersey, L.J. Bellis, J. Chambers, M. Davies, F.A. Krüger, Y. Light, L. Mak, S. McGlinchey, M. Nowotka, G. Papadatos, R. Santos & J.P. Overington (2014) ‘The ChEMBL bioactivity database: an update’ Nucleic Acids Res . Database Issue , 42 1083-1090. DOI:10.1093/nar/gkt103 PMID: 24214965 myChEMBL R. Ochoa, M. Davies