Skip to main content

Posts

Accessing SureChEMBL data in bulk

It is the peak of the summer (at least in this hemisphere) and many of our readers/users will be on holiday, perhaps on an island enjoying the sea. Luckily, for the rest of us there is still the 'sea' of SureChEMBL data that awaits to be enjoyed and explored for hidden 'treasures' (let me know if I pushed this analogy too far). See here and  here for a reminder of SureChEMBL is and what it does.  This wealth of (big) data can be accessed via the SureChEMBL interface , where users can submit quite sophisticated and granular queries by combining: i) Lucene fields against full-text and bibliographic metadata and ii) advanced structure query features against the annotated compound corpus. Examples of such queries will be the topic of a future post. Once the search results are back, users can browse through and export the chemistry from the patent(s) of interest. In addition to this functionality, we've been receiving user requests for  local (behind the ...

LSH-based similarity search in MongoDB is faster than postgres cartridge.

TL;DR: In his excellent blog post , Matt Swain described the implementation of compound similarity searches in MongoDB . Unfortunately, Matt's approach had suboptimal ( polynomial ) time complexity with respect to decreasing similarity thresholds, which renders unsuitable for production environments. In this article, we improve on the method by enhancing it with Locality Sensitive Hashing algorithm, which significantly reduces query time and outperforms RDKit PostgreSQL cartridge . myChEMBL 21 - NoSQL edition    Given that NoSQL technologies applied to computational chemistry and cheminformatics are gaining traction and popularity, we decided to include a taster in future myChEMBL releases. Two especially appealing technologies are Neo4j and MongoDB . The former is a graph database and the latter is a BSON document storage. We would like to provide IPython notebook -based tutorials explaining how to use this software to deal with common cheminformat...

Paper: Activity, assay and target data curation and quality in the ChEMBL database

We've just published an Open Access paper in the Journal of Computer-Aided Molecular Design  on the curation of bioactivity, assay and target data in ChEMBL , including current practices and future plans.  Here is the abstract: The emergence of a number of publicly available bioactivity databases, such as ChEMBL, PubChem BioAssay and BindingDB, has raised awareness about the topics of data curation, quality and integrity. Here we provide an overview and discussion of the current and future approaches to activity, assay and target data curation of the ChEMBL database. This curation process involves several manual and automated steps and aims to: (1) maximise data accessibility and comparability; (2) improve data integrity and flag outliers, ambiguities and potential errors; and (3) add further curated annotations and mappings thus increasing the usefulness and accuracy of the ChEMBL data for all users and modellers in particular. Issues related to activity, assay ...

ChEMBL python client update

Along with updating ChEMBL web services to the new 2.x version, we've also updated the python client library ( chembl_webresource_client ). The change was backwards compatible so it's possible that existing users haven't even noticed the change. As we've already provided examples of using new web services via cURL or using live docs , now it's good time to explain the changes made to the python client. First of all, if you haven't installed (or updated) it yet, you can do it using Python Package Index : Now you can access new functionality using the following import statement: Just as a mild warning, in 0.8.x versions of the client the new part will be called new_client . In 0.9.x it will change the name to client and the old part will be renamed to old_client and deprecated. In 1.0.x the old functionality will be removed completely. OK, so since we know how to import our new_client object, we can try to do something useful. Let's retrieve...

Biological annotations in SureChEMBL

Termite annotation in action. (Termite not to scale) SureChEMBL is perhaps the only freely available, large-scale, comprehensive and live resource of chemistry extracted from the patent literature. SureChEMBL automatically annotates, normalises and indexes chemistry found in the full text, images and attachments (i.e. mol files) of patent documents. The next logical step for us, was to complement the chemical annotations with biological ones, such as mentions of gene names and classifications, protein classes and disease indications. As the first step towards this direction, we used Termite provided by SciBite (via funding from OpenPHACTS ) to integrate these annotations dynamically into the full text patent view of the SureChEMBL user interface; in other words, you can now view biological annotations on-the-fly. How do I add the annotations and navigate through them? There is now an additional checkbox underneath the ' Highlight additional recognised...

myChEMBL + docker

In addition to the myChEMBL 20 VM images released earlier , today we are very happy to release myChEMBL Docker images. What's docker?   Docker is a new open-source project that automates the deployment of distributed applications. It takes advantage of some new cool features of modern Linux kernel in order to run virtual containers, avoiding the overhead of starting and maintaining virtual machines [from Wikipedia ]. In contrast to virtual machines, which emulate virtual hardware, docker containers employ the kernel of the host machine so they don't require or include the whole operating system. While still separated from the host, they only add a very thin level of abstraction [ ZDNet article ]. Why docker?   Docker is an emerging technology; it has become extremely popular over the last year and been adopted and used by the largest IT companies, such as RedHat, Canonical and Microsoft. Basically, using this platform you can do three things: Buil...

We're recruiting!

Want to join the ChEMBL team? We are seeking to recruit an experienced Web Application Developer to join the Chemogenomics Team at the European Bioinformatics Institute (EMBL-EBI). You will develop a series of web-based applications and interfaces for the  ChEMBL  chemogenomic resources. In collaboration with senior team members you will also have a role in determining and advising on the web development strategy for the chemogenomic resources. In addition you will be involved with the development, maintenance and documentation of these tools and supporting their usage within EMBL-EBI and externally. The position will also involve some requirement gathering and use-case development. For more information or to apply for this position follow this link: http://ig14.i-grasp.com//fe/tpl_embl01.asp?newms=jj&id=53807&aid=15470 The ChEMBL team