The ChEMBL-og

Posts

Forthcoming Conferences

There are a number of Conferences and meetings coming up in the next few weeks that might be of interest: Firstly, it's not too late to register for the KNIME Spring Summit in Berlin 24th -26th February More details here  The next SME Forum will be held on the Wellcome Genome Campus at Hinxton near Cambridge on 7th and 8th March. Come and find out more about EMBL-EBI's freely available data resources including ChEMBL and SureChEMBL. More details on the meeting and registration here UKQSAR and Physchem Forum Joint Symposium This is a two day meeting being held on 15th to 16th March at Stevenage in the UK. There are a limited number of places still available and you must register (by 29th Feb) if you want to attend. More details can be found here . Last but not least consider going to the Spring ACS meeting in San Diego 13th to 17th...

ChEMBL 21 is coming soon...

We are pleased to announce that, after a long wait, the next ChEMBL release is finally on its way. We will be making the data available in the next couple of weeks. However, in the meantime, here is a sneak preview of what has been added (though we've been quiet, we have been busy working on some nice new features): Clinical candidates - we have added data on >900 drug candidates in clinical trials together with their mechanism of action. This initial set focusses on candidates modulating kinase, GPCR and nuclear hormone receptor targets, but we will be adding broader coverage in future releases. Drug indications - we have collated indications for FDA approved drugs from a number of sources and provided these using controlled vocabularies/ontologies (MeSH and EFO). Drug metabolism and PK data - we have extracted information on pharmacokinetics and drug metabolic pathways from Drug Metabolism and Disposition journal, FDA approval packages and a variety of other sour...

Wanted: Experienced Java Developer

We are looking for an experienced Java developer/contractor to work with us on a really innovative text-mining and Java backend development project, related to SureChEMBL and the Illuminating the Druggable Genome (IDG) grant. The role will be on a 6-month contract and will involve the development of new Java components for the extraction, storage and provision of biological named entities in patent documents. The suitable candidate (like Alice above) will have more than three years hands-on experience working as an Enterprise Java Developer in production environments with large codebases. Experience in the field of chemo/bioinformatics would be ideal but not a a deal-breaker. More information on the job description and requirements here . If you think you have what it takes, drop us a line with your CV.

SureChEMBL: A New Hope

US-D254080-S SureChEMBL has disrupted the field of patent chemistry by liberating chemical structures and knowledge locked in text and images, and by making the compound-patent associations freely and fully searchable and accessible on a daily basis to everyone: academics, IP professionals, content providers, software vendors, biotechs, small and big pharma, and related chemical industries . The speed, scale and scope of the data is unprecedented for a public resource. SureChEMBL has been around for less than two years ; during this time, it has evolved into a full-blown chemistry resource provided by the EMBL-EBI: the SureChEMBL interface was revamped and released last year , including combined keyword and structure-based queries against the annotated patent corpus. All chemistry is integrated with UniChem and there are several ways to access the data in bulk, including flat files and a data client. Very soon, the data will be fully integrated and avai...

Advanced keyword and structure searches with SureChEMBL

Previously in the SureChEMBL series, we described how to access SureChEMBL data in bulk , offline and locally. So, you may ask, what is the point in using the SureChEMBL web interface ? Well, how about the unprecedented functionality that allows you to submit very granular queries by combining: i) Lucene fields against full-text and bibliographic metadata and ii) advanced structure query features against the annotated compound corpus - at the same time? Let’s see each one separately first: Lucene-powered keyword searching You may use the main text box for simple keyword-based patent searches, such as ‘Apple’, ‘diabetes’ or even ' chocolate cake ' (the patent corpus as a recipe book is a new use-case here). You will get a lot of results and probably a lot of noise. With Lucene fields, you can slice and dice a query by indicating specific patent sections and bibliographic metadata, such as date/year of filing or publication, assignee, patent classification code,...

Is ChEMBL down or is it just me?

Have you ever wondered whether your favorite resource of bioactive molecules data is down or there is some temporary network issue, that makes it unavailable from your end? There are many online tools, that can help in such cases (for example downforeveryoneorjustme.com or similar websites). We, however, provide now a much better solution: ChEMBL status page : http://chembl.github.io/status/ As you may notice, the status page is hosted on GitHub , so it is outside of the EBI infrastructure. This means that even when ChEMBL core websites are down, you should still be able to see the status page (assuming that GitHub is online, which is a quite reasonable assumption , despite occasional incidents ). We've placed a link to the status page at the bottom of the left-side navigation menu on the main ChEMBL web page , as it provides some useful information even when everything is fine. The status page presents information about the health of ChEMBL's most critical...

Paper: Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents

Our collaborators in GSK have just published an Open Access paper in the Journal of Cheminformatics . It is a comparative study of the quality of chemistry extraction from patent documents and includes patent chemistry sources derived by automated text-mining, such as SureChEMBL and the IBM/NIH data set . Among other things, the paper provides a useful detailed overview of SureChEMBL's chemistry annotation specifications. While conducting this study, we realised that this task is far from trivial for several reasons: The patent corpus is inherently noisy, ambiguous and error-rich. There are diverse use cases and accuracy expectations when it comes to chemistry extracted from patents. Not all the chemistry found in a patent document is of equal importance. Compound standardisation variants such as stereoisomers, tautomers, salts and mixtures is always an issue. There is a distinct lack of an open Gold Standard when it comes to standardised chemistry extracted fro...