Monday, 20 July 2015

Biological annotations in SureChEMBL

Termite annotation in action. (Termite not to scale)

SureChEMBL is perhaps the only freely available, large-scale, comprehensive and live resource of chemistry extracted from the patent literature. SureChEMBL automatically annotates, normalises and indexes chemistry found in the full text, images and attachments (i.e. mol files) of patent documents. The next logical step for us, was to complement the chemical annotations with biological ones, such as mentions of gene names and classifications, protein classes and disease indications.

As the first step towards this direction, we used Termite provided by SciBite (via funding from OpenPHACTS) to integrate these annotations dynamically into the full text patent view of the SureChEMBL user interface; in other words, you can now view biological annotations on-the-fly.

How do I add the annotations and navigate through them?

There is now an additional checkbox underneath the 'Highlight additional recognised chemical terms' checkbox:

Simply check the box and let the magic happen! Once the loading process is complete (the annotation process can take quite some time on large patents, please be patient!), you should be greeted with an annotation type tree that allows you to navigate through the highlighted annotations.

Expanding a node will show the terms, sorted by frequency of occurrence in the patent document. Hovering over the term entry will yield navigation buttons to traverse forwards or backwards through all instances of a particular annotation.

The text is also highlighted in the corresponding colour, with the currently selected term instance bouncing for your attention.

Clicking on the annotations also allows you to view cross-references and links to external resources, in a similar fashion to the chemical annotations.

As part of our contribution to the OpenPHACTS project, all chemical and biological annotations found in our huge patent backfile (6 million documents and counting) will be semantically integrated in the OpenPHACTS system and will be part of the freely available OpenPHACTS API. This will be available later this year.

As always, comments and feedback are very welcome.

The SureChEMBL Team

