Our collaborators in GSK have just published an Open Access paper in the Journal of Cheminformatics. It is a comparative study of the quality of chemistry extraction from patent documents and includes patent chemistry sources derived by automated text-mining, such as SureChEMBL and the IBM/NIH data set. Among other things, the paper provides a useful detailed overview of SureChEMBL's chemistry annotation specifications.
While conducting this study, we realised that this task is far from trivial for several reasons:
- The patent corpus is inherently noisy, ambiguous and error-rich.
- There are diverse use cases and accuracy expectations when it comes to chemistry extracted from patents.
- Not all the chemistry found in a patent document is of equal importance.
- Compound standardisation variants such as stereoisomers, tautomers, salts and mixtures is always an issue.
- There is a distinct lack of an open Gold Standard when it comes to standardised chemistry extracted from relevant full text patent documents. Recently, there have been several attempts towards text-mining standards provided by BioCreative and publications such as this one, which offer position and type of chemical named entities but not converted structures.
- The commercial patent chemistry vendors do not disclose their extraction specifications, which makes any comparisons even harder.
Here is the background:
First public disclosure of new chemical entities often takes place in patents, which makes them an important source of information. However, with an ever increasing number of patent applications, manual processing and curation on such a large scale becomes even more challenging. An alternative approach better suited for this large corpus of documents is the automated extraction of chemical structures. A number of patent chemistry databases generated by using the latter approach are now available but little is known that can help to manage expectations when using them. This study aims to address this by comparing two such freely available sources, SureChEMBL and IBM SIIP (IBM Strategic Intellectual Property Insight Platform), with manually curated commercial databases.
%A S. Senger
%A L. Bartek
%A G. Papadatos
%A A. Gaulton
%J Journal of Cheminformatics
George and Anna