As you may know, patents can be inherently noisy documents which can make it challenging to extract drug discovery information from them, such as the key targets or compounds being claimed. There are many reasons for this, ranging from deliberate obfuscation through to the long and detailed nature of the documents. For example, a typical small molecule patent may contain extensive background information relating to the target biology and disease area, chemical synthesis information, biological assay protocols and pharmacological measurements (which may refer to endogenous substances, existing therapies, reaction intermediates, reagents and reference compounds), in addition to description of the claimed compounds themselves. The SureChEMBL system extracts this chemical information from patent documents through recognition of chemical names, conversion of images and extraction of attached files, and allows patents to be searched for chemical structures of interest. However, t...
The Organization of Drug Discovery Data
| | | | | | | |