Slides and recordings from the recent ChEMBL UGM are starting to appear on the meeting website. Here I want to draw attention to the presentation by Nicolas Bosc and myself on "SureChEMBL: Your next source of research data".
Ever since my NextMove Software days, I've been aware of the large amount of scientific data available in patents. This includes everything from large sets of chemical analogs, to bioactivity values, reactions, and NMR spectra. US patents in particular are a rich source of data as they are (a) born digital, and (b) freely available, and thus automated tools to extract relevant data can generate substantial high quality datasets. For example, here's a graph I did back in 2017 to illustrate a NextMove Software blog post entitled "Are more bioactivities available from patents than from the academic literature?". This compared the data deposited from papers into ChEMBL and that extractable by LeadMine from patents (see the talk linked from that blog post for more details).
Patent data around chemicals is very much overlooked and underused in the research space. With the UGM presentation I wanted to encourage the audience to rethink this by showing that some research questions are better answered with SureChEMBL data vs data from ChEMBL. For example, given that patents contain much longer sets of chemical analogs compared to papers, there is much more R group replacement information available from patents; it's not just a question of more of the same data, but the data is more diverse (i.e. data for more R groups). We also spoke a little about the project our ARISE2 fellow has just begun on automatic extraction of bioactivities from patents. This project, co-funded by GSK, is very exciting and we hope will finally realise the promise of the graph above by making patent bioactivity data available at scale.
On a final note, this is sadly the last presentation that Nicolas Bosc will give as part of the team, as the 9-year rule has finally caught up with him. He will be greatly missed. If you are looking for a well-rounded cheminformatician with broad experience with patent data and ML models and intimate knowledge of ChEMBL data, then reach out to him.

Comments