A week ago, I had the pleasure of presenting SureChEMBL2.0 at the Cambridge Cheminformatics Network Meeting , organised by Andreas Bender and kindly hosted by the Cambridge Crystallographic Data Centre . It was a great opportunity to introduce one of the latest freely available databases of scientifically annotated patents to a broad scientific audience. The recording of the talk is now available online , along with the slides . What did I cover during this 30-minute talk? Why scientists should pay attention to patent data Why patents are challenging to work with What SureChEMBL is and what it does How we identify chemical compounds in patent documents What SureChEMBL 2.0 has recently introduced How we annotate patents for genes/proteins and diseases How we are improving the quality of structures extracted from images What you can download from the SureChEMBL core datasets — and what they contain Examples of queries that SureChEMBL h...
AI-driven bioassay annotation strategy The continued expansion of ChEMBL bioactivity data makes high-quality, structured assay metadata essential for reproducible analysis and machine-learning applications aligned with FAIR principles. Recent work by our team published in J. Cheminf. describes coordinated manual and AI-driven strategies to enhance the annotation, classification, and interoperability of ChEMBL bioassays. In this work, we have developed a spaCy-based named entity recognition (NER) model trained on manually curated assay descriptions to identify the Experimental Method within ChEMBL assay descriptions. The model achieved cross-validated precision, recall, and F1-scores of approximately 0.93, 0.95, and 0.94, respectively, and detected experimental methods in ~57 % of binding and functional assays in ChEMBL 35. Extracted method terms were subsequently mapped to the Bioassay Ontology (BAO) , demonstrating good precision at higher confidence thresholds bu...