![]() |
| AI-driven bioassay annotation strategy |
The continued expansion of ChEMBL bioactivity data makes high-quality, structured assay metadata essential for reproducible analysis and machine-learning applications aligned with FAIR principles. Recent work by our team published in J. Cheminf. describes coordinated manual and AI-driven strategies to enhance the annotation, classification, and interoperability of ChEMBL bioassays.
In this work, we have developed a spaCy-based named entity recognition (NER) model trained on manually curated assay descriptions to identify the Experimental Method within ChEMBL assay descriptions. The model achieved cross-validated precision, recall, and F1-scores of approximately 0.93, 0.95, and 0.94, respectively, and detected experimental methods in ~57 % of binding and functional assays in ChEMBL 35. Extracted method terms were subsequently mapped to the Bioassay Ontology (BAO), demonstrating good precision at higher confidence thresholds but highlighting the need for hybrid ontology-linking strategies to improve coverage.
In parallel, a multi-class classification model was trained to assign more granular assay-aim categories beyond the traditional ASSAY_TYPE schema. The resulting models showed strong cross-validation performance (F1 typically ~0.85–0.95) and provided confident predictions for ~88 % of literature-derived binding and functional assays, supporting improved dataset stratification for downstream modelling.
Complementary curation efforts—including alignment of assay organism annotations with the NCBI taxonomy, expanded protein-variant mapping across assays and activities, enhanced ADME metadata capture, and ontology integration for cell-level annotations—further increase the FAIR compliance of ChEMBL bioactivity data.
Collectively, these advances demonstrate how combining expert curation with AI-driven strategies can systematically improve data annotation in large-scale bioassay resources, enabling more reliable compound–target activity modelling and more informed selection of training data for predictive machine learning.

Comments