A couple of us attended the 3rd RDKit UGM , hosted by Merck in Darmstadt this year. It was an excellent opportunity to catch up with RDKit developments and applications and meet up with other loyal "RDKitters". I presented a talk-torial there and went through an IPython Notebook, which some of you may find useful. It uses patent chemistry data extracted from SureChEMBL and after a series of filtering steps, it follows a few "traditional" chemoinformatics approaches with a set of claimed compounds. My ultimate aim was to identify "key compounds" in patents using compound information alone, inspired by papers such as this and this . The crucial difference is that these authors used commercial data and software, where in this implementation everything is free and open. At the same time, I wanted to show off what the combination of pandas, scikit-learn, mpld3, Beaker, RDKit, IPython Notebook and SureChEMBL can do nowadays (hint: a lot). So,
The Organization of Drug Discovery Data
| | | | | | | |