Tuesday, 27 January 2009

StARlite Schema

The StARlite database has a very simple schema, and should be easy to integrate other resources against. I have had a couple of requests for the schema, so here it is, in its current form. It may change a little, but not much, as we align it against other EMBL-EBI resources.

We will also arrange a web meeting, probably at 8pm GMT on Thursday 5th Feb to talk through the data model and schema, describe some sample queries, etc., for those interested, mail me to get further details.

Tuesday, 13 January 2009

Conference - 7th Annual Pharmaceutical Technology IT Summit

We are going to be at, and present at the 7th Annual Pharmaceutical Technology IT Summit, held on the 15-16 June 2009, Le Montreux Palace, Switzerland. A link to the conference is here.

Monday, 12 January 2009

Books and Papers - 4 - The Cathedral And The Bazaar

All this travelling is making me read. This was a book I picked up over the summer, but have only recently properly read. It is a classic in the open source field, and is a very lucid discussion of the culture of open source projects. It is excellent in quite a few places, and my copy is now dog-eared where I thought at the time that I must remember some point or other. Many things I simply had not though of before, in particular the multifaceted discussions of 'free' vs commercial were particularly interesting. One definite down side was that I felt like I was 'l33t' after I had finished it, and tried to bore my kids with my new insights - I guess this is may be the modern equivalent of you dad dancing at a disco.

%T The Cathedral And The Bazaar
%A Eric Raymond
%I O'Reilly Media, Inc.
%D 2001
%O ISBN 978-0596001087

Calling SMEs and consultants

One of the anticipated user groups for the ChEMBL data are for SMEs and 'sole trader' type consultants (i.e. very small consultancy businesses, one-person-bands, etc.). So, as a general question, what sort of access and query tools would be most useful for this type of user. We have contacts with quite a lot of large pharma, non-profits, universities and larger biotech, especially in Europe and the USA, but could do with more diverse contacts to make sure we align our services with a broader community.

I guess there are a number of obvious options for delivery.

  • Locally installable databases - what sort of technical environment would be wanted for this (Oracle, mySQL, ....)
  • Hosted web server access - what sort of queries would people want to do, how would they like the results?

    Where should our priorities lie?

    So, please, mail me (jpo (at) if you have some thoughts, or would consider yourself to be one of these types of users. Contact and input from developing economies is especially welcome.

  • Sunday, 11 January 2009

    Books and Papers - 3 - The Long Tail

    The power law is everywhere - in choice of food, types of music, species distributions, etc. Over the holiday season I read The Long Tail by Chris Anderson. The subtitle of the book is How Endless Choice is Creating Unlimited Demand, and there is a provocative mix of economics alongside the reality of the power law frequency distribution of assets. The discussion of physical vs virtual assets is nice and balanced, as also is the discussion of the strategies for economic exploitation on various portions of the demand curve. I guess on-line resources, specifically scientific data sources, (both free and commercial) are a lot like this as well - the vast majority of the data is never accessed, whilst a small fraction makes up the majority of the interest. It does make you think about the 'long tail of scientific data', and the best approaches to archive, preserve and distribute it.

    As an aside, when the book first came out, it seemed highly lauded, as a refreshing boost to business models built around the internet, diversity and revenue generation. However, it seems, even on the internet the 80:20 rule rules. Don't get me wrong, I think this is a really excellent though provoking book. Buy it!

    %D 2007
    %A Chris Anderson
    %T The Long Tail: How Endless Choice is Creating Unlimited Demand
    %I Random House Business Books
    %O ISBN 978-1844138517

    Friday, 9 January 2009

    Conference - NCRI-NCIN Joint meeting presentation

    I (jpo) am speaking on the ChEMBL databases at a NCRI (National Cancer Research Institute) Informatics - NCIN (National Cancer Intelligence Network) meeting in London on the 12th February.

    Thursday, 8 January 2009

    Books and Papers - 2 - The Tufte For A New Generation

    This year, Santa Claus delivered a book (I always thought he lived at the North Pole, but clearly he is now based in The Amazon) that I had seen advertised a few places - 'Information Dashboard Design: The Effective Visual Communication Of Data', by Stephen Few. It is one of those compelling books that although the fundamental message is simple and arguably obvious, it is nonetheless a delight to read, and I learnt a lot from it. The basic theme of the book is in the necessary features and design of intuitive interfaces, and in particular those that need to display quantitative and comparative numerical data. My first contact with books of this type were with the classics of Edward Tufte, which remain timeless, but are complemented by this book addressing HCI issues.
    %D 2008
    %A Stephen Few
    %T Information Dashboard Design: The Effective Visual Communication Of Data
    %I O'Reilly
    %O ISBN 978-0596100162

    Friday, 2 January 2009

    ChEMBL Target Dictionary

    Here is a link to the ChEMBL databases target dictionary. This contains the sequences of the targets contained within the entire set of ChEMBL databases, with a few exceptions (primarily around CandiStore entries). The vast majority of these are from the StARlite medicinal chemistry database, however, not all of them currently are, so caveat emptor.

    The file is around 2.4MB in size, is in fasta format, and the identifiers are simply the internal database identifiers (tids), but there are also organism and trivial protein names as well. The exercise of linking these through to UniProt, RefSeq, etc, etc. is left, as they often say, as an exercise for the reader (for now). However, it should give some idea of the diversity and distribution of sequences within the databases.