I’m in the middle of Joubert et al. (2010), and I’m shaking my head a bit.
The paper is a pretty standard specimen of the kind of next-generation sequencing study that’s been proliferating in recent years. It looked for genes that make the shell of an oyster by sequencing all the RNA expressed in the tissues that build the shell.
This sort of study generates an awful lot of data, which makes it unfeasible to analyse it just by old-fashioned human brainpower and maybe a little statistics. Anyone happy to manually trawl through nearly 77 thousand sequences to find the interesting ones is a lot crazier than anyone I know… and anyone willing to study each of them to figure out what they do is probably a god with all of eternity at their hands.
One of the ways researchers use to quickly and automatically find interesting patterns in such large datasets is the Gene Ontology (GO) database. GO uses a standard vocabulary to tag protein sequences with various attributes such as where they are found within the cell, what molecules they might bind to, and what biological processes they might participate in. Such tags may be derived from experiments, or often from features of the sequence itself. For example, proteins often contain specific sequence motifs that tell the cell where to send them. Assuming that related proteins have similar functions in different organisms, GO annotations derived from one creature can be used to make sense of data produced in another.
So, Joubert et al. took their 77 thousand sequences, translated them into protein, and ran them through a program that finds similar sequences with existing annotations from GO or similar databases. They got some numbers. X per cent of the proteins are predicted to have some metabolic function, Y per cent of them bind to something, and so on. Then they went on to pull hypotheses out of these numbers. And that’s where they really should have remembered Scientific Method 101.
The part where I started looking askance at the paper is where they get excited about the percentage of predicted ion-binding proteins in the dataset. Of course, oyster shells are largely made of ions (calcium and carbonate ions, to be precise). But, importantly, “ions” are an awfully broad category, and they are essential for many of the everyday workings of any cell. Even calcium ions, which make up one half of the mineral component of the shell, play many other roles. Muscle contraction, cell adhesion, conduction of nerve impulses – all involve calcium ions in one way or another, and lots and lots of proteins either regulate or are regulated by calcium signals via binding the ions. So the question arose in my head: is it really unusual for 17% of “binding” sequences in a sample to bind ions?
This is not the first time I see that people who publish high-throughput sequencing data don’t ask that sort of question. They just report the pattern they saw, and try to interpret it. But patterns can only be interpreted in context. To tell whether a pattern is unusual, you must first know what the usual pattern looks like!
In fact, finding this many ion binding proteins doesn’t seem extraordinary at all. SwissProt, the awesome hand-curated protein sequence database, has a really handy browsing tool where you can get a breakdown of a selected set of sequences by GO categories. Curious, I had a quick look at all the 20 244 human proteins in SwissProt. (I picked humans because I’d probably be hard-pressed to find organisms with more GO-annotated proteins.) Out of the 11 674 that GO classifies under “binding”, 3934 are thought to bind ions. That’s almost 34%. This almost certainly has nothing to do with us having bones made of ions – SwissProt includes proteins from all tissues.
Even when you compare ion-binding sequences with ones that are predicted to bind something else within the same dataset, the “background” wins: while the human collection I looked at has about 1.6 times as many protein-binding sequences as ion-binders, in the oyster dataset the protein-binders outnumber the ion-binders two to one. Caveats apply as usual – datasets are incomplete, GO annotations are probably both incomplete AND a bit suspect, etc.; but I can’t see how that justifies the implication that the pattern this study found in the oyster is somehow special and interesting from a shell-building perspective. The whole thing smells like looking for faces in the clouds to me. Science is not simply about pattern-finding – it’s about finding meaningful patterns. I hope that this crucial distinction doesn’t get completely washed away by the current flood of data, data, data.
Joubert C et al. (2010) Transcriptome and proteome analysis of Pinctada margaritifera calcifying mantle and shell: focus on biomineralization. BMC Genomics 11:613