Have a goddamned null hypothesis!

I’m in the middle of Joubert et al. (2010), and I’m shaking my head a bit.

The paper is a pretty standard specimen of the kind of next-generation sequencing study that’s been proliferating in recent years. It looked for genes that make the shell of an oyster by sequencing all the RNA expressed in the tissues that build the shell.

This sort of study generates an awful lot of data, which makes it unfeasible to analyse it just by old-fashioned human brainpower and maybe a little statistics. Anyone happy to manually trawl through nearly 77 thousand sequences to find the interesting ones is a lot crazier than anyone I know… and anyone willing to study each of them to figure out what they do is probably a god with all of eternity at their hands.

One of the ways researchers use to quickly and automatically find interesting patterns in such large datasets is the Gene Ontology (GO) database. GO uses a standard vocabulary to tag protein sequences with various attributes such as where they are found within the cell, what molecules they might bind to, and what biological processes they might participate in. Such tags may be derived from experiments, or often from features of the sequence itself. For example, proteins often contain specific sequence motifs that tell the cell where to send them. Assuming that related proteins have similar functions in different organisms, GO annotations derived from one creature can be used to make sense of data produced in another.

So, Joubert et al. took their 77 thousand sequences, translated them into protein, and ran them through a program that finds similar sequences with existing annotations from GO or similar databases. They got some numbers. X per cent of the proteins are predicted to have some metabolic function, Y per cent of them bind to something, and so on. Then they went on to pull hypotheses out of these numbers. And that’s where they really should have remembered Scientific Method 101.

The part where I started looking askance at the paper is where they get excited about the percentage of predicted ion-binding proteins in the dataset. Of course, oyster shells are largely made of ions (calcium and carbonate ions, to be precise). But, importantly, “ions” are an awfully broad category, and they are essential for many of the everyday workings of any cell. Even calcium ions, which make up one half of the mineral component of the shell, play many other roles. Muscle contraction, cell adhesion, conduction of nerve impulses – all involve calcium ions in one way or another, and lots and lots of proteins either regulate or are regulated by calcium signals via binding the ions. So the question arose in my head: is it really unusual for 17% of “binding” sequences in a sample to bind ions?

This is not the first time I see that people who publish high-throughput sequencing data don’t ask that sort of question. They just report the pattern they saw, and try to interpret it. But patterns can only be interpreted in context. To tell whether a pattern is unusual, you must first know what the usual pattern looks like!

In fact, finding this many ion binding proteins doesn’t seem extraordinary at all. SwissProt, the awesome hand-curated protein sequence database, has a really handy browsing tool where you can get a breakdown of a selected set of sequences by GO categories. Curious, I had a quick look at all the 20 244 human proteins in SwissProt. (I picked humans because I’d probably be hard-pressed to find organisms with more GO-annotated proteins.) Out of the 11 674 that GO classifies under “binding”, 3934 are thought to bind ions. That’s almost 34%. This almost certainly has nothing to do with us having bones made of ions – SwissProt includes proteins from all tissues.

Even when you compare ion-binding sequences with ones that are predicted to bind something else within the same dataset, the “background” wins: while the human collection I looked at has about 1.6 times as many protein-binding sequences as ion-binders, in the oyster dataset the protein-binders outnumber the ion-binders two to one. Caveats apply as usual – datasets are incomplete, GO annotations are probably both incomplete AND a bit suspect, etc.; but I can’t see how that justifies the implication that the pattern this study found in the oyster is somehow special and interesting from a shell-building perspective. The whole thing smells like looking for faces in the clouds to me. Science is not simply about pattern-finding – it’s about finding meaningful patterns. I hope that this crucial distinction doesn’t get completely washed away by the current flood of data, data, data.

***

References:

Joubert C et al. (2010) Transcriptome and proteome analysis of Pinctada margaritifera calcifying mantle and shell: focus on biomineralization. BMC Genomics 11:613

Advertisements

Finger counting: an update

So, a while ago, I went on at length about the confused identities of those stunted little bones in a bird’s wing that were once fingers. (Diagram of wing skeleton by L. Shyamal, Wikimedia Commons.)

For a brief recap: bird wings have (the remnants of) three digits. Based on palaeontological evidence (read: dinosaur hands), these would be digits I, II and III. But in a modern bird embryo, they develop in the places of digits II, III and IV. The II-III-IV hypothesis also seemed more consistent with the way other tetrapods lose digits: if five is too many, DI is typically the first to go.

To solve the apparent conundrum, Wagner and Gauthier (1999) proposed the developmental frame shift hypothesis, meaning that the developing fingers switch identities as they form. The paper I tried to discuss (and probably ended up garbling :-P) in my previous post concluded that the frame shift (if it can be called that) happens very early, before there is anything resembling actual digits in the wing bud.

Well, as always, there is a new twist on the story. (Chicken wings seem to be all the rage this year!) Wang et al. (2011) took a brute force approach towards pinning down those elusive digit identities. They compared all the genes that are active in each developing digit in both the wing and the foot of chicken embryos at two different time points. They used next-generation sequencing to catch mRNAs, the RNA copies of genes that cells make for use as templates for protein synthesis. This means thousands of genes identified from each sample, plus information on how much of each mRNA there is, plenty of material for comparison.

After a considerable amount of statisticking on that giant pile of data, it turns out that avian digit identities are… confused. When they grouped the different digits by how similar their gene expression patterns were, first digits proved starkly and uncontroversially well-behaved. DI from the foot (whose identity no one seriously disputes) and the first digit of the wing obviously belong together. Every other digit, though…

For example, the second finger is most like the third toe, but also kind of like the second toe. The third finger goes from being closest to the second toe to being most similar to the fourth toe as development progresses. And it’s apparently chock full of mRNA belonging to a gene that had never been reported to participate in limb development (never mind determining any digit identity) before.

With all that said, I wouldn’t be too quick to declare that wings are just weird. Well, they certainly are, but these data don’t necessarily tell us that they are weird in this particular respect. I’m sorely missing a comparison to an ordinary, five-fingered forelimb, for example. What if the weirdness is not an issue with birds, but with forelimbs?

I’d say it’s a fair bet that chicken wings will keep evo-devo nerds entertained for years to come…

(I think I’ll be discussing limb development and genetic evidence in evo-devo again in the near future. Well, for certain values of “near”. I’ve had that post in the pipeline for weeks, but the damned thing just doesn’t want to get finished.)

***

References:

Wagner GP and Gauthier JA (1999) 1,2,3 = 2,3,4: A solution to the problem of the homology of the digits in the avian hand. PNAS 96:5111-5116

Wang Z et al. (2011) Transcriptomic analysis of avian digits reveals conserved and derived digit identities in birds. Nature 477:583-586

Zooming in on mutations

Evolution depends on variation, and variation depends on mutations. The evolution of new features, in particular, wouldn’t be possible without new mutations. Thus, mutation is of great interest to evolutionary biologists. More specifically, how mutations affect an organism’s fitness has been discussed and debated ever since the concept of mutations entered evolutionary theory. Relatively speaking, how many mutations are harmful, beneficial, or neither? What kinds of mutations are likely to be each in which parts of the genome? It’s hard to get a confident picture on such questions, partly because there are so many possible mutations in any given gene, let alone genome, and partly because fitness isn’t always easy to measure (see Eyre-Walker and Keightley [2007] for a review).

How do mutations affect fitness? Top: three theoretical possibilities; bottom: the real thing. (Hietpas et al., 2011)

Hietpas et al. (2011) did something really cool that hasn’t been done before: they took a small piece of an important gene, and examined the fitness consequences of every possible mutation in that sequence. This approach is limited in its own way, of course. Due to the sheer number of possibilities, it’s only feasible for short sequences, which might make it hard to generalise any results. But the unique window it opens on the relationship of a gene’s sequence and its owner’s success is invaluable.

What did they do?

Let’s examine the method in a bit more detail, mainly to understand what “every possible mutation” means in this context; because it’s a little more complicated than it sounds.

The bit of DNA they chose codes for a 9-amino acid region of heat shock protein 90 (Hsp90) in brewer’s yeast. So it really is small, only 27 base pairs altogether (recall that in the genetic code, 3 base pairs [1 codon] translate to 1 amino acid). Hsp90 is a very important protein found all over the tree of life. It’s a so-called chaperone, a protein that helps other proteins fold correctly, and in eukaryotes it’s absolutely required for survival.

The team generated mutant versions of the Hsp90 gene, each of which differed from the “wild type” version in one codon out of these nine. So each “mutation” examined could actually be anywhere between one and three mutations. They generated all possible mutants like that, amounting to over 500 different sequences.

[NOTE: If you check back at the genetic code, you’ll note that most amino acids are encoded by more than one codon, so not all of the resulting proteins differed from one another. Mutations that don’t change the amino acid are called synonymous. This will become important later.]

Then came the measurement of fitness. The researchers took a strain of yeast whose own Hsp90 gene was engineered not to work at high temperatures, and infected the cells with small pieces of DNA called plasmids, each carrying either a wild type (temperature-insensitive) Hsp90 gene or one of the 500+ mutants. They then grew all cells together in a common culture. After a while, they raised the growing temperature to let the engineered genes determine the cells’ survival.

They took samples every few hours – wild type yeast populations doubled every 4 hours – and did something that would not have been possible even a few years ago: sequenced the region of interest from this mixed culture, and compared the abundance of different sequence variants. By counting how many times each mutant was sequenced at each time point, they got a very good estimate of their relative abundances. The way each mutant prospered or declined relative to others over time gave a measurement of their fitness.

What did they find?

There are so many interesting things in this study that I’m not sure where to begin. Let’s start with the result that concerns the first question posed in my introductory paragraph. How are the mutations distributed along the deleterious – beneficial axis?

Perhaps not surprisingly, most non-synonymous mutations were harmful to fitness. I say not surprisingly because this protein has been honed by selection for many, many millions of years. It is probably close to the best it can be, although the researchers tried to pick a region that contained variable as well as highly conserved amino acids.

[ASIDE: They didn’t really succeed in that – among the 400+ species they say they used for comparison, 4 of 9 positions don’t vary at all, 2 are identical in almost all species, another 2 can have two amino acids with roughly equal chance, and only one can hold three different amino acids. I’ve seen more variation in supposedly highly conserved sequences over smaller phylogenetic distances. Perhaps Hsp90 is just that conserved everywhere.]

There were a few mildly beneficial mutations, but no highly beneficial ones. Deleterious mutations could be divided into two large groups, with very few in between: mostly they were either very harmful or close to neutral. This constitutes support for the nearly neutral theory of molecular evolution, but as I said, the sequence they examined is hardly representative of all sequences under all circumstances. It would be interesting to see how (if) the distribution changes in sequences under directional selection, or sequences that don’t experience much selection at all. I’m kind of hoping that that’s their next project 😛

The second interesting observation – interesting to me, anyway – is that nonsense mutations, those that introduce an early stop codon in the sequence, were not as unfit as complete deletions of the gene. A stop codon means the end of the protein – an early stop codon eliminates everything that comes after it. Cells making a truncated protein were lousy at survival, but not quite as lousy as cells with no Hsp90 at all. This is a bit strange, given that earlier the paper states that a region of Hsp90 that comes after their 9 amino acids is necessary for its function. A nonsense mutation in the test region removes that supposedly necessary part, so why did those cells do any better than mutants lacking the gene entirely?

Looking at synonymous mutations, the team determined that these don’t affect fitness much. This has practical importance, because synonymous mutations have long been used as a “baseline” to detect signs of selection in other mutations. If they weren’t neutral, the central assumption of that approach would fall down.

Another question the study asked was whether certain positions in the protein require amino acids of a certain type. The twenty amino acids found in proteins can be loosely grouped according to their physical and chemical properties. For example, some of them are positively charged, while others carry no charge at all; some are (relatively speaking) huge and some are tiny. These properties determine how a protein folds and what its different regions can do, so one would expect that in important positions, only amino acids similar in size and chemistry could work.

To find all the amino acids that worked equally well in a given position, Hietpas et al. looked at a subset of amino acid changes: those whose fitness was very close to the wild type. Surprisingly, they found that several positions tolerated radically different amino acids without losing much fitness. Quoting from the paper,

“[t]his type of physical plasticity illustrates the degenerate relationship between physics and biology: Biology is governed by physical interactions, but biological requirements can have multiple physical solutions.”

This is kind of stating the obvious in this context, but it does echo a more general observation about life. In evolution, there is often more than one way to skin a cat.

[ASIDE: Analogous enzymes provide a striking demonstration of that. These are pairs – or even groups – of enzymes that catalyse the same reaction, without bearing any physical resemblance to one another. Their sequences are different, their 3D structures are different, and their catalytic mechanisms are different, yet they do essentially the same thing. But there are also more familiar, if less extreme, examples. For instance, within vertebrates only, we see three different solutions for powered flight and even more variations on gliding (here are some of them).]

The researchers built a “fit amino acid profile” of their test sequence using these “wild type-like” mutations, then compared it to the actual pattern of amino acid substitutions observed in “real” Hsp90 proteins. It turns out the two are quite different: eight out of the nine positions are conspicuously less variable in real life than the fitness profile would predict. The paper lists a few possible explanations. Lab environments are not natural environments, and amino acids that work fine in their very controlled environment may not be so great under harsher or less stable real-world conditions. Wild type-like fitness does not mean the substitution is completely neutral – many of them are slightly deleterious, which may come out more strongly under natural circumstances, especially over the long term. And one of the substitutions would require more than one mutation at the DNA level – with strongly deleterious intermediate steps.

That last point leads me to the part of the study I personally found most interesting. Thus far, we’ve taken the genetic code as a given, and hardly paid any attention to it at all. But, in fact, the genetic code itself is a product of evolution. Most likely, it didn’t spring into existence fully formed when organisms invented protein synthesis. There is a mind-blowingly large number of possible genetic codes – why is it that organisms use this particular one, with only minor variations? We won’t go into all of the hypotheses about that, mostly because I’m not very familiar with them. It’s enough to note that in principle, the genetic code could be accidental – it just happened to be the one some distant ancestor of all living things stumbled on –, a chemical inevitability of some sort, or it could have risen to prominence by natural selection.

[ASIDE: The options are not mutually exclusive. For example, it is possible that the only important thing about the genetic code is how easy it is to mutate from particular amino acids to certain others – in other words, that it’s the structure of the code that’s under selection, while its finer details, such as which four codons stand for glycine, may be largely coincidental or determined by chemical necessity.]

For this tiny region of the Hsp90 gene/protein, it looks very much like selection had a hand in it. Hietpas et al. used their theoretical fit amino acid profile and a sample of 1000 randomly generated genetic codes – and asked how many substitutions it would take to switch between equally fit amino acids under each genetic code. Intriguingly, very few genetic codes made it as easy as the real one. In other words, the genetic code seems geared to minimise the number of deleterious mutations.

What’s really fascinating about that result is that it came from an analysis of such a tiny sequence. Earlier, I mentioned that it might be hard to generalise anything from a short sequence. But it’s hard to believe that this particular finding doesn’t have general applicability. The genetic code sets the rules for all proteins – if it weren’t optimised in general, what’s the chance that such strong optimisation would be detected in such a tiny sample? This also suggests that roughly the same amino acids are interchangeable across the board, regardless of which protein we’re talking about. (Which is not necessarily surprising if you’ve ever spent time comparing protein sequences between species, but still, it’s valuable as a new way of looking at a familiar phenomenon).

All in all, this is the kind of paper that makes me all giddy with excitement. It digs deep into fundamental questions in evolutionary theory, and it finds some intriguing answers. It’s also a great reminder of how amazingly far technology has come – merely sequencing 27 base pairs would have been a formidable task at the dawn of molecular biology, and now we can mix 500 different versions together, sequence all of them in a single experiment, and reliably count how many of each variant there are. And that’s nowhere near the limits of current sequencing technology. This is the future, folks, and it’s better than sci-fi.

References

Eyre-Walker A & Keightley PD (2007) The distribution of fitness effects of new mutations. Nature Reviews Genetics 8:610-618

Hietpas RT et al. (2011) Experimental illumination of a fitness landscape. PNAS 108:7896-7901