If you want to use molecular sequences to uncover the relationships between organisms, you have two choices of molecule. You can use either DNA or the proteins it encodes. I always thought, why the hell would anyone use DNA when they can also use protein?
The DNA alphabet has four different letters, versus the 20 amino acids proteins are made of. There is much more danger of a chance similarity, there’s much more chance of multiple mutations at the same spot returning to the ancestral state and completely erasing important phylogenetic information. You could, I suppose, use codon-based models instead of single-nucleotide models, but what’s the point when you can just translate the sequence and analyse the protein instead?
Well, it seems there is a point. The crucial thing is that while DNA translates unambiguously to protein, this is not true the other way. Take a look at the genetic code table below (modified from here):
The letters in black represent RNA bases (the DNA would have T instead of U), and the coloured ones are the three-letter abbreviations of amino acids, except for Stop, which, as you might have guessed, means “end of protein, stop translating”.
The first thing to note about the table is that most amino acids are encoded by more than one DNA/RNA codon. There’s already more information here than if you simply took the protein sequence. The second point is that some amino acids have two sets of codons that aren’t easily interchangeable.
With something like glycine (bottom right box), all codons are almost the same, only differing in the third letter, which might even be irrelevant anyway due to third base wobble. Mutating between glycine’s codons is easy and unlikely to screw the organism.
In contrast, serine (red and yellow boxes) has two sets of codons that differ in both their first and second positions. It’s much easier to move within either of those sets by mutation than to jump from one set to the other. Changing either of the first two letters in any of these six codons results in a different amino acid, which has a lot more potential to wreak havoc than a mutation that leaves the protein alone.
And apparently, that can, from a phylogenetic point of view, practically turn serine into two different amino acids. In a fairly recent Nature paper, Regier et al. (2010) investigated arthropod relationships and found that while protein-based methods gave very similar results to DNA-based methods, they often couldn’t offer as much support for these results as DNA did. That paper hints at the serine problem and that a couple of the authors are working on it, and now the “working on” bit is out in PLoS ONE. (That’s how I came across this issue, in fact.)
The new analysis (Zwick et al., 2012) finds that tweaking protein-based evolutionary models so that the two kinds of serine count as different letters increases confidence in the resulting tree dramatically. The serines aren’t changing any major conclusions – if you take them out, you still get the same tree, just with lousy statistical support. But clearly, protein sequences alone were missing important evidence. In another situation, they might make the difference between a wrong answer and a right one.
(Now I wonder if anyone’s done codon-based Hox gene phylogenies. Hox genes/proteins can be really difficult to classify because only a short region can be compared among all of them, and this short region evolves pretty slowly, yielding very few informative differences. But what if there’s more information hiding in the codons? There’s not an awful lot of serine in homeodomains, though, and while the other sixfold degenerate amino acid (arginine) is pretty common in them, that one doesn’t have nearly the mutational chasm that separates the two codon clusters for serine. Meh. Maybe codons wouldn’t help at all with Hoxes.)
*OK, technically, you sort of have three choices, but RNA and DNA sequences contain the exact same information, so they don’t really count as different.
Regier JC et al. (2010) Arthropod relationships revealed by phylogenomic analysis of nuclear protein-coding sequences. Nature 463:1079-1083
Zwick A et al. (2012) Resolving discrepancy between nucleotides and amino acids in deep-level arthropod phylogenomics: differentiating serine codons in 21-amino acid models. PLoS ONE 7:e47450