The Mammal is confused

I’m talking mostly to myself in this one, but do listen if you want to know the random questions crossing a trainee evolutionary scientist’s mind 🙂

So, when you work with DNA or protein sequences, you learn that you should be careful with repetitive sequences. For example, here is a protein that consists mostly of the same amino acid (D = aspartic acid, in case you wondered) over and over again:

>gi|37991666|dbj|BAD00044.1| shell matrix protein [Pinctada fucata]

It’s a protein found in oyster shells, but it could be anything, the problem remains the same. You learn that when you compare/align sequences, you shouldn’t base an alignment, or conclude that two sequences are homologous, on bits like the above. (Which didn’t prevent some people from saying that this very protein “shows homology to” something else)

I kind of never questioned that, but really, why?

We humans are intuitively bad at probability, but when you think about it, a series of sixty Ds is just as likely as a pitch-perfect homeodomain. It’s like coin tossing. When every throw has a 1/2 chance of landing heads *or* tails, then any particular sequence of heads and tails is equally likely. HHHH and HTHT are exactly equally probable.

Nucleotides or amino acids are slightly different, because not all letters are equally likely, but the rarer ones can also form repeats, and I don’t think I’ve ever heard anyone make an exception for them.

My best guess is that repeats are easy to make by some sort of DNA replication slippage, so they come and go in evolution as they please. Does that make sense?

I think I have to go and ask someone…

ETA: Hmm, it appears to be a computational issue, based on the BLAST guides over at NCBI. Sayeth the Manual:

There is one frequent case where the random models and therefore the statistics discussed here break down. As many as one fourth of all residues in protein sequences occur within regions with highly biased amino acid composition. Alignments of two regions with similarly biased composition may achieve very high scores that owe virtually nothing to residue order but are due instead to segment composition. Alignments of such “low complexity” regions have little meaning in any case: since these regions most likely arise by gene slippage, the one-to-one residue correspondence imposed by alignment is not valid. While it is worth noting that two proteins contain similar low complexity regions, they are best excluded when constructing alignments [42-44]. The BLAST programs employ the SEG algorithm [43] to filter low complexity regions from proteins before executing a database search.

So… yes, part of it is replication slippage, but also, the… low complexity of low-complexity regions skews the probability calculations. If you base your match probabilities on the assumption that only every 20th amino acid is an aspartate, and in your sequence it’s every 2nd or 3rd, the match will seem far less likely than it would if you used the real composition of your sequence.

That makes sense. The Mammal is unconfused and feeling kind of smug now 🙂


Chime in!

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s