Lamprey Hox clusters and genome duplications, oh my!

What the hell is up with lamprey Hox clusters?

Lampreys are among the few living jawless vertebrates, creatures that parted evolutionary ways with our ancestors somewhere on the order of 500 million years ago. If you want to know where things like jaws, paired fins or our badass adaptive immune systems came from, a vertebrate that doesn’t possess some of these things and may have diverged from the rest of the vertebrates soon after others originated is just what you need for comparison.

The vertebrate fossil record is pretty rich thanks to us having hard tissues, so a lot can be inferred about these things from the wealth of extinct fishes we have at our disposal. However, there are times when comparisons of living creatures are just as useful, if not more, than examinations of fossils. (Fossils, for example, tend not to have immune systems. ;))

One of the things you absolutely need a living animal to study is, of course, genome evolution. Vertebrates – well, at least jawed vertebrates – are now generally accepted to have the remnants of four genomes. Our long-gone ancestors underwent two rounds of whole genome duplication. Afterwards, most of the extra genes were lost, but evidence for the duplications can still be found in the structure of our genomes, where entire recognisable gene neighbourhoods of our close invertebrate relatives often still exist in up to four copies (Putnam et al., 2008).

Among these neighbourhoods are the four clusters of Hox genes most groups of jawed vertebrates possess. A “normal” animal like a snail or a centipede only has one of these. Since Hox genes are involved in the making of body plans, you have to wonder how suddenly having four sets of them and other developmental “master genes” might have influenced the evolution of vertebrate bodies.

Of course, to guess that, you need to know precisely when these duplications happened. That’s where lampreys come in: their lineage branched off from our definitely quadruple-genomed one after the next closest, definitely single-genomed group. But was it before, between, or after, the two rounds of duplication?

A few years ago, a phylogenetic analysis of 55 gene families by Kuraku et al. (2009) suggested that the lamprey-jawed vertebrate split happened after the 2R. Just this year, the genome of the sea lamprey Petromyzon marinus was finally published (Smith et al., 2013), and its authors agreed that yes, lampreys probably split off from us post-2R. (I don’t entirely get all the things they did to arrive at this conclusion. Groups of linked genes show up again, among other approaches.)

However, that isn’t the whole story, the latest lamprey genomics paper argues (Mehta et al., 2013). The P. marinus genome assembly couldn’t stitch all the Hox clusters properly together. There were two that sat on nice big scaffolds with the whole row of Hox genes and a few of their neighbours, and then there were a bunch of “loose” Hox genes that they couldn’t link to anything (diagram comparing humans and P. marinus below from Smith et al., 2013; the really pale blue boxes under the numbers represent Hox genes):


Given that Hox9 genes exist in four copies in this species, it seems like there may be four clusters. However, in hagfish, the other kind of living jawless vertebrate, a study found Hox genes that seemed to have as many as seven copies (Stadler et al., 2004). Another round of duplication? It wouldn’t be unheard of. Most teleosts, which include most of the things we call “fish” in everyday parlance, have seven Hox clusters courtesy of an extra genome duplication and loss of one cluster*. Salmon and kin have thirteen, after yet another duplication. Maybe hagfish also had another one – but did lampreys? How many more clusters do those lonely Hox genes belong to?

Mehta et al. hunted down the Hox clusters of Japanese lampreys (Lethenteron japonicum), hoping to pin down exactly how many there were. They used large chunks of DNA derived partly from the testicles, where sperm cells and their precursors keep the full genome throughout the animal’s life (lampreys throw away large chunks of the genome in most non-reproductive cells [Smith et al., 2009]). They probed these for Hox genes and sequenced the ones that tested positive. Plus they also got about two-thirds of the full genome together in fairly big pieces. Together, these data allowed them to get a better idea of the mess that is lamprey Hox cluster genomics.

They assembled four whole clusters, including their neighbouring genes, and a partial fifth cluster. A bunch of other genes sat on smaller sequence fragments containing only a couple of Hoxes, or a Hox and a non-Hox, but they were tentatively assigned to a total of eight clusters, eight being the number of different Hox4 genes in the data (no known vertebrate Hox cluster contains more than one Hox4 gene). The L. japonicum equivalents of the 31 publicly available Hox sequences from P. marinus spread out over six of these, which indicates that both species have at least six clusters. Seems like lampreys had another round of genome duplication after 2R? (Summary of L. japonicum Hox clusters from Mehta et al. below.)

But wait, that’s not the end of it.

First of all, although there are undoubtedly four complete Hox clusters in there L. japonicum, the relationships of these clusters to our four are terribly confused. Whether you look at the phylogenetic trees of individual genes, or the arrangement of non-Hox genes on either side of the cluster, only a big pile of what the fuck emerges. Phylogenies are problematic because the unusual composition of lamprey genes and proteins (Smith et al., 2013) could easily throw them off. All the complete lamprey clusters have a patchwork of neighbours that look like a mashup of more than one of our Hox clusters. Might it mean that lampreys’ proliferation of Hox clusters occurred independently of ours? Did we split before 2R after all?

Hox genes are not the only interesting things in a Hox cluster. In the long gaps between them, there are all sorts of little DNA switches that regulate their behaviour. Some of these are conserved across the jawed vertebrates. Mehta et al. aligned complete Hox clusters of humans, elephant sharks and lampreys to look for such sequences – called conserved non-coding elements or CNEs – in the lamprey.

They only found a few, but that’s enough for a bit more head-scratching. Most CNEs in, say, the human HoxA cluster are only found in one elephant shark cluster, and vice versa. Humans have a HoxA cluster, elephant sharks have a HoxA cluster, they’re clearly the same thing, pretty straightforward. Not so for lampreys. Homologues of individual CNEs in the complete lamprey clusters are spread out over all four human/elephant shark clusters. More evidence for independent duplications?

Mehta et al. are cautious – they point out that the silly mix of Hox cluster neighbours in lampreys could just be due to independent post-2R losses, which is plausible if the split between lamprey and jawed vertebrate lineages happened not too long after 2R. There’s also the fact that the weird lamprey sequences are phylogenetic minefields – however, that’s a double-edged sword, since the same caveat applies to analyses that support a post-2R divergence. Then, perhaps the same argument that goes for Hox cluster neighbours could also apply to CNEs. And, of course, this is just Hox clusters. Smith et al.‘s (2013) findings about overall genome structure don’t go away just because lamprey Hox clusters are weird.

So, in summary, thanks, lampreys. Fat lot of help you are! 😛


*Actually, two losses of two separate clusters in two different teleost lineages. Because Hox evolution wasn’t already complicated enough.



Kuraku S et al. (2009) Timing of genome duplications relative to the origin of the vertebrates: did cyclostomes diverge before or after? Molecular Biology and Evolution 26:47-59

Mehta TK et al. (2013) Evidence for at least six Hox clusters in the Japanese lamprey (Lethenteron japonicum). PNAS 110:16044-16049

Putnam NH et al. (2008) The amphioxus genome and the evolution of the chordate karyotype. Nature 453:1064-1071

Smith JJ et al. (2009) Programmed loss of millions of base pairs from a vertebrate genome. PNAS 106:11212-11217

Smith JJ et al. (2013) Sequencing of the sea lamprey (Petromyzon marinus) genome provides insights into vertebrate evolution. Nature Genetics 45:415-421

Stadler PF et al. (2004) Evidence for independent Hox gene duplications in the hagfish lineage: a PCR-based gene inventory of Eptatretus stoutii. Molecular Phylogenetics and Evolution 32:686-694

Sponges never yield

Ah, those pesky sponges again. Although their lives are rather low on action, these strange animals have found lots of other ways to fascinate me. Borrowed skeletons, mysterious lost Hox genes, wonderful alien shapes and perhaps the oldest fossils that might be animals – sponges have no trouble supplying us with stories.

Sponges are also something of a headscratcher for evolutionary biologists. These days it’s generally agreed that they can be divided into four major groups, the glass sponges (hexactinellids), the demosponges, the calcareous sponges and the homoscleromorphs. My impression is that the close relationship of the first two is also well-established. However, to this day biologists haven’t quite managed to agree whether the three are one lineage to the exclusion of other animals, or two or three separate lineages with some of them being closer to non-sponge animals. The trees below illustrate two possibilities:


This is kind of important if you want to know what kind of organism gave rise to the diversity of animals. If sponges are paraphyletic (some sponges closer to non-sponges, as in the right-hand tree), then the mother of all animals was likely sponge-like, sitting on the seafloor and driving water through its porous body to capture food. In such scenarios, two or three of the deepest branchings of the animal tree run into sponges on one side. The simplest explanation for that is that the ancestors at these branching points were themselves sponge-like.

If, however, sponges are monophyletic (their last common ancestor only gave rise to sponges, as in the left-hand tree), then the last common ancestor of animals immediately branched into sponges and non-sponges, whose living descendants are very different from sponges. Suddenly, guessing what Mummy Metazoa might have been like becomes much harder.

The deepest phylogeny of animals is difficult. We are talking about lineages that diverged over 600 million years ago even by conservative estimates (Peterson et al., 2004), and it’s also likely that their early divergences followed each other in rapid succession. That combination is depressingly good at eroding the useful information in gene and protein sequences (Rokas and Carroll, 2006). So what can a phylogeneticist do?

Picking your genes carefully is one solution – use as many as you can to maximise the information you can glean from them, but don’t use genes that evolve so fast their “information” is basically all noise. Another option is to use so-called rare genomic changes. These are things like gaining and losing bits of genes or insertions of parasitic DNA.

Their advantage is that they are unlikely to occur twice in the same way. If two animals have the same viral sequence between genes A and B, it’s far more likely to indicate relatedness as opposed to chance similarity than having the same letter at position 138 in the sequence of gene A. The principle of parsimony (choose the simplest explanation) is a shitty way of interpreting sequence similarity because there’s a high chance of any given change occurring more than once. It works much better for such unlikely events as gaining a virus in the same spot.

microRNAs look like a pretty good source of rare genomic changes. They are small RNAs encoded in the genome, and they play crucial roles in gene regulation in most animals. Their sequences evolve extraordinarily slowly, so it’s relatively easy to identify them across species despite their tiny size. There’s loads and loads of them – miRbase, the microRNA database, lists 1600 different miRNA genes for humans, yielding over 2000 mature miRNAs after processing. The miRNAs in our genome include everything from ancient types with origins in the mists of the Precambrian to young sequences confined to our close relatives. On top of all that, they are thought to be very difficult to lose. All in all, perfect phylogenetic markers.

Perfect for some cases, that is. Their presences and absences may paint a coherent evolutionary picture for most animals, but don’t ask them about sponges. Robinson et al. (2013) tried…

In their introduction stands this depressing summary of current animal miRNA lore, based on many non-sponge genomes plus that of the demosponge Amphimedon queenslandica:

“None of the thousands of miRNAs thus far discovered in eumetazoans are present in the genome [of] A. queenslandica and none of the eight silicisponge‐specific miRNAs have been described in any eumetazoan (or any other eukaryotic group for that matter).”

(Eumetazoans = all animals except sponges and the Blob; silicisponges = glass sponges + demosponges)

However, that’s only two of the four sponge groups. What about the other two? Are they any more helpful? Might they have silicisponge-like repertoires, supporting sponge monophyly? Or might they be hiding some “eumetazoan” miRNAs, arguing for one history or another involving sponge paraphyly? This is what the authors wanted to find out.

They looked for miRNAs by collecting and sequencing small RNAs from calcareous and homoscleromorph sponges. Two species of the former and one of the latter also have genome projects going, which allowed the researchers to verify the RNAs they found as bona fide miRNAs (miRNA genes have a particular structure that doesn’t all show in the mature RNA product) as well as look for the protein components of the editing machinery miRNAs need to reach maturity.

(Below: the three sponges with newly sequenced genomes. Sycon ciliatum from the Adamska group, Leucosolenia complicata from, Oscarella carmela & Oscarella sp. from the Nichols lab.)

Well, the calcarean genomes certainly contained genes for miRNA-processing enzymes, which is a good sign that they also have miRNAs somewhere. So what do those look like?

Overall, the results of Robinson et al.‘s search are a bit disappointing. They used strict criteria to identify miRNAs, since there are plenty of other kinds of small RNA molecules floating around doing stuff in animal cells. According to these criteria, only one miRNA was confidently identified in the calcareous sponges. This was present in both Sycon and Leucosolenia, but the niggardly bastards didn’t share it with either silicisponges or other animals. Leucosolenia may have a second one. A bunch of eumetazoan-like sequences also showed up, but these were probably contamination from actual eumetazoans, since the Sycon genome, which was obtained from squeaky clean lab-grown sponges, had none of them.

Oscarella only yielded two possible miRNAs, neither of which was known from anything else (including the other homoscleromorphs in this study!) Worse, they couldn’t even find the two processing enzymes in the Oscarella genome – they only recovered a small fragment of one. Maybe the genome sequence is just incomplete, which wouldn’t be very surprising. Then again, maybe Oscarella genuinely doesn’t have a functioning miRNA system, and that could be quite interesting.

Either way, now we know something about microRNAs in all the great sponge lineages. It doesn’t look like they’re going to help us sort out deep animal phylogeny, but maybe the very absence of similarities is telling us something. The reason many microRNAs are so conserved in other animals is that they play important roles in fundamental developmental processes, such as specifying cell types. If sponges aren’t really fussed about keeping them, then maybe their development just doesn’t depend heavily on miRNAs. So what do they do with theirs? Why the difference? Questions, questions…

(Incidentally, here’s yet another reason not to just look at one species and make sweeping claims about “sponges”. Here is also a reason to thank the gods for next generation sequencing. Without the ability to quickly and cheaply [for certain values of “cheap”] sequence tons of DNA and RNA from any creature you fancy, half of the story of animal evolution would be hidden in undeciphered strings of DNA in animals too few people care about for a sequencing project. Yay for technology, yay for diversity!)

[P.S.: there’s so much I’ve wanted to write about and didn’t recently. Since I came back from my Christmas break, I’ve spent most of my time buried in work-related literature. Reading more literature was the last thing I wanted to do with my free time. I’m almost done with that, though. No promises, but I’m almost done ;)]



Peterson KJ et al. (2004) Estimating metazoan divergence times with a molecular clock. PNAS 101:6536-65451

Robinson JM et al. (2013) The identification of microRNAs in calcisponges: independent evolution of microRNAs in basal metazoans. Journal of Experimental Zoology B, advance online publication available 24/01/2013, doi: 10.1002/jez.b.22485

Rokas A, Carroll SB (2006) Bushes in the tree of life. PLoS Biology 4:e352

More genes from scratch

Following on from the yeast “proto-gene” study, I’ve started paying more attention to news about gene birth from non-coding DNA. (Or “junk” DNA, if you will, though “junk” is… something of a misnomer :-P) The yeast paper explored protein-coding genes in the process of birth. This new one I found in PLoS Genetics looks at genes that have already been born, and argues that the sequences they came from were functional long before they began to serve as templates for proteins.

Paternity testing for genes

So how do you know that a protein-coding gene came from non-coding DNA? Xie et al. (2012) looked specifically for genes born along the ape lineage, that is, the group that includes gibbons, orangs, chimps, gorillas and ourselves. They searched for human genes that had no protein-coding homologue in non-apes including rhesus macaques, mice, dogs and a handful of other mammals, but did match some of the other animals’ DNA sequence. It was also important that there were no other similar sequences in the human genome itself, which might have indicated that the “new” gene actually originated by duplication.

To ascertain that the selected genes represented gene birth in the ape lineage as opposed to a dying gene in other mammals, they also looked at the sequence changes that garbled the would-be protein product of the non-ape sequences. If these are the same in all non-ape versions of a gene, that probably means that they were inherited from the common ancestor of apes and all these species, that is, the non-coding version of the gene came first. Only genes that really seemed to have been born in our lineage were kept.

In the end, they came up with 24 genes that passed all muster. Some of these coded for proteins in both chimps and humans, others only in humans. Based on RNA-sequencing data from rhesus macaques, 20 non-coding versions of these genes were active in monkeys – and this is where things get interesting.

Function in the junkyard

The big question the team asked was this: are these non-coding RNAs just random noise in transcription, or are they already functional even without a protein product? They decided to answer this question by looking at the structure and expression patterns of the RNAs in macaques, chimps and humans.

By “structure”, they mean how the RNA is edited after it’s transcribed from the gene sitting in the DNA. The RNA from most genes in animals and other eukaryotes isn’t taken straight through protein synthesis. First, pieces called introns are chopped out and the rest (called exons) are spliced together to yield the final template for the protein.* When the researchers looked at RNA sequencing results from the three primates, they found that the non-coding sequences in macaques were cut and joined at the same points that the protein-coding human sequences were: a bunch of RNA sequences from macaques spanned both sides of a human splice site, and contained none of the intron in between. Such conservation, the authors argue, is indicative of functionality.

Expression patterns also suggested that the macaque sequences weren’t just noise: when they compared the abundance of the different RNAs between different tissues in the three primates, Xie et al. found that the non-coding RNAs weren’t just expressed all over the place – they were significantly more abundant in some parts of the body than others. This pattern was consistent across species: a sequence that was most abundant in the macaque’s brain was also likely to be brain-specific in humans, even though the macaque version didn’t code for a protein and the human gene did. (By the way, a lot of the 24 were most active in the brain, which is apparently something of a trend among new human genes regardless of their mode of origin. Guess our brains evolved rather a lot in the last few million years ;))

This is very cool, but I’m kind of worried about the arguments used for functionality. Maybe this is just me being a newbie in this area, but I’m not sure that useless non-coding RNAs should be expressed all over the place. One of the salient features of the 24 genes in this study is that they are nearby other genes, sometimes even overlapping them. In any case, they are close enough to use the switches that normally regulate the activity of the other gene(s). That would mean that they’d be most highly expressed wherever their neighbours are, which doesn’t depend on them having a function. If some of them happened to acquire proteing coding potential by mutation, presumably it’d only be kept by natural selection if the resulting proteins did something useful in those places, hence the conservation of expression patterns.

Likewise, splice sites may well arise by accident (they aren’t all that complex), and they don’t have to disappear just because a mutation somewhere else makes the sequence suitable for protein synthesis. Though in fact, splice site conservation sounds more convincing than expression conservation to me as far as arguments for function go. Because splice sites can come and go quite easily, there’s no reason they should be particularly conserved between any two sequences unless they’re important. And the splice sites can only be important if the sequence they’re in does something. Who cares where you cut a strand of random RNA that’ll only end up eaten by housekeeping enzymes anyway?

So, while I get all excited about the whole new genes side of things, and I love this sort of genomic detective work, I think I still have to sleep on that point about function coming before protein. It’s a pity they didn’t check if any of the “non-coding” RNAs in macaques (and chimps) were occasionally translated into a protein, albeit a smaller one than the human counterpart. The yeast people did that and it was awesome, and it would have been such an informative thing to do in this case.

(Also, it would have been darn cool if they’d tried knocking out some of them and seeing if they got screwed up monkeys, but let’s be realistic. Macaques don’t breed like mice, they take a hell of a long time to grow up, and we’re kinda reluctant to mistreat our close relatives like that. You’d have to be comic book supervillain insane to embark on that experiment.)


*This may sound like a weird way to run a genome, but it’s actually quite good for making more than one product from the same gene. It’s pretty important in real life – nearly all human genes with multiple exons are spliced in at least two different ways, and many genetic diseases originate from messed up splicing.



Xie C et al. (2012) Hominoid-specific de novo protein-coding genes originating from long non-coding RNAs. PLoS Genetics 8:e1002942

Animals, amoebae and assumptions

Animals aren’t the only multicellular creatures in their phylogenetic neighbourhood. Social amoebae, many fungi and quite a few of the poorly known choanoflagellates spend at least part of their lives as collections of cooperating cells. Conventional wisdom has been that these groups invented multicellularity independently, but maybe conventional wisdom needs a bit of challenging.

To tell you the truth, I never really thought about the other possibility, that being multicellular is the original state of affair for these organisms. I never really considered the evidence on which the conventional wisdom was based. You could say I didn’t really care either way. A while back I saw a paper that said something about a social amoeba having an epithelium, but I just kind of shrugged and went on with my life. I don’t know, now an article in BioEssays brought this up again, and I’m not sure I was right to ignore it back then. I think Dickinson et al. (2012) have a point, and I think some assumptions may need to be reexamined.

In case you wondered, an epithelium is a type of tissue made of a layer or layers of polarised cells. “Polarised” means that various cellular components – proteins, attachments to neighbouring cells, organelles – are distributed unevenly in the cell, clustered towards one or the other side of the cell layer. Epithelia line pretty much everything in a typical animal’s body, from, well, the entire body, to things like guts and glands. They secrete important stuff like hormones, and their closely packed cells form a barrier to keep molecules and pathogens where they belong. An epithelium was thought to be a uniquely animal thing to have, but looking more closely at that weird little amoeba suggested it may not be.

The paper that I ignored was Dickinson et al. (2011) – yes, by the exact same people who wrote the BioEssays piece. OK, I didn’t completely ignore it. I read enough of it to scribble a quick note in my citation manager saying “screams convergent evolution to me”. The paper examined the multicellular stage in the life of Dictyostelium discoideum, an ordinarily single-celled amoeba that reacts to food shortages by crowding together with friends and family to form a fruiting body that helps disperse some of its cells in search of new habitats. The fruiting body is pretty complex for a “unicellular” creature, and it turns out that this complexity includes a region of tissue that looks quite a lot like a simple epithelium. It doesn’t just look like one; it sorts out its insides and outsides with the help of proteins called catenins, which are also involved in cell polarity in the epithelia of animals. (Below: D. discoideum being multicellular, from Wikipedia)

That isn’t much evidence to base an inference of homology on, especially since other key players in animal cell polarity are entirely absent from D. discoideum. But equally, the fact that tons of unikonts (the group including amoebae, slime moulds, fungi, choanoflagellates and animals) are single-celled doesn’t mean that the multicellular groups all came up with the idea independently. Evolution doesn’t always increase complexity – sometimes complexity becomes superfluous.

I remember when we discussed the choanoflagellate genome paper (King et al., 2008) in class. The genome in question belongs to a purportedly single-celled creature, but it contains tons of genes you’d think only multicellular organisms would need, such as genes for cell-to-cell adhesion proteins. So one explanation is that these proteins originally did something else, like anchoring a single cell to its favourite spot. Another explanation is that they did have something to do with multicellularity – it just wasn’t the multicellularity of animals at first.

This suggestion isn’t terribly controversial when you’re talking about choanoflagellates, since some of them do obviously form colonies (one such colony of Salpingoeca/Proterospongia rosetta is shown below, from Mark Dayel of the King lab via ChoanoWiki). It’s not hard to imagine that either the “single-celled” species whose genome was sequenced also has a colonial stage the scientists just never saw, or that its recent ancestors did.

Whether or not the same applies to the whole of unikonts is a more difficult question. I’m not at all familiar with the details of unikont relationships, but based on the tree shown in the BioEssays article, multicellularity is all over the group. In most cases, it’s facultative multicellularity; animals are rather the exception in being doomed to it for their entire lives. However, if you just looked at that tree, you’d wonder why the hell anyone thought the common ancestor of these things wasn’t some kind of multicellular.

Yet the details of animal-like multicellularity aren’t so widespread. True cadherins (the cell adhesion proteins I mentioned) have only been found in animals proper. Choanoflagellates and some even more obscure relatives of animals have bits and pieces of them, and other unikonts have none at all as far as anyone knows. Epithelium-like tissues have only been described in that one species of amoeba – but, as Dickinson and colleagues note, no one really looked in the others.

Personally, I wouldn’t be at all surprised if the conventional wisdom ended up shifting. I still don’t think that the evidence from Dictyostelium is enough to draw a conclusion. We obviously need to know a lot more about unikont genomes, tissues and life cycles to piece together the history of multicellularity in the group, but I’m not sure that right now a unicellular ancestor has a lot more going in its favour than a multicellular one. Guess we’ll have to wait and look with an open mind 🙂



Dickinson DJ et al. (2011) A polarized epithelium organized by β- and α-catenin predates cadherin and metazoan origins. Science 331:1336-1339

Dickinson DJ et al. (2012) An epithelial tissue in Dictyostelium challenges the traditional origin of metazoan multicellularity. BioEssays advance online publication, 29/08/2012, doi:10.1002/bies.201100187

King N et al. (2008) The genome of the choanoflagellate Monosiga brevicollis and the origin of metazoans. Nature 451:783-788


Genes from scratch

When we talk about evolutionary novelty, especially if the talking is to non-specialists, gene duplication is all the rage. From the sophistication of vertebrate blood clotting to the seemingly pointless complexity of a yeast proton pump (Finnigan et al., 2012), accidentally copied genes are undoubtedly an important source of new stuff in evolution. But copying and tweaking is not the only way new genes can arise. Sometimes, new genes really are new.

I admit, I wasn’t nearly excited enough about this possibility until this paper landed in my RSS reader a while back. Toll-Riera et al. (2012) find that the boring repetitive DNA that my gut feeling would’ve dismissed as true “junk” may actually be a great source of new proteins. First, it’s a good theoretical source . Long stretches of repetitive sequence are less likely than random sequence to suddenly and unceremoniously end in a stop codon* and translate to a short and useless amino acid sequence. Second, it appears that younger proteins do contain more repetitive sequence than old ones. What’s more, the repeats are often found within the regions that confer function on proteins. They aren’t just useless filler.

So, okay, a lot of proteins seem come from pieces of “junk” DNA. How?

Maybe they arise from random gene expression noise and turn into proper genes gradually, say Carvunis et al. (2012). It has been known for a while that DNA that doesn’t belong to traditionally recognised genes quite often gets transcribed into RNA in cells. Sometimes, these random bits of RNA may even be translated into an amino acid chain. If some of these accidents are actually useful, the researchers reasoned, they could create a selection pressure to turn the DNA that produced them into a proper gene.

They took this idea and applied it in a study of open reading frames (ORFs) in the yeast genome. An “ORF” is jargon for a stretch of DNA that isn’t interrupted by stop codons. In theory, any ORF could make a “meaningful” piece of protein. Most ORFs that aren’t genes are short, often just a handful of codons; and most ORFs known to be genes are long, with hundreds of codons. The team argued that if random ORFs can give rise to genes, there should be plenty of transitional forms.

To test this, they first classified all the hundreds of thousands of ORFs in the yeast genome according to their evolutionary age. The ones that were conserved in all of the yeast species they used for comparison were given a score of 10, and ORFs that only brewer’s yeast had were called zeroes. (Most known genes belong to classes 5-10, meaning they evolved quite far back on the yeast family tree.) The next step was to pick the Class Zero ORFs that were actually transcribed and translated, so might be in the pool of potential “proto-genes”. They found this set of “0+” ORFs by analysing RNA sequencing data in both happy yeast cells and yeast deprived of food, just to make sure they caught any sequences that only acted like genes under some circumstances. In addition, they also checked which of those RNAs were associated with ribosomes, the sites of translation. These filtering steps left over a thousand little ORFs that don’t belong to known genes, are completely unique to Saccharomyces cerevisiae, expressed, and probably translated.

Going up the conservation scale, ORFs become increasingly gene-like. The older ones are longer, their RNA copies are more abundant, and more of them appear constrained by natural selection. (Interestingly, when you translate them, the more gene-like ORFs produce less ordered protein structures. Not sure what to make of that.) Proper genes are also better suited to get ribosomes to translate them. Conservation classes 1-4, those ORFs that are shared only by closely related Saccharomyces species, are intermediate in all of these properties (and some more) between the zeroes and the older ORFs.

There is one more thing about this study that definitely bears mentioning When you count how many new gene duplicates this yeast species has versus how many new, potentially functional, random ORFs, the latter come out on top by far. Between them, S. cerevisiae and its closest sister species apparently have somewhere between one and five newly duplicated genes. The same duo also came up with nineteen new ORFs that are under selection and therefore probably functional. Potentially, these random little sequences people might have dismissed as background noise not long ago are more potent sources of new genes than the celebrated gene duplication.

I don’t know about you, but that absolutely fascinates me.


P.S.: Incidentally, this is all about protein-coding genes. However, thousands of genes in your own genome do NOT encode proteins. They include genes for the good old RNA components of the translation machinery, ribosomal and transfer RNA, but there are also other RNA genes with transcripts involved in everything from keeping parasitic DNA in check to editing the messenger RNAs of other genes. I kind of want to find out how these RNAs form and acquire functions. Also, when we are quite happy to call a piece of DNA that doesn’t have a protein product a “gene”, and cells are swarming with RNA that doesn’t come from things traditionally called “genes”, and some of this RNA actually does encode proteins, what does that do to the definition of a “gene”??


*Gotta love the mnemonics on that page. I didn’t think three three-letter combinations would be that hard to remember, but I have to admit I chuckled at “U Are Gone”.



Carvunis A-R et al. (2012) Proto-genes and de novo gene birth. Nature advance online publication, doi: 10.1038/nature11184

Finnigan GC et al. (2012) Evolution of increased complexity in a molecular machine. Nature 481:360-364

Toll-Riera M et al. (2012) Role of low-complexity sequences in the formation of novel protein-coding sequences. Molecular Biology and Evolution 29:883-886