When we talk about evolutionary novelty, especially if the talking is to non-specialists, gene duplication is all the rage. From the sophistication of vertebrate blood clotting to the seemingly pointless complexity of a yeast proton pump (Finnigan et al., 2012), accidentally copied genes are undoubtedly an important source of new stuff in evolution. But copying and tweaking is not the only way new genes can arise. Sometimes, new genes really are new.
I admit, I wasn’t nearly excited enough about this possibility until this paper landed in my RSS reader a while back. Toll-Riera et al. (2012) find that the boring repetitive DNA that my gut feeling would’ve dismissed as true “junk” may actually be a great source of new proteins. First, it’s a good theoretical source . Long stretches of repetitive sequence are less likely than random sequence to suddenly and unceremoniously end in a stop codon* and translate to a short and useless amino acid sequence. Second, it appears that younger proteins do contain more repetitive sequence than old ones. What’s more, the repeats are often found within the regions that confer function on proteins. They aren’t just useless filler.
So, okay, a lot of proteins seem come from pieces of “junk” DNA. How?
Maybe they arise from random gene expression noise and turn into proper genes gradually, say Carvunis et al. (2012). It has been known for a while that DNA that doesn’t belong to traditionally recognised genes quite often gets transcribed into RNA in cells. Sometimes, these random bits of RNA may even be translated into an amino acid chain. If some of these accidents are actually useful, the researchers reasoned, they could create a selection pressure to turn the DNA that produced them into a proper gene.
They took this idea and applied it in a study of open reading frames (ORFs) in the yeast genome. An “ORF” is jargon for a stretch of DNA that isn’t interrupted by stop codons. In theory, any ORF could make a “meaningful” piece of protein. Most ORFs that aren’t genes are short, often just a handful of codons; and most ORFs known to be genes are long, with hundreds of codons. The team argued that if random ORFs can give rise to genes, there should be plenty of transitional forms.
To test this, they first classified all the hundreds of thousands of ORFs in the yeast genome according to their evolutionary age. The ones that were conserved in all of the yeast species they used for comparison were given a score of 10, and ORFs that only brewer’s yeast had were called zeroes. (Most known genes belong to classes 5-10, meaning they evolved quite far back on the yeast family tree.) The next step was to pick the Class Zero ORFs that were actually transcribed and translated, so might be in the pool of potential “proto-genes”. They found this set of “0+” ORFs by analysing RNA sequencing data in both happy yeast cells and yeast deprived of food, just to make sure they caught any sequences that only acted like genes under some circumstances. In addition, they also checked which of those RNAs were associated with ribosomes, the sites of translation. These filtering steps left over a thousand little ORFs that don’t belong to known genes, are completely unique to Saccharomyces cerevisiae, expressed, and probably translated.
Going up the conservation scale, ORFs become increasingly gene-like. The older ones are longer, their RNA copies are more abundant, and more of them appear constrained by natural selection. (Interestingly, when you translate them, the more gene-like ORFs produce less ordered protein structures. Not sure what to make of that.) Proper genes are also better suited to get ribosomes to translate them. Conservation classes 1-4, those ORFs that are shared only by closely related Saccharomyces species, are intermediate in all of these properties (and some more) between the zeroes and the older ORFs.
There is one more thing about this study that definitely bears mentioning When you count how many new gene duplicates this yeast species has versus how many new, potentially functional, random ORFs, the latter come out on top by far. Between them, S. cerevisiae and its closest sister species apparently have somewhere between one and five newly duplicated genes. The same duo also came up with nineteen new ORFs that are under selection and therefore probably functional. Potentially, these random little sequences people might have dismissed as background noise not long ago are more potent sources of new genes than the celebrated gene duplication.
I don’t know about you, but that absolutely fascinates me.
P.S.: Incidentally, this is all about protein-coding genes. However, thousands of genes in your own genome do NOT encode proteins. They include genes for the good old RNA components of the translation machinery, ribosomal and transfer RNA, but there are also other RNA genes with transcripts involved in everything from keeping parasitic DNA in check to editing the messenger RNAs of other genes. I kind of want to find out how these RNAs form and acquire functions. Also, when we are quite happy to call a piece of DNA that doesn’t have a protein product a “gene”, and cells are swarming with RNA that doesn’t come from things traditionally called “genes”, and some of this RNA actually does encode proteins, what does that do to the definition of a “gene”??
*Gotta love the mnemonics on that page. I didn’t think three three-letter combinations would be that hard to remember, but I have to admit I chuckled at “U Are Gone”.
Carvunis A-R et al. (2012) Proto-genes and de novo gene birth. Nature advance online publication, doi: 10.1038/nature11184
Finnigan GC et al. (2012) Evolution of increased complexity in a molecular machine. Nature 481:360-364
Toll-Riera M et al. (2012) Role of low-complexity sequences in the formation of novel protein-coding sequences. Molecular Biology and Evolution 29:883-886