Following on from the yeast “proto-gene” study, I’ve started paying more attention to news about gene birth from non-coding DNA. (Or “junk” DNA, if you will, though “junk” is… something of a misnomer :-P) The yeast paper explored protein-coding genes in the process of birth. This new one I found in PLoS Genetics looks at genes that have already been born, and argues that the sequences they came from were functional long before they began to serve as templates for proteins.
Paternity testing for genes
So how do you know that a protein-coding gene came from non-coding DNA? Xie et al. (2012) looked specifically for genes born along the ape lineage, that is, the group that includes gibbons, orangs, chimps, gorillas and ourselves. They searched for human genes that had no protein-coding homologue in non-apes including rhesus macaques, mice, dogs and a handful of other mammals, but did match some of the other animals’ DNA sequence. It was also important that there were no other similar sequences in the human genome itself, which might have indicated that the “new” gene actually originated by duplication.
To ascertain that the selected genes represented gene birth in the ape lineage as opposed to a dying gene in other mammals, they also looked at the sequence changes that garbled the would-be protein product of the non-ape sequences. If these are the same in all non-ape versions of a gene, that probably means that they were inherited from the common ancestor of apes and all these species, that is, the non-coding version of the gene came first. Only genes that really seemed to have been born in our lineage were kept.
In the end, they came up with 24 genes that passed all muster. Some of these coded for proteins in both chimps and humans, others only in humans. Based on RNA-sequencing data from rhesus macaques, 20 non-coding versions of these genes were active in monkeys – and this is where things get interesting.
Function in the junkyard
The big question the team asked was this: are these non-coding RNAs just random noise in transcription, or are they already functional even without a protein product? They decided to answer this question by looking at the structure and expression patterns of the RNAs in macaques, chimps and humans.
By “structure”, they mean how the RNA is edited after it’s transcribed from the gene sitting in the DNA. The RNA from most genes in animals and other eukaryotes isn’t taken straight through protein synthesis. First, pieces called introns are chopped out and the rest (called exons) are spliced together to yield the final template for the protein.* When the researchers looked at RNA sequencing results from the three primates, they found that the non-coding sequences in macaques were cut and joined at the same points that the protein-coding human sequences were: a bunch of RNA sequences from macaques spanned both sides of a human splice site, and contained none of the intron in between. Such conservation, the authors argue, is indicative of functionality.
Expression patterns also suggested that the macaque sequences weren’t just noise: when they compared the abundance of the different RNAs between different tissues in the three primates, Xie et al. found that the non-coding RNAs weren’t just expressed all over the place – they were significantly more abundant in some parts of the body than others. This pattern was consistent across species: a sequence that was most abundant in the macaque’s brain was also likely to be brain-specific in humans, even though the macaque version didn’t code for a protein and the human gene did. (By the way, a lot of the 24 were most active in the brain, which is apparently something of a trend among new human genes regardless of their mode of origin. Guess our brains evolved rather a lot in the last few million years ;))
This is very cool, but I’m kind of worried about the arguments used for functionality. Maybe this is just me being a newbie in this area, but I’m not sure that useless non-coding RNAs should be expressed all over the place. One of the salient features of the 24 genes in this study is that they are nearby other genes, sometimes even overlapping them. In any case, they are close enough to use the switches that normally regulate the activity of the other gene(s). That would mean that they’d be most highly expressed wherever their neighbours are, which doesn’t depend on them having a function. If some of them happened to acquire proteing coding potential by mutation, presumably it’d only be kept by natural selection if the resulting proteins did something useful in those places, hence the conservation of expression patterns.
Likewise, splice sites may well arise by accident (they aren’t all that complex), and they don’t have to disappear just because a mutation somewhere else makes the sequence suitable for protein synthesis. Though in fact, splice site conservation sounds more convincing than expression conservation to me as far as arguments for function go. Because splice sites can come and go quite easily, there’s no reason they should be particularly conserved between any two sequences unless they’re important. And the splice sites can only be important if the sequence they’re in does something. Who cares where you cut a strand of random RNA that’ll only end up eaten by housekeeping enzymes anyway?
So, while I get all excited about the whole new genes side of things, and I love this sort of genomic detective work, I think I still have to sleep on that point about function coming before protein. It’s a pity they didn’t check if any of the “non-coding” RNAs in macaques (and chimps) were occasionally translated into a protein, albeit a smaller one than the human counterpart. The yeast people did that and it was awesome, and it would have been such an informative thing to do in this case.
(Also, it would have been darn cool if they’d tried knocking out some of them and seeing if they got screwed up monkeys, but let’s be realistic. Macaques don’t breed like mice, they take a hell of a long time to grow up, and we’re kinda reluctant to mistreat our close relatives like that. You’d have to be comic book supervillain insane to embark on that experiment.)
*This may sound like a weird way to run a genome, but it’s actually quite good for making more than one product from the same gene. It’s pretty important in real life – nearly all human genes with multiple exons are spliced in at least two different ways, and many genetic diseases originate from messed up splicing.
Xie C et al. (2012) Hominoid-specific de novo protein-coding genes originating from long non-coding RNAs. PLoS Genetics 8:e1002942