“Same” function, but the devil is in the details.

Aaaaaand todaaaaay, ladies and, um, other kinds of people…. Hox genes!

Considering that I did my Honours project on them and I think they are made of awesome, I’m kind of shocked by the general lack of them here*. Hmmmmmm. Well, having just found Sambrani et al. (2013), I think today is a good time to do something about that.

Hox genes in general are “what goes where” type regulators of development. In bilaterian animals, they tend to work along the head to tail axis of the embryo. (Cnidarians like sea anemones also have them, but the situation re: main body axis and Hox genes in cnidarians is a leeeetle less clear. And heaven knows what sort of weird things happened with the rest of the animals.)

Hox genes are responsible for one of the peculiarities of the insect body plan. Unlike many other arthropods, insects have leg-free abdomens. On the left below is a poor little lobster with legs or related appendages all the way down (plus a bonus clutch of eggs). (Arnstein Rønning, Wikimedia Commons). To her right is a bland, boring insect abdomen (Hans Hillewaert, Wikimedia Commons).

As I said, Hox genes are responsible for the difference. Three of them are expressed in various segments of the abdomen of a developing insect: Ultrabithorax (Ubx), Abdominal-A and Abdominal-B. I’m going to whip out that amazing fluorescent image of Hox gene expression in a fruit fly embryo from Lemons and McGinnis (2006) because aside from being cool as hell, it also happens to be a good illustration:

(The embryo is folded back on itself, so the Abd-B-expressing tail end is right next to the Hox gene-free head)

In insects, all three can turn off the expression of the leg “master” gene distal-less (dll). However, they turn out to do so through two different mechanisms. Ubx and Abd-A proteins have long been known to team up with the distantly related Extradenticle (Exd) and Homothorax (Hth). With their partners, the Hoxes can sit on a regulatory region belonging to the dll gene and prevent its activation.

Sambrani et al. were curious whether Abd-B works in the same way. Sure enough, Abd-B also represses dll wherever it shows up. However, when it comes to interacting with Exd and Hth, differences start to emerge. For starters, those two aren’t even present in the rear end of the abdomen, where Abd-B does its business. When the researchers took the regulatory region of dll and threw various combinations of proteins at it, they found that (1) Abd-B is perfectly capable of binding the DNA on its own, (2) Exd, Hth or engrailed (another Hox cofactor) didn’t improve this ability at all, (3) Hth alone or in combination with the others actually inhibited the binding of Abd-B to the dll regulatory sequence.

Interestingly, dll repression in the anterior and posterior abdominal segments requires the exact same bits of regulatory DNA even though different proteins are involved. It looks like in the posterior segments, Abd-B actually takes over an “Exd” binding site – maybe that’s how it can do the job without getting Exd itself involved.

Furthermore, while the DNA-binding ability of Abd-B is crucial to its ability to kill dll expression, the same is not the case for Ubx. The authors speculate that cooperation with Exd and Hth kind of exempts Ubx from having to bind the regulatory sequences itself, while Abd-B, being on its own, can’t afford to slack off like that. The paper illustrates the idea with such a deliciously ugly pair of drawings that I feel compelled to post it:

(I know they’re going for colour-matching with the fluorescent images, but unfortunately glowy greens and reds that look good on a black background kind of just hurt my eyes on white.)

I don’t really have a point to make here. (There doesn’t always have to be a point, right?) There’s absolutely nothing surprising about the fact that different Hox genes evolved the same overall function in different ways –  after all, they existed as separate entities long before insects lost their buttward legs. I just think Hox genes are cool, and this was an interesting look into the nuts and bolts of how they work. And that’s that.



*Well, aside from this one I’ve written three posts about them and a couple more where they are mentioned. That’s maybe not that bad considering how many different things I’m interested in.



Lemons D and McGinnis W (2006) Genomic evolution of Hox gene clusters. Science 313:1918-1922

Sambrani N et al. (2013) Distinct molecular strategies for Hox-mediated limb suppression in Drosophila: From cooperativity to dispensability/antagonism in TALE partnership. PLoS Genetics 9:e1003307.

Animals, amoebae and assumptions

Animals aren’t the only multicellular creatures in their phylogenetic neighbourhood. Social amoebae, many fungi and quite a few of the poorly known choanoflagellates spend at least part of their lives as collections of cooperating cells. Conventional wisdom has been that these groups invented multicellularity independently, but maybe conventional wisdom needs a bit of challenging.

To tell you the truth, I never really thought about the other possibility, that being multicellular is the original state of affair for these organisms. I never really considered the evidence on which the conventional wisdom was based. You could say I didn’t really care either way. A while back I saw a paper that said something about a social amoeba having an epithelium, but I just kind of shrugged and went on with my life. I don’t know, now an article in BioEssays brought this up again, and I’m not sure I was right to ignore it back then. I think Dickinson et al. (2012) have a point, and I think some assumptions may need to be reexamined.

In case you wondered, an epithelium is a type of tissue made of a layer or layers of polarised cells. “Polarised” means that various cellular components – proteins, attachments to neighbouring cells, organelles – are distributed unevenly in the cell, clustered towards one or the other side of the cell layer. Epithelia line pretty much everything in a typical animal’s body, from, well, the entire body, to things like guts and glands. They secrete important stuff like hormones, and their closely packed cells form a barrier to keep molecules and pathogens where they belong. An epithelium was thought to be a uniquely animal thing to have, but looking more closely at that weird little amoeba suggested it may not be.

The paper that I ignored was Dickinson et al. (2011) – yes, by the exact same people who wrote the BioEssays piece. OK, I didn’t completely ignore it. I read enough of it to scribble a quick note in my citation manager saying “screams convergent evolution to me”. The paper examined the multicellular stage in the life of Dictyostelium discoideum, an ordinarily single-celled amoeba that reacts to food shortages by crowding together with friends and family to form a fruiting body that helps disperse some of its cells in search of new habitats. The fruiting body is pretty complex for a “unicellular” creature, and it turns out that this complexity includes a region of tissue that looks quite a lot like a simple epithelium. It doesn’t just look like one; it sorts out its insides and outsides with the help of proteins called catenins, which are also involved in cell polarity in the epithelia of animals. (Below: D. discoideum being multicellular, from Wikipedia)

That isn’t much evidence to base an inference of homology on, especially since other key players in animal cell polarity are entirely absent from D. discoideum. But equally, the fact that tons of unikonts (the group including amoebae, slime moulds, fungi, choanoflagellates and animals) are single-celled doesn’t mean that the multicellular groups all came up with the idea independently. Evolution doesn’t always increase complexity – sometimes complexity becomes superfluous.

I remember when we discussed the choanoflagellate genome paper (King et al., 2008) in class. The genome in question belongs to a purportedly single-celled creature, but it contains tons of genes you’d think only multicellular organisms would need, such as genes for cell-to-cell adhesion proteins. So one explanation is that these proteins originally did something else, like anchoring a single cell to its favourite spot. Another explanation is that they did have something to do with multicellularity – it just wasn’t the multicellularity of animals at first.

This suggestion isn’t terribly controversial when you’re talking about choanoflagellates, since some of them do obviously form colonies (one such colony of Salpingoeca/Proterospongia rosetta is shown below, from Mark Dayel of the King lab via ChoanoWiki). It’s not hard to imagine that either the “single-celled” species whose genome was sequenced also has a colonial stage the scientists just never saw, or that its recent ancestors did.

Whether or not the same applies to the whole of unikonts is a more difficult question. I’m not at all familiar with the details of unikont relationships, but based on the tree shown in the BioEssays article, multicellularity is all over the group. In most cases, it’s facultative multicellularity; animals are rather the exception in being doomed to it for their entire lives. However, if you just looked at that tree, you’d wonder why the hell anyone thought the common ancestor of these things wasn’t some kind of multicellular.

Yet the details of animal-like multicellularity aren’t so widespread. True cadherins (the cell adhesion proteins I mentioned) have only been found in animals proper. Choanoflagellates and some even more obscure relatives of animals have bits and pieces of them, and other unikonts have none at all as far as anyone knows. Epithelium-like tissues have only been described in that one species of amoeba – but, as Dickinson and colleagues note, no one really looked in the others.

Personally, I wouldn’t be at all surprised if the conventional wisdom ended up shifting. I still don’t think that the evidence from Dictyostelium is enough to draw a conclusion. We obviously need to know a lot more about unikont genomes, tissues and life cycles to piece together the history of multicellularity in the group, but I’m not sure that right now a unicellular ancestor has a lot more going in its favour than a multicellular one. Guess we’ll have to wait and look with an open mind 🙂



Dickinson DJ et al. (2011) A polarized epithelium organized by β- and α-catenin predates cadherin and metazoan origins. Science 331:1336-1339

Dickinson DJ et al. (2012) An epithelial tissue in Dictyostelium challenges the traditional origin of metazoan multicellularity. BioEssays advance online publication, 29/08/2012, doi:10.1002/bies.201100187

King N et al. (2008) The genome of the choanoflagellate Monosiga brevicollis and the origin of metazoans. Nature 451:783-788


Hexapeptide or bust!

I have a big soft spot for Hox genes, or rather, Hox proteins. Thanks to some of my earlier work, I also have a soft spot for all their secret little sequence motifs that help them interact with other proteins and help us classify them (e. g. Balavoine et al., 2002). Probably the best-known such motif is the hexapeptide. (“Hexa-” would kind of imply that it’s made of six amino acids, but people only ever seem to talk about four. Don’t ask me why they call it a hexapeptide…)

This motif is very widespread, occurring not just in Hox proteins but also in many others in the larger class of homeodomain proteins that Hoxes belong to. For many years, the hexapeptide has been regarded as the key to the interaction of Hox proteins with another homeodomain-bearing protein, called Extradenticle in flies and Pbx plus a number (we have 3 of them) in vertebrates*. (Above is a cartoon version of DNA with the homeodomains – the purple curls – of Exd and the fly Hox protein Ubx bound to it, from the Protein Data Bank.) Hox proteins bind DNA to regulate various genes, and are absolutely vital for an embryo to develop the right organs in the right places. This interaction changes their DNA binding behaviour, making the hexapeptide possibly the most important four amino acids in animal development.

And now we’re supposed to scrap that?

I’ve just skimmed through Hudry et al. (2012), and died a little inside.

The study claims – on what seems to be good evidence – that the hexapeptide is not all it’s cracked up to be. Out of six fruit fly Hox proteins examined, only two stop interacting with Exd when the hexapeptide is mutated beyond recognition, and even then one of them is kind of half-hearted about it. The team also tested a few mouse Hox genes in cultured cells and – for whatever reason – chick embryos, and got largely the same results.

I rather like their approach, though. I think the method for detecting interaction is incredibly clever, though it’s clearly not something they invented. The idea is based on fluorescent proteins. These are very commonly used to track the levels and whereabouts of other proteins. Since they are pretty small and innocuous, the gene encoding them can be tacked onto the gene of interest, and the resulting protein chimaera will do whatever the target protein would do without its fluorescent companion. The only difference is now it glows wherever it goes. The more protein, the brighter the glow.

The nice thing about fluorescent proteins is that you can cut them in half, and if the two parts get close enough, they’ll still glow. Therefore, if you glue one half of the DNA for a fluorescent protein to gene 1, the other half to gene 2, and let them loose in the same cell, you can tell whether the protein products of gene 1 and gene 2 interact just by looking for the telltale fluorescence. And the tale it’s telling is that all these Hox proteins are getting snug with Exd despite the loss of the motif supposedly necessary for the interaction.

I’ll just go away and quietly get over that now.


*Vertebrate geneticists have no imagination. Okay, they did come up with lunatic fringe and Sonic hedgehog. After the fly people named fringe and hedgehog.



Balavoine G et al. (2002) Hox clusters and bilaterian phylogeny. Molecular Phylogenetics and Evolution 24:366-373

Hudry B et al. (2012) Hox proteins display a common and ancestral ability to diversify their interaction mode with the PBC class cofactors. PLoS Biology 10:e1001351

Have a goddamned null hypothesis!

I’m in the middle of Joubert et al. (2010), and I’m shaking my head a bit.

The paper is a pretty standard specimen of the kind of next-generation sequencing study that’s been proliferating in recent years. It looked for genes that make the shell of an oyster by sequencing all the RNA expressed in the tissues that build the shell.

This sort of study generates an awful lot of data, which makes it unfeasible to analyse it just by old-fashioned human brainpower and maybe a little statistics. Anyone happy to manually trawl through nearly 77 thousand sequences to find the interesting ones is a lot crazier than anyone I know… and anyone willing to study each of them to figure out what they do is probably a god with all of eternity at their hands.

One of the ways researchers use to quickly and automatically find interesting patterns in such large datasets is the Gene Ontology (GO) database. GO uses a standard vocabulary to tag protein sequences with various attributes such as where they are found within the cell, what molecules they might bind to, and what biological processes they might participate in. Such tags may be derived from experiments, or often from features of the sequence itself. For example, proteins often contain specific sequence motifs that tell the cell where to send them. Assuming that related proteins have similar functions in different organisms, GO annotations derived from one creature can be used to make sense of data produced in another.

So, Joubert et al. took their 77 thousand sequences, translated them into protein, and ran them through a program that finds similar sequences with existing annotations from GO or similar databases. They got some numbers. X per cent of the proteins are predicted to have some metabolic function, Y per cent of them bind to something, and so on. Then they went on to pull hypotheses out of these numbers. And that’s where they really should have remembered Scientific Method 101.

The part where I started looking askance at the paper is where they get excited about the percentage of predicted ion-binding proteins in the dataset. Of course, oyster shells are largely made of ions (calcium and carbonate ions, to be precise). But, importantly, “ions” are an awfully broad category, and they are essential for many of the everyday workings of any cell. Even calcium ions, which make up one half of the mineral component of the shell, play many other roles. Muscle contraction, cell adhesion, conduction of nerve impulses – all involve calcium ions in one way or another, and lots and lots of proteins either regulate or are regulated by calcium signals via binding the ions. So the question arose in my head: is it really unusual for 17% of “binding” sequences in a sample to bind ions?

This is not the first time I see that people who publish high-throughput sequencing data don’t ask that sort of question. They just report the pattern they saw, and try to interpret it. But patterns can only be interpreted in context. To tell whether a pattern is unusual, you must first know what the usual pattern looks like!

In fact, finding this many ion binding proteins doesn’t seem extraordinary at all. SwissProt, the awesome hand-curated protein sequence database, has a really handy browsing tool where you can get a breakdown of a selected set of sequences by GO categories. Curious, I had a quick look at all the 20 244 human proteins in SwissProt. (I picked humans because I’d probably be hard-pressed to find organisms with more GO-annotated proteins.) Out of the 11 674 that GO classifies under “binding”, 3934 are thought to bind ions. That’s almost 34%. This almost certainly has nothing to do with us having bones made of ions – SwissProt includes proteins from all tissues.

Even when you compare ion-binding sequences with ones that are predicted to bind something else within the same dataset, the “background” wins: while the human collection I looked at has about 1.6 times as many protein-binding sequences as ion-binders, in the oyster dataset the protein-binders outnumber the ion-binders two to one. Caveats apply as usual – datasets are incomplete, GO annotations are probably both incomplete AND a bit suspect, etc.; but I can’t see how that justifies the implication that the pattern this study found in the oyster is somehow special and interesting from a shell-building perspective. The whole thing smells like looking for faces in the clouds to me. Science is not simply about pattern-finding – it’s about finding meaningful patterns. I hope that this crucial distinction doesn’t get completely washed away by the current flood of data, data, data.



Joubert C et al. (2010) Transcriptome and proteome analysis of Pinctada margaritifera calcifying mantle and shell: focus on biomineralization. BMC Genomics 11:613

This is when *I* need a good science blogger

Today you get to meet yet another of my random interests: the origin of life. (Is there a person with an interest in living things who isn’t fascinated by the origin of life?) And, since we sciencey types are very anxious about personal biases, I might as well start with a confession.

I love the RNA world hypothesis.

It was just one of those things that you learn at school/uni (I think it was 1st year molecular biology for me), and it’s so neat and elegant and compelling that you immediately fall in love. Sure, later, when you’re out of the inevitable simplicity of class, you learn about the nuances. The difficulties. But the evidence for still seems so convincing that you have no doubt that we’ll eventually solve the problems.

In case you aren’t familiar with it, the RNA world hypothesis is the leading solution to the chicken and egg problem that is the “central dogma” of molecular biology (diagram from Wikipedia):

DNA is great genetic hardware, but it’s nothing without proteins. Proteins are encoded in DNA, but the code is useless without proteins to read it. Making DNA requires proteins. But the proteins come from the DNA code. You see where this is going…

RNA takes the stage

The RNA world is an ingenious idea that elevates RNA from being merely the messenger between DNA and protein to centre stage. While its big brother DNA is a fairly stable and inert molecule, RNA is much more chemically active. It doesn’t like languishing in long, stable double helices – rather, it folds up into all kinds of odd shapes that can, surprisingly, catalyse a variety of chemical reactions. Just like proteins. Yet the “letters” of RNA can form complementary pairs, allowing for faithful copying. Just like DNA.

And, so the theory goes, there was a time when RNA was both the genome and the enzymes (enzymes made of RNA are called ribozymes). The right sort of RNA molecule could have copied itself without proteins [1], and performed whatever chemistry a primitive life form needed – also without proteins. Crucially, the right sort of RNA molecule could have invented proteins [2].

One of the key revelations to lend support to the RNA world hypothesis is that proteins in cells today are still made by RNA. Proteins are manufactured in ribosomes. A modern ribosome is a very complicated structure made of several folded-up RNA molecules and dozens of proteins. However, investigations of its structure (see Cech [2000] for a quick review) revealed that the place where amino acids are joined into a protein chain is all RNA – the proteins may support the RNA, but it seems to be the RNA that actually does the job.

Beautiful hypothesis vs. ugly facts?

So, everything is shiny and awesome and exciting. Ribozymes capable of all sorts of interesting chemistry [3] abound, and we have some very neat ideas regarding how RNA paved the way towards the modern protein-and-DNA world [2].

And then Harish and Caetano-Anollés (2012) come along, and I don’t know what to think.

A large part of the problem is that their methods go way over my head. I get the gist of their message. They figured out the relative ages of the RNA and protein components of the ribosome. The protein-synthesis parts – RNA and protein alike – turned out relatively new. They also found that the oldest protein parts interact with the oldest RNA parts – and seem to have coevolved. That, they say, would suggest that RNA and fairly large pieces of protein had a common history together before the future ribosome became capable of making proteins.

Yes, that means either that RNA didn’t invent proteins, or at the very least, that the “inventor” was not a precursor of the ribosome.

I really really don’t want to believe the former, and the latter possibility is a butchery of Occam’s razor without further evidence. But what else is left, if the study is correct?

One part of their results that I found intriguing is the structural similarity of the most ancient parts of ribosomal RNA to – you’d never guess – lab-evolved RNA-copying ribozymes. That is… oh, I don’t really know what it is, aside from “fascinating”. Did the ribosome start out as replication machinery, and turn into a protein factory only later? Or are the structures similar because reading the primitive genetic code required the same sort of molecular machine as copying RNA? Or is it even just coincidence?

And this is why I need a good science blogger. I need someone who deeply understands the paper and can translate it into something I can digest. Because at the moment, I can’t make heads or tails of this. I’m rather attached to the RNA world; it makes sense to me, and as far as scientific hypotheses go, it’s simply beautiful. Yet I can’t point to any obviously bullshit reasoning in the new study, other than where they seem to imply that because modern ribosomes need proteins to work, proteins must have been present in the ribosome from the start. (Which is a bit like every damn irreducible complexity argument advanced by creationists.) I just don’t have a good enough grasp on the methodology to tell whether it’s all solid or whether any of it is dodgy. Words fail to express how much that bugs me.



[1] Lincoln and Joyce (2009) and Wochner et al. (2011) came tantalisingly close to making/evolving the right sort of RNA molecule in the lab. The former’s pair of ribozymes can only copy each other by stitching together two half-ribozymes, but they can keep going at it forever and ever. Wochner et al.’s molecule can copy RNA using single letters as ingredients, but it runs out of steam after 90 or so of them. That’s several times better than the previous record, but still not long enough for the ribozyme to replicate its twice-as-long self.

[2] This excellent video describes one way it could have happened. When it comes to science education, cdk007 never fails to deliver!

[3] Including attaching amino acids to other RNA molecules (Turk et al., 2010) – look up tRNA if you don’t see why this is exciting 😉



Cech TR (2000) The ribosome is a ribozyme. Science 289:878-879

Harish A & Caetano-Anollés G (2012) Ribosomal history reveals origins of modern protein synthesis. PLoS ONE 7:e32776

Lincoln TA & Joyce GF (2009) Self-sustained replication of an RNA enzyme. Science 323:1229-1232

Turk RM et al. (2010) Multiple translational products from a five-nucleotide ribozyme. PNAS 107:4585-4589

Wochner A et al. (2011) Ribozyme-catalyzed transcription of an active ribozyme. Science 332:209-212

The Mammal is confused

I’m talking mostly to myself in this one, but do listen if you want to know the random questions crossing a trainee evolutionary scientist’s mind 🙂

So, when you work with DNA or protein sequences, you learn that you should be careful with repetitive sequences. For example, here is a protein that consists mostly of the same amino acid (D = aspartic acid, in case you wondered) over and over again:

>gi|37991666|dbj|BAD00044.1| shell matrix protein [Pinctada fucata]

It’s a protein found in oyster shells, but it could be anything, the problem remains the same. You learn that when you compare/align sequences, you shouldn’t base an alignment, or conclude that two sequences are homologous, on bits like the above. (Which didn’t prevent some people from saying that this very protein “shows homology to” something else)

I kind of never questioned that, but really, why?

We humans are intuitively bad at probability, but when you think about it, a series of sixty Ds is just as likely as a pitch-perfect homeodomain. It’s like coin tossing. When every throw has a 1/2 chance of landing heads *or* tails, then any particular sequence of heads and tails is equally likely. HHHH and HTHT are exactly equally probable.

Nucleotides or amino acids are slightly different, because not all letters are equally likely, but the rarer ones can also form repeats, and I don’t think I’ve ever heard anyone make an exception for them.

My best guess is that repeats are easy to make by some sort of DNA replication slippage, so they come and go in evolution as they please. Does that make sense?

I think I have to go and ask someone…

ETA: Hmm, it appears to be a computational issue, based on the BLAST guides over at NCBI. Sayeth the Manual:

There is one frequent case where the random models and therefore the statistics discussed here break down. As many as one fourth of all residues in protein sequences occur within regions with highly biased amino acid composition. Alignments of two regions with similarly biased composition may achieve very high scores that owe virtually nothing to residue order but are due instead to segment composition. Alignments of such “low complexity” regions have little meaning in any case: since these regions most likely arise by gene slippage, the one-to-one residue correspondence imposed by alignment is not valid. While it is worth noting that two proteins contain similar low complexity regions, they are best excluded when constructing alignments [42-44]. The BLAST programs employ the SEG algorithm [43] to filter low complexity regions from proteins before executing a database search.

So… yes, part of it is replication slippage, but also, the… low complexity of low-complexity regions skews the probability calculations. If you base your match probabilities on the assumption that only every 20th amino acid is an aspartate, and in your sequence it’s every 2nd or 3rd, the match will seem far less likely than it would if you used the real composition of your sequence.

That makes sense. The Mammal is unconfused and feeling kind of smug now 🙂