Field of Science

The Attack of Mega-matrix

Lienau, E. K., R. DeSalle, M. Allard, E. W. Brown, D. Swofford, J. A. Rosenfeld, I. N. Sarkar & P. J. Planet. 2010. The mega-matrix tree of life: using genome-scale horizontal gene transfer and sequence evolution data as information about the vertical history of life. Cladistics 26: 1-11.

The last two decades have seen a great deal of discussion about the role of horizontal gene transfer (HGT) in phylogenetic reconstruction of prokaryotes. That HGT occurs among prokaryotes, occassionally between members of far distant lineage, is undeniable; the question is whether HGT is a common event in bacterial evolution or whether it is mere occasional noise. Some researchers have gone so far as to argue that HGT is so rampant among prokaryotes that the reconstruction of a reliable tree of life for bacteria is an impossibility. As I've noted elsewhere, I have to admit to a certain degree of hostility towards this idea, but I immediately have to confess that this hostility is entirely due to personal prejudice (I really want there to be a tree of life for prokaryotes) and not supported by anything rational.

Attempts to reconstruct the prokaryote tree of life have usually attempted to circumvent the issue of HGT by focusing on a small subset of genes that are believed to be resistant to this problem, such as ribosomal RNA genes. However, this method carries two major issues: (1) the assumption that horizontal transfer of these genes is not possible may not be as robust as believed (some have suggested that there may be no such thing as a truly HGT-free gene), and (2) the smaller the data set used, the greater the chance that other complicating factors may interfere with results. For instance, it is now generally accepted that high-level phylogenetic reconstructions of eukaryotes using rRNA are very vulnerable to the effects of inequal evolutionary rates, with many supposedly 'basal' branches being shown to in fact be highly derived. There is no a priori reason to assume that the same problem would not apply to rRNA phylogenies of prokaryotes.

A paper just published in Cladistics takes the opposite approach to the problem: it uses an absolutely enormous amount of data to see whether a coherent tree can still be recovered. Two main data sets were used analysing 166 genomes from taxa throughout the tree of life (mostly prokaryotes). One concatenated direct amino acid sequences from 12,381 genes to provide 846,999 phylogeny-informative characters (out of a potential 4,540,579 characters). The other compared presence vs absence of genes from the 166 genomes. Analysis was done using parsimony, which is potentially problematic for sequence data but probably necessary to simply work with this amount of data. One analysis was run on the sequence data alone; another was run using the combined sequence and gene presence/absence data (the gene presence/absence data alone had been analysed by an earlier study).

The heartening result of this analysis is that a coherent phylogeny was recovered, particularly using the combined data set (shown above from the paper; a few anomalies were present using the sequence data alone). Most previously recognised major bacterial groups analysed were recovered by the combined data as monophyletic* (the only exception being the spirochaetes, with Leptospira failing to associate with the two Spirochaetaceae). Many of the higher-level relationships were also congruent with earlier proposals: α-proteobacteria as sister to the clade of β- and γ-proteobacteria, with δ-proteobacteria the next group out; a clade of the sphingolipid-producing bacteria (Chlorobium + Bacteroidales; and a clade uniting ε-proteobacteria with Aquifex + Thermotoga, which would then include all known hydrogen-oxidising Eubacteria. It appears unlikely that HGT fatally compromises large-scale analyses.

*Or perhaps I should say 'congruent'. As far as I can see, the study glosses over the question of the rooting of the tree of life; the tree shown is rooted between Neomura (Archaea + eukaryotes) and Eubacteria but no discussion is given on that position.

Of course, the tree is not without warning signs. The aforementioned polyphyletic spirochaetes are a bit worrying in light of the distinctive spirochaete ultrastructure. Some of the relationships within the major clades are a bit off: Gloeobacter is nested well within other cyanobacteria rather than being the most divergent (Gloeobacter is the only known cyanobacterium to lack thylakoids), and the arrangement of eukaryotes is all wrong. However, it must be stressed that, as large as this study was, the taxa analysed still represent only a small proportion of the world's total diversity. What is more, the choice of organisms to have their whole genome sequenced (a necessary pre-requisite for this study) has not been evenly distributed through prokaryote diversity. Many little-studied but potentially phylogenetically significant taxa (such as many of the low-diversity bacterial 'divisions') are significant by their absence, as are many significant subgroups of those divisions that are represented. This story is not yet over.


  1. ...parsimony??? In 2010?! Did they not learn from Woese et al and their contemporaries? Like, it's the exact fucking same:
    Microsporidia (now within Fungi) branching at the base of eukaryotes = awesome positive control for grotesque LBA issues.
    Followed by Plasmodium, the sole representative of the fucking ginormous SAR clade = horrible sampling.
    (fungi,(animals,land plants)) = LOL

    tl;dr the eukaryotes are completely fucked up and show warning signs of serious long branch problems. At least they're monophyletic - phew.

    And I see you mentioned it's all wrong, but I needed to rant regardless =P Currently sitting in a lab where proper eukaryotic phylogeny is done, etc.

    As for bacteria, I don't know much, but I don't trust it at all, and I totally agree with your point about sampling issues. Dude, you can't just grab random organisms and hope it spits out the right tree at you! They did that in the 90's, and it failed miserably. These people have apparently not moved on yet. If they only did better sampling (sacrificing a few genes in the process; also, quality is very often better than quantity) and MCMCMC'd the damn thing, the results might have come out quite interesting!

    Oh, published in Cladistics. Explains A LOT, actually...

    [/unsolicited rant]

  2. They didn't just grab random organisms. Problem is, the determining factor for which organisms were grabbed wasn't always a question of proper phylogenetic coverage (otherwise, we wouldn't have ended up with four OTUs for Escherichia coli and two for Shigella, which is just E. coli that's learnt a trick or two). As I alluded to in the post, it's evident that the choice of OTUs was determined by which taxa had entire genome sequences available. Unfortunately, no-one seems to have successfully made the case yet for the Acidobacteria Genome Project ;-P

    I'm not too happy with the use of maximum parsimony either, but how long would a Likelihood or Bayesian analysis of that much data have taken? I think, primarily, this study should really be taken as just a Proof of Concept (which, admittedly, it's definitely not the first, but it is one of the biggest) that it is possible to do an analysis of bacterial phylogeny using bucketloads of data and not get a result that looks like names were just added to the nodes in the order that they were pulled out of a hat (like some of the older single-gene trees looked). Hopefully, as more prokaryote genomes become available, we'll see more large-scale studies of a wider range of taxa. I mentioned a couple of suspect results that leaped out at me in the post. I'm also a bit iffy about the supposed Mycoplasmatales + Epsilonproteobacteria + hyperthermophiles clade; that many oddball taxa in one place may be another LBA red flag. On the other hand, I do find it promising that the hyperthermophiles have not gone rushing to the base of the tree like they usually do; as I mentioned in the post, an Epsilonproteobacteria + Aquificae clade would make a lot of sense physiologically (not so sure about fitting Togobacteria in there, though - let's have some more S-membrane taxa, please!)

  3. Right, there's also the following slight issue: We tend to have full genomes primarily for pathogens. Pathogens tend to be parasites. As such, they tend to also be diverged as fuck. When most of your sample is long branchy and highly diverged, it's generally rather pointless to even attempt building a tree in the first place, unless you know quite well how they evolved (which we don't). So yeah, we *desperately* need non-pathogenic bacterial genomes, ie. the 'normal' stuff if you will ;-)

    Have you seen Jonathan Eisen's Wu et al. 2009 Nature tree? (tree in paper is low res; could pass on a higher res image if you'd like)

  4. I haven't looked at cladistics or bacterial grouping for a while, but from a genetics point of view I can tell you that you can often sort of spot the genes have have been horizontally jumping. They have vestigial jumping gene sequences on either side of them, and sometimes they just quite obviously don't fit in (different GC levels etc). This won't work for very ancient jumped genes, but it does give you an idea of how often it happens within a bacterial group.

    I can't embiggen that diagram enough to read it :( I'm guessing it's big on the paper though *goes to take a look*.


Markup Key:
- <b>bold</b> = bold
- <i>italic</i> = italic
- <a href="">FoS</a> = FoS