Lienau, E. K., R. DeSalle, M. Allard, E. W. Brown, D. Swofford, J. A. Rosenfeld, I. N. Sarkar & P. J. Planet. 2010. The mega-matrix tree of life: using genome-scale horizontal gene transfer and sequence evolution data as information about the vertical history of life. Cladistics 26: 1-11.
The last two decades have seen a great deal of discussion about the role of horizontal gene transfer (HGT) in phylogenetic reconstruction of prokaryotes. That HGT occurs among prokaryotes, occassionally between members of far distant lineage, is undeniable; the question is whether HGT is a common event in bacterial evolution or whether it is mere occasional noise. Some researchers have gone so far as to argue that HGT is so rampant among prokaryotes that the reconstruction of a reliable tree of life for bacteria is an impossibility. As I've noted elsewhere, I have to admit to a certain degree of hostility towards this idea, but I immediately have to confess that this hostility is entirely due to personal prejudice (I really want there to be a tree of life for prokaryotes) and not supported by anything rational.
Attempts to reconstruct the prokaryote tree of life have usually attempted to circumvent the issue of HGT by focusing on a small subset of genes that are believed to be resistant to this problem, such as ribosomal RNA genes. However, this method carries two major issues: (1) the assumption that horizontal transfer of these genes is not possible may not be as robust as believed (some have suggested that there may be no such thing as a truly HGT-free gene), and (2) the smaller the data set used, the greater the chance that other complicating factors may interfere with results. For instance, it is now generally accepted that high-level phylogenetic reconstructions of eukaryotes using rRNA are very vulnerable to the effects of inequal evolutionary rates, with many supposedly 'basal' branches being shown to in fact be highly derived. There is no a priori reason to assume that the same problem would not apply to rRNA phylogenies of prokaryotes.
A paper just published in Cladistics takes the opposite approach to the problem: it uses an absolutely enormous amount of data to see whether a coherent tree can still be recovered. Two main data sets were used analysing 166 genomes from taxa throughout the tree of life (mostly prokaryotes). One concatenated direct amino acid sequences from 12,381 genes to provide 846,999 phylogeny-informative characters (out of a potential 4,540,579 characters). The other compared presence vs absence of genes from the 166 genomes. Analysis was done using parsimony, which is potentially problematic for sequence data but probably necessary to simply work with this amount of data. One analysis was run on the sequence data alone; another was run using the combined sequence and gene presence/absence data (the gene presence/absence data alone had been analysed by an earlier study).
The heartening result of this analysis is that a coherent phylogeny was recovered, particularly using the combined data set (shown above from the paper; a few anomalies were present using the sequence data alone). Most previously recognised major bacterial groups analysed were recovered by the combined data as monophyletic* (the only exception being the spirochaetes, with Leptospira failing to associate with the two Spirochaetaceae). Many of the higher-level relationships were also congruent with earlier proposals: α-proteobacteria as sister to the clade of β- and γ-proteobacteria, with δ-proteobacteria the next group out; a clade of the sphingolipid-producing bacteria (Chlorobium + Bacteroidales; and a clade uniting ε-proteobacteria with Aquifex + Thermotoga, which would then include all known hydrogen-oxidising Eubacteria. It appears unlikely that HGT fatally compromises large-scale analyses.
*Or perhaps I should say 'congruent'. As far as I can see, the study glosses over the question of the rooting of the tree of life; the tree shown is rooted between Neomura (Archaea + eukaryotes) and Eubacteria but no discussion is given on that position.
Of course, the tree is not without warning signs. The aforementioned polyphyletic spirochaetes are a bit worrying in light of the distinctive spirochaete ultrastructure. Some of the relationships within the major clades are a bit off: Gloeobacter is nested well within other cyanobacteria rather than being the most divergent (Gloeobacter is the only known cyanobacterium to lack thylakoids), and the arrangement of eukaryotes is all wrong. However, it must be stressed that, as large as this study was, the taxa analysed still represent only a small proportion of the world's total diversity. What is more, the choice of organisms to have their whole genome sequenced (a necessary pre-requisite for this study) has not been evenly distributed through prokaryote diversity. Many little-studied but potentially phylogenetically significant taxa (such as many of the low-diversity bacterial 'divisions') are significant by their absence, as are many significant subgroups of those divisions that are represented. This story is not yet over.
A mathematical theory of communication
21 hours ago in Doc Madhattan