Field of Science

Bacterial pseudogenes and within-species diversity

Last night Jon Eisen posted about a new paper by Chih-Horng Kuo and Howard Ochman, about the evolutionary fates of bacterial pseudogenes (PLoS Genetics: The Extinction Dynamics of Bacterial Pseudogenes).  I don't (yet) understand their conclusion very clearly, but it ties in well to the issues around the diversity of bacterial competence that I need to sort out for my CIfAR talk next week.

What do we know about within-species genetic diversity in bacteria?  The big issue is core genome and accessory genome.

In most (all?) species, different strains have a core set of genes in common; usually these make up about 80% of each strain's gene set (typical range ~70%-90%).  These core genes are usually syntenic.  They are very similar across the different strains, usually no more than a few percent different in DNA sequence, and almost identical in protein sequence, consistent with recent descent from a common ancestor.  These shared-by-descent genes are what justifies grouping the strains as representatives of a single 'species'.

The rest of each genome gene set comprises genes that are absent from some or most other strains.  It's not just that the alleles of these genes are very divergent, but that the genes have different ancestries.  Many of these accessory genes are in large blocks ('islands') with evidence of a mechanism by which they have been transferred from another distantly related species (e.g. phage, integron or transposon sequences, flanking tRNA genes).  This within-species genetic diversity is not seen in typical eukaryote genomes, perhaps because of the homogenizing effect of meiotic sexual reproduction.

Also unlike most eukaryote genomes, bacterial genomes usually contain only a small amount of non-gene sequences, usually about 10% of the genome.  This is almost entirely intergenic; introns are very rare and usually contain other genes (excisionases and mobilization genes).

What about pseudogenes?  Pseudogenes are DNA sequences that are closely related to functional genes but have mutations that destroy the function.  They are usually identified by comparison with the functional sequence in a close relative ('allele' if in the same species, 'homlog' if in another species).  Although function could be destroyed by mutations that change one or more critical amino acids, these can't be recognized without biochemical characterization of the gene product, and in practice pseudogenes are identified by the presence of a stop codon or indel that would prevent translation into a full-length protein.

Bacteria do have pseudogenes; in 2005 Lerat and Ochman examined 11 genomes from 4 genera and found that1%-8% of the open reading frames were pseudogenes.  Most pseudogenes were unique, defective in one genome and apparently functional in the genomes of close relatives, but some pseudogenes were shared between several Staphylococcus pyogenes strains and between two Vibrio vulnificus strains, and two were shared between the closely related V. vulnificus and V. parahaemolyticus.  Because shared pseudogenes were uncommon the authors concluded that old pseudogenes are rare.

The new paper examines the evolutionary histories of pseudogenes in five strains of Salmonella.  The strains all did have pseudogenes, from 0.3% to 3.7% of their functional genes.  All but 32 of the 378 pseudogenes identified had only a single defect, suggesting that they had arisen recently.  Consistent with recent origin, very few pseudogenes were shared (maybe 3?).  Most were created by small deletions or by point mutations that created stop codons.  The authors don't explicitly consider the core gene/accessory gene distinction, but because the pseudogenes were identified by alignment of not just the gene itself but of the genes flanking it on both sides, I think these are pseudogenes of the core gene set common to all strains, not of accessory genes present in only one or two strains.  (I just emailed the authors to check this.)  Many of the genes have no assigned or suggested function.

Kuo and Ochman ask why there are not more old pseudogenes.  But first I want to consider the basics -  what we might expect to happen after the first mutation happens.  If the functional gene makes an important contribution to fitness, we expect cells with the mutation to die or be quickly outcompeted by other cells, so the mutation will be gone from the population.  These pseudogenes are so short-lived that they are unlikely to be present in sequenced genomes.  If the functional gene makes little or no contribution to fitness in the present environment, the mutant cells may persist and even found a lineage (or, more likely, still go extinct).  The pseudogenes that are detected in sequenced genomes must be of this type.  Because the pseudogene's sequences are no longer under selection for the coding function, additional mutations that change its sequence may be selectively neutral, or they may be beneficial if they eliminate a harmful effect of the pseudogene.  What could such harmful fitness effects be?  The non-functional gene could produce a toxic product, being translated into a defective protein that interfered with the regulation or function of other proteins.  It might be transcribed but not translated, using up transcriptional resources.  Even if it is never transcribed, the cells still has to replicate and maintain this DNA, and it's often thought that bacterial cells have compact genomes because selection favours deletions of nonfunctional DNA that reduce this burden.

Kuo and Ochman conclude that(from the Abstract)
We found that after their initial formation, the youngest pseudogenes in Salmonella genomes have a very high likelihood of being removed by deletional processes and are eliminated too rapidly to be governed by a strictly neutral model of stochastic loss. Those few highly degraded pseudogenes that have persisted in Salmonella genomes correspond to genes with low expression levels and low connectivity in gene networks, such that their inactivation and any initial deleterious effects associated with their inactivation are buffered.
There are two points here, one I agree with and one I don't.  I agree that most pseudogenes are of recent origin, and their results do suggest that genes that are highly expressed and/or well connected are less likely to persist once they become pseudogenes.  The Discussion emphasizes the toxic-product hypothesis, which makes sense.

But I don't agree that deletion must be the reason we see few old pseudogenes in genome sequences.  It's true that deleting a pseudogene will eliminate both any toxic-protein cost and the cost of maintaining the unneeded DNA.  But it doesn't eliminate the cost of the original mutation that created the pseudogene.  Unless we have independent evidence that the DNA of pseudogenes is removed from genomes by deletion, we should probably suspect that instead cells carrying pseudogenes are removed from populations by selection.

Bottom line:

Is the DNA of new pseudogenes quickly lost from genomes by deletion, creating strains that are more fit than those with the pseudogene (but probably not more fit than the ancestor with the functional gene)?  This predicts that sequenced genomes should contain many sites where 'core' genes have been deleted.

Alternatively, are cells containing new pseudogenes quickly lost from populations because the cells compete poorly with cells that retain the functional gene?  This predicts that sequenced genomes will typically all contain the same core genes.

The figure below shows what we might expect to see when comparing 5 closely related genomes under each hypothesis  The orange parts of each bar represent genes that are intact in most genomes but are a pseudogene in one genome.


  1. Great synopsis. I tend to agree that most pseudogenes are of recent origin, but then I think there is the occasional odd player.

    I was part of a group that recently sequenced and annotated a gram + bacterial pathogen of salmon, Renibacterium salmoninarum (J. Bact. 2008 vol. 190 (21) pp. 6970-82). This genome is littered with pseudogenes, and out of 3507 ORFs, 730 were putative pseudogenes (21%). 360 of these were frameshifts, 208 were disrupted by point mutations, and 162 were disrupted by insertion sequences an deletions. (By the way these are all real and not sequencing errors--I looked at the contig alignments of every single one of them!) Whether these are unique to the strain sequenced (and possibly new) or old and maintained awaits sequencing of other strains (one group I know of is taking this on). However limited PCR and sequencing showed several of them to be maintained among different strains, and other evidence shows strains of this species are highly homogeneous at least with respect to broad geographical niches.

    I'd be interested in any thoughts you might have!


  2. Hi Mark,

    I'm still floundering around trying to get a grip on pseudogene evolution. I'm particularly interested because many H. influenzae strains have defects in competence, that we think are due to recent mutations in DNA uptake genes.

  3. Hi Rosie,

    I just re-read the paper and I'd like to get your take on a contrarian viewpoint to Cho and Ochman.

    They seem to assume that rates of pseudogene formation (or fixation) are consistent across the Salmonella genomes. However, fig. 1 shows that some of these strains are likely undergoing slightly different selection pressures (leading to slightly higher Ka/Ks ratios). Could it be that there is just differential fixation of slightly deleterious pseudogenes within these lineages instead of faster clearing, leading to the inflated values of new pseudogenes? Am I missing something?


Markup Key:
- <b>bold</b> = bold
- <i>italic</i> = italic
- <a href="">FoS</a> = FoS