With my Ottawa bioinformatics colleagues I'm analyzing the effects of uptake sequence accumulation on the H. influenzae and N. meningitidis proteomes. One issue I'm struggling with right now is how to tell the difference between genes that give poor "E-value" scores in BLAST searches against test genomes from the same group (e.g. gamma proteobacteria) because they have no true homologs in the test genomes being searched, and genes that give poor scores because they have no homologs in these genomes. The latter genes would be ones that were acquired by transfer from more distantly-related bacteria. I know (at least in principle) how to use phylogenetic analysis to show that a gene has been acquired by lateral gene transfer (LGT) rather than having simply diverged, but here I'm looking for a less rigorous and more automatable method, one that can be used to quickly screen many genes (say 100).
So far I think we could use a combination of low BLAST score against the test genomes and aberrant base composition and codon adaptation index to identify genes that are good candidates for having been acquired by LGT. But I'd like to go one step further - to do one more test where genes acquired by LGT are predicted to differ from genes that have simply diverged in situ.
One test I'm considering would be to take the genes that the base composition and low CAI tests flagged, and BLAST them against a closer relative (something in the same family). If the genes are just divergent, they should give higher scores with the relative. But if they were acquired by LGT, they should either give very high scores (if they were acquired before the H. influenzae and relative lineages diverged) or just as low scores (if they were acquired by the H. influenzae lineage after the relative diverged. The analysis should be also done on a control set of genes, ones that had equally low BLAST scores in the original tests but normal base composition (38% G+C) and codon adaptation indices (???). The BLAST scores of these genes should be modestly higher against the close relative than against the test genomes.
So. Prediction: Control gene set, blasted against close relative: most (all?) BLAST scores modestly higher than with test genomes. LGT gene set, blasted against close relative: Some BLAST scores much higher than with test genomes, others not improved at all.
Leroy Hood and the tool-driven revolution in biology
1 day ago in The Curious Wavefunction