Field of Science

Showing posts with label BLAST. Show all posts
Showing posts with label BLAST. Show all posts

New BLAST book!

Thanks to a suggestion from a reader, I ordered the O'Reilly Press book on BLAST. It just arrived and looks to be exactly what I and the rest of my lab need.

We all use BLAST all the time, but we've never really had any understanding of how our search query sequence became the search results. We sort-of knew that this was asking for trouble, but haven't taken the time to learn more. Probably this was partly because doing a BLAST search is so fast and easy that you want to use the results right away, not 'waste time' reading the manual.

The new book has a Glossary! (No more using Google to find hints of what the terms mean.) It has a detailed index! Chapter 2 has a section on Evolution, which opens with the wonderful statement that "BLAST works because evolution is happening."

Yesterday I used my newly gained ability to do local BLAST searches to set up a search for one of the post-docs. We blundered around a bit because I couldn't remember what the different letters controlling the parameter settings did. Now I have the book, all the information I need is at hand.

The only problem is that the book was published in 2002, and some details have changed. Right now I only notice that the BLAST web interface has changed a lot. The available version of BLAST has also changed, from 2.2.6 (new when the book was written) to 2.2.16. I suspect I'll need to read the book before I'll have the background to let me understand the changes.

BLAST problem solved



The reason BLAST was finding no variation at the central positions of the 39 nt sequences was that I had set its 'word' length to 20. This told it to begin searching by looking for perfect matches to an initial sequence that was 20 nt long. This meant that it could never find a mismatch at the central position because such a mismatch would have been flanked by matched segments that were, at best, 19 nt long.

So I tested word lengths of 10 and of 8. Both eliminated the central mismatch problem, at the trivial expense of increasing search time to at worst 10 seconds for the whole genome.

I'm gradually getting a better understanding of how BLAST searches work (it takes me a while to absorb the complexities), so I've also improved the searches in other ways - allowing higher "E-values" so I get sequences with more than one mismatch, and setting the maximum number or results to a value appropriate for my database size. I've also improvved my Word/Excel shuffle methods, so I get a cleaner dataset. And I now carefully note the numbers of sequences at the various steps.

The graph above is the control analysis for only one orientation of only one of the geneome sequences. So now I'm ready to search all three genomes against the control dataset and, if this looks good, against the USS dataset.

BLAST success, analysis success

Everything is working.

I still can't get BLAST to attend to matches close to the ends of the 39 nt fragments, but I'm treating these as mismatches at the innermost position and 'no information' at positions closer to the end. For example, if a sequence matches at positions 4-39, I assume there's a mismatch at position 3 and that I have no information about positions 1 and 2.

I'm searching for the two USS orientations separately (searching the forward and reverse strands of the query genome separately). I'm analyzing the data separately. So far I've analyzed only the forward searches, but I'll need to flip the results I'll get from analyzing the reverse searches.

I'm collecting the output as pairwise comparisons between query and USSs, because this makes it easiest to pull out the positions of the mismatches without the information about what kind of a mismatch each is.

I'm doing the analysis by bouncing the file back and forth a couple of times between Word and Excel, using Word to first insert tabs between the output lines, and then Excel to delete the columns (formerly lines) I don't want (including the actual sequences) and to sort the results by both the match score and the locations of the ends of the matches. Then I use Word to insert tabs between each position of each alignment, and Excel to count the numbers of mismatches at each position and to graph the results.

Next post I'll describe the results.

I can run BLAST locally!

A few suggestions from the post-doc ("Try keeping all the BLAST files in the same folder.") got my Unix problems solved. Commenters on my last post have provided lots of suggestions, some of which are useful and some of which address problems I've already solved. Neil has even set up a NodalPoint page about this problem, but I haven't yet figured out how to edit in my comments.

I've now used standalone (local) BLAST to search the database of 2136 different 39 nt segments of the H. influenzae Rd genome with the source genome (Rd) sequence. This is the 'positive control' for the searches I want to do with the non-Rd genome sequences. This worked, and let me do some trials to work out which options I should specify. Then I did the same searches with the three other genomes I have. This showed me other problems, some of which I haven't worked out yet.

I have two big remaining problems.

First, I need to understand BLAST searches well enough to optimize the alignments. I understand that mismatches close to the ends of the sequences will be under-represented, because of how BLAST works. I think this appeared in the search results - instead of alignments with single mismatches near the ends I think I got alignments that had been shortened by 4 nt. I may be able to minimize this by setting the options appropriately, but I probably can't eliminate it. Luckily the first and last 5 nt are the least important for this analysis.

Second, the information I need to get from the non-Rd searches is the locations of mismatches within the 39 nt segments. For this I found that representing the output as pairwise alignments made easiest to extract the information that specifies the positions of the mismatches within the alignments. This is relatively straightforward (yes Neil, with Word and Excel) provided the mismatches are internal to 39 nt alignments, but will need some more sophisticated tricks for alignments that are truncated at one or the other end. Another problem for extracting the position info is that half of the 39 nt sequences are in the opposite orientation to the others, and so they are reversed in the output.

The guy in the next office strongly recommended BBEdit, so I bought that. He said it does a lot of the editing chores that I'd otherwise need to write Perl scripts to do. Sounded great. But BBEdit has wandered far from its "Bare Bones Software" (="BB") roots, and learning how to use it will take some time...

Preparing to run BLAST locally

I asked the guy in the next office for advice about using BLAST to characterize the pattern of variation in and around USS sites in the sequenced H. influenzae strain genomes. His advice was to set up BLAST on one of our own computers (not a big deal, he said), to create a BLAST database containing all of the ~2000 USS sequences (39 nt each) from the H. influenzae Rd genome, and then to 'query' this database with the genome sequences of each of the other strain genomes in turn. This has the big benefit of creating only one output file for each genome, rather than the one file for each SS site I would have if I used the 39 nt USS sequences as queries.

Finding the downloadable files at NCBI was quite easy, and I now have a BLAST folder in the applications folder of our fast Mac (Stingray). I also had no trouble converting my file of ~2000 USS sequences into FASTA format, with each entry given an identifying number (I don't need that but BLAST does). And I downloaded the genome sequence files - only 3 other H. influenzae genome sequences turned out to be directly available from GenBank; the others apparently still must be obtained from the various sequencing centers. One of the post-docs may already have these, but if not I think I don't really need data from more than three anyway.

I also found the instructions for formatting commands for BLAST searches, and (must do first) how to use the 'formatdb' program to convert my FASTA file into a recognized BLAST database file. BUT, as usual, I'm stuck at the very basics of using Unix commands in the Mac Terminal interface. I'm hoping one or other of the post-docs will be able to set me straight today.