Field of Science

Preparing to run BLAST locally

I asked the guy in the next office for advice about using BLAST to characterize the pattern of variation in and around USS sites in the sequenced H. influenzae strain genomes. His advice was to set up BLAST on one of our own computers (not a big deal, he said), to create a BLAST database containing all of the ~2000 USS sequences (39 nt each) from the H. influenzae Rd genome, and then to 'query' this database with the genome sequences of each of the other strain genomes in turn. This has the big benefit of creating only one output file for each genome, rather than the one file for each SS site I would have if I used the 39 nt USS sequences as queries.

Finding the downloadable files at NCBI was quite easy, and I now have a BLAST folder in the applications folder of our fast Mac (Stingray). I also had no trouble converting my file of ~2000 USS sequences into FASTA format, with each entry given an identifying number (I don't need that but BLAST does). And I downloaded the genome sequence files - only 3 other H. influenzae genome sequences turned out to be directly available from GenBank; the others apparently still must be obtained from the various sequencing centers. One of the post-docs may already have these, but if not I think I don't really need data from more than three anyway.

I also found the instructions for formatting commands for BLAST searches, and (must do first) how to use the 'formatdb' program to convert my FASTA file into a recognized BLAST database file. BUT, as usual, I'm stuck at the very basics of using Unix commands in the Mac Terminal interface. I'm hoping one or other of the post-docs will be able to set me straight today.


  1. Just open a terminal and if you have formatdb on your path just type

    formatdb -p F -i __filename__

    To check other options type

    formatdb --help



  2. If you're running a nucleotide search, you should be able to format the database and run the search in about 15minutes (assuming your database is small b/c it's just viral sequences).

    I'd recommend using the -m 8 option which outputs your results in a tab delimited format.

    good luck!

  3. Looking through the command history on my Terminal, I see that for the thing I ran the other day I used this command first:

    xdformat -n -o /Users/iayork/Lab/WU-Blast/db/influenzae /Users/iayork/Desktop/Downloads/PitEE.fasta

    That formatted the FastA file containing the PitEE sequence into the database file /Users/iayork/Lab/WU-Blast/db/influenzae . Then I put the USS sequences into a text file and used this command:

    blastn /Users/iayork/Lab/WU-Blast/db/influenzae /users/iayork/Desktop/USS e=1e-3 mformat=2 o=/Users/iayork/Desktop/blastout.txt

    That searched with blastn the database I had just set up, with the query sequences in /users/iayork/Desktop/USS, expecting 1 e -3, output in the second format option which is tabular, output saved into text file /Users/iayork/Desktop/blastout.txt .

    That's using WU-BLAST, not NCBI BLAST, but I think most of the options are similar.

    Hope that gives a start, anyway.

  4. Oh, and the "bit of scripting" really should be quite straightforward. The Blast output will look quite different from what you're looking for, but (if I understand what you'd like to do) it's probably two lines of regular expressions to get it into a format that's pretty close, and maybe a handful of lines of scripts to get the number of matches/mismatches and so on. If you have someone handy who knows regexps that's the way to go because you can refine it on the fly according to your needs.

  5. Dear Rosie,
    I've been following your BLAST trials these past few days and would like to offer some advice. First, let me point you to this wiki page that I created at the Nodalpoint wiki to discuss your problem. If you're registered at Nodalpoint, you can login and edit wiki pages too.

    I don't think BLAST is the best solution here. One problem you're likely to face is that you won't get the entire 39bp query sequence aligned in every case. You'll also have to do a lot of tweaking to deal with short query sequences.

    If you do want to run BLAST, don't create your database from the 2000 USS sequences. Create it from the 12 genome sequences. If your fasta files are "genome1.fna", "genome2.fna" etc., you can concatenate them into one file using the command "cat *.fna > all.fna". Then run formatdb on the "all.fna" file. Your BLAST database is then called "all" and with your queries in file "USS.fa", the BLAST command is "blastall -d all -p blastn -i USS.fa -o blast.out".

    An alternative to BLAST is BLAT, which is much faster, generates a nice tab-delimited output and can be run on your fasta files. You'd run it like "blat all.fna USS.fa outfile". However, you have to compile BLAT from source code which might be a bit tricky.

    Since you want to align specific sets of sequences, I think a program that does Smith-Waterman local alignments is more appropriate than a general sequence search method. If you can, get EMBOSS installed locally. It's a fantastic suite of tools, including programs for local and global alignment.

    Whatever you end up doing, it will generate a large output file that will have to be parsed and reformatted. Scripting (e.g. in Perl) is really the only way to deal with this. So I hope you have someone on hand to help out with that.

    And finally - make sure your sequence files are saved as plain ASCII text, if you insist on using Word! It really is well worth investing some effort in learning to use a simple UNIX text editor like nano or emacs.

  6. Hi everyone,

    Most of the problems are solved.

    The Unix problem turned out to be that I needed to keep all the blast files in the same directory.

    Tab-delimited format is no good because the info I need is not the number of mismatches but their positions. Pairwise (-m 0) appears to be best.

    I checked out BLAT, but decided that it doesn't do what I want.

    If I had a minion who knows regexps I'd be set, but ....

    Wow, a Nodalpoint page all about my problem! I registered but can't see how to edit. Shall I add responses to the suggestions?

    Where can I find a simple explanation of the difference between a Smith-Waterman local alignment and a general sequence search method?

    Emacs? I used line editors when I was a grad student, but abandoned them as soon as alternatives became available. Now my spoiled brain refuses to grapple with them.


Markup Key:
- <b>bold</b> = bold
- <i>italic</i> = italic
- <a href="">FoS</a> = FoS