A few suggestions from the post-doc ("Try keeping all the BLAST files in the same folder.") got my Unix problems solved. Commenters on my last post have provided lots of suggestions, some of which are useful and some of which address problems I've already solved. Neil has even set up a NodalPoint page about this problem, but I haven't yet figured out how to edit in my comments.
I've now used standalone (local) BLAST to search the database of 2136 different 39 nt segments of the H. influenzae Rd genome with the source genome (Rd) sequence. This is the 'positive control' for the searches I want to do with the non-Rd genome sequences. This worked, and let me do some trials to work out which options I should specify. Then I did the same searches with the three other genomes I have. This showed me other problems, some of which I haven't worked out yet.
I have two big remaining problems.
First, I need to understand BLAST searches well enough to optimize the alignments. I understand that mismatches close to the ends of the sequences will be under-represented, because of how BLAST works. I think this appeared in the search results - instead of alignments with single mismatches near the ends I think I got alignments that had been shortened by 4 nt. I may be able to minimize this by setting the options appropriately, but I probably can't eliminate it. Luckily the first and last 5 nt are the least important for this analysis.
Second, the information I need to get from the non-Rd searches is the locations of mismatches within the 39 nt segments. For this I found that representing the output as pairwise alignments made easiest to extract the information that specifies the positions of the mismatches within the alignments. This is relatively straightforward (yes Neil, with Word and Excel) provided the mismatches are internal to 39 nt alignments, but will need some more sophisticated tricks for alignments that are truncated at one or the other end. Another problem for extracting the position info is that half of the 39 nt sequences are in the opposite orientation to the others, and so they are reversed in the output.
The guy in the next office strongly recommended BBEdit, so I bought that. He said it does a lot of the editing chores that I'd otherwise need to write Perl scripts to do. Sounded great. But BBEdit has wandered far from its "Bare Bones Software" (="BB") roots, and learning how to use it will take some time...
Hi
ReplyDeleteWhy don't you use Geneious for some of the alignment editing. The free version would be enough for your needs.
Paulo
I'd never heard of it! It might not do what I want for this problem but I bet it will be very useful for other things we do. I'll download the free version and tell the lab about it.
ReplyDeleteThanks.
Geneious is a very nice software, has some problems here and there but it is solid.
ReplyDeleteI mentioned it because I guess you use Macs, so BioEdit is not compatible. Another option for Macs is Clc Workbench, but it is not my favourite.
cheers
Good to see progress. I was going to mention Excel as an option. If you have delimited text output, it's easy to open in Excel and sort by various columns. Ultimately it's about getting the job done fast in a way that works for you so if you can avoid scripts/regexes, go for it. I stand by my deep hatred of Word though, for any purpose :)
ReplyDeleteIf you feel like editing the wiki page, great, but no worries if not. You need to login with the Nodalpoint user/pass, then the edit buttons will appear. Of course, anyone else is welcome to register, create and edit content there.
I'd definitely look at BLAT at some stage, if not for this project, it's far faster than BLAST and very useful for things like whole genome alignment. I included an example of some output in the BLAT section at the wiki page.
Here's Wikipedia's Smith-Waterman page - not bad.
About the "missing ends" problem - there's an alignment method named "glocal", which tries to find best local alignments that include ends. I don't know what's available for Mac in this regard, but try glocal as keyword in your web searches.
Neil, I'd like to edit the wiki page, as I see this as an experiment in open science. But I don't see any 'edit' buttons even after logging in. Where should they be? What should they look like?
ReplyDeleteWiki edit buttons - there should be one at the top left just under the page title, one at bottom left and a small one for each page section at the right.
ReplyDeleteIf you don't see them there may be a Nodalpoint permission issue. I changed you to a "registered" user type - if that doesn't help let me know and I'll check with Greg, the site admin.
Thanks Neil, now I can edit it.
ReplyDelete