Field of Science

Reanalysis of old uptake data

I've started reanalyzing the old DNA-uptake data (see New bottles for old wine). Yesterday I succeeded in using the Gibbs motif-search software (thank you RSA Tools!) to analyze the sequences from the 1990 paper, and was encouraged when it did find a USS motif in 15 of the 28 sequences. These 15 were fragments that cells had strongly preferred to take up, and the USS motif looks very much like the one derived previously from the whole-genome consensus. This result is very preliminary (I haven't yet kept any notes or done it meticulously), but it suggests that the bias of the uptake machinery does correspond well to the consensus of the genome repeats.

Today I did the preliminary analysis (this time keeping notes) of the phage-derived sequences from one of the earlier papers (1984). These sequences had not been put into GenBank as a neat set, so I had to download the phage sequence and use a nice shareware DNA-analysis program (Sequence Analysis; thank you Will Gilbert!) to identify the sequences of the five short fragments I will analyze.

I still need to deal with an annoying format problem. The motif-search programs accept DNA sequences only in particular formats, of which the simplest is "FASTA". FASTA identifies comment lines by starting them with an ">", but for some reason these programs treat the text after my ">"s as sequence. Of course they choke, because the text contains non-sequence characters (i.e. not just A G C T and N). If I paste FASTA-format sequence in directly from GenBank there's no problem, so I think Word is doing something weird with the ">" character. I need to find a better text editor for Macs (maybe Mi). Unfortunately TextEdit has been 'improved' to the point where it can no longer handle plain text - it insists on saving all files as RTF or HTML.

3 comments:

  1. If you are using OS X it might be time to look under the hood and try the command line.

    The absolute simplest and best text editor is nano (a copy of an older program called pico).

    Simply open a terminal (Applications->Utilities) and type nano at the prompt.

    TextEdit does do plain text. The default is rich text, so you have to be careful when you copy and paste, to remember to use the format menu to convert to plain text. Alternatively you can set plain text as the default for all new documents.

    I would also try to avoid Word for any kind of sequence editing. Especially if you plan to paste any sequence from word.

    Feel free to ask on nodalpoint.org for text editing tips. There are a lot of bioinformatics people there who would gladly offer advice.

    Excellent blog by the way, although I find it hard to keep up with all the new information. Your grad students must be very busy !

    ReplyDelete
  2. I second Greg's comments. It's really important to use the right software tool for the right job. Word is a word processor, so use it for word processing. Editing plain ASCII text requires a plain ASCII text editor. What you see in Word is not plain text, even if it looks like it is - as a friend of mine says, it's a graphical representation of plain text. You'll have all kinds of trouble if you try to use Word as an editor.

    ReplyDelete
  3. Thanks guys. I did suspect that I shouldn't trust Word as a text editor, but I only now discovered that changing TextEdit's Preferences would allow it to creat text files. And thanks for pointing me to nodalpoint.

    Greg, I'm amazed that you even try to keep up with the science I'm describing. I only really expect members of my lab to do that - I put it on the blog just to let the general publi see what "doing science" looks like.

    ReplyDelete

Markup Key:
- <b>bold</b> = bold
- <i>italic</i> = italic
- <a href="http://www.fieldofscience.com/">FoS</a> = FoS