RRResearch: More about analysis of the DNA-uptake sequencing data

The graph below shows the efficiency of DNA uptake relative to the 'input' DNA sample) across a 13 kb segment of the H. influenzae Rd genome. The red dots are for a 'short' sample with average fragment size about 0.25 kb, and the blue dots are for a 'long' sample, with an average fragment size of about 6 kb (The average lengths come from crude examination of agarose gels, which might underestimate the abundance of short fragments, so the actual length distributions will be measured with a DNA Analyzer).

The previous post considered why the red data are so spiky - each spike corresponds to the location in the DNA of a short sequence matching the uptake-signal-sequence (USS) motif. Fragments containing a USS sequence are taken up much better (maybe 25-50 times better?) that fragments lacking a USS.

But the blue data are also spiky, and I don't know why. Ignoring the two big spikes for a minute, the spikes and dips have much smaller amplitude than the big red spikes (they don't go up as high or down as low), but they're also more frequent on the distance scale.

The gradual rise and fall of the blue dots over distances of several kb is expected from the length distribution of the fragments, but this jaggedness is entirely unexpected, especially given the apparent smoothness of the red points between the USS spikes. Is this just noise in the data? Is it an artefact of how the uptake data were normalized to the input data?

The two high spikes might be a different puzzle, or they might be extreme cases of whatever is causing the low-amplitude spikiness. How could variation in uptake of DNA fragments that are mostly at least several kb long give a spike that's only about 11 bp wide? Could this be an alignment artefact that somehow affects 'uptake' DNA very differently than 'input' DNA?

Here's a different graph of the uptake ratios (over about 100 kb), made by the former post-doc; again we see much more spikiness in the long-fragment DNA than in the short-fragment DNA.

To investigate the cause(s), I think the first thing to do is to go back one step from the uptake ratio data and look separately at the coverage for the input DNA and the recovered 'uptake' DNA. Luckily, the first thing the post-doc did when he got the sequencing results is to send us a screen shot of a 20 kb Integrated Genome Viewer view of the 4 sample types (long input and uptake, short input and uptake).

I'm surprised by how variable the input coverage is. The very fine scale variation is perhaps noise, but the larger peaks and valleys (500-2000 bp) are quite consistent between the long and short input DNA samples.

Unfortunately I don't have the uptake ratio graph for the same region that I have this IGV analysis, and I don't have the R skills to generate it. But I can ask the grad student to do it for me, and to send me his code so I can figure out how it's done.

Field of Science

RRResearch

More about analysis of the DNA-uptake sequencing data

No comments:

Post a Comment