I realized that I don't know what happens when illumina sequencing chemistry reaches the end of a fragment. Does the reaction stop for that fragment, or are bases added in some way? The reason I ask is because I have fragments ranging from 120-160bp in length and yet on 150 cycles I will always get 150bp long reads. And some of my reads will end in terminal repeats like this:
CGTCTTCTGCTTGAAAAAAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
Does your fragment size include the sequencing adapters?
No the adapters are already trimmed off.
wait, er, what? Then you should sequence into the adapters no? This doesn't come after the adapters does it?
Yeah that was unclear. I was trying to say that the adapters were trimmed from my data already, but I guess maybe this is not the case.
Is this RNA-seq on NextSeq/MiniSeq?
This is DNA sequencing on the hiseq 4000.
Hm, my guess was based on the polyA and polyG :p
Yes, normally on a HiSeq 2500, it seems like there's mainly poly-A, while on NextSeq there's poly-A for a little while then poly-G. I also thought it was probably NextSeq. This might be because NextSeq and HiSeq 3000+ both need the same base-calling software; our 2500s are using an older version than we use for our NextSeq.
From my understanding the polyG on NextSeq/MiniSeq is due to the two-colour chemistry of those sequencers compared to four-colour on the other machines. On NextSeq/MiniSeq, absence of signal indicates a G. (see also this post on qcfail)
That's true, but it does not explain the poly-A prior to the poly-G. In fact, the poly-A on NextSeq tends to be the same length for every read, so it actually appears in the consensus of BBMerge's "outa" results:
adapter sequence - poly-A - poly-G
I don't know whether the poly-A is actual signal, or a base-caller artifact.
Since A is the result of imaging both dyes, might be just decaying noise... who knows!
Once we go off the end of a
adapter-fragment-adapter
construct, sequencer may start sequencing into the adapter lawn present on the flowcell. This can lead to very odd results (polyA's can be one manifestation).It's highly unlikely that the signal represents sequencing into the adapter lawn. You'd have to invoke some bizarre mechanism of strand dissociation/mismatch annealing/synthesis that hasn't been reported previously for DNA polymerases (whose biochemical properties have been studied intensively for about five decades).
More likely is addition of an untemplated 3' A (a known activity of Taq and similar DNA polymerases) to a subset of molecules for a few cycles until that peters out, then G calls afterward due to background/non-signal. Or it could be a function/artifact of the base-caller (much like the '2' PHRED score conventionally signifies a run of low Qs at the end of the read).
Disclaimer: rampant speculation on my part!
I recall that being offered as a possible explanation by someone in Illumina tech support a long time ago. But I don't have any hard evidence of that conversation/a document I can point to. AFAIK the sequence of the adapters on the flowcell is a trade secret.
The nucleotide sequence on the flow cell has to be complementary to the adapter for the library to anneal. But the moiety used to tether it and any chemical modifications are indeed proprietary.