Question

fastQC - case of the anomalous last base

0

Entering edit mode

8.1 years ago

wasphunter • 0

Hello.

I am not getting the best mapping rate (~60%) on my latest batch of sequence from a HiSeq run following de novo assembly. I don't see a lot of evidence for DNA contamination in the reads so I've been looking elsewhere for a reason for the low mapping efficiency.

My sequences appear to have high quality scores throughout except for the final base (sub-30 phred). I was able to get these removed in a subset of the data using trimmomatic. One thing persists, however. FastQC reports for "per base sequence content) indicates the last base percentages diverge substantially from the percentages present elsewhere in the reads. For example, my average G% and C% appear at a steady ~22% each throughout the reads but, for the final base, the G read increases to ~25%. the C% to almost 30%.

This observation differs from that I've seen of "normal" RNAseq reads. Have you seen this and/or can you explain the significance of this divergence? Thanks.

rna-seq fastqc • 2.3k views

ADD COMMENT • link updated 8.1 years ago by Fluorine ▴ 100 • written 8.1 years ago by wasphunter • 0

score 0 · Answer 1 · 2017-03-20

Hi, my first thought is, was your sample treated in any way or do you have any overrepresented sequences? If the bias is only at the final base, than it's most probably due to too aggressive adapter trimming, or due to your trimming of final bases. How long are your reads? Is it paired-end?

There are a number of common scenarios which would elicit a warning or error from this module:

Overrepresented sequences: If there is any evidence of overrepresented sequences such as adapter dimers or rRNA in a sample then these sequences may bias the overall composition and their sequence will emerge from this plot.
Biased fragmentation: Any library which is generated based on the ligation of random hexamers or through tagmentation should theoretically have good diversity through the sequence, but experience has shown that these libraries always have a selection bias in around the first 12bp of each run. This is due to a biased selection of random primers, but doesn't represent any individually biased sequences. Nearly all RNA-Seq libraries will fail this module because of this bias, but this is not a problem which can be fixed by processing, and it doesn't seem to adversely affect the ability to measure expression.
Biased composition libraries: Some libraries are inherently biased in their sequence composition. The most obvious example would be a library which has been treated with sodium bisulphite which will then have converted most of the cytosines to thymines, meaning that the base composition will be almost devoid of cytosines and will thus trigger an error, despite this being entirely normal for that type of library
If you are analyzing a library which has been aggressively adapter trimmed then you will naturally introduce a composition bias at the end of the reads as sequences which happen to match short stretches of adapter are removed, leaving only sequences which do not match. Sudden deviations in composition at the end of libraries which have undergone aggressive trimming are therefore likely to be spurious. [1]