Hello.
I am not getting the best mapping rate (~60%) on my latest batch of sequence from a HiSeq run following de novo assembly. I don't see a lot of evidence for DNA contamination in the reads so I've been looking elsewhere for a reason for the low mapping efficiency.
My sequences appear to have high quality scores throughout except for the final base (sub-30 phred). I was able to get these removed in a subset of the data using trimmomatic. One thing persists, however. FastQC reports for "per base sequence content) indicates the last base percentages diverge substantially from the percentages present elsewhere in the reads. For example, my average G% and C% appear at a steady ~22% each throughout the reads but, for the final base, the G read increases to ~25%. the C% to almost 30%.
This observation differs from that I've seen of "normal" RNAseq reads. Have you seen this and/or can you explain the significance of this divergence? Thanks.
Thank you for your input.
My RNAseq data derives from 125 bp paired-end reads.
I should have stated more clearly that the "errant" bases are at the 3' end of the reads, not the 5'. Apologies for not including an image earlier:
content across all bases
I believe I did a good job of removing adapter sequences as nothing comes up on the "overrepresented sequences" report. Since the problem I'm concerned with is at the 3' end of paired-end reads, it isn't clear that biased fragmentation could account for the observation (although that could explain the observations at the 5' end of the reads in the image above).
So, does this new information give you any more ideas or have I missed something? Thanks again.
That profile looks typical of data which has been trimmed for adapter sequences. Did you use cutadapt or trim_galore for the adapter trimming? Normally when I use trim_galore, the amount of overlap used to detect adapter contamination is a single base pair. This causes the last base of the reads to have a funny base content percentage. You can remove this by increasing the amount of overlap required to detect the adapter. Either way this is not the source of your low mapping rate.
Thanks for your comments, James.
I used Trimmomatic for the adapter removal in both single and palindromic modes.
I'm not sure which software was used to remove the index adapters (still checking with my sequencing company). The data came to me with almost all reads 125 bases in length. The fastQC image that I link to in the comment above is similar to the original at the 5' and 3' ends but differs in that I managed to remove a peak around the 40-50 base range that was due to illumina sequences in some of the reads (~0.1%) . I conducted my trimming using Trimmomatic with a special primer file to allow for both single end and palindromic trimming modes.