Hello,
I've been trying to assembly Illumina paired-end reads (2x100bp) from RNA-Seq, but after checking FastQC results I noticed a certain pattern in the first 12 bases:
Looks like this pattern is caused by a not so random hexamer priming and that's normal and expected. Thus, the first bases are biased towards sequences that prime more efficiently. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2896536/ and http://support.illumina.com/sequencing/faqs.html (search for: Why is GC high in the first few bases?)
I've tried assembling this reads before and after trimming the first 12 bases using Trinity with default settings:
contigs N50 assembled bases
without trimming 269491 1083 165837915
trimming first 12 171249 1270 133008734
Trimming seems better at first sight. I also tried mapping the untrimmed reads in the assembly to check the match rate in the first bases using BBmap, as suggested by Brian Bushnell here
Match rate by read position:
Base pos. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
all_left.fastq 0.93793 0.97391 0.97935 0.97992 0.98321 0.98266 0.98412 0.98521 0.98521 0.98643 0.98751 0.98761 0.98713 0.98808 0.98770
all_right.fastq 0.26800 0.93917 0.97231 0.97560 0.97726 0.97941 0.97856 0.98346 0.98445 0.98447 0.98320 0.98509 0.98278 0.98500 0.98770
Only the first base from all_right.fastq
seems to have a high error rate.
Now I'm trying to make sense of these results, so my question is how does this kind of bias supposedly affects de novo assembly?
Thank you
Hello,
I just finished a new assembly, trimming the first 2bp. Now I have this:
Trimming the first 2bp had little effect. It's quite interesting. It really seems that the first 12 bases have 'equal' value lowering the continuity of the assembly, even with high match rates (~98%). I wonder if I trim, let's say, the first 20, N50 will continue to rise or it only happens when I trim the biased bases.
Thanks for sharing that; it's clearly not what I expected. I will suggest to my team that we look into the effects on our assemblies of trimming these bases.