I could use some help in understanding and interpreting FastQC results, when comparing pre Vs. post trim_galore input data: All data is from Illumina HiSeq4000 paired end reads, following which I performed adapter trimming and base quality-dependent trimming using "trim_galore" - which is a wrapper around 'FastQC' and 'cutadapt'. The syntax I used was:
trim_galore --fastqc --illumina --paired --retain_unpaired EthFoc-2.S282_L007.1.txt EthFoc-2.S282_L007.2.txt
I also have some general questions, listed below:
In FastQC results, when a tile is not blue, but orange or red etc, I am curious if tile-specific exclusion of Illumina reads from such tiles ever becomes necessary, and if yes, then what tools can perform such filtering / exclusion, if at all available.
My most pressing question - I am really surprised by the change in Per Base Sequence Content after trimming at ~ 150nt, where A content =0!!! Isn't this abnormal? Also, at positions 1-10nt, are these sequences worth trimming away, or can I / should I keep them? Why does this initial stretch not look similar to the rest with more or less equal A/T/G/C content?
The Per Sequence GC content is not discernibly different across the graphs in the composite image. For the fungal species being sequenced, overall GC content is commonly ~48-51%. I wonder if I should download Illumina files from NCBI SRA, for related fungal species, generated by other research groups, to check whether this deviation from the theoretical distribution is not uncommon. BTW, on basis of what genome reads is this theoretical curve plotted?
For Per Base N content, there is a minor bump at position 1. Does this mean that my trimming was not performed as well as it should have been?
For the Sequence Duplication Level graphs, I am not sure I understand the difference between the red and blue lines in the sub-panels. Interestingly the only bump is for repeats ~ > 10X, not sequences with fewer or more numbers of repeats. Could this be species specific? And I wonder if I should compare this to SRA reads for identical or similar species, sequenced by other research groups. Thoughts?
In terms of adapter content- this is what started it all, I saw FastQC return Illumina Universal Adapter content at multiple positions in the original reads, increasing all the way up to the read end. So I decided to run this trim_galore / cutadapt step. It seems totally normal that the adapter content would go away after this step. Correct?
THANK YOU!
Thank you genomax! Very useful replies indeed. For your answer #2 to my question #2, do you mean initial base pair bias ~ 1-10 alone? How about how nucleotide compositional changes ~ 140-150 before Vs. after the trim_galore step? I am trying to understand what scenario(s) can cause "A" content to go down to zero upon trimming! Could you please throw some light on it? THANKS.
You may like to change the stringency of trim-galore. By default, trim galore removes even a single bp match to adapter seq ('AGATCGGAAGAGC'), which mean a terminal A will be always trimmed. See Fig 2 https://www.epigenesys.eu/images/stories/protocols/pdf/20120720103700_p57.pdf
Sequencing bias in the first few bases are explained very clearly here https://sequencing.qcfail.com/articles/positional-sequence-bias-in-random-primed-libraries/