I have a Fastqc and multiqc report of bunch of bulk RNA-seq (PE) data, and each of 80 Fastqc report of both raw and trimmed (Trimmomatic) shows- overrepresented sequence of (2%-6%) with possible source as TrueSeq adapters in 6 of 160 reads (2 are both pairs, 2 are either forward only). each of the read in raw and trimmed both. Duplication sequence ~45-50% is each of the read in raw and trimmed both. clipping did not work. %GC content mean value 51% has a sharper peak than theoretical distribution in all 160 raw and 154 trimmed. (FASTQC put them 154 in Warning and 6 in failed category for gc content) Adapter content is there in raw, but trimming removed it.
Fastqc/multiqc report shows that trimming of reads caused some changes in GC contents that is evident in double spikes in 6/160 trimmed read. I am sharing the 1 out of those 6 trimmed (and corresponding raw) reads that is showing failed at gc content, Adapter content in raw
gc content in raw
gc content in trimmed
Overrepresented reads are 6.56% TruSeq Adapter, Index 6 (97% over 37bp) for this trimmed and raw both.
However, Quality score is very good lies between 30-40 median range for all 160 raw and trimmed both.
How to interpret the spikes of GC content? I have explored several fastqc interpretation by its author and related fraternity discussions but could not conclude to go ahead for mapping onwards. Should i ignore the spikes and overrepresentations since sequence quality is good? or else. I appreciate for your time and suggestions.
Check to see if those samples contain rRNA. It has a different GC content compared to other genes.
Having some samples "fail" FastQC criteria does not make them automatically bad. There is also no rule that says you can't move forward with the analysis. If you notice any strangeness with PCA etc after you do the counting and start basic analysis then consider whether to backtrack and investigate or drop the outlier samples (if it is justified).
Thanks, I will check it.