HI all,
Just a quick question on theory I suppose. After running trimmomatic on my fastq files with the following, generally used parameters such as:
ILLUMINACLIP:adapters:TruSeq2-PE.fa:2:30:10
SLIDINGWINDOW:4:20
HEADCROP:5
MINLEN:36
And then running FastQC on the trimmed, paired dataset - most quality checks seem to be okay except the Per sequence GC content goes from being okay (untrimmed dataset) to failing it. Now, I think it is because after trimming poor quality bases and such I do get a variety of sequence lengths from 36-146 and I believe this is driving the sudden increase in GC content in especially short sequences.
My question here is, is this correct? Is this why it would fail GC content and I can safely ignore this warning? It makes sense to me that shorter sequences would suddenly have high %GC content and this is causing this quality check to fail but I don't know if this is actually something we expect.
Thank you!
EDIT:
Image to show the FastQC results for two samples - it looks like my trimming does not affect GC content at all.
https://i.ibb.co/FKDsWQT/fastqc-results.jpg
EDIT2:
I should mention this is whole genome sequence data not RNAseq data.
Is your organism expected to have a GC rich genome? There is no change essentially after the trimming.
I am not sure actually - first time working with Hymenoptera species. Some quick research says there are GC-rich domains but I think most eukaryotes would have some regions of the genome be more GC rich than others for different purposes.
If you don't have a concrete reason to think that this observation is problematic then you can move ahead with rest of your analysis.
Okay thanks for your feedback!