Hi all,
I'm checking quality for my RNA-Seq through FastQC and all my fastq failed on "Per base sequence content" and "Sequence Duplication Levels", besides warning on "Overrepresented sequences" only for read 1 files (it's paired-end; the sequences match between samples). Below is an example, but it's very similar across all fastq.
Can you give me any clue about the possible causes or how to investigate them?
Importante note: it's DNBseq (BGI sequencer).
Thank you,
For "Sequence Duplication Levels", I will try to plot duplication vs read density to check for technical duplication, thank you!
But for "Per base sequence content", what's bothering me is not the biased sequence at the beginning, but the separation between G/C and A/T proportions. Could it reflect duplicated sequences too?
Separation between the G/C and A/T is possible because your organism may have GC rich exons. They could also reflect rRNA, if they were not completely eliminated.
But in case of GC rich exons, "Per sequence GC content" should fail too, right? This is not the case for any of my fastq files...
To check for rRNA, could I blast the overrepresented sequences? I have read about adapter dimers too, do you know how could I check this?
None of the
failures
on FastQC prevent you from proceeding with the data analysis. In fact you should do so. If there are issues downstream (e.g. alignment % looks bad, you are not able to assign counts to gene etc) then backtrack and try to investigate the causes of why that may be happening.