Hello,
I checked the quality of my fastq files (MGI Tech; paired-end bulk RNAseq) with FastQC and I have one sample that had a fail for the 'per sequence GC content' (huge secondary peak - see below). Most of the overrepresented sequences are rRNA from Rattus Norvegicus, my model. So what I did was run SortMeRNA on fileR1.fq and fileR2.fq separately. I took the non-rRNA read outputs and aligned them with STAR. The alignment rate/uniquely mapped reads is 1.88%.
If I do not filter the rRNA and align with STAR with the original fastq file, it is 76% (all my other samples are ~88-97%). On the PCA plot, this 'problematic' sample clusters tightly with the rest of its group.
I am not sure how to proceed. Is there another way to check for rRNA maybe more downstream with the count matrix?
Is it okay to continue downstream analysis (featureCounts then DESeq2 or limma) without sorting the rRNA / i.e. with the raw fastq? Or should I remove this sample from my analysis?
Thanks in advance for the help!
If you have enough replicates, I would simply trash the sample. Even if you remove the rRNA genes from the count matrix you will end up with an expression matrix of mostly low-count genes