Hi everyone,
I'm new to RNA-seq and I'm struggling with the QC of my data. I performed Illumina paired-end stranded RNA-seq on human cells (ran on HiSeq). I did FASTQC on some samples and I have a high GC content and fail the "Per sequence GC content" module, with several peaks around 60 or 80% GC :
It looks like most of the over-represented sequences are rRNA and I've read that these usually have high GC content and could explain the failure of this module. I'm just wondering how it will affect my downstream analysis (since I don't know how much of my reads are made of these rRNA) and if there is a way to asses the amount/remove these rRNA sequences before aligning to the genome, to see if it improves the quality of my data ?
Many thanks for your help. Christophe
I'd suggest you check out MultiQC and run it over all your fastqc reports to see if this is a sample level event, or an experiment level event. GC content distributions that aren't normally distributed, can be indicative of sample contamination (see here). I think attempting to remove the rRNA sequences, as @Antonio suggested would be a useful stepping stone.