I'm new to RNA-Seq and have just run FastQC on my dataset. On the plots of GC content, all of the samples have a peak at around 60%, as shown here:
I've blasted a few of the most overrepresented sequences and each one hits multiple genes of multiple mammalian species with 100% identity. Each one hits the human signal recognition particle RNA (SRP 7SL), but also hits predicted targets in other mammals. Here's an example sequence:
GTTCTGGGCTGTAGTGCGCTATGCCGATCGGGTGTCCGCACTAAGTTCGG
Can anyone suggest what could be causing this? As I say, I'm new to RNA-Seq so it could be some beginners misunderstanding/ignorance. I haven't touched the data in any way (no trimming or any other quality cut-offs) - they are run directly through FastQC. As far as I can tell, the main quality measures (Per base sequence quality, Per sequence quality scores) are good, though several of the others (Per base sequence content, Adapter content, and kmer content) show red flags.
In case it's useful, these were paired end reads generated on Illumina Total RNA TRUSEQ.
Thank-you for any help.
Update: so I've tried trimming adapters but the GC peak is still there...
The same happened to me with this overrrepresented sequence:GATCGGAAGAGCACACGTCTGAACTCCAGTCACACAGTGATCTCGTATGC. In my case, with ChIP-seq data from lab mice models. I trimmed and the QC report just got worse (and the GC content plot almost didn´t change). I blasted it now and it shows 93% match with Staphylococcus phage Andhra, but it also appears in the adapter catalog. Because of the Blast I could think there´s a contamination of the DNA of that virus (it´s a double-stranded DNA virus), but bc of being also an adapter I would think it makes more sense that´s an adapter contamination. But if it is an adapter, also why it doesn´t appear in the "adapter content plot"? I would like to see some well-founded explanation of this, because so far I just read suggestions such as "proceed with the mapping anyways that probably it won´t affect too much", but no real explanation.