Hi,
I'm analysing several RNA-seq datasets from Zea mays I've read several posts about issues with GC-content in RNA-seq.
I know it could be related to
- contamination with material from another species
- two different populations of transcripts (grasses have two populations of transcrips with different GC-content eg article).
So in some downloaded data there are two peaks eg.
E-MTAB-11331 leaf, osmotic stress
GSE220124 leaf, drought stress
In some there is a bump at the right E-MTAB-4258 leaf, various stresses
And some are almost perfectly gaussian GSE167881 leaf, cold stress
I've also done sequencing by myself and discovered that samples from leaf and SAM have different GC-content
leaf samples - right (red) peak, shoot apical meristem (SAM) plus surrounding tissue (etiolated, no chloroplasts)- left (gray) peak These samples have poly-A transcripts selected during library prep.
I could imagine the differences could be due chloroplast RNA in leaf samples.
But how about other projects from databases? I can't believe that in all cases there is contamination with different species... The rest of statistics were okay in those projects, no adapters, no other issues.
This could be because or rRNA that generally has different GC distribution. You can easily check that.
Thanks, actually for my own samples - labelled "Two tissues" in the post, there are overrepresented sequences which indeed blasted to rRNA. Even thought there was poly-T selection step during library preparation.
What method would you recommend? Prwviously I used simply blasted overrepresented sequences to rRNA NCBI databases. However in the case of three projects I've checked so far there are not overrepresented sequences.
For what? You could ignore the rRNA counts in subsequent analysis or just not count if the annotations are not in your GTF/GFF.
I mean, how can I check if the sequences map to rRNA? I should transform
fastq
tofasta
and do blast? Or there is easier method? In my annotation there are no rRNA genes.If you have maize rRNA sequences then you can use
bbduk.sh
from BBMap suite in filter mode to find if there are any reads that match the RNA. You can also remove/trim those reads at the same time if you wish (by using trim mode).Okay, so removal of rRNA didn't help - tested on two full projects. I've downloaded all plant rRNA sequences I can find on NCBI and used command like that in loop
Resulting
FastQC
reports are very similar to those for files beforebbduk
- the second peak in GC content is still here.So either this is not rRNA or I've just downloaded too small set.
I think I should mention my goal - I'm interested in genes which do not change expression. So I assume any contamination with different species could cause detection of more false-positive non-differentially expressed genes.
For now proceed with your data analysis and see what happens. If there is contamination then it should become apparent after you align the data (normally there should be no contamination unless samples were mishandled). This will likely show up as reduced % alignments (if the contamination is real). If there is no apparent difference in alignment %, then the transcribed sequences may be real (and have GC bias) (maize genome is huge and these may be from previously unknown regions) and will still align.
You may want to test aligning to genome and to a known transcriptome (to see if any differences appear).
Looks like there is a T2T assembly of maize genome that may have the sequence of complete rDNA repeat: https://www.nature.com/articles/s41588-023-01419-6
Thank you GenoMax for help. I've extracted 5S rDNA and 45S rDNA from the genome reported in the article and used them in
bbduk
as above.However the additional peaks remain.
For GSE11331 (first plot in my orginal post) ca 130000 sequences (0.3% of all) were removed
similarly for GSE124340 (second plot) ca 250000 sequences (0.6% of all) were removed
So rRNA content is negligible. I'll follow your advice and proceed to mapping and I'll see what happens.
Okay GenoMax , so for three projects with bimodal GC-content the mapping stats (after
STAR
) are comparable to other projects. I've judged that based onBamQC
'sPercent sequences unmapped
which is0
in all cases. Mapping quality histograms are also comparable to other projects. The majority of mappings are very good (only few percent have values<10
others have>248
... and yes I know implementation of MAPQ inSTAR
qc-fail explanation). Alsopercent primary alignments
is ca.90%
in all cases.Just as a interesting complement please look at plot form GSE122866 where samples from control (red) and heat (gray) have different GC-histograms. So it seems the shape of histogram can be influenced by sample type or condition.
If you are trying to do meta-analysis on public datasets then this is not surprising. If you generated these data then there may be some interesting biology to decipher. In any case, it sounds like the sequences are aligning well so they must come from different parts of the genome with differing GC content.
I'm trying to do meta-analysis. Only the last plot from my orginal post is from my own data.