FastQC GC-content, bimodal for maize?
0
1
Entering edit mode
4 months ago
boczniak767 ▴ 870

Hi,

I'm analysing several RNA-seq datasets from Zea mays I've read several posts about issues with GC-content in RNA-seq.

I know it could be related to

  1. contamination with material from another species
  2. two different populations of transcripts (grasses have two populations of transcrips with different GC-content eg article).

So in some downloaded data there are two peaks eg.

E-MTAB-11331 leaf, osmotic stress E-MTAB-11331

GSE220124 leaf, drought stress GSE220124

In some there is a bump at the right E-MTAB-4258 leaf, various stresses E-MTAB-4258

And some are almost perfectly gaussian GSE167881 leaf, cold stress GSE167881

I've also done sequencing by myself and discovered that samples from leaf and SAM have different GC-content

leaf samples - right (red) peak, shoot apical meristem (SAM) plus surrounding tissue (etiolated, no chloroplasts)- left (gray) peak Two tissues These samples have poly-A transcripts selected during library prep.

I could imagine the differences could be due chloroplast RNA in leaf samples.

But how about other projects from databases? I can't believe that in all cases there is contamination with different species... The rest of statistics were okay in those projects, no adapters, no other issues.

RNA-seq • 890 views
ADD COMMENT
0
Entering edit mode

This could be because or rRNA that generally has different GC distribution. You can easily check that.

ADD REPLY
0
Entering edit mode

Thanks, actually for my own samples - labelled "Two tissues" in the post, there are overrepresented sequences which indeed blasted to rRNA. Even thought there was poly-T selection step during library preparation.

ADD REPLY
0
Entering edit mode

What method would you recommend? Prwviously I used simply blasted overrepresented sequences to rRNA NCBI databases. However in the case of three projects I've checked so far there are not overrepresented sequences.

ADD REPLY
0
Entering edit mode

What method would you recommend?

For what? You could ignore the rRNA counts in subsequent analysis or just not count if the annotations are not in your GTF/GFF.

ADD REPLY
0
Entering edit mode

I mean, how can I check if the sequences map to rRNA? I should transform fastq to fasta and do blast? Or there is easier method? In my annotation there are no rRNA genes.

ADD REPLY
0
Entering edit mode

If you have maize rRNA sequences then you can use bbduk.sh from BBMap suite in filter mode to find if there are any reads that match the RNA. You can also remove/trim those reads at the same time if you wish (by using trim mode).

ADD REPLY
0
Entering edit mode

Okay, so removal of rRNA didn't help - tested on two full projects. I've downloaded all plant rRNA sequences I can find on NCBI and used command like that in loop

bbduk.sh in=read1.fastq.gz in2=read2.fastq.gz \
ref=rrnas-plant.fa \
out=read1-clean.fastq.gz out2=read2-clean.fastq.gz \
stats=read-stats overwrite=t

Resulting FastQC reports are very similar to those for files before bbduk - the second peak in GC content is still here.

So either this is not rRNA or I've just downloaded too small set.

I think I should mention my goal - I'm interested in genes which do not change expression. So I assume any contamination with different species could cause detection of more false-positive non-differentially expressed genes.

ADD REPLY
1
Entering edit mode

For now proceed with your data analysis and see what happens. If there is contamination then it should become apparent after you align the data (normally there should be no contamination unless samples were mishandled). This will likely show up as reduced % alignments (if the contamination is real). If there is no apparent difference in alignment %, then the transcribed sequences may be real (and have GC bias) (maize genome is huge and these may be from previously unknown regions) and will still align.

You may want to test aligning to genome and to a known transcriptome (to see if any differences appear).

Looks like there is a T2T assembly of maize genome that may have the sequence of complete rDNA repeat: https://www.nature.com/articles/s41588-023-01419-6

ADD REPLY
0
Entering edit mode

Thank you GenoMax for help. I've extracted 5S rDNA and 45S rDNA from the genome reported in the article and used them in bbduk as above.

However the additional peaks remain.

For GSE11331 (first plot in my orginal post) ca 130000 sequences (0.3% of all) were removed

similarly for GSE124340 (second plot) ca 250000 sequences (0.6% of all) were removed

So rRNA content is negligible. I'll follow your advice and proceed to mapping and I'll see what happens.

ADD REPLY
0
Entering edit mode

Okay GenoMax , so for three projects with bimodal GC-content the mapping stats (after STAR) are comparable to other projects. I've judged that based on BamQC's Percent sequences unmapped which is 0 in all cases. Mapping quality histograms are also comparable to other projects. The majority of mappings are very good (only few percent have values <10 others have >248... and yes I know implementation of MAPQ in STAR qc-fail explanation). Also percent primary alignments is ca. 90% in all cases.

Just as a interesting complement please look at plot form GSE122866 where samples from control (red) and heat (gray) have different GC-histograms. So it seems the shape of histogram can be influenced by sample type or condition. plot

ADD REPLY
0
Entering edit mode

If you are trying to do meta-analysis on public datasets then this is not surprising. If you generated these data then there may be some interesting biology to decipher. In any case, it sounds like the sequences are aligning well so they must come from different parts of the genome with differing GC content.

ADD REPLY
0
Entering edit mode

I'm trying to do meta-analysis. Only the last plot from my orginal post is from my own data.

ADD REPLY

Login before adding your answer.

Traffic: 1976 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6