Bimodal GC content
4
8
Entering edit mode
8.8 years ago
Folder40g ▴ 190

Hi

I'm analysing some human CLL data (cancer, whole exome), and when running fastqc to see how data are I observe all samples do show a bimodal GC content. Generally the only warn shown by Fastqc happens for the GC module, the other normally are good.

I have run fastq_screen only against human genome having a 80% only one hit reads, 18% having multiple hits and about 0.6% not mapping against human, this is making me thinking that no contamination is present in the samples.

After some thought I do not know why samples do show this kind of distribution.

Anyone?

Thanks for your time.

GC FASTQC screenshot

fastqc GC-content • 12k views
ADD COMMENT
3
Entering edit mode

Used to routinely see this with Agilent human exomes, never had a particularly good explanation for it other than there might be some inherent bias in the baits?

ADD REPLY
3
Entering edit mode

Just to back up Dan's answer, I've seen the same in Agilent exomes, and haven't come up with a reasonable explanation. Traditionally with things like RNA seq, this bimodal distribution would make me go straight to the possibility of sample contamination, but with exomes it seems more systematic more than anything else.

ADD REPLY
0
Entering edit mode

This data were obtained also by whole-exome seq library Agilent SureSelect.

ADD REPLY
2
Entering edit mode

Did you ever find a solution for your issue runnerbio?

I wrote a small tool to drill down into BAM statistics like GC% to see if your secondary peak is over-represented in certain reads (certain chromosomes, certain mapping conformations, certain read flags, certain fragment lengths, certain read tags, etc etc).

I haven't published it to github yet, but if you would be interested in 'test driving' it to see if it can help you figure out your issue, I'd be more than willing to give some support as you go along :) Heres a video - skip to about min. 9:00 :) https://vimeo.com/123508180

ADD REPLY
0
Entering edit mode

No I haven't found a reason for this behavior. I don't think there is contamination from bacterias or fungi in theses samples, neither I think that heterogeneity of samples can cause this (this is exome data, I may think that RNA data and heterogeneity in samples could show bimodal GC content). And finally, as said in here by two mates, it seems to be a general "pattern" for Agilent exomes.

I'll take a view of the video, I think it may be worthy to take a look to your tool to see if it gives a answer to the bimodal GC contente in exomes.

ADD REPLY
4
Entering edit mode
8.4 years ago

Good news everyone! To be honest we obtain such strange pictures with bimodal distribution of GC in every run. Just finished inspection of one human sample, decided to intersect my bam file reads with exonic and intronic regions downloaded from ucsc - and it fits perfectly.

Thats how it looks in FastQC:

fastqc gc

And this is the same GC plot colored according to its genomic location - you can see there is two main peaks for introns and exons respectively:

gc_genoic_region

So, here is one more possible explanation of bimodal GC content, but it is library-specific. In our lab we use Agilent Focused Exome. Hope this would help!

ADD COMMENT
1
Entering edit mode

While I think that's some great detective work Liu, this may not be the answer for some people - for example, if I do the same analysis as you on some data which does not have a bimodal peak, I also get the same breakdown as you got for exonic/intronic GC%

In other words, yes GC% for intronic and exonic DNA is different, but you should still expect to see a normally distributed GC% plot for unbiased/untargeted sequencing when looking at all the reads together.

But it's still very interesting :)

ADD REPLY
1
Entering edit mode

Absolutely agree John, in my case library is targeted on exons but there are still some reads map on introns.

ADD REPLY
1
Entering edit mode

Ahh i see - ok awesome :) Well that's very interesting then that you only see a few more reads in exons than introns with that assay. Also, your ggplot GC% graph is so much more detailed (for the intron/exon series) than the FASTQ one. I really wish FASTQ would stop smoothing their graphs.

ADD REPLY
2
Entering edit mode

I wonder what the plot looks like for the off-target reads that are intergenic

ADD REPLY
0
Entering edit mode

Would you mind sharing more details of how you plotted this ? I would like to try it out on my samples. Thanks!

ADD REPLY
0
Entering edit mode

Would the curve have multiple peaks if the sample is rRNA depleted? rRNA depleted samples would have several types of ncRNA besides mRNA that might alter the GC content.

ADD REPLY
3
Entering edit mode
8.8 years ago
thackl ★ 3.0k

I don't have particular experience with either human nor exome sequencing, but I came across similar distributions in genome sequencing projects. Among others, I have observed it for a highly repetitive plant. In that case, the second peek corresponded to specific repeat class, that was really highly abundant in the data set.

Giving your mapping result, I concur, contamination is unlikely. So I would try to figure out from which locations of the genome these high GC reads derive and whether you can associate that with some useful annotations. Based on your mappings, you could extract regions from the genome with proper reads coverage, e.g. with bedtools, and than look for entire sequences or large windows of high GC.

ADD COMMENT
0
Entering edit mode

Hello, dear thackl I was running a denovo rnaseq expriment on a plant.similarity, my fastq GC content result is bimodal. Is it possible for you to more explain about "the second peak corresponded to specific repeat class"? I think it is depended to existance of chloroplast genome, what is your idea? best regards

ADD REPLY
1
Entering edit mode
6.7 years ago
raulAlc ▴ 10

Hi! I recently stumbled upon this nice little example of a bimodal distribution of GC content for an WG-Seq of orange. We were suspecting possible contamination. Upon blasting some of the reads with high %GC, I came upon hits that looked like: "C.limon DNA for clsat_9 satellite" (satellite DNA), looking at the citation ( https://link.springer.com/article/10.1007/s001220100719 ) I did corroborate that Citrus are rich in satellite DNA which has a GC-content between 60% and 68%. So that explained our secondary peak. Cool!

GC content in orange

ADD COMMENT
0
Entering edit mode
7.2 years ago
thackl ★ 3.0k

I don't think that you can necessarily extend the observations made above to directly to RNASeq experiments. Also, I don't really know if a bimodal GC distribution is something to be concerned about in the first place when looking at RNAseq. You might need to talk to people more involved with RNASeq. Sorry.

ADD COMMENT

Login before adding your answer.

Traffic: 1725 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6