I'm analysing some human CLL data (cancer, whole exome), and when running fastqc to see how data are I observe all samples do show a bimodal GC content. Generally the only warn shown by Fastqc happens for the GC module, the other normally are good.
I have run fastq_screen only against human genome having a 80% only one hit reads, 18% having multiple hits and about 0.6% not mapping against human, this is making me thinking that no contamination is present in the samples.
After some thought I do not know why samples do show this kind of distribution.
Used to routinely see this with Agilent human exomes, never had a particularly good explanation for it other than there might be some inherent bias in the baits?
Just to back up Dan's answer, I've seen the same in Agilent exomes, and haven't come up with a reasonable explanation. Traditionally with things like RNA seq, this bimodal distribution would make me go straight to the possibility of sample contamination, but with exomes it seems more systematic more than anything else.
Did you ever find a solution for your issue runnerbio?
I wrote a small tool to drill down into BAM statistics like GC% to see if your secondary peak is over-represented in certain reads (certain chromosomes, certain mapping conformations, certain read flags, certain fragment lengths, certain read tags, etc etc).
I haven't published it to github yet, but if you would be interested in 'test driving' it to see if it can help you figure out your issue, I'd be more than willing to give some support as you go along :) Heres a video - skip to about min. 9:00 :) https://vimeo.com/123508180
ADD REPLY
• link
updated 5.0 years ago by
Ram
44k
•
written 9.0 years ago by
John
13k
0
Entering edit mode
No I haven't found a reason for this behavior. I don't think there is contamination from bacterias or fungi in theses samples, neither I think that heterogeneity of samples can cause this (this is exome data, I may think that RNA data and heterogeneity in samples could show bimodal GC content). And finally, as said in here by two mates, it seems to be a general "pattern" for Agilent exomes.
I'll take a view of the video, I think it may be worthy to take a look to your tool to see if it gives a answer to the bimodal GC contente in exomes.
Good news everyone!
To be honest we obtain such strange pictures with bimodal distribution of GC in every run. Just finished inspection of one human sample, decided to intersect my bam file reads with exonic and intronic regions downloaded from ucsc - and it fits perfectly.
Thats how it looks in FastQC:
And this is the same GC plot colored according to its genomic location - you can see there is two main peaks for introns and exons respectively:
So, here is one more possible explanation of bimodal GC content, but it is library-specific. In our lab we use Agilent Focused Exome.
Hope this would help!
While I think that's some great detective work Liu, this may not be the answer for some people - for example, if I do the same analysis as you on some data which does not have a bimodal peak, I also get the same breakdown as you got for exonic/intronic GC%
In other words, yes GC% for intronic and exonic DNA is different, but you should still expect to see a normally distributed GC% plot for unbiased/untargeted sequencing when looking at all the reads together.
Ahh i see - ok awesome :) Well that's very interesting then that you only see a few more reads in exons than introns with that assay. Also, your ggplot GC% graph is so much more detailed (for the intron/exon series) than the FASTQ one. I really wish FASTQ would stop smoothing their graphs.
Would the curve have multiple peaks if the sample is rRNA depleted? rRNA depleted samples would have several types of ncRNA besides mRNA that might alter the GC content.
I don't have particular experience with either human nor exome sequencing, but I came across similar distributions in genome sequencing projects. Among others, I have observed it for a highly repetitive plant. In that case, the second peek corresponded to specific repeat class, that was really highly abundant in the data set.
Giving your mapping result, I concur, contamination is unlikely. So I would try to figure out from which locations of the genome these high GC reads derive and whether you can associate that with some useful annotations. Based on your mappings, you could extract regions from the genome with proper reads coverage, e.g. with bedtools, and than look for entire sequences or large windows of high GC.
Hello, dear thackl
I was running a denovo rnaseq expriment on a plant.similarity, my fastq GC content result is bimodal. Is it possible for you to more explain about "the second peak corresponded to specific repeat class"? I think it is depended to existance of chloroplast genome, what is your idea?
best regards
Hi! I recently stumbled upon this nice little example of a bimodal distribution of GC content for an WG-Seq of orange. We were suspecting possible contamination. Upon blasting some of the reads with high %GC, I came upon hits that looked like: "C.limon DNA for clsat_9 satellite" (satellite DNA), looking at the citation ( https://link.springer.com/article/10.1007/s001220100719 ) I did corroborate that Citrus are rich in satellite DNA which has a GC-content between 60% and 68%. So that explained our secondary peak. Cool!
I don't think that you can necessarily extend the observations made above to directly to RNASeq experiments. Also, I don't really know if a bimodal GC distribution is something to be concerned about in the first place when looking at RNAseq. You might need to talk to people more involved with RNASeq. Sorry.
Used to routinely see this with Agilent human exomes, never had a particularly good explanation for it other than there might be some inherent bias in the baits?
Just to back up Dan's answer, I've seen the same in Agilent exomes, and haven't come up with a reasonable explanation. Traditionally with things like RNA seq, this bimodal distribution would make me go straight to the possibility of sample contamination, but with exomes it seems more systematic more than anything else.
This data were obtained also by whole-exome seq library Agilent SureSelect.
Did you ever find a solution for your issue runnerbio?
I wrote a small tool to drill down into BAM statistics like GC% to see if your secondary peak is over-represented in certain reads (certain chromosomes, certain mapping conformations, certain read flags, certain fragment lengths, certain read tags, etc etc).
I haven't published it to github yet, but if you would be interested in 'test driving' it to see if it can help you figure out your issue, I'd be more than willing to give some support as you go along :) Heres a video - skip to about min. 9:00 :) https://vimeo.com/123508180
No I haven't found a reason for this behavior. I don't think there is contamination from bacterias or fungi in theses samples, neither I think that heterogeneity of samples can cause this (this is exome data, I may think that RNA data and heterogeneity in samples could show bimodal GC content). And finally, as said in here by two mates, it seems to be a general "pattern" for Agilent exomes.
I'll take a view of the video, I think it may be worthy to take a look to your tool to see if it gives a answer to the bimodal GC contente in exomes.