I have 50% duplicates on WGS on tumour samples and while I was expecting that the coverage will be reduced from 30x to 15x, it goes below 8x. So I am trying to figure out the reason that I have less coverage.
I can see that I have a warning/failure on the per tile sequence quality.
I suspect that this can affect the number of reads, can it though affect the coverage as well?
Have you looked at this blog post? FastQC is great at pinpointing characteristics of your dataset that you should look at more closely. The "failures" are an essential part of this process. If your per tile sequence quality indeed has an issue then post an image here. Likely those sequences may be taken care of by trimming/filtering (which you will be doing next). Any reduction in number of reads, as a result, will affect gross coverage.
How are you determining the duplication rate? Is it just from FastQC or based on the alignment (Picard MarkDuplicates, for example)? Those can be very different.
It will take a while because the analysis is exhaustive, but the provided metrics will give a thorough perspective, including percentage of optical duplicates, marked duplicates, and overlapping bases, among other things.
Thanks for the feedback.
The duplicate rate is based on fastqc that I run. After trimming and filtering, I don't have overrepresented seq or adapter content. The 50% duplication rate is after mapping, trimming. I loose a few million reads due to trimming but why doesn't explain the big loss of coverage (expected to get ~15x and it drops below 8x). I run CollectWgsMetrics Tool at the moment, so I see if I get anything from there. The data are illumina.
So I had calculated the duplicate rate according to fastqc, however this is not absolutely correct as it is an estimate of the first 100000seq. According to CollectWgsMetrics Tool and the metrics in MarkDuplicates, the duplicate rate is much higher, which fit the mean coverage, calculated with depthofcoverage GATK.
However, the mean coverage calculated by CollectWgsMetrics Tool is smaller than the one I calculated from the depthofCoverage average column. I suppose this happens because the CollectWgsMetrics Tool calculates the mean coverage in bases of the genome territory (non-N bases in the genome), after all filters are applied.
Is that right?
Also I am not sure what are the filter that are applied in the CollectWgsMetrics Tool. Does anyone know?
Please use ADD COMMENT/ADD REPLY when responding to existing posts or providing additional information. This helps keep the threads logically organized.
Latest note is adding useful information since it appears that the problem is worse than you initially suspected. If you have such a high % of duplicates (I assume they are PCR duplicates if you marked them with Picard) then perhaps something went wrong with the experiment (low input, too many PCR cycles)? Do you see any visual evidence that this sample has uneven coverage across the genome?
Did you use the default parameters for CollectWgsMetrics, if so then the default mapping and base qualities are 20. Also COUNT_UNPAIRED parameter needs to be set if you have too many one-end reads mapping from paired-end data.
Have you looked at this blog post? FastQC is great at pinpointing characteristics of your dataset that you should look at more closely. The "failures" are an essential part of this process. If your per tile sequence quality indeed has an issue then post an image here. Likely those sequences may be taken care of by trimming/filtering (which you will be doing next). Any reduction in number of reads, as a result, will affect gross coverage.
What sequencing platform? What library prep method?
How are you determining the duplication rate? Is it just from FastQC or based on the alignment (Picard MarkDuplicates, for example)? Those can be very different.
Most likely on FastQC. Seeing those red "X" on FastQC output seems to stop people in their tracks :)
Sometimes I wish it stopped them. Just distracts and upsets them in my experience.