fastqc and coverage
2
0
Entering edit mode
8.1 years ago
aleka ▴ 110

I have 50% duplicates on WGS on tumour samples and while I was expecting that the coverage will be reduced from 30x to 15x, it goes below 8x. So I am trying to figure out the reason that I have less coverage.

I can see that I have a warning/failure on the per tile sequence quality. I suspect that this can affect the number of reads, can it though affect the coverage as well?

next-gen sequencing • 4.6k views
ADD COMMENT
0
Entering edit mode

Have you looked at this blog post? FastQC is great at pinpointing characteristics of your dataset that you should look at more closely. The "failures" are an essential part of this process. If your per tile sequence quality indeed has an issue then post an image here. Likely those sequences may be taken care of by trimming/filtering (which you will be doing next). Any reduction in number of reads, as a result, will affect gross coverage.

ADD REPLY
0
Entering edit mode

What sequencing platform? What library prep method?

ADD REPLY
0
Entering edit mode

How are you determining the duplication rate? Is it just from FastQC or based on the alignment (Picard MarkDuplicates, for example)? Those can be very different.

ADD REPLY
1
Entering edit mode

Most likely on FastQC. Seeing those red "X" on FastQC output seems to stop people in their tracks :)

ADD REPLY
1
Entering edit mode

Sometimes I wish it stopped them. Just distracts and upsets them in my experience.

ADD REPLY
0
Entering edit mode
8.1 years ago
Dan D 7.4k

I recommend running Picard's CollectWgsMetrics Tool.

It will take a while because the analysis is exhaustive, but the provided metrics will give a thorough perspective, including percentage of optical duplicates, marked duplicates, and overlapping bases, among other things.

ADD COMMENT
0
Entering edit mode

Hi all,

Thanks for the feedback. The duplicate rate is based on fastqc that I run. After trimming and filtering, I don't have overrepresented seq or adapter content. The 50% duplication rate is after mapping, trimming. I loose a few million reads due to trimming but why doesn't explain the big loss of coverage (expected to get ~15x and it drops below 8x). I run CollectWgsMetrics Tool at the moment, so I see if I get anything from there. The data are illumina.

Aleka

ADD REPLY
0
Entering edit mode
8.1 years ago
aleka ▴ 110

So I had calculated the duplicate rate according to fastqc, however this is not absolutely correct as it is an estimate of the first 100000seq. According to CollectWgsMetrics Tool and the metrics in MarkDuplicates, the duplicate rate is much higher, which fit the mean coverage, calculated with depthofcoverage GATK.

However, the mean coverage calculated by CollectWgsMetrics Tool is smaller than the one I calculated from the depthofCoverage average column. I suppose this happens because the CollectWgsMetrics Tool calculates the mean coverage in bases of the genome territory (non-N bases in the genome), after all filters are applied. Is that right?

Also I am not sure what are the filter that are applied in the CollectWgsMetrics Tool. Does anyone know?

Aleka

ADD COMMENT
0
Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing posts or providing additional information. This helps keep the threads logically organized.

ADD REPLY
0
Entering edit mode

this was an answer to my initial question, if you didn't realise it. Aleka

ADD REPLY
0
Entering edit mode

Latest note is adding useful information since it appears that the problem is worse than you initially suspected. If you have such a high % of duplicates (I assume they are PCR duplicates if you marked them with Picard) then perhaps something went wrong with the experiment (low input, too many PCR cycles)? Do you see any visual evidence that this sample has uneven coverage across the genome?

ADD REPLY
0
Entering edit mode

good point about the uneven coverage. you reminded me to check. I didn't have though uneven coverage. more less the same.

ADD REPLY
0
Entering edit mode

If the coverage is not uneven then why do you have so many duplicates? Is there an experimental explanation? Is this a HiSeq 4000 dataset?

ADD REPLY
0
Entering edit mode

Did you use the default parameters for CollectWgsMetrics, if so then the default mapping and base qualities are 20. Also COUNT_UNPAIRED parameter needs to be set if you have too many one-end reads mapping from paired-end data.

ADD REPLY
0
Entering edit mode

yes I use the default parameters. that would make sense why CollectWgsMetrics gave a smaller coverage.

ADD REPLY

Login before adding your answer.

Traffic: 2597 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6