Hi all,
I have a dataset of ~200 exomes from Agilent V4 71MB. I calculated coverage with bedtools for off-target and on-target regions. Then I normalised them to library size (divide by the total number of reads in on-target or off-target regions).
Then I divided my cohort into 2: males and females. For each pair Male-Female I calculated median ratio between coverages of the corresponding regions on X chromosome.
In the end I checked if the ratio is close to expected (0.5 or 2). I was surpised to see that for on-target reads the median of ratios was finally 1.94, and for off-target - 1.72! This is quite far from the expected 2.
I also tried to do subsampling, different coverage normalizations such as GC content, variance stabilization, however, I always have numbers close to that.
Do smb faced this problem? Could you suggest some reasons of that?
Example of code for getting ratios in R:
> for (i in 1:ncol(x_coverages_males)) {
+ for (j in 1:ncol(x_coverages_females)) {
+ ratio <- c(ratio, median(x_coverages_females[,j] / x_coverages_males[,i]))
+ }
+ }
> summary(ratio)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.099 1.910 1.941 1.930 1.982 2.170
UPD: I am not SO confused about on-target reads - yeap, it is 1.94 instead of 2, but still the bias is not so big. But for off-target it can not be ignored for sure :/
UPD2 and answer: the key was in mappability. The mappability filter should be not MAPQ=1 or 2, but MAPQ=10. It gives ratios close to 2, yeah.
Did you forget about the pseudoautosomal regions? You should count X-chromosome coverage outside of those.
Of course I forgot, thank you! But according to wiki they are ~1.5 percents of the sex chromosomes, and the median was used as an estimation of ratios, I will remove them and report the result here, but I doubt that ~1.5 percents may make shift of ratios from 2 to 1.72 ...
Excluded them, the ratios improved only marginally. But you gave an excellent idea, will try to filter reads by mappability more strictly.
I m not sure to understand what you mean about target and off target. For the problem if i understand correctly is that you don't have the ratio between male an female you expected right ?
To me ( may be i m wrong ) to get this ratio you should check heterozygous and homozygous SNP , because i don't know your protocol but you have selected region (target region ) so that's mean enrichment and after amplification , so at the end quit same number of reads for X chromosome.
Yeap, that's right, I do not get the coverage ratio between males and females as expected.
For ratio between SNVs is is well known that for exome seq it is more close to ~0.48 than to 0.5, so I would say it is kinda useless to check that - it is uneven by default. "at the end quit same number of reads for X chromosome." - yeap, that's what I am confused about
On-targets means "reads from the regions of enrichment", and off-target reads are "reads that are > several insert distances from any target", so basically "reads you do not asked for mostly from intronic/intergenic regions". Typically for WES 30-40% of reads are off-target.
Well don't you think enrichment / selected region saturation (during library preparation) is the main reason ?
Yeap, but technically, if you have 1 copy or 2 copies - the processes should be similar for both of them and in the end we will have ratio close to 2 anyway.
Based on the replies here I guess I am not the only one who doesn't fully understand the issue at hand. Can you clarify what your expectations are for both, on-target and off-target, and why?
I expect the number of reads on X chromosome be doubled for females comparing to males regardless are they on-target or off-target. This is a common assumption that arise from the fact that we have 2 copies of the region in females and 1 copy for males.
i think during the capture your DNA is in saturation compare of your probes, so at he end you have the same quantity of DNA amplified. That confirmed by the ration between homozygous and heterozygous SNP you should get.
So i have wrong theory but i don't understand why there is this ratio ....
I finally understood your concerns and it makes sense. I will try to include this "saturation" as the parameter in my estimations. But such drastic change in number of reads was partially explained by the "uncertainly" mapped reads. I excluded all reads with MAPQ < 2 and I had median ratio of 1.923 for off-target reads which is much closer to the expected 2. I am trying to play with MAPQ now and will update the question accordingly to the results.
My colleague who actually knows what happens in Agilent enrichment kits told me that there are much more probes than DNA so the effect should be minor. But we'll know tomorrow when the results will be finished.
Thank's to take the time to check that bias, and give me a feed back if you have one :)
I think this bias actually exists. I was thinking that strict mappability filtering will eliminate this bias, but finally no, the ratio is still 1.93, not 2. So I guess there is no other explanation except saturation of probes...
Thank you for the reply. I discuss with so colleagues in the lab , they explained to me that during amplification if the PCR cycle are under 35 cycles you can be quantitative for X. That's make sens too.
kuckunniwid : Add the info you updated above as an answer and then
accept
(green check mark) to provide closure for this thread.Thanks! I will close when the issue will be finally solved (hopefully tomorrow), I mistakenly reported it as "answered" while there is still some uncertainty here.