Hello guys, I am working with rates of variation (expressed in SNP/Mbp) and I'm observing something which biologically doesn't have much sense.
I have two variant call sets: one contains calls on the whole genome, the other contains an subset of it, made on the coding regions only (CDS lines in GFF format, so to say).
I calculate the density by taking the number of SNPs identified and dividing it by the positions covered by the reads used to call such SNPs: e.g. if I have 100 SNPs identified within coding regions, and I have coverage on 1 Mbp of the coding region positions, I will have 100 SNPs / 1 Mbp.
Biological expectation: higher density of variants when looking at the whole genome, than when looking at the CDS only. Results: the opposite (even though not a big difference).
I thought that this might be due to the fact that I work with a non-model organism for which the genome is not of high quality. So I repeated the pipeline with A. thaliana and there it showed a higher density of variants on the whole genome and a lower one in the CDS (as expected).
Do you know what could have generated this weird result? Is the genome assembly quality your best guess as culprit? I am running out of options to test ^^
Actually, my best guess would be a problem in the annotation (GFF). If you sample CDS and non-CDS and see no big difference, maybe the annotation between the two is not right.
Actually, it is not exactly as you said:
I don't have a non-CDS group and a CDS group, I have a whole-genome (including CDS) and a CDS only.
Also, I am quite confident on the annotation because it was done by me, so I know what happenened at each stage and I'm quite sure this is not the case. Or so I hope!