I'm working on some SNP analyses using GATK Haplotype Caller. In my initial test, with just 35 samples, I was getting over 350,000 SNPs. However, when I added in more samples this reduced. With my entire set of over 200 samples, it's barely over 1000, though if I cut about half of those out it's closer to 10,000. These are a conglomeration from 4 different data sets, but my original data set included two of the most different collection practices. I can't find anything in the documentation that seems to explain why this would happen, considering I've tried with the GVCF mode and got similar results. I'm imagining it's some outcome of the method of variant calling, but I want to make sure. Could anyone explain what could be causing this?
You should start a thread of GATK Forums and link to it here. I recall exploring this topic a few years ago, when we discovered that the batch size affected the number of variants called, and it gave us a little doubt on the n+1 logic that GVCF files enable.
Eliminate all the filters (if you have any) in your pipeline and if the results differ even then, start a thread on GATK Forums
Sounds like a good idea. This is before filtration but I have a couple more things I'm trying this evening and if they don't pan out I'll post on GATK tomorrow and report back. In the meantime, do we still have an archive of the other discussion? I'd love to see what logic people came up with.
I don't recall initiating a conversation on GATK Forums, unfortunately. We were collaborating with Broad and it was easier to speak to my colleague who worked for both Broad and my team. There were a few emails exchanged but I don't have a record of them now. I'm sorry!
As an update, I did post this on GATK forum https://gatkforums.broadinstitute.org/gatk/discussion/24163/larger-sample-sizes-are-reducing-snps-dramatically
However, I noticed looking through some VCF files that the caller is only registering chromosome 1 so my new challenge is tracking that down.