I have been running GATK's UnifiedGenotyper with the argument Emit_All_Sites, e.g.
Unified Genotyper -out_mode Emit_All_Sites -R my_ref.fasta -mbq 10 -stand_call_conf 31 -stand_emit_conf 31 -ploidy 2 -I genome1.bam -o genome_sites.vcf
It is my understanding that the number of sites per chromosome arm in in the vcf should precisely match the number of sites for the same chromosome arm in the reference genome. Instead, I find that the number of sites in the vcf is consistently smaller, often by hundreds of kb.
I'm trying to understand the possible reasons for this - it shouldn't be due to ambiguous 'N' sites in the reference genome, as these should still be returned as placeholder lines in the vcf. Similarly, more or less restrictive settings for stand_emit_conf should also give blank placeholders rather than skipping lines when evidence at a site is poor.
Is there anything that I'm missing?
I can't speak to UnifiedGenotyper, as it has been many years since GATK stopped supporting it, and I have not used it in a long time. But it in the documentation for
HaplotypeCaller
, this is what is said about theEmit_All_Sites
flag:Not all sites are callable (e.g, sites where no reads mapped), so I would not expect the VCF to have as many entries as there are positions in the genome, even with this setting.
If that were the case, I would expect there not to be any uncalled sites in the vcf, yet there are a large number of sites with "." placeholders for uncalled sites, included at the chromosome ends where the reference is N. So I'm trying to figure out why some uncallable sites are returned as placeholder lines while others are excluded.
Is there some argument that forces GATK to return a blank line for all uncalled sites?
I believe UnifiedGenotyper is a locus walker that sets
emitEmptySites() = False
. As such you will not see entries for positions that have no reads. You might be able to get around this by passing -L or -V which covers all sites in the reference.Could you please elaborate on what you mean by passing -L or -V, i.e. what would the appropriate argument be for -L with UnifiedGenotyper for the desired result of returning all sites?