I have called variants from haploid individuals in a population using a reference genome from a related species. Specifically, I used the GATK
pipeline with CombineGVCFs
and GenotypeGVCFs
to produce the final vcf including all individuals in the population.
I would like to count the number of SNPs and Indels that vary (are polymorphic) among the individuals in the population. However, I want to exclude sites where all of the individuals have a single alternative allele (or a mix of either one alternative allele or no data); I have many of those sites in my vcf because the reference is a different species, but they are monomorphic with respect to the population.
I was hoping the "number of multiallelic sites" in the output of bcftools stats
might do it, but I think it refers to >1 alternate allele and that does not include all the sites I want to count. If a site had two individuals with the reference allele (GT = 0) and two individuals with an alternate allele (GT = 1) I would want that to count as a polymorphic site as well.
Could anyone suggest a tool for achieving this statistic (polymorphic sites among individuals in the vcf independent of the reference)?
Thank you very much.
I'm not sure if this will work, but you could use
bcftools view
with include expressions to includeN_PASS(GT!="ref" & GT!="mis") > 1
and combine it withAN > 1
to get both multi-allelic sites and sites with at least 2 non-ref non-missing samples. You might have to do some trial and error but I hope this is a good starting point.I will look into this. Thank you!
please give us a an example of valid and invalid input.
example vcf:
monomorphic sites (position on chr1): 100, 200, 300
polymorphic sites (position on chr1): 400, 500, 600
Any tool that could count the number of polymorphic sites among the individuals in the population? If I use
bcftools stats
I believe it would count positions 200 and 300 as polymorphic because they differ relative the reference, but in fact they all share the same allele.I'm wondering if this could work:
bcftools query -f '[%GT]\n' my.vcf.gz | awk '{ gsub(/\./, ""); print }' | grep -Ev '^(.)\1+$' | wc -l
The
bcftools query
part should show the genotype calls (possible characters are "." for no data, 0 for ref, 1 + for alternate alleles) with all the individual genotype calls represented as a string (one line per site); theawk
code should remove the individuals without a genotype (i.e., remove "." from each string), and thegrep
should remove lines composed of only a single repeated character (i.e., monomorphic sites) and then finish by counting the number of remaining lines (i.e., polymorphic sites).