count number of polymorphic sites among samples in vcf
1
0
Entering edit mode
25 days ago
vaellis • 0

I have called variants from haploid individuals in a population using a reference genome from a related species. Specifically, I used the GATK pipeline with CombineGVCFs and GenotypeGVCFs to produce the final vcf including all individuals in the population.

I would like to count the number of SNPs and Indels that vary (are polymorphic) among the individuals in the population. However, I want to exclude sites where all of the individuals have a single alternative allele (or a mix of either one alternative allele or no data); I have many of those sites in my vcf because the reference is a different species, but they are monomorphic with respect to the population.

I was hoping the "number of multiallelic sites" in the output of bcftools stats might do it, but I think it refers to >1 alternate allele and that does not include all the sites I want to count. If a site had two individuals with the reference allele (GT = 0) and two individuals with an alternate allele (GT = 1) I would want that to count as a polymorphic site as well.

Could anyone suggest a tool for achieving this statistic (polymorphic sites among individuals in the vcf independent of the reference)?

Thank you very much.

bcftools vcf • 770 views
ADD COMMENT
1
Entering edit mode

I'm not sure if this will work, but you could use bcftools view with include expressions to include N_PASS(GT!="ref" & GT!="mis") > 1 and combine it with AN > 1 to get both multi-allelic sites and sites with at least 2 non-ref non-missing samples. You might have to do some trial and error but I hope this is a good starting point.

ADD REPLY
0
Entering edit mode

I will look into this. Thank you!

ADD REPLY
0
Entering edit mode

However, I want to exclude sites where all of the individuals have a single alternative allele (or a mix of either one alternative allele or no data);

please give us a an example of valid and invalid input.

ADD REPLY
0
Entering edit mode

example vcf:

CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  ind1 ind2 ind3 ind4
chr1    100     .       C       A       .       .       .       GT      0    0    0    .
chr1    200     .       C       A       .       .       .       GT      1    1    1    .
chr1    300     .       C       A       .       .       .       GT      1    1    1    1
chr1    400     .       C       A,T     .       .       .       GT      2    1    1    .
chr1    500     .       C       A       .       .       .       GT      0    0    1    1
chr1    600     .       C       A       .       .       .       GT      0    0    1    .

monomorphic sites (position on chr1): 100, 200, 300

polymorphic sites (position on chr1): 400, 500, 600

Any tool that could count the number of polymorphic sites among the individuals in the population? If I use bcftools stats I believe it would count positions 200 and 300 as polymorphic because they differ relative the reference, but in fact they all share the same allele.

ADD REPLY
0
Entering edit mode

I'm wondering if this could work:

bcftools query -f '[%GT]\n' my.vcf.gz | awk '{ gsub(/\./, ""); print }' | grep -Ev '^(.)\1+$' | wc -l

The bcftools query part should show the genotype calls (possible characters are "." for no data, 0 for ref, 1 + for alternate alleles) with all the individual genotype calls represented as a string (one line per site); the awk code should remove the individuals without a genotype (i.e., remove "." from each string), and the grep should remove lines composed of only a single repeated character (i.e., monomorphic sites) and then finish by counting the number of remaining lines (i.e., polymorphic sites).

ADD REPLY
0
Entering edit mode
21 days ago
vaellis • 0

Here's a solution:

## print the CHROM, POS, and GT and a fourth field with number of unique alleles in GT
## filter to rows with the fourth field greater than 1
## print just the first two fields and save as a target file for filtering the vcf
bcftools query -f '%CHROM\t%POS\t[%GT]\n' my.vcf.gz | awk 'BEGIN {OFS = "\t"} {gsub(/\./, "", $3); print}' | awk '{split(x,C); n=split($3,F,""); for(i in F) n-=(C[F[i]]++>0); $4=n}1' | awk '{ if ($4 > 1) print $1 "\t" $2 }' > target_filter.tsv

## filter vcf
bcftools index my.vcf.gz # index first
bcftools view -T target_filter.tsv -Oz -o my.n.vcf.gz my.vcf.gz

This should filter the vcf to only sites that are polymorphic among the individuals in the population. There should be an easier way to do this, but this does seem to work. I used the code from here to count the unique alleles at a site (elements of an awk array after splitting).

ADD COMMENT

Login before adding your answer.

Traffic: 2047 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6