I had whole genome sequence data in .vcf format from several different individuals. I extracted a SNP set from each individual, removed any SNPs with more than 2 alleles, and then merged them all together in bcftools.
Everything seems OK, other than there are several sites which have more than 1 alternate allele in the merged dataset. For example:
1 776546 . A G,T,C 287 . GG=257,297,0,297,730,285,730,221,285,730;DP=135 GT:PL 0/1:257,0,221 0/1:86,0,133 0/0:0,12,165 0/1:255,0,325 0/1:337,0,77 0/0:0,3,46 0/0:0,129,1000 0/0:0,42,291
However, if you notice the genotypes, they are all either 0/1 or 0/0, i.e. there are only 2 alleles present in the callset. The excessive alternate alleles is messing up a new merge that I want to do, because bcftools is saying that there are 4 alleles, but only 3 PL score entries.
Does anyone know of a way to trim the vales in the ALT column on the vcf, so that there is the 'correct' number, given the number of different genotypes?
EDIT: Ive just found the command bcftools view --trim-alt-alleles. I think it's done what I hoped it has, but the documentation isnt very descriptive. Could someone confirm what it does? Thanks.
I will check later