Entering edit mode
24 months ago
I try to filter out the spanning or overlapping deletions in a GVCF file, noted as asterisk in the VCF format. https://gatk.broadinstitute.org/hc/en-us/articles/360035531912-Spanning-or-overlapping-deletions-allele-
I have tried different bcftools command for this:
bcftools view -f '%ALT != *' -O z -o GVCF_SNPs_output.vcf.gz GVCF_input.vcf.gz
bcftools filter -i 'FORMAT/ALT="*"' -O z -o GVCF_SNPs_output.vcf.gz GVCF_input.vcf.gz
But it seems not to work for loci having several alternative alleles (for example "A, *, C" in the ALT field).
Would anyone have successfully filtered out the deletions (*) out of a GVCF file?
If you're using GATK, following the joint genotyping with GenotypeGVCFs, you could use SelectVariants. For example:
I only say afer GenotypeGVCFs as I don't know the implications of removing variants from a GVCF file. Alternatively, you could also use the
parameter if you want more than just SNPs, though I can't see what type of variant*
is in the docs.Thank you for the answer dthorbur, it is interesting as I have already used this command to make the GVCF file. Which means that apparently overlapping deletions are not removed by this command.
Damn. I remember I had this problem too a while ago as MSMC wouldn't accept
annotations, but I believe I just removed all sites where they were present.I also found this previous forum post, which may offer a solution:
Where multiallelic annotations appear to be given their own line. Whilst this then would permit removal of
entries, it may result in multiple lines for otherwise multiallelic sites you want to keep.Why do you need to remove the
annotations anyway?Thank you for the script dthorbur, I will give it a go.
I want to use the population genetic software angsd to analyse my dataset. http://popgen.dk/angsd/index.php/ANGSD
Unfortunately, it seems that the * sites are for the moment not recognised by angsd. https://github.com/ANGSD/angsd/issues/557#issuecomment-1435521926
I anyone has a simple solution for this, I would be interested.
I want to do the same but it I have these "SNPs" reported with a
being displayed as phased on the allele without an accompanying upstream indel, so I am hesitant to just remove them.See here: Removing / Excluding / Collapsing Overlapping Indels