Entering edit mode
21 months ago
Begonia_pavonina
▴
200
I try to filter out the spanning or overlapping deletions in a GVCF file, noted as asterisk in the VCF format. https://gatk.broadinstitute.org/hc/en-us/articles/360035531912-Spanning-or-overlapping-deletions-allele-
I have tried different bcftools command for this:
bcftools view -f '%ALT != *' -O z -o GVCF_SNPs_output.vcf.gz GVCF_input.vcf.gz
bcftools filter -i 'FORMAT/ALT="*"' -O z -o GVCF_SNPs_output.vcf.gz GVCF_input.vcf.gz
But it seems not to work for loci having several alternative alleles (for example "A, *, C" in the ALT field).
Would anyone have successfully filtered out the deletions (*) out of a GVCF file?
If you're using GATK, following the joint genotyping with GenotypeGVCFs, you could use SelectVariants. For example:
I only say afer GenotypeGVCFs as I don't know the implications of removing variants from a GVCF file. Alternatively, you could also use the
--select-type-to-exclude
parameter if you want more than just SNPs, though I can't see what type of variant*
is in the docs.Thank you for the answer dthorbur, it is interesting as I have already used this command to make the GVCF file. Which means that apparently overlapping deletions are not removed by this command.
Damn. I remember I had this problem too a while ago as MSMC wouldn't accept
*
annotations, but I believe I just removed all sites where they were present.I also found this previous forum post, which may offer a solution:
Where multiallelic annotations appear to be given their own line. Whilst this then would permit removal of
*
entries, it may result in multiple lines for otherwise multiallelic sites you want to keep.Why do you need to remove the
*
annotations anyway?Thank you for the script dthorbur, I will give it a go.
I want to use the population genetic software angsd to analyse my dataset. http://popgen.dk/angsd/index.php/ANGSD
Unfortunately, it seems that the * sites are for the moment not recognised by angsd. https://github.com/ANGSD/angsd/issues/557#issuecomment-1435521926
I anyone has a simple solution for this, I would be interested.
I want to do the same but it I have these "SNPs" reported with a
*
being displayed as phased on the allele without an accompanying upstream indel, so I am hesitant to just remove them.See here: Removing / Excluding / Collapsing Overlapping Indels