I am attempting to split multiallelic sites using bcftools norm
with the following command:
zcat ${inputVcf} | \
sed 's/AD,Number=./AD,Number=R/g' | \
sed 's/ADR,Number=./ADR,Number=R/g' | \
sed 's/ADF,Number=./ADF,Number=R/g' | \
bcftools norm \
--fasta-ref ${genomeFa} \
--check-ref s \
--multiallelics -any \
--output ${outputVcf}
The sed
commands were based on the recommendation from here. However I'm still getting FORMAT entries such as the following: GT:GQ:GQX:DPI:AD:ADF:ADR:FT:PL 1/0:44:44:56:1,10,5:1,4,2:0,6,3:PASS:511,99,48 ./.:.:.:.:.:.:.:.:. 0/1:53:53:63:0,12,6:0,4,1:0,8,5:PASS:483,210,164
which are clearly multiallelic. Anybody know how to fix this?
i think that clarifies things and pointed me in the right direction. what happened was, the vcf file was normalized in a previous step so the ALT column was split, but fields like AD remained as they were because those fields were was ignored, and their data types were still wrong. fixing the upstream implementation of
bcftools norm
worked for me and now both my ALT and AD fields are split as i expect them.How can i achieve that you discribed above for a VCF file ?
If you want to additionally left-align indels, then supply a FASTA reference:
Take a look at my Step 4, here: Produce PCA bi-plot for 1000 Genomes Phase III - Version 2
I had a vcf file only contains snp variants (bi and multi) after GATK VQSR , now I want to split multiallelic variant into biallelic variant, the order I used is : bcftools norm -m -snps snp.2.vcf.gz -Ov -o output then it throw an error: Error: wrong number of fields in INFO/MLEAC at 2:10443, expected 2, found 1 how can i solve it?
first perform
bcftools norm -m-any
thenVQSR
an off topic question: is there a mention of bcftool norm in any publication?
You will not be able to find
bcftools norm
in any publication. But you will be able to findbcftools
in publications