I successfully obtained the output VCF using the following command:
vg call ${gbz_file} \
-k ${pack_file} \
-r ${snarls_file} \
-s ${sample_name} \
-t $SLURM_CPUS_PER_TASK \
-z -a > ${output_vcf_file}
I need some interpretations for the VCF data.
It appears that the number of snarls is fewer than the number of SVs used to construct the graph pangenome. The number of snarls is approximately 1.28 million, which is inferred from the record count in ${output_vcf_file}. However, the number of SVs used to construct the graph is about 1.43 million. Can you explain why this situation occurred?
I would like to count the number of different types of structural variations (INS, DEL, INV in this case). However, it seems that the output vcf do not provide conclusions for each variant type. Is there a good way to address this?
In the future, I will need to merge VCF files from multiple samples into a single multi-sample VCF. Can you recommend good workflows or tools for this?
Thank you for your assistance.
Maxine
The content is too long, I have to cut it in two.
All the variants in my input VCF are biallelic. I did not perform any augmentation before calling, so I would expect that the output VCF should not contain any new POS or genotypes. However, in reality, the output VCF contains many multi-allelic variants. For instance, the variant mentioned in the previous example appears as a multi-allelic variant in sample C, whereas the input VCF indicates that it is biallelic.
Why is this happening? are these multiallelic reliable?
I would like to hear your opinion on whether to enable nested calling mode. Based on my understanding, using short-read data to call nested variants may increase the false positive rate (inaccurate calling). If my goal is to have the most accurate calling possible and I don't mind sacrificing the quantity of calls (some inside variants not being called), then I should not enable it.
Thanks for your attention.
Maxine
The reason for the multiallelic records is the same as my answer for 1) above: biallelic variants can overlap each other in the genome. When this happens
vg call
will treat it as one variant locus with with multiple alleles. I'm not sure why the position would shift though. Maybe glenn.hickey knows better.