Hello,
I have a question regarding the methodology for comparing the number of SNP called using giraffe on a pangenome-graph and BWA-MEM2 on a linear reference.
I read in publications two different methods.
One converts alignment in .gam to .bam using vg surject
, then proceeds with a regular variant calling pipeline with the linear reference used as a backbone to construct the pangenome-graph. I saw this used in several papers, like here or here.
I also saw a second method done here where authors used vg augment
from the alignments, followed by vg pack
, vg snarl
and finally vg call
.
Is there a particular method that you would recommend for doing that?
I wish you a nice day, Regards, Marion
Hi,
Thank you for your answer.
My issue has a bit evolved since. I have done the
surject
method, which led to a ~20% decrease in reads aligned in the resulting .bam file. Consequently, I have way less variant called than if I just use a regular linear reference with the same downstream variant calling method (GATK in my case).I am now trying to see how I could improve that and if other methods for variant calling on pangenome-graph could be applied to divergent species.
The variant calling method you want does not exist yet.
If you are using short reads, the approach using
vg surject
works best with graphs without too large structural variants. Otherwise many reads will map to locations that are nowhere near the reference sequence. Those alignments cannot be projected to the reference, andvg surject
will drop them.vg call
works best for genotyping variants already present in the graph. You can try using it to call novel variants with thevg augment
approach, but that introduces a lot of noise from sequencing errors and unnormalized edits, andvg call
does not handle the noise very well.What you want is closer to genome inference than variant calling. You would need a variant caller that works directly with the pangenome graph. After calling variants relative to the graph, you would infer the most likely haplotype paths in the graph and then use the graph to get the alignment between those paths and the paths corresponding to the reference genome.