Hello vg-team,
I have a graph that I created and indexed using:
vg construct -v vars -r ref -a >graph.vg
vg index -x graph.xg graph.vg
vg index -G graph.gbwt -v vars graph.vg
The VCF used for construction has phased genotypes for all 7 chromosomes, so I would expect 14 haplotype threads. However vg paths reveals many more than that, 945.
vg paths -g graph.gbwt -x graph.xg -E
_thread_ZI284_NC_004353.4_0_1 127100
_thread_ZI284_NC_004353.4_1_1 127104
_thread_ZI284_NC_004354.4_0_0 932781
_thread_ZI284_NC_004354.4_1_0 932778
_thread_ZI284_NC_004354.4_0_1 627525
_thread_ZI284_NC_004354.4_1_1 627553
_thread_ZI284_NC_004354.4_0_2 992875
_thread_ZI284_NC_004354.4_1_2 992884
_thread_ZI284_NC_004354.4_0_3 113038
_thread_ZI284_NC_004354.4_1_3 113036
_thread_ZI284_NC_004354.4_0_4 319932
_thread_ZI284_NC_004354.4_1_4 319953
_thread_ZI284_NC_004354.4_0_5 102680
_thread_ZI284_NC_004354.4_1_5 102686
_thread_ZI284_NC_004354.4_0_6 122150
_thread_ZI284_NC_004354.4_1_6 122160
_thread_ZI284_NC_004354.4_0_7 41509
_thread_ZI284_NC_004354.4_1_7 41514
_thread_ZI284_NC_004354.4_0_8 62633
_thread_ZI284_NC_004354.4_1_8 62637
_thread_ZI284_NC_004354.4_1_9 422021
_thread_ZI284_NC_004354.4_0_9 1177293
...
I see there are two 'main' threads:
_thread_sample_contig_0_x
_thread_sample_contig_1_x
What are the other threads? And what does the 'x' represent? Are they just parts of the collective thread?
Thanks, Cade
I remade my index with the -P option, but still resulted with 945 paths. Is there anything else I could try?
Sometimes haplotypes contain alternate alleles of overlapping variants that make no sense together (under the vg interpretation of the VCF). By default, this causes a phase break in GBWT construction. With option
-o
, the construction will use the reference allele for the variant that occurs later in the file in such cases. Together with-P
, this option will guarantee haplotype paths spanning the entire contig. However, in some cases the paths will end up using edges that do not exist in the graph.