More haplotype threads than expected
2
0
Entering edit mode
5.0 years ago
cmirchan ▴ 10

Hello vg-team,

I have a graph that I created and indexed using:

vg construct -v vars -r ref -a >graph.vg
vg index -x graph.xg graph.vg
vg index -G graph.gbwt -v vars graph.vg

The VCF used for construction has phased genotypes for all 7 chromosomes, so I would expect 14 haplotype threads. However vg paths reveals many more than that, 945.

 vg paths -g graph.gbwt -x graph.xg -E
_thread_ZI284_NC_004353.4_0_1   127100
_thread_ZI284_NC_004353.4_1_1   127104
_thread_ZI284_NC_004354.4_0_0   932781
_thread_ZI284_NC_004354.4_1_0   932778
_thread_ZI284_NC_004354.4_0_1   627525
_thread_ZI284_NC_004354.4_1_1   627553
_thread_ZI284_NC_004354.4_0_2   992875
_thread_ZI284_NC_004354.4_1_2   992884
_thread_ZI284_NC_004354.4_0_3   113038
_thread_ZI284_NC_004354.4_1_3   113036
_thread_ZI284_NC_004354.4_0_4   319932
_thread_ZI284_NC_004354.4_1_4   319953
_thread_ZI284_NC_004354.4_0_5   102680
_thread_ZI284_NC_004354.4_1_5   102686
_thread_ZI284_NC_004354.4_0_6   122150
_thread_ZI284_NC_004354.4_1_6   122160
_thread_ZI284_NC_004354.4_0_7   41509
_thread_ZI284_NC_004354.4_1_7   41514
_thread_ZI284_NC_004354.4_0_8   62633
_thread_ZI284_NC_004354.4_1_8   62637
_thread_ZI284_NC_004354.4_1_9   422021
_thread_ZI284_NC_004354.4_0_9   1177293
...

I see there are two 'main' threads:

_thread_sample_contig_0_x
_thread_sample_contig_1_x

What are the other threads? And what does the 'x' represent? Are they just parts of the collective thread?

Thanks, Cade

vgteam vg • 1.3k views
ADD COMMENT
0
Entering edit mode
5.0 years ago
glenn.hickey ▴ 520

Ambiguities, conflicts or missing data in the phasing information in the VCF will cause the haplotype threads to be broken up. Adding the -P option to your index -G command to force phasing at unphased genotypes may resolve this.

ADD COMMENT
0
Entering edit mode

I remade my index with the -P option, but still resulted with 945 paths. Is there anything else I could try?

ADD REPLY
0
Entering edit mode

Sometimes haplotypes contain alternate alleles of overlapping variants that make no sense together (under the vg interpretation of the VCF). By default, this causes a phase break in GBWT construction. With option -o, the construction will use the reference allele for the variant that occurs later in the file in such cases. Together with -P, this option will guarantee haplotype paths spanning the entire contig. However, in some cases the paths will end up using edges that do not exist in the graph.

ADD REPLY
0
Entering edit mode
2.8 years ago

Hello, i'm letting this answer here for people in futur who may have the same problem.

i solved it by adding '' --discard-overlaps --force-phasing " arguments to the GBWT construction as i had unphased VCF file (Documentation here)

The vg paths then showed 20 haplotypes for my 10 samples

ADD COMMENT

Login before adding your answer.

Traffic: 1805 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6