Question

How to identify which genome the variations in the pan-genome graph originate from

0

Entering edit mode

23 months ago

Wenke • 0

We want to use vg to construct a pan-genome. In order to detect variations in the subsequent pan-genome graph and determine which genomes they exist in, we intend to retain the genome source information for the paths within the pan-genome graph. We attempted to add genome information to the ID and info columns in the VCF input provided to vg. However, after completing the construction process, the ID and info columns in the resulting pan-genome graph were replaced with randomly generated identifiers, and the corresponding genome information couldn't be found. How should we proceed?

vg • 2.3k views

ADD COMMENT • link 23 months ago by Wenke • 0

score 2 · Answer 1 · 2023-08-07

2

Entering edit mode

23 months ago

glenn.hickey ▴ 540

vg has no mechanism for preserving the VCF IDs, unfortunately. You will need to use position lookups (vg find) to compare your VCF and graph.

If your VCF is phased, vg will store the haplotypes in your VCF in the graph by way of the GBWT/GBZ which will allow you to do some times of queries on your samples.

ADD COMMENT • link 23 months ago by glenn.hickey ▴ 540

0

Entering edit mode

Thank you! What does "VCF is phased" mean ? Does it mean to add the allele types at mutation positions of different genomes?

ADD REPLY • link 23 months ago by Wenke • 0

1

Entering edit mode

A phased VCF specifies which variants co-occur on the same haplotype, whereas an unphased VCF only lists genotypes at each site, with no assertion of which combination is on each haplotype. In the VCF format, phasing is usually indicated by using a | symbol between the alleles instead of a /.

ADD REPLY • link 23 months ago by Jordan M Eizenga ▴ 740

0

Entering edit mode

Is there any way to know which specific path in the resequencing alignment has been aligned

ADD REPLY • link 23 months ago by Wenke • 0

2

Entering edit mode

If you are referring to the sequence of node IDs that the read is aligned to, then yes. The GAM and GAF formats both specify the path. GAF is a text-based format and is pretty easy to inspect manually to see the path. GAM is based on Protobuf, but if you want to inspect some reads, you can convert it to JSON with vg view -a -j. However, the JSON option is not very efficient at genome scale.

ADD REPLY • link 23 months ago by Jordan M Eizenga ▴ 740

0

Entering edit mode

Is there a way to construct a pan-genome graph using vg to create a GFA graph with "W"

ADD REPLY • link 23 months ago by Wenke • 0

score 0 · Answer 2 · 2023-08-08

0

Entering edit mode

23 months ago

colindaven 7.7k

If you use PGGB to build a pangenome from FASTAs ( so not from fasta + VCF) you get a resultant VCF out with the SNPs in each pangenome. Of course, theres' no guarantee that it will find the SNPs you're expecting, but there must be some overlap if they are present in the genomes.

ADD COMMENT • link 23 months ago by colindaven 7.7k