We want to use vg to construct a pan-genome. In order to detect variations in the subsequent pan-genome graph and determine which genomes they exist in, we intend to retain the genome source information for the paths within the pan-genome graph. We attempted to add genome information to the ID and info columns in the VCF input provided to vg. However, after completing the construction process, the ID and info columns in the resulting pan-genome graph were replaced with randomly generated identifiers, and the corresponding genome information couldn't be found. How should we proceed?
Thank you! What does "VCF is phased" mean ? Does it mean to add the allele types at mutation positions of different genomes?
A phased VCF specifies which variants co-occur on the same haplotype, whereas an unphased VCF only lists genotypes at each site, with no assertion of which combination is on each haplotype. In the VCF format, phasing is usually indicated by using a
|
symbol between the alleles instead of a/
.Is there any way to know which specific path in the resequencing alignment has been aligned
If you are referring to the sequence of node IDs that the read is aligned to, then yes. The GAM and GAF formats both specify the path. GAF is a text-based format and is pretty easy to inspect manually to see the path. GAM is based on Protobuf, but if you want to inspect some reads, you can convert it to JSON with
vg view -a -j
. However, the JSON option is not very efficient at genome scale.Is there a way to construct a pan-genome graph using vg to create a GFA graph with "W"