I have constructed an MC pangenome graph with --reference CHM13 GRCh38
and want to align short read to the graph with haplotype sampling. In the Personalized Pangenome arxiv paper, this pipeline (https://github.com/vgteam/vg_wdl/blob/master/workflows/giraffe.wdl) has been applied to do the analysis. For the HPRC graph, a path list start with GRCh38.
or CHM13.
can be used to surject the GAM to GRCh38/CHM13 reference genomes, respectively. According to the pipeline, the path list file can be generated by:
vg gbwt -CL -Z ${in_gbz_file} | sort > path_list.txt
grep -v _decoy path_list.txt | grep -v _random | grep -v chrUn_ | grep -v chrEBV | grep -v chrM | grep -v chain_ > path_list.sub.txt
However, for my pangenome graph (after haplotype sampling), the output from vg gbwt -CL -Z sampled_graph.gbz
does not have the reference sample prefix:
chr10
chrUn_KI270756v1
chr11
chr11_KI270721v1_random
chrUn_KI270417v1
chrUn_KI270425v1
chr12
chr13
chrUn_KI270743v1
...
In this case, how can I specify the path list to surject the GAM to a specific reference genome?
By the way, I also checked the contigs in hprc-v1.1-mc-chm13.gbz
and I didn't find reference prefix in the hprc-v1.1 graph. It seems the tutorial on vg_wdl is still based on the hprc-v1.0 graph, is there an updated pipeline to align and call variants on a personalized pangenome?
Thanks.
Thank you so much for the clarification. The output from
vg paths -L -S GRCh38
looks like:It seems that the GRCh38 assembly was split into multiple chunks (is it due to clip?). Because I want to let one chromosome has the same name in the bam file, I tried to use a path list without position suffix and got a bam file. Is it correct?
Besides, in the
vg_wdl
scripts, the bam header was also replaced by the.dict
file of the reference genome. Is it because that, in CHM13-based clip graph, GRCh38 assembly doesn't have full sequence. So, we need to manually change the bam header to make it match the length of the reference genome? Moreover, if I want to convert the surjected bam to cram, should I use the raw GRCh38 reference genome (input for MC pipeline), or the GRCh38 fasta file extracted from the graph?Thank you!
If you are using the CHM13-based default (clipped) graph (or a frequency-filtered version of it), any >10 kbp parts of GRCh38 that don't align to CHM13 are missing from the graph. If you surject to a list of path fragments (such as
GRCh38#0#chr10[19649]
), any offset (i
) will be translated to the corresponding offset (19649+i
) in the full sequence (GRCh38#0#chr10
) in the output. I'm not sure if the opposite works (if you specifyGRCh38#0#chr10
in the path list, vg realizes thatGRCh38#0#chr10[19649]
is a fragment of it).You need to replace the header, because sequence lengths will be incorrect. The WDL files should also have a mechanism for removing the prefix (
GRCh38#0#
) from the output files, so that the alignments will be to contigs (chr10
) instead of paths (GRCh38#0#chr10
).If you want to convert the output to CRAM, you should use the full reference rather than the fragments you can get from the graph.
Many thanks for the clarification. It is quite clear to me now. I will use the list of path fragments and full GRCh38 reference sequence to surject the reads.