What path list should be used to surject GAM to GRCh38/CHM13 reference genome
1
0
Entering edit mode
5 months ago
Hanc • 0

I have constructed an MC pangenome graph with --reference CHM13 GRCh38 and want to align short read to the graph with haplotype sampling. In the Personalized Pangenome arxiv paper, this pipeline (https://github.com/vgteam/vg_wdl/blob/master/workflows/giraffe.wdl) has been applied to do the analysis. For the HPRC graph, a path list start with GRCh38. or CHM13. can be used to surject the GAM to GRCh38/CHM13 reference genomes, respectively. According to the pipeline, the path list file can be generated by:

vg gbwt -CL -Z ${in_gbz_file} | sort > path_list.txt
grep -v _decoy path_list.txt | grep -v _random |  grep -v chrUn_ | grep -v chrEBV | grep -v chrM | grep -v chain_ > path_list.sub.txt

However, for my pangenome graph (after haplotype sampling), the output from vg gbwt -CL -Z sampled_graph.gbz does not have the reference sample prefix:

chr10
chrUn_KI270756v1
chr11
chr11_KI270721v1_random
chrUn_KI270417v1
chrUn_KI270425v1
chr12
chr13
chrUn_KI270743v1
...

In this case, how can I specify the path list to surject the GAM to a specific reference genome?

By the way, I also checked the contigs in hprc-v1.1-mc-chm13.gbz and I didn't find reference prefix in the hprc-v1.1 graph. It seems the tutorial on vg_wdl is still based on the hprc-v1.0 graph, is there an updated pipeline to align and call variants on a personalized pangenome?

Thanks.

vg • 690 views
ADD COMMENT
1
Entering edit mode
5 months ago
Jouni Sirén ▴ 470

Path naming conventions are still evolving, and the commands needed for generating the path name list depend on how the graph was built. The commands you mention are appropriate when the graph was built using vg construct or it's based on a GFA file with reference paths as P-lines and haplotype paths as W-lines. If you have all paths as W-lines (as in HPRC v1.1 graphs), you can get the path names for a specific sample (such as GRCh38) with the following command:

vg paths -L -S sample_name -x graph.gbz | sort > path_list.txt

(You may also want to filter the path name list with grep.)

ADD COMMENT
0
Entering edit mode

Thank you so much for the clarification. The output from vg paths -L -S GRCh38 looks like:

GRCh38#0#chr10[133748460]
GRCh38#0#chr10[19649]
GRCh38#0#chr10[38573348]
GRCh38#0#chr10[38920104]
GRCh38#0#chr10[42097906]
GRCh38#0#chr10[47918393]
GRCh38#0#chr11[135076191]
GRCh38#0#chr1[124940206]
...

It seems that the GRCh38 assembly was split into multiple chunks (is it due to clip?). Because I want to let one chromosome has the same name in the bam file, I tried to use a path list without position suffix and got a bam file. Is it correct?

GRCh38#0#chr10
GRCh38#0#chr11
GRCh38#0#chr12
...

Besides, in the vg_wdl scripts, the bam header was also replaced by the .dict file of the reference genome. Is it because that, in CHM13-based clip graph, GRCh38 assembly doesn't have full sequence. So, we need to manually change the bam header to make it match the length of the reference genome? Moreover, if I want to convert the surjected bam to cram, should I use the raw GRCh38 reference genome (input for MC pipeline), or the GRCh38 fasta file extracted from the graph?

Thank you!

ADD REPLY
1
Entering edit mode

If you are using the CHM13-based default (clipped) graph (or a frequency-filtered version of it), any >10 kbp parts of GRCh38 that don't align to CHM13 are missing from the graph. If you surject to a list of path fragments (such as GRCh38#0#chr10[19649]), any offset (i) will be translated to the corresponding offset (19649+i) in the full sequence (GRCh38#0#chr10) in the output. I'm not sure if the opposite works (if you specify GRCh38#0#chr10 in the path list, vg realizes that GRCh38#0#chr10[19649] is a fragment of it).

You need to replace the header, because sequence lengths will be incorrect. The WDL files should also have a mechanism for removing the prefix (GRCh38#0#) from the output files, so that the alignments will be to contigs (chr10) instead of paths (GRCh38#0#chr10).

If you want to convert the output to CRAM, you should use the full reference rather than the fragments you can get from the graph.

ADD REPLY
0
Entering edit mode

Many thanks for the clarification. It is quite clear to me now. I will use the list of path fragments and full GRCh38 reference sequence to surject the reads.

ADD REPLY

Login before adding your answer.

Traffic: 1832 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6