Question

Vg Call error multiple paths detected using HPRC-GRCH38 ref

0

Entering edit mode

7 months ago

Alexandria A. • 0

Hi vg team,

I am attempting to use vg call on a wgs sample aligned to hprc-grch38, using wdl Cromwell. I'm pretty new to cs so please bear with me.

I Used hprc-grch38 minimap cactus build. Vg autoindex workflow giraffe to produce .gbz .xg .min .dist files.

./vg autoindex --workflow giraffe -x grch38.xg
-g grch38.gcsa -g grch38/hprc-v1.0-minigraph-grch38.gfa -p GRCH38_pangenome --target-mem 80G -t 20 > grch38_pangenome.girrafe.gbz

Next Vg giraffe for alignment using phased haplotype wgs sample. alignment takes a long time and .gam file size is very large (149G), but has a MAPQ60.

command {
        set -e -o pipefail
        vg giraffe \
        -t ~{threads} \
        --read-group "ID:1 LB:lib1 SM:~{sampleId} PL:illumina PU:unit1" \
        --sample ~{sampleId} \
        --gbz-name ~{giraffe_gbz} \
        --minimizer-name ~{min}  \
        --dist-name ~{dist} \
        --fastq-in ~{r1Fastq} \
        --fastq-in ~{r2Fastq} \
        --ref-paths ~{minigraph_grch38} \
        -b fast \
        --prune-low-cplx \
        --output-format gam \
        > ~{outGamPath}

minigraph_grch38 == grch38/hprc-v1.0-minigraph-grch38.txt

I ran Vg Call on .pack (3.8G) generated from gam

command {
    set -e -o pipefail
    vg pack \
    --threads ~{threads} \
    --xg ~{referenceGraph_xg} \
    --gam ~{outGam} \
    --min-mapq 5 \
    --packs-out ~{outPackPath} \
    > ~{outPackPath}

and got an error msg: error [vg call]: Multiple reference samples detected: [CHM13, HG00438, HG00621, HG00673, HG00733, ...]. Please use -S to specify a single reference sample or use -p to specify reference paths

I ran Vg paths to get a better idea of the ref paths available. And realize that multiple ref paths are included in the .xg file. This might clarify why the alignment step takes so long.

vg paths —metadata -x ./GRCH38_pangenome.xg

enter image description here

From my understanding hprc-grch38 is built using multiple samples, do I need to select one of said samples to generate a vcf file? Selecting one path seems to undermine using a Pangenome as the reference in the first place. Should I drop haplotypes in .xg so that vg call doesn't have to choose a path (more manageable)? Any help is much appreciated.

I have 400 samples to align (rare disease group), with the plan of merging vcfs for analysis (bcftools).

hprc call vg grch38 • 438 views

ADD COMMENT • link updated 7 months ago by Jordan M Eizenga ▴ 660 • written 7 months ago by Alexandria A. • 0

0

Entering edit mode

I have a few comments on this. First, the arguments in vg autoindex look misplaced. The -x is supposed to be a GFF containing transcripts for RNA-seq applications, and the -g is supposed to be a GFA file expressing a graph, but you've provided a GCSA in addition to that. You've also piped stdout down to disk with >, but vg autoindex doesn't output anything to stdout. Its output is all based on the -p prefix argument. However, it seems that despite these issues, you did get usable outputs out of vg autoindex since you got vg giraffe to run.

The output of vg call is a VCF file, and VCFs require variants to be expressed relative to a reference sequence. Since there are multiple samples expressed in the graph, you have to tell it which one you want to use as the reference. You are correct that this sacrifices some of the benefits promised by a pangenome, but it is a limitation of the VCF format/formalism, and the genomics world still largely revolves around VCFs. Chances are you would want to choose GCRCh38 or CHM13 as the reference.

A final note: if it's not too late, you might want to consider updating to the v1.1 pangenome graph, which corrects some known issues in the 1.0 graph. We've also found that the clipped graphs (marked d9 or d2 on the downloads) work somewhat better as mapping targets than the full graph. In the most recent versions of vg giraffe, we also have more sophisticated methods of doing the clipping online at mapping time.

ADD REPLY • link 7 months ago by Jordan M Eizenga ▴ 660