Hi vg team,
I am attempting to use vg call on a wgs sample aligned to hprc-grch38, using wdl Cromwell. I'm pretty new to cs so please bear with me.
I Used hprc-grch38 minimap cactus build. Vg autoindex workflow giraffe to produce .gbz .xg .min .dist files.
./vg autoindex --workflow giraffe -x grch38.xg
-g grch38.gcsa -g grch38/hprc-v1.0-minigraph-grch38.gfa -p GRCH38_pangenome --target-mem 80G -t 20 > grch38_pangenome.girrafe.gbz
Next Vg giraffe for alignment using phased haplotype wgs sample. alignment takes a long time and .gam file size is very large (149G), but has a MAPQ60.
command {
set -e -o pipefail
vg giraffe \
-t ~{threads} \
--read-group "ID:1 LB:lib1 SM:~{sampleId} PL:illumina PU:unit1" \
--sample ~{sampleId} \
--gbz-name ~{giraffe_gbz} \
--minimizer-name ~{min} \
--dist-name ~{dist} \
--fastq-in ~{r1Fastq} \
--fastq-in ~{r2Fastq} \
--ref-paths ~{minigraph_grch38} \
-b fast \
--prune-low-cplx \
--output-format gam \
> ~{outGamPath}
- minigraph_grch38 == grch38/hprc-v1.0-minigraph-grch38.txt
I ran Vg Call on .pack (3.8G) generated from gam
command {
set -e -o pipefail
vg pack \
--threads ~{threads} \
--xg ~{referenceGraph_xg} \
--gam ~{outGam} \
--min-mapq 5 \
--packs-out ~{outPackPath} \
> ~{outPackPath}
and got an error msg:
error [vg call]: Multiple reference samples detected: [CHM13, HG00438, HG00621, HG00673, HG00733, ...]. Please use -S to specify a single reference sample or use -p to specify reference paths
I ran Vg paths to get a better idea of the ref paths available. And realize that multiple ref paths are included in the .xg file. This might clarify why the alignment step takes so long.
vg paths —metadata -x ./GRCH38_pangenome.xg
From my understanding hprc-grch38 is built using multiple samples, do I need to select one of said samples to generate a vcf file? Selecting one path seems to undermine using a Pangenome as the reference in the first place. Should I drop haplotypes in .xg so that vg call doesn't have to choose a path (more manageable)? Any help is much appreciated.
I have 400 samples to align (rare disease group), with the plan of merging vcfs for analysis (bcftools).
I have a few comments on this. First, the arguments in
vg autoindex
look misplaced. The-x
is supposed to be a GFF containing transcripts for RNA-seq applications, and the-g
is supposed to be a GFA file expressing a graph, but you've provided a GCSA in addition to that. You've also piped stdout down to disk with>
, butvg autoindex
doesn't output anything to stdout. Its output is all based on the-p
prefix argument. However, it seems that despite these issues, you did get usable outputs out ofvg autoindex
since you gotvg giraffe
to run.The output of
vg call
is a VCF file, and VCFs require variants to be expressed relative to a reference sequence. Since there are multiple samples expressed in the graph, you have to tell it which one you want to use as the reference. You are correct that this sacrifices some of the benefits promised by a pangenome, but it is a limitation of the VCF format/formalism, and the genomics world still largely revolves around VCFs. Chances are you would want to choose GCRCh38 or CHM13 as the reference.A final note: if it's not too late, you might want to consider updating to the v1.1 pangenome graph, which corrects some known issues in the 1.0 graph. We've also found that the clipped graphs (marked d9 or d2 on the downloads) work somewhat better as mapping targets than the full graph. In the most recent versions of
vg giraffe
, we also have more sophisticated methods of doing the clipping online at mapping time.