I have a graph of 10 different assemblies of the 5 Mbp salmonella genome created with pggb. As far as I can tell, the smoothed GFAs created from that tool look as expected, and aren't corrupted or anything.
I want to use vg map or GraphAligner to map some Illumina short reads (fastq files) from an SRA accession to the graph, and call variants in the SRA sample against the graph (with or without augmenting - I don't mind just genotyping the sample). However, I'm running into several roadblocks that I can't figure out.
My pipeline goes like this:
($gfa
is my GFA file from PGGB's output, $fq1
and $fq2
are my SRA fastq paired read files.)
vg view -Fv $gfa > salm.vg
vg convert -g -t 38 -v $gfa > salm.vg
vg index -t 38 -x salm.xg salm.vg
GraphAligner --verbose -x vg -t 38 -g salm.vg -f $fq1 -f $fq2 -a salm.gam
vg pack -t 38 -x salm.xg -g salm.gam -Q 5 -o salm.gam.pack
vg call -d 1 -t 38 -m 1,2 -k salm.gam.pack -s SRR13061905 -a salm.xg > salm.xg.gam.vcf
There are a few issues I'm having here, in order:
For the first two steps, I have tried both vg view and vg convert as you can see. However, these give wildly different outputs - the "vg convert" result is over 2x larger in filesize for example. Is this intended, and if so, which is preferred when converting GFA -> vg?
When I attempt to create a GCSA index in the indexing step in order to use vg map, it fails due to an incredibly huge disk usage - over 2 terabytes - too much for my cluster. I understand this is expected for GCSA, since these files can be 100x or more larger than the graph as stated in the docs. However, setting a temporary filesize limit of e.g. 500G with "vg index -Z [...]" causes failure due to the temp files still being too big. Is there any way around this? I've been forced to try to use GraphAligner until I can figure this out, leading to my next issue...
vg pack seems to work ok, but "vg call" always produces a VCF that is either empty (without -a) or has nothing but rejected no-calls for every single position, failing the "lowad" low depth filter. The GAM output from GraphAligner seems valid, and successfully maps almost all the reads to unique positions. The stats look good for the GAM, but VG seems to think there is literally nothing at all in the GAM and totally rejects every single read. What's going on?
Any help with any of this is appreciated, as I'd really like to use these tools. We really want to use a variation graph instead of a reference-based approach for our project.