I'm trying to do whole-genome calling with vg, using altered versions of the sample scripts provided on the github's wiki pages - I initially tried calling the whole genome without chunking and ran into memory issues, even when providing 256 GB of memory (somewhat understandable at least, considering it's trying to process the whole thing in one go). However, when I chunked my graph with vg chunk
and attempted to process the chunks separately, they seem to take wildly different amounts of memory - for example, chr2 and chr3 took relatively small amounts of memory (10-12 GB), but chr1 ran out of memory when provided up to 160 GB. The same happened with about half the chromosome chunks , repeated out-of-memory crashes even as I raised the memory allocations (confirmed with chr1,10,12,13,14,15,17,20 at 160 GB each before cancelling the rest of the attempt). I'm testing with 256 GB just to see what happens, but... it doesn't seem like it should take this many resources to run, especially after it's been divided up.
Here's what it looks like when I check the failed job with seff
:
[asherrar@cedar1 scratch]$ seff 60828429_13
Job ID: 60850336
Array Job ID: 60828429_13
Cluster: cedar
User/Group: asherrar/asherrar
State: OUT_OF_MEMORY (exit code 0)
Nodes: 1
Cores per node: 32
CPU Utilized: 2-14:35:33
CPU Efficiency: 36.61% of 7-02:59:12 core-walltime
Job Wall-clock time: 05:20:36
Memory Utilized: 157.35 GB
Memory Efficiency: 98.34% of 160.00 GB
I'm not sure why this is happening, if it's just a product of my input data's complexity or something else entirely - any suggestions on what I could do differently would be appreciated. I'm using a graph genome I built from the T2T-CHM13 v2.0 reference and the Simons Genome Diversity Project variant calls against it (52 GB vcf.gz dataset), and calling a set of PacBio (I think) long reads on that graph (27 GB fasta). The pipeline I used is available here
Correction: the bulk of the issue is in the vg augment
step - not sure how to optimize it, and every other part of the vg
variant calling pipeline is dependent on that.
I had problems with vg call and chunking speedup here: Speed up vg call ?
I didn't analyze memory usage though, but the chunking process is nowhere near as efficient as reference based variant calling on a chunked bam file.
It seems there is a great need for efficient (and multisample) read mapping and SNP and SV calling tools for graph genomes, and I think we will see a lot of progress in the next few years.
Why did you use vg map as opposed to GraphAligner, out of interest?
Yeah, I expect it's not a particularly efficient process, but this is a bit absurd.
Speaking of other tools, I did see a paper on vg SNP-Aware, but it seemed to be running on a much older version of
vg
- and if memory serves, one that was before several formatting tweaks that might make my newer graphs and intermediate files incompatible with it. The main reason I'm trying to do everything insidevg
is just for consistency and because it seemed like a major tool worth learning for pangenomics. If there's better and faster alternatives, I'm all ears - this is far from my expertise and I'd like to learn. Definitely will take a peek at GraphAligner and see if it works for my needs.Do you happen to know any alternative graph variant callers, by chance? I'm having trouble finding any that aren't using old versions of
vg
(like SNP-Aware).No, I think there aren't many SNP callers available yet - I've seen people make custom callers using the vg packfile. Another future mapper for short reads might be dragmap (hardware), but the software version does not map against pangenomes yet.
I've had poor results (slow calling, high RAM usage) on shorter chr (max 75 mb) than human chr1 but have suspended this for now.
I think I'll concentrate on building a pangenome with SVs only, not PGGB pangenomes, and mapping and genotyping (not calling) vs this.
I'm currently playing around with long reads and consequently SV calling as well, if you have any recommendations there - big goal of what I'm working on right now is looking at the benefits of T2T and long-read calling on solving "stubborn" genetic disorders, so I'm tinkering with whatever approaches I can get my hands on.
GraphMapper for mapping seems good for mapping to graphs as I mentioned. For SV calling there's odgi, see odgi pav and their nice tutorial for finding PresAbs Variations (but bed/tsv, not VCF generated).
But also you have to notice the distinction between (novel) SV calling and (known) SV genotyping (which was not clear to me for some time).
It seems you can use short reads mapped to a pangenome to genotype (not call new) known SVs. Tools: vg call (maybe), graphtyper (maybe), paragraph (also maybe). But this won't work at present for long reads mapped to a pangenome to my knowledge (lack of callers?). Or maybe just with odgi.
There's still a massive need for pangenome capable mappers and variation callers.
Thank you so much! I do recall seeing odgi come up, but I never looked into it - seems like it might be more in line with what I need.
And yeah, I'm definitely looking for calling over genotyping. The big issue with the disorders my lab studies is, we haven't been able to find anything causative so far. We're looking for novel variants that could potentially be explanatory, and since short reads and hg38 have failed so far on that front, I figured that looking at a more complete reference and longer reads might help (though short reads seem to perform pretty well on T2T-CHM13 as far as I've seen despite the low-complexity regions being present). And with the rise of graph genomics and pangenome references, it made sense to tinker with them as well - even if they're still in their infancy.
I added these software to a new repo here : https://github.com/colindaven/awesome-pangenomes/blob/main/README.md
Yes, sounds very challenging. If you have sufficient long read datasets which can be assembled from your patients, then the pangenome approach using PGGB and later odgi might be a good way to go. PGGB will give you SNPs out of the box, and while mapping to the resulting pangenome is challenging (maybe only practical with long reads), you can use odgi pav to find candidate SVs. Then potentially verify these on the linear long read contig assemblies? Tough!
Update: at the very least, chr1 is still failing with 256 GB RAM - this is just excessive. Thought it might be
vg
using RAM over storage because it can't find enough swap space to work with, but there's no option as far as I can tell to assignvg augment
orvg map
a temp directory, unlike previous steps in the chain.