Speed up vg call ?
1
1
Entering edit mode
23 months ago

Dear all,

has anyone found a good way of speeding up vg call by chunking or another method ?

Current chunking seems to lead to a very modest speedup of 0-20% so maybe is not the right approach.

Specifically, I have aligned with vg giraffe and used the following code to chunk the resulting GAM file into sets of 100000 reads.

Commands from a nextflow script

vg chunk -t $task.cpus --gam-split-size $params.gam_split_size -a $gam

vg pack -t $task.cpus -x $xg -g $chunked_gam -Q5 -o ${prefix}.aln.pack
vg call -t $task.cpus $xg -k ${prefix}.aln.pack --min-support $params.min_support -s $sample_name > ${prefix}.vcf

I've read lots of docs but am not sure what's up to date

  • There is a complicated vg_toil script here, however I don't know if this is up to date (from 2020) so I'm a bit wary

vg_toil_script

Thanks

vg_team vg • 1.8k views
ADD COMMENT
0
Entering edit mode

vg call took me 3 days and failed ultimately due to time-limit (the job has duration as I submitted the job by slurm). Is this considered to be a normal occurrence?

The command I utilized is :

vg call graph.gbz -k ${pack_file} -r ${snarls_file} -t 32 > ZYZ288A.giraffe.vcf

The log was silent and the output vcf was empty.

The the sizes of input files are:

-rw-r----- 1 maxine91 maxine91 4.6G Jul 21 01:05 div.12bufo.giraffe.gbz
-rw-r----- 1 maxine91 maxine91 9.5M Jul 25 02:01 div.12bufo.giraffe.gbz.snarls
-rw-r----- 1 maxine91 maxine91 3.8G Jul 25 00:56 ZYZ288A.gbz.pack
-rw-r----- 1 maxine91 maxine91  88G Jul 22 02:05 ZYZ288A.giraffe.mapped.gam
ADD REPLY
1
Entering edit mode

When running vg call on .gbz input, you can often see a major speedup by adding -z to limit it to haplotypes present in the .gbz. For example, this makes it 100s of times faster for the HPRC graphs.

ADD REPLY
0
Entering edit mode

Thanks, I'll try it. But I also have a vg call process that takes xg, xg.pack, xg.snarls as input. It also run 3 days for nothing happened in its vcf. it seemed that no matter how much time elapsed, it would never finish. I even began to question whether the process was stuck. Is it a normal situation? Is there any method to determine if a process is stuck or not?

update on Aug 2nd:

Currently, I have two instances of the vg call command running, and at intervals of 12 hours, I have been monitoring the memory usage, which has remained virtually unchanged. This further intensifies my concerns that the processes may be stuck. I eagerly await your assistance. Thanks.

ADD REPLY
0
Entering edit mode

Interesting idea. So my nextflow code is like this at the moment, can I just change the $xg and $gbwt to $gbz, add -z, and get the speedup like shown here ?

Edit - looks like it worked, time reduced on a test 25k arabidopsis example from 4m14 to 2m38. Thanks!

#current code
VG_FULL_TRACEBACK=1
vg pack -t $task.cpus -x $xg -g $gam -Q5 -o ${prefix}.aln.pack
vg call -t $task.cpus $xg -C 100 -k ${prefix}.aln.pack --min-support $params.min_support -a -r $snarls -g $gbwt -s $sample_name > ${prefix}.vcf

#suggested code
VG_FULL_TRACEBACK=1
vg pack -t $task.cpus -x $gbz -g $gam -Q5 -o ${prefix}.aln.pack
vg call -t $task.cpus $gbz -C 100 -k ${prefix}.aln.pack --min-support $params.min_support -a -s $sample_name -r $snarls -z $gbz  > ${prefix}.vcf
ADD REPLY
1
Entering edit mode

-z does not take an argument.

vg call graph.xg -g graph.gbwt should be exactly equivalent to vg call graph.xg -g graph.gbwt. (ie -g should give the same speedup as -z). If you are seeing different runtimes, I suggest double-checking your output.

ADD REPLY
0
Entering edit mode

to Glenn:

May I inquire if you could provide me with information regarding the species that the VG team has attempted while executing the vg call command, along with the corresponding time and resources expended? This would give me some insight into how I should plan my workflow.

Furthermore, I am contemplating the idea of constructing graphs and calling variants on a per-chromosome basis. Is this theoretically feasible?

Thanks

ADD REPLY
2
Entering edit mode
20 months ago
glenn.hickey ▴ 520

For bigger datasets, make sure to pass in your snarls (computed with vg snarls) with -r. Sometimes, that is most of vg call's runtime. You can also try using -C to avoid huge snarls. In general, unless your graph is extremely complex, you should not need to chunk it up before running vg call.

ADD COMMENT

Login before adding your answer.

Traffic: 1373 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6