vg map time consuming
1
0
Entering edit mode
16 months ago
Maxine ▴ 50

Hi vg team,

The vg map command has been running for 4 days. Initially, the size of the .gam file accumulated to around 100G, but then suddenly dropped to 1G. Currently, the size is around 2.3G and it continues to increase.

The command used is as follows:

vg map -t 32 -x $xg_file -g $gcsa_file -f $read1_file -f $read2_file --log-time --debug > $sample_name.mapped.gam

I am wondering if this progress is normal, and if it's possible to estimate when the process will be completed. Can I estimate the end time based on the size of the input files?

Thank you for your assistance.

vg • 1.4k views
ADD COMMENT
1
Entering edit mode

Just to add: since the gam is being created via redirecting vg map's output via >, there is no way for vg map to decrease the size of your file. If it goes from 100G to 1G, it means another process deleted or overwrote it (and also, possibly, that the final output will be corrupt).

ADD REPLY
1
Entering edit mode
16 months ago

For relatively simple graphs, you usually would expect vg map to be roughly an order of magnitude slower than a linear read mapper like bwa mem. On very complicated graphs, vg map has never really been practical for paired-end reads. However, you could still map the FASTQs independently as single-ended reads. It might also be a good idea to map the first ~500k reads to get some sense of the relative speed before committing to another full mapping run.

ADD COMMENT
0
Entering edit mode

Sure, using the first 500kb of the map could be a method to estimate the speed. However, I believe it would still take more than 4 days to process a single sample. This timeframe is too long, especially considering that I have dozens of samples. As you mentioned, mapping paired-end reads against complex graphs is impractical. Are there any practical ways to utilize paired-end reads? For example:

  1. Is it possible to create an interleaved fastq file? Would using an interleaved fastq file be more efficient?
  2. Could I use one of the paired reads and disregard the other?
  3. Is it feasible to map one single-end read at a time and then map the other?
  4. What about using vg giraffe? (However, I'm concerned that my unphased VCF might result in numerous false negative calls.)
ADD REPLY
1
Entering edit mode

Option 3 is what I had in mind when I suggested mapping the two ends independently. Option 1 won't change anything, and option 2 will be a significant amount of information loss relative to 3. It could be worth experimenting with vg giraffe, but if the structure of the graph is too complicated, it can also create challenges for constructing the distance index that is used in vg giraffe.

ADD REPLY
0
Entering edit mode

I understand what you mentioned regarding option 4. I recall that when I attempted to use vg autoindex for giraffe, it consistently failed due to out-of-memory, even when I allocated 230G of memory to it.

Regarding option 3, I'm wondering how to map read files one by one. The output of vg map is a gam file. Can the gam file be reused as input for another vg map operation? Please clarify this for me.

ADD REPLY
1
Entering edit mode

GAM files can be concatenated, so one option is to run the command twice: vg map -f read1.fq > out.gam; vg map -f read2.fq >> out.gam. You could also concatenate the read files for input vg map -f <(cat read1.fq read2.fq) > out.gam.

ADD REPLY
0
Entering edit mode

Thank you for helping. I'll try this way.

ADD REPLY
0
Entering edit mode

I want to ask if it's the same for vg giraffe, where paired reads are concatenated and given to vg giraffe. Can you please confirm if this is the case?

ADD REPLY
0
Entering edit mode

Yes, it is. FASTQ and GAM formats can both be concatenated, and all the VG tools handle single-ended reads more or less the same.

ADD REPLY

Login before adding your answer.

Traffic: 1550 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6