Distance index computation during `vg --autoindex` takes too long with one GFA file but not with the other.
0
0
Entering edit mode
10 months ago
Harsh • 0

I have two GFA files representing the same dataset. I am trying to map reads to these two files using Giraffe. One of them is much smaller than the other (17 MB vs 52 MB). However, the distance index computation (and hence autoindex) is much faster on the larger dataset. As a matter of fact, it only took a couple of seconds on that dataset while it has been running for several hours on the smaller dataset.

I want to know why this is the case and how it can be fixed. I have attached a Google Drive link containing the two datasets with this post.

The command I am using is vg autoindex --workflow giraffe -g <file_name.gfa> -p <output_file_name>

Here are the two GFA files: https://drive.google.com/drive/folders/1mCmgIuVTDthDS7h5iW0PY86-dkFYGCDG?usp=sharing The large one is very fast during indexing while the small one is sluggishly slow.

vg giraffe • 492 views
ADD COMMENT
1
Entering edit mode

The distance index's efficiency is determined to a large extent by the complexity of the graph. It tends to work best when the graph looks mostly like a series of "bubbles". If the graph has a much more complicated topology, the index can require a lot of computation to create, and it typically also ends up being quite large. I think it's likely that the reason the one graph is smaller is because it has merged more distant paralogous sequences, leading to a complicated topology, which makes it less amenable to vg giraffe's indexing strategies.

ADD REPLY

Login before adding your answer.

Traffic: 2154 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6