Question

Centrifuge Index creation issue

0

Entering edit mode

8 months ago

skbrimer ▴ 740

Greetings Hive mind!

I'm trying to build a newer index for Centrifuge (the downloadable one is out of date by 4 years) and I'm running into a memory issue. The input file for the complete archea, bacterial, and viral refseq index is 171GB which means I would need like... 600 -700 GB of RAM to index. Currently I do not have access to a cloud solution (i.e. EC2). Since I do not have the needed RAM, I decided to chunk out the input file in to smaller files with at most 1000 fastas each.

My question is, if I run all the individual chucks and create all the FM Index files, can I combine them into just the 4 expected files? If I understand correctly the FM index would concatenate all the seqs as one seq and then index it. So I'm not sure if I combine multiple indexes it will still work.

I tried to feed the small chunks to centrifuge using a quick bash loop, but it overwrites the previous index when I try that way.

Any thoughts on this process will be greatly appricated.

Thank you, Sean

EDIT 19AUG24 Code used I used, in an attempt to feed Centrifuge chunks of the reference file.

for i in group*.fa do; centrifuge-build -p4 --conversion-tabel seqid2taxid.map \
                                                          --taxonomy-tree taxonomy/nodes.dmp \
                                                          --name-table taxonomy/names.names.dmp \
                                                          $i abv; done

metagenomics centrifuge • 986 views

ADD COMMENT • link 8 months ago by skbrimer ▴ 740

0

Entering edit mode

please post the script code so that other users can help and debug it

ADD REPLY • link 8 months ago by 1769mkc ★ 1.3k

0

Entering edit mode

Hi, thank you for reading this and replying. I understand why my code was not working how I had hoped. My question is more on if I make an index from all the groups and then concatenate them together would it still work?

ADD REPLY • link 8 months ago by skbrimer ▴ 740

0

Entering edit mode

may be what you are trying or tried if you can show then people can help since it will help to reproduce the the issue locally given the version or code or file if you can show and cite

ADD REPLY • link 8 months ago by 1769mkc ★ 1.3k

score 1 · Answer 1 · 2024-08-17

1

Entering edit mode

8 months ago

mourisl ▴ 30

It is a non-trivial to concatenate FM index, which would rely on global suffix order. Since you have already downloaded the reference files, you may try the method centrifuger (https://github.com/mourisl/centrifuger). The "centrifuger-build" has similar options as "centrifuge-build", and it has the option --build-mem, which will try to build the index within the specified memory. The index cannot be used for Centrifuge though. The index building step might be slow if the memory limit is too small. How much memory do you have?

There is one recent Centrifuger's index at https://zenodo.org/records/10023239 if it works for you.

ADD COMMENT • link 8 months ago by mourisl ▴ 30

0

Entering edit mode

It is a non-trivial to concatenate FM index, which would rely on global suffix order.

This is what I was wondering about and why I was thinking that my breaking it into chunks idea was not the correct solution. I will look into Centrifuger today. It looks like the way I will need to go. I have 256GB of RAM available (if I kick people off the server) so this looks like a win!

ADD REPLY • link 8 months ago by skbrimer ▴ 740