Greetings Hive mind!
I'm trying to build a newer index for Centrifuge (the downloadable one is out of date by 4 years) and I'm running into a memory issue. The input file for the complete archea, bacterial, and viral refseq index is 171GB which means I would need like... 600 -700 GB of RAM to index. Currently I do not have access to a cloud solution (i.e. EC2). Since I do not have the needed RAM, I decided to chunk out the input file in to smaller files with at most 1000 fastas each.
My question is, if I run all the individual chucks and create all the FM Index files, can I combine them into just the 4 expected files? If I understand correctly the FM index would concatenate all the seqs as one seq and then index it. So I'm not sure if I combine multiple indexes it will still work.
I tried to feed the small chunks to centrifuge using a quick bash loop, but it overwrites the previous index when I try that way.
Any thoughts on this process will be greatly appricated.
Thank you, Sean
EDIT 19AUG24 Code used I used, in an attempt to feed Centrifuge chunks of the reference file.
for i in group*.fa do; centrifuge-build -p4 --conversion-tabel seqid2taxid.map \
--taxonomy-tree taxonomy/nodes.dmp \
--name-table taxonomy/names.names.dmp \
$i abv; done
please post the script code so that other users can help and debug it
Hi, thank you for reading this and replying. I understand why my code was not working how I had hoped. My question is more on if I make an index from all the groups and then concatenate them together would it still work?
may be what you are trying or tried if you can show then people can help since it will help to reproduce the the issue locally given the version or code or file if you can show and cite