What is the best tool to align millions of 2x101bp illumina reads against the NCBI nt database (100+ GB) using global alignment on a machine with 50GB of ram?
Bowtie2 keeps crashing when I go to build the index, even if I divide into smaller parts. It finds parameters and passes the memory test but then gets "Killed" in the sorting stage. It worked when I divided into pieces that were 5GB large (but indexing took forever).
Idea is to find novel sequences. I also thought about using BBDuk to filter against nt, but not sure I will have the ram.
BBMap can be used to align to nt. Its indexing is very fast compared to Burrows-Wheeler-transform indexed tools.
BBMap uses around 6 bytes of RAM per reference base, or around 3 bytes in low-memory mode (with the "usemodulo" flag). So, you would need to subdivide it appropriately. BBDuk, on the other hand, would be far faster... but it uses ~20 bytes per reference base, so you'd need to subdivide it even more.
Edit - you can, however, use BBDuk's "speed" flag to reduce memory usage and increase speed at the expense of sensitivity. Any method you use to align millions of reads against nt is going to be slow, so filtering out as many as possible beforehand is probably prudent. The "speed" flag ignores a fraction of the kmer space; "speed=0" uses all reference kmers; "speed=1" ignores 1/16 of the reference kmers (reducing memory consumption by 1/16); and the maximum "speed=15" reduces memory consumption by 15/16, to a little over 1 byte per reference base. Sensitivity is not affected much for ~150bp reads and genome-size references up to around speed 12 (75% memory reduction).
Really detailed and helpful response. Thank you!