I feel like this is a dumb question...not the first one I've asked and won't be the last. I'd like to do taxonomic analysis on some PacBio Sequel 2 samples using the Taxonomic-Profiling-Nucleotide workflow. I'm wondering if there is an easy way to create a lightweight database that's a subset of the stuff in NCBI nt? For this pipeline they want me download and index that database, which will take forever and create a 850GB mmi file and is also excessive for my purposes. I do have a metagenomic sample, but I don't need to check against everything in NCBI nt (do I?) Normally, with my short read data I just use Kraken2 - either their standard8 or standard16 (which I got here). Is there a way I can create this kind of lightweight alternative from NCBI nt?
you can use makeblastdb program to create local index for subset of sequences (.fa) and use it.
Don't I need to use minimap2's indexing? I need to align using minimap2. The pipeline specifies that I should align to NCBI nt, but I want to align to a smaller DB. I don't know how to get subsets of NCBI nt that are like "all refseq bacteria and viruses," etc.
change OP to what you need. Partial information is not useful
You can use any set of sequences to create the database, NCBI nt is simply the most comprehensive. Custom downloading based on taxonomy is not NCBI's strong suit (though it should be!), so doing this means digging around for the right tools (NCBI has limited docs on this here: https://www.ncbi.nlm.nih.gov/guide/howto/dwn-records/).
Alternatively, if you are set on nucleotide-alignment matches for taxonomic profiling and don't want to do the heavy lifting, I would recommend using BugSeq (https://bugseq.com). It's a cloud platform analysis that uses NCBI nt for alignment and a comparable LCA algorithm for taxon matching. It actually performed better than the PacBio nucleotide pipeline (and similar to the PacBio protein pipeline). There are 10 free trials you can use with NCBI nt, so depending on sample size this could be the easiest way forward.