Entering edit mode
12 months ago
10mz1
▴
10
Hi all,
I've been trying to get Kraken2 installed locally and the main database is clearly far too large. I got 16S greengenes working fine and so now I'd like to try a larger database, the entire bacterial database. I don't understand why it is so huge on my local PC. For example the kraken 2 manual states that the entire standard database (which consists of the bacterial one and others if I am not mistaken) should take around 100GB. I've tried installing and building only the bacterial database locally and it is currently taking 254GB and failed to install as the disc filled up completely. What gives?
From where and using what criteria did you get the bacterial genomes/sequences? Standard pre-built kraken database seems to restrict itself to RefSeq and is only 70GB.
followed the command here: https://github.com/DerrickWood/kraken2/wiki/Manual#custom-databases
kraken2-build --download-library bacteria --db $DBNAME
Guess bacterial content at NCBI could have grown significantly since when the manual was written.
It is a good idea to work with as small database as possible to get the job done. Still, 6-10 TB hard drives are available for $150-300. Given the affordability, I think these days no research should suffer because of disk space.
If 250GB of data filled up OP's local disk then it is difficult to imagine that there is enough RAM available to go with the disk.
"It is a good idea to work with as small database as possible to get the job done". What would be your argumentation for this? One could argue that a larger database performs better, as e.g. shown by Pochon et al. 2023 - figure 8: https://link.springer.com/content/pdf/10.1186/s13059-023-03083-9.pdf