kraken2 bacteria database 250GB+

1

Entering edit mode

19 months ago

10mz1 ▴ 10

Hi all,

I've been trying to get Kraken2 installed locally and the main database is clearly far too large. I got 16S greengenes working fine and so now I'd like to try a larger database, the entire bacterial database. I don't understand why it is so huge on my local PC. For example the kraken 2 manual states that the entire standard database (which consists of the bacterial one and others if I am not mistaken) should take around 100GB. I've tried installing and building only the bacterial database locally and it is currently taking 254GB and failed to install as the disc filled up completely. What gives?

metagenomics 16s kraken2 kraken • 2.8k views

ADD COMMENT • link updated 6 months ago by rDNA ▴ 20 • written 19 months ago by 10mz1 ▴ 10

1

Entering edit mode

From where and using what criteria did you get the bacterial genomes/sequences? Standard pre-built kraken database seems to restrict itself to RefSeq and is only 70GB.

ADD REPLY • link 19 months ago by GenoMax 151k

0

Entering edit mode

followed the command here: https://github.com/DerrickWood/kraken2/wiki/Manual#custom-databases

kraken2-build --download-library bacteria --db $DBNAME

ADD REPLY • link 19 months ago by 10mz1 ▴ 10

0

Entering edit mode

Guess bacterial content at NCBI could have grown significantly since when the manual was written.

ADD REPLY • link 19 months ago by GenoMax 151k

0

Entering edit mode

It is a good idea to work with as small database as possible to get the job done. Still, 6-10 TB hard drives are available for $150-300. Given the affordability, I think these days no research should suffer because of disk space.

ADD REPLY • link 19 months ago by Mensur Dlakic ★ 29k

1

Entering edit mode

If 250GB of data filled up OP's local disk then it is difficult to imagine that there is enough RAM available to go with the disk.

ADD REPLY • link 19 months ago by GenoMax 151k

0

Entering edit mode

"It is a good idea to work with as small database as possible to get the job done". What would be your argumentation for this? One could argue that a larger database performs better, as e.g. shown by Pochon et al. 2023 - figure 8: https://link.springer.com/content/pdf/10.1186/s13059-023-03083-9.pdf

ADD REPLY • link 6 months ago by rDNA ▴ 20

Login before adding your answer.