I am working on a comparison between current metagenomics tools, and I have troubles finding a good, complete and updated reference database. My dream would be a selection of bacterial genomes from NCBI RefSeq with representatives from each species, covering strains with high phylogenetic diversity, as proposed in GEBA. Another nice feature would be easy availability for downloading, since I don't find NCBI so user-friendly (not easy to select interesting genomes, downloading file by file with ftp takes ages, or I am simply not able to do it properly). The best option I have found is HMP, but I would prefer a complete bacterial database. Another option would be using SILVA, but I would like to compare performances on whole genomes rather than 16S only.
Do you know any free databases with these characteristics? What do people use as reference databases when dealing with metagenomics? Thanks in advance for any suggestion.
I think it really depends on the metagenomics project but in general a database of the full reference genome of bacteria, viruses, archaea and environmental samples would make a good starting database for genome sequence based comparison.
Thank you for your answer. For now the comparison will be limited to a simulated read set, so that would be more than a good start. The problem is: which, where and how to get. Do you have any advice? Furthermore, tools like Kraken build huge databases, and their construction takes more than 100GB RAM only for old bacterial refseq (~2500 seq). That's why I am interested on selecting the "best" candidates to build the database upon.