Question

Reference database for metagenomics

0

Entering edit mode

7.9 years ago

pignottisimone ▴ 30

I am working on a comparison between current metagenomics tools, and I have troubles finding a good, complete and updated reference database. My dream would be a selection of bacterial genomes from NCBI RefSeq with representatives from each species, covering strains with high phylogenetic diversity, as proposed in GEBA. Another nice feature would be easy availability for downloading, since I don't find NCBI so user-friendly (not easy to select interesting genomes, downloading file by file with ftp takes ages, or I am simply not able to do it properly). The best option I have found is HMP, but I would prefer a complete bacterial database. Another option would be using SILVA, but I would like to compare performances on whole genomes rather than 16S only.

Do you know any free databases with these characteristics? What do people use as reference databases when dealing with metagenomics? Thanks in advance for any suggestion.

genome database NCBI taxonomy metagenomics • 4.3k views

ADD COMMENT • link 7.9 years ago by pignottisimone ▴ 30

0

Entering edit mode

I think it really depends on the metagenomics project but in general a database of the full reference genome of bacteria, viruses, archaea and environmental samples would make a good starting database for genome sequence based comparison.

ADD REPLY • link 7.9 years ago by Sej Modha 5.3k

0

Entering edit mode

Thank you for your answer. For now the comparison will be limited to a simulated read set, so that would be more than a good start. The problem is: which, where and how to get. Do you have any advice? Furthermore, tools like Kraken build huge databases, and their construction takes more than 100GB RAM only for old bacterial refseq (~2500 seq). That's why I am interested on selecting the "best" candidates to build the database upon.

ADD REPLY • link 7.9 years ago by pignottisimone ▴ 30

score 1 · Accepted Answer · 2017-01-18

1

Entering edit mode

7.9 years ago

5heikki 11k

#!/bin/bash
mkdir ref_prok_rep_genomes
cd ref_prok_rep_genomes
wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/ref_prok_rep_genomes.??.tar.gz
tar zxvf ref_prok_rep_genomes.??.tar.gz
#Could play with -outfmt to get easier parsing for a tax map
blastdbcmd -db ref_prok_rep_genomes -entry all > ref_prok_rep_genomes.fna

This is representative/reference archaea + bacteria (~21 GB file). It's of course relative what is actually "representative", e.g. this db includes just one Salmonella genome (Salmonella enterica subsp. enterica serovar Typhi str. CT18)..

ADD COMMENT • link 7.9 years ago by 5heikki 11k

0

Entering edit mode

That's great thanks! I didn't think about checking out blast directory

ADD REPLY • link 7.9 years ago by pignottisimone ▴ 30

score 0 · Accepted Answer · 2017-01-18

I want to share also what I came up with, even if 5heikki's answer is very good for my purposes. I found kind of a more modular way though:

awk -F "\t" -v OFS="\t" '$12=="Complete Genome" && $11=="latest"\
&& $5~/^(reference genome|representative genome)$/ {print $20}'\
assembly_summary_refseq.txt | awk 'BEGIN{FS=OFS="/";filesuffix="genomic.fna.gz"}\
{ftpdir=$0;asm=$10;file=asm"_"filesuffix;print ftpdir,file}' > ftpfilepaths

This will print all latest versions of reference/representative genomes in RefSeq's bacterial database into the file 'ftpfilepaths', which you can later download with wget -i ftpfilepaths. To obtain their taxa instead:

awk -F "\t" -v OFS="\t" '$12=="Complete Genome" && $11=="latest"\
&& $5~/^(reference genome|representative genome)$/ {print $1, $7}'\
assembly_summary_refseq.txt > acc2taxid.map

acc2taxid.map's first column will contain the sequences' accession numbers, and the second column the taxa of the specie of each column (use $6 instead of $7 for strains).