Reference genome for microbiome WGS metagenomics ?
1
0
Entering edit mode
4.8 years ago
richard • 0

Which reference database is really the best ? I know metaphlan's data base has 17,000 species and 1M marker genes, but the Ensembl database has 40,000 bacterial genomes. Any ideas on what is best ?

microbiome metagenome metaphlan2 ensembl • 1.5k views
ADD COMMENT
1
Entering edit mode

Well NCBI has 228,000 prokaryotic genomes as of today.

Will depend on what you need/want to get from your dataset.

ADD REPLY
0
Entering edit mode

The goal is to do WGS Shotgun sequence taxonomic profiling from human stool samples. Im trying to understand the benefits of interrogating a larger database like NCBI or Ensembl versus using the smaller curated database of Metaphlan2.

ADD REPLY
0
Entering edit mode

Larger databases are going to have redundancy which will cause problems with running alignments/multi-mapping. Sounds like paper referenced by @Asaf below may be the way to go.

ADD REPLY
1
Entering edit mode

What do you want to achieve? What kind of samples do you have?

You can also subset NCBI/Ensembl databases if you have specific targets or can exclude specific bacteria. Like this you can have a high resolution for you target group.

EDIT: Another strategy is to pre-screen your data with a subset of bacterial genomes. Like for example all assemblies from RefSeq with tag "Reference/Representative". Then you download all genomes related to the found genomes in the pre-screening.

EDIT2: So you have seq. data you want to screen? What type of data?

ADD REPLY
0
Entering edit mode

The goal is to do WGS Shotgun sequence taxonomic profiling from human stool samples. Im trying to understand the benefits of interrogating a larger database like NCBI or Ensembl versus using the smaller curated database of Metaphlan2.

ADD REPLY
0
Entering edit mode

Many samples high throughput or just a few but more in depth analysis?

ADD REPLY
0
Entering edit mode

Ideally in-depth analysis but for many samples. At first I want to do taxonomy, but later may look into gene pathways present.

ADD REPLY
1
Entering edit mode

I will suggest using the corrected GTDB from this paper: https://www.biorxiv.org/content/10.1101/712166v1.full.pdf See this github: https://github.com/rrwick/Metagenomics-Index-Correction

ADD REPLY
0
Entering edit mode

Thank you, will check it out

ADD REPLY
2
Entering edit mode
4.8 years ago
mschmid ▴ 180

If you have many samples, want to do an in depth analysis and have enough time I personally would do the following:

1) Get a basic set of RefSeq or Ensembl Bacterial genome assemblies covering all taxonomic groups. Either take all of them or perform a clever sub setting (like taking representative genomes). You can do the same for Fungi and Protists if you think you might have them in the sample. Or other Eukaryotes. I guess Archaea are not necessary, but there are not that many so you can as well add some of them. You can remove genomes from surveillance projects, as many of them can be virtually identical.

2) With those you do a first screening of all samples to roughly identify what you have. I would use something like Kraken2 or Centrifuge.

3) Now extend the groups you find with closely related genomes from RefSeq and maybe GenBank. Be careful with GenBank genomes, they could have wrong taxonomic annotation. Maybe check this. For Bacteria, Fungi, Protists and what else you have, you can do the same. Kick out the Genomes where you do not see any hits. Maybe add some/all viral RefSeq genomes.

4) Now do another screening and check if you seem to have a good representation of what is there. You can check the reads which did not get any hits with Blast or so to get indications what you are missing.

5) Improve your DB a bit more if necessary

6) Enjoy :)

ADD COMMENT

Login before adding your answer.

Traffic: 2175 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6