Hello dear all. My question is for people experienced in metagenomic analysis of complex microbial communities such as in soil. Suppose I have several soil shotgun metagenomes and I need to classify all of it reaching maximum possible diversity stored in my data. As far as I know, current full-genome databases can be divided into curated (i.e. RefSeq, progenomes) and non-curated (i.e. nr, nr-euc) ones. Use of curated, moderated databases is safe, but cannot recreate all the diversity existing in data (because they are very restricted in terms of size). In my situation, I have approximately 45% of data classified as bacteria with "progenomes" db and nearly 5% as fungi. Use of the biggest possible db (such as nr) can theoretically classify all my seqs, but the price is (as far as I understand) unacceptable percent of data classified improperly, because anyone could upload data of any quality in these databases. And also, I just do not nave enough RAM at the time to perform full-metagenome taxonomy analysis with such db as nr.
My idea was to use kraken2 to classify full metagenome data with RDP or SILVA database. This idea is based on the fact that 16S rRNA dbs are much bigger than full-genome-ones, and theoretically such approach could uncover more taxa. No sooner said than done, and such attempt classified less then 0.1% of a reads, which is ~50000 reads in absolute numbers. This, to my knowledge, corresponds to the average yield of reads from a typical soil metabarcoding analysis, with sequencing performed by MySeq.
So here starts my question: is my Idea has a sense, and if yes, is such sample size (~50000) enough to represent soil biodiversity with acceptable level of correctness? I suspect that all copies of 16S rRNA gene from N gram of soil are not proportional to all DNA from the same quantity of soil, in terms of diversity and share-of-community, but I have no other thought on that.
I would appreciate all critical comments of my idea. If you have this task (uncover maximum possible biodiversity with minimum bias of share-of-community) done by another approach, I would love to know with which exactly.
Thank ya'll for the attention to my question.
You can't use one phylogenetic marker to capture the sample biodiversity.
MetaPhlAn does that but with a database of unique SGB(species-level genome bins)-specific marker genes. link to the preprint