when i run the default command mentioned in the MetaPhlAn 3 manual i am getting a high rate of unknown estimation i.e. 80 %
metaphlan SK_1-forward_paired.fq.gz,SK_1-reverse_paired.fq.gz,SK_1-forward_unpaired.fq.gz,SK_1-reverse_unpaired.fq.gz --bowtie2out sample1.bowtie2.bz2 --nproc 5 --bt2_ps very-sensitive-local --add_viruses --unknown_estimation --input_type fastq -o profiled_sample1.txt.
Can anyone suggest how can i reduce the unknown estimation. And what is the accepted normal for unknown estimation in case of soil samples.
metaphlan3 utilizes ChocoPhlAn database which is uniref based (~17,000 reference genomes, it a lot but not enough ). I think it is ok for gut microbe research but not enough for soil samples.
the better way is to run de novo assembly fastq -> contigs -> bins -> MAGs then perform genome annotation by GTDB toolkit or prokka or eggnog.
there are some snakemake pipeline tool such as sunbeam, Metagenome-atlas and metaGEM which do all the stuff altogether.
Another way is to run kraken2 with much larger database as reference.