Hello Biostars,
I've been given occasion to plunge into the metagenomics field. I'm working with shotgun sequencing data, so not targeted at 16S or another gene. From reading several articles, including this review by Siegwald et al., and the websites of QIIME and mothur, it seems that targeted (non-shotgun) analyses are the standard, but that I cannot use the same software to analyse shotgun sequencing data. Q1: Is this correct?
To analyze shotgun sequencing, I've so far identified two tools: Kraken2 and MetaPhlAn2. I know that there are other tools available such as CLARK, but they do not seem to be significantly different from Kraken2.
After trying Kraken2, it only seems to support export to the Krona visualization tool for downstream analysis. So I now intend to try MetaPhlAn2. Output from this tool should be compatible with the R packages phyloseq and microbiome, which seem to be standard tools for downstream statistical analyses of metagenomes. Q2: Are the observations and imputations in this paragraph correct? Q3: Should I indeed be aiming for compatibility with phyloseq and microbiome, or are there alternatives for downstream statistical analysis? Q4: Is it indeed not possible to do further statistical analysis on Kraken2's output?
Finally, the datasets which I'm starting my analyses from are not so large, so the speed advantage of Kraken2 compared to other tools does not matter much to me. I'm mostly interested in increasing the sensitivity of my analysis. Human reads have already been removed from the datasets (BAMs) by alignment, although Kraken2 could still identify a significant share.
Thank you for your time!
P.S. I've read this recent review by Ye et al., but it mostly discusses taxonomic classifiers performance, not the possibilities for downstream statistical analysis.
Your post is too long and with too many questions, so I will answer them superficially / partially - maybe someone will chime in with a more detailed answer. A better approach is to post more specific questions, I believe you will increase the odds of getting better answers.
What are the standard tools you are referring to? Regular 16S pipelines output a table of taxonomic identifications and their counts, one can certainly coerce Kraken output to the same format, so you could use all the same tools. However, there are several analyses where it doesn't make sense to use these approaches, e.g., using PICRUSt (or similar methods) to make functional predictions from taxonomic distribution, because one can assemble, predict genes and make functional predictions from the annotated metagenome.
I don't know.
Not necessarily, but phyloseq (which I know more) has outstanding documentation, so it is convenient to use it.
Further than what? Just alpha- and -beta-diversity?
Thank you h.mon! As far as I'm concerned, this could definitely be a stand-alone answer.
I've chosen for a long forum post, because this means I will not have to explain the context multiple times. Also, I think all these questions belong together when trying to do a full statistic analysis on a shotgun sequencing metagenomic dataset. I have made some edits to improve readability, however, based on your comment.
With these, I'm refering mostly to QIIME and mothur. These give BIOM files as an output directly if I'm not mistaken and are easy to couple to phyloseq and microbiome. When starting my literature session, I thought I might be able to use QIIME and mothur, but these are aimed at the analysis of target genes (López-Garcia et al., SEQanswers)
No this is not what I'm looking into, as this post revolves around taxonomic sequence classification.
Yes indeed, but I have not found standard methods for Kraken2 to calculate these scores. The kraken-biom package might be suitable for linking to downstream statistical analysis. I find it surprising, however, that neither Kraken nor phyloseq implements this package themselves.
EDIT: Deleted unnecessary information, striked out unclear questions. Reintroduced whitespace to increase readability.