Hi there,
I've been recently getting into the world of metagenomics analysis and have been using Metalign as a taxonomic profiling tool. So far, I've been just running the application on the same FASTQ file, just to see how the results change according to different parameters.
My lab is particularly interested in M. smegmatis, and the abundance of that within a particular soil sample. When I ran Metalign with the default parameters, M. smeg was found in 10% relative abundance, and around 30 other species were identified in the sample. This is a native soil sample, with a limited amount of M. smeg added to the soil. I would expect there to be more than just 30 species identified and for M. smeg to not be 10% of all reads within that native soil sample, given the biodiversity of soil.
When I changed the parameters to run in sensitive mode, Metalign found more than 17,000 different species, and the M. smegmatis abundance dropped to 0.01. If this is relative abundance, which includes the % of unmapped reads in both of these runs, shouldn't it find the same amount of M. smegmatis?
I've been reading on the CMash algorithm and how it pre-filters the database based upon the ratio (containment index) of k-mers from reads in common with a reference genome to the number of k-mers in that reference genome. When using the defaults, Metalign has this ratio/index cutoff of 0.01. When Metalign runs in sensitive mode, the cutoff is 0.0, effectively eliminating this pre-filtering step.
Additionally, 61% of reads were unmapped with the defaults as opposed to 78% unmapped with --sensitive mode. Therefore, despite the --sensitive mode identifying way more organisms, the percentage of unmapped reads increased. Thus, is Metalign falsely aligning reads to M. smeg and the 30 other organisms since that is all the filtered database has? Why did the abundance of M. smegmatis go from 10% with a 0.01 CMash cutoff to a 0.015% abundance when there was no CMash cutoff?
If there is anything wrong with my understanding of Metalign's algorithm or how it works, please let me know. I'm very new at this and just trying to understand this large discrepancy in the results. Thank you!