I'm looking for a short read mapper to map metagenomics samples against wide range of microbial genomes.
I just need to know the best mapping so aligner that optimized to find variants is unnecessary.
I know that it is common to use Bowtie2 and BWA but wondering whether you have further recommendations.
So:
The main need is a fast aligner that would find the best hit for each pairs of reads.
Any suggestions?
I did not found any good benchmarking for that purpose.
True, Kraken2 is kmer-based. Is there a reason to prefer KrakenUniq over Kraken2? Kraken2 is a follow-up to KrakenUniq; is faster and has a lower memory footprint; and has comparable accuracy, sensitivity, and positive predictive value (see Fig. 2).
Results on mock communities, less false positives. I'm not a great believer in the accuracy statistics of tool authors, they are invariably a little better than the rest. Trust the independents, the systematic comparisons, and your own results on "known" mocks.
I would try to better describe our issue and maybe it would help.
We have a collection of assembly based genomes which are under-represented in current dbs and we want to map them back along with some of the references.
The idea is in general to understand the portion of mapped reads to each "dataset" of genomes, and later on to compute the abundance of each.
If you have any thoughts regarding that would love to hear!
Short-read aligners are not designed to distinguish between closely related species. Results that you get with those could be highly misleading as equally well-aligned reads would be assigned in a random fashion to one of the contigs.
Thus if two species share a long region then even if you have only one of the species present a short read mapper might indicate that both are there at 50% abundance.
If the goal is to evaluate the abundance of various species then use a dedicated tool that has a built-in mechanism to account for similarities between species.
This is an important point, which is why simple mapping approaches will fail. You need to use a mapping quality filter to rule out counting alignments to these shared regions - we use MQ30. Having said that, the more genomes are present around the genome of interest, the more "genomic masking" due to larger proportions of shared regions (and fewer unique regions) will occur. This is becoming more and more of a problem as more and more species are sequenced.
I should note that all tools suffer from alignment problems. No tool is perfectly able to reconstitute a mock community in my experience.
Also, a conservative SNV filter (eg about 1 SNV per 50 bp allowed) is absolutely necessary from experience to rule out false alignments and mapping poor reads falsely to other bacteria.
Thanks Istvan,
Very good point. I guess we will need to treat cases of multimapping or taking different approach.
The goal is indeed to calculate abundance, the problem is that we are highly interested in genomes that are not annotated well and missing their taxonomy. If you have any idea other than mapping it would be great to hear!
Your task is a classification problem, and as such you need a tool that has some ability to deal with redistribution of reads. Many people mentioned kraken2, that is a good start.
I have also read, but never evaluated myself, that pseudo-alignment based RNA-Seq quantification methods are also applicable, salmon or kallisto
these tools implement various redistribution methods that may be superior to the somewhat simplistic approach that say Kraken2 does.
I have not evaluated the claims. Maybe I should since I find these types of evaluations quite eye-opening when done as a third party that cares only about accuracy.
Kraken2 is kmer based, not a read aligner.
True, Kraken2 is kmer-based. Is there a reason to prefer KrakenUniq over Kraken2? Kraken2 is a follow-up to KrakenUniq; is faster and has a lower memory footprint; and has comparable accuracy, sensitivity, and positive predictive value (see Fig. 2).
Results on mock communities, less false positives. I'm not a great believer in the accuracy statistics of tool authors, they are invariably a little better than the rest. Trust the independents, the systematic comparisons, and your own results on "known" mocks.