Question

Recommended mapper

0

Entering edit mode

3.1 years ago

biobiu ▴ 150

I'm looking for a short read mapper to map metagenomics samples against wide range of microbial genomes. I just need to know the best mapping so aligner that optimized to find variants is unnecessary. I know that it is common to use Bowtie2 and BWA but wondering whether you have further recommendations. So: The main need is a fast aligner that would find the best hit for each pairs of reads. Any suggestions? I did not found any good benchmarking for that purpose.

alignment metagenomics Mapping • 3.0k views

ADD COMMENT • link updated 3.1 years ago by Istvan Albert 102k • written 3.1 years ago by biobiu ▴ 150

score 1 · Answer 1 · 2021-10-19

1

Entering edit mode

3.1 years ago

Chris Dean ▴ 420

Kraken2 is a popular tool for aligning short reads to a large database of microbial reference genomes. It's also *really* fast.

ADD COMMENT • link 3.1 years ago by Chris Dean ▴ 420

0

Entering edit mode

Kraken2 is kmer based, not a read aligner.

ADD REPLY • link 3.1 years ago by colindaven 7.0k

0

Entering edit mode

True, Kraken2 is kmer-based. Is there a reason to prefer KrakenUniq over Kraken2? Kraken2 is a follow-up to KrakenUniq; is faster and has a lower memory footprint; and has comparable accuracy, sensitivity, and positive predictive value (see Fig. 2).

ADD REPLY • link 3.1 years ago by Chris Dean ▴ 420

0

Entering edit mode

Results on mock communities, less false positives. I'm not a great believer in the accuracy statistics of tool authors, they are invariably a little better than the rest. Trust the independents, the systematic comparisons, and your own results on "known" mocks.

ADD REPLY • link 3.1 years ago by colindaven 7.0k

score 0 · Answer 2 · 2021-10-19

0

Entering edit mode

3.1 years ago

colindaven 7.0k

We offer bwa, and minimap2 configured for short or long reads in our pipeline: https://github.com/MHH-RCUG/Wochenende

Otherwise, I would recommend kraken-uniq (not kraken2) and Metaphlan.

ADD COMMENT • link 3.1 years ago by colindaven 7.0k

0

Entering edit mode

Thanks, The pipeline looks great!

I would try to better describe our issue and maybe it would help. We have a collection of assembly based genomes which are under-represented in current dbs and we want to map them back along with some of the references. The idea is in general to understand the portion of mapped reads to each "dataset" of genomes, and later on to compute the abundance of each.

If you have any thoughts regarding that would love to hear!

ADD REPLY • link 3.1 years ago by biobiu ▴ 150

score 0 · Answer 3 · 2021-10-19

0

Entering edit mode

3.1 years ago

Istvan Albert 102k

The use case is very important here.

Short-read aligners are not designed to distinguish between closely related species. Results that you get with those could be highly misleading as equally well-aligned reads would be assigned in a random fashion to one of the contigs.

Thus if two species share a long region then even if you have only one of the species present a short read mapper might indicate that both are there at 50% abundance.

If the goal is to evaluate the abundance of various species then use a dedicated tool that has a built-in mechanism to account for similarities between species.

ADD COMMENT • link 3.1 years ago by Istvan Albert 102k

0

Entering edit mode

This is an important point, which is why simple mapping approaches will fail. You need to use a mapping quality filter to rule out counting alignments to these shared regions - we use MQ30. Having said that, the more genomes are present around the genome of interest, the more "genomic masking" due to larger proportions of shared regions (and fewer unique regions) will occur. This is becoming more and more of a problem as more and more species are sequenced.

I should note that all tools suffer from alignment problems. No tool is perfectly able to reconstitute a mock community in my experience.

Also, a conservative SNV filter (eg about 1 SNV per 50 bp allowed) is absolutely necessary from experience to rule out false alignments and mapping poor reads falsely to other bacteria.

ADD REPLY • link 3.1 years ago by colindaven 7.0k

0

Entering edit mode

Thanks Istvan, Very good point. I guess we will need to treat cases of multimapping or taking different approach. The goal is indeed to calculate abundance, the problem is that we are highly interested in genomes that are not annotated well and missing their taxonomy. If you have any idea other than mapping it would be great to hear!

ADD REPLY • link 3.1 years ago by biobiu ▴ 150

0

Entering edit mode

Your task is a classification problem, and as such you need a tool that has some ability to deal with redistribution of reads. Many people mentioned kraken2, that is a good start.

I have also read, but never evaluated myself, that pseudo-alignment based RNA-Seq quantification methods are also applicable, salmon or kallisto

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5945057/

these tools implement various redistribution methods that may be superior to the somewhat simplistic approach that say Kraken2 does.

I have not evaluated the claims. Maybe I should since I find these types of evaluations quite eye-opening when done as a third party that cares only about accuracy.

ADD REPLY • link 3.1 years ago by Istvan Albert 102k

score 0 · Answer 4 · 2021-10-19

0

Entering edit mode

3.1 years ago

GokalpC ▴ 100

How about a denovo assembler and a database searcher.

I would use SPAdes and Silva.

ADD COMMENT • link 3.1 years ago by GokalpC ▴ 100