Question

Kraken2 vs Minimap2 and Blast results seem to be incongruent

3

Entering edit mode

6.0 years ago

cecilio11 ▴ 120

Hello,

I obtained incongruent "classification" results when running Kraken2 vs. Minimap2 and Blast on the same data sets.

I am planning to assemble 5 bacterial genomes from pacbio reads. For 2 of the bacteria I generated two raw assemblies by using Canu. I suspected that there may be some contamination in the reads, so for each bacterium, I evaluated the raw pacbio reads using minimap2. Additionally, for the 2 draft assemblies I also used NCBI Blast to compare them to whole bacterial genomes as well as to the bacterial 16s rRNA databases. I was advised to use Kraken for a more comprehensive evaluation of possible contaminants.

Comparing the results from Minimap2 and Blast vs. Kraken2 I noticed some incongruence between the matches they report.

For example, Minimap2 indicate that 18.72% of the raw sequences of my Sphingomonas sp. map to Ignatzschineria larvae; 55% map to Sphingomonas paucimobilis. But, Kraken2 does not match Sphingomonas sp. to Ignatzschineria larvae at all. However, Kraken2 does match Sphingomonas sp. to a Sphingomonas (Sphingomonas paucimobilis is not included in the library of ref seqs that kraken builds).

Minimap2 indicates that 78.49% of the raw sequences of my Kurthia sp. map to Moellerella wisconsensis. But, kraken2 does not match my Kurthia sp. to Moellerella wisconsensis at all.

For the raw sequences of my Ignatzschineria sp., Minimap2 indicates that 81% of the raw sequences map to Moellerella wisconsensis. About 20% of the raw seqs map to Providencia. Blast of the draft assembly of my Ignatzschineria sp. to whole genomes and to 16s indicate that my Ignatzschineria may, in fact, be Moellerella wisconsensis. However, kraken2 does not match my Ignatzschineria sp. to Moellerella wisconsensis at all.

I run kraken2 in an IBM cluster.

Can someone give me a hint on why I observe the incongruence between the results reported by these methods?

Best regards,

assembly genome • 6.9k views

ADD COMMENT • link 5.7 years ago by cecilio11 ▴ 120

0

Entering edit mode

h.mom and colindaven,

Thank you for your answer. See below.

ADD REPLY • link 6.0 years ago by cecilio11 ▴ 120

score 7 · Answer 1 · 2019-01-09

7

Entering edit mode

6.0 years ago

h.mon 35k

You are comparing three different programs using three different databases, it is very difficult to pinpoint the cause of the differences. In addition, you do not explain how did you build the Blast and Kraken databases - what are the sources?

In a previous answer, I should have explained why mapping with minimap2 to a limited number of references is a bad approach. You didn't state clearly, but I am assuming you mapped to each reference separately, when using minimap2. This is not a good approach because some reads may map to a given genome when it is the only reference, but these reads wouldn't map to this genome if another, more similar genome (more similar to the reads), were included in the reference. So, when you say:

Minimap2 indicates that 81% of the raw sequences map to Moellerella wisconsensis. About 20% of the raw seqs map to Providencia.

This may mean about 80% of the reads mapped to Moellerela, and the remaining 20% mapped to Providencia. But it could also mean 60% of the reads mapped to Moellerela, 20% mapped to Moellerela and Providencia, and 20% didn't map at all.

ADD COMMENT • link 5.8 years ago by h.mon 35k

1

Entering edit mode

Just as an addon to the great answer above, the methods are dramatically different.

Minimap2 and BLAST are alignment based, whereas Kraken uses kmer matching (and kmers are highly variant with up to 10-15% errors in raw pacbio reads).

" Kraken aims to achieve high sensitivity and high speed by utilizing exact alignments of k-mers and a novel classification algorithm. "

But beyond this difference, the critical point in metagenomics and or contaminant assessment is the database behind this. As I understand it, you used different databases in each case, so results will naturally be dependent on this.

ADD REPLY • link 6.0 years ago by colindaven 7.0k

0

Entering edit mode

h.mom and colindaven,

Thank you for your answer.

For the raw reads, I did run Minimap2 and kraken2 in an IBM cluster. And yes, I run Minimap2 with a single reference sequence per run per bacteria. The percentages refer to each single run. For the draft assemblies, I run Blast (bacterial genomes and 16s bacterial and archaea databases) at the NCBI website (online).

I understand that Kraken2 uses databases downloaded from NCBI, which I do not believe are much different from the databases used by BLAST. The "reference genomes" I used to run Minimap2 were downloaded from the NCBI database.

The difference might be due in part to methodology (as you mention above) and in part to the goal of the tool. Kraken2 is a (ultrafast) classifier. I think Kraken will assign a piece of DNA to a given taxon (the best match), even though, let's say only 30-40% of the k-mers match that taxon and 20% match to another taxon (and perhaps kraken cannot classify the rest of the strand). BLAST and Minimap will show you all the matches.

Thank you for your insights. They were very helpful.

Happy new year to you and all the Biostars.

Cecilio11

ADD REPLY • link 6.0 years ago by cecilio11 ▴ 120

score 7 · Answer 2 · 2019-04-12

I posted this issue a while ago. I was starting my work on genomics those days. I am nostalgic of those times, three months ago.

I found out that the reason for the no congruent results between mapping by minimap2 and classifying by Kraken2 was that the database of the latter did not contain the genomes I was working on. When those genomes were custom-added to the Kraken2 database, the results from minimap and kraken2 were in agreement.

As I continue my work on genomics I realize that GREAT tools that use databases to identify/classify queries, such as MetaPhlan, MG-RAST and Kraken2 have to be used carefully. One has to check if their databases contain the genomes one is working on.

This is really important if one is working with species that may not have a complete reference sequence (or have not been sequenced before). If there are draft genome sequences for your species of interest, try to custom-add those to the database of the tool you are using.

Thank you biostars,

cecilio