Hello everyone,
I'm reaching out to those familiar with nucleic sequence classification software such as Blastn and Kraken2.
After performing shotgun sequencing on DNA extracted from a soil sample, I attempted to classify the obtained reads using Kraken2 software. To do this, I created a database containing sequences of bacteria, viruses, archaea, protozoa, plants, fungi, and the human genome.
The classification went well, and I obtained my report with reads distributed with varying percentages among different organisms.
However, in order to ensure there are no false positives, especially since I'm working with short reads of 30-50 bp, I increased the confidence threshold of Kraken to 0.9.
The puzzling issue is that when I extract the sequences classified as plants with a confidence of 0.9 from the original filtered FASTQ file and run them through Blast, I get completely different results!
Some reads are classified as animals or more generally metazoa, which makes sense because my Kraken database does not contain animal sequences. It's expected that based on similarity, these reads might fall under eukaryotes and occasionally extend to plants.
However, many other reads are classified as bacteria by Blast, and I don't understand why Kraken doesn't produce the same result, especially considering that I significantly increased the confidence threshold.
My question is: Can I consider my plant classification reliable or not?
Thank you for your help!
I'm not convinced that this is the cause but plant genomes often contain some bacterial genomes contamination in my experience.
What does this BLAST database contain and how are the results different? Are you seeing full length alignments (not short local HSP's) in your blast results?
I'm using the blastn suit setting the nucleotide collection (nt) as database and 'Somewhat similar sequences (blastn)' as algorithm. If I change the algorithm to something more specific like 'Highly similar sequences (megablast)' I can't get any match. The most of the match reached with blastn are local and not full length. This seems even more concerning to me because I wonder then how Kraken's classification is possible if indeed my reads do not seem to map closely anywhere, to the point that Blast either doesn't find matches or only finds partial ones, and not even on the same genomes found by Kraken.
Kraken2 (k-mers) and BLAST are using different methods to produce their results. If the blast results are not full length (and the blast database does not contain the same source references) then you can't conclude much of anything. See this past thread for reference: Kraken2 vs Minimap2 and Blast results seem to be incongruent
Note: There is a premade NT database for
kraken2
available (480GB) https://genome-idx.s3.amazonaws.com/kraken/k2_nt_20230502.tar.gz if you truly want to directly compare.