Question

Kraken classified based on 1 mer

0

Entering edit mode

12 weeks ago

Jyoti • 0

I have classified some reads using kraken2. This is one of the lines of the output file that is generated. According to this only 1 mer is used to classify to Bacillus cereus

C       A00804:223:H337FDSX7:1:1108:27706:11115_ACGGATATTGACCTTAGACAAGTAGAA_CGATTG      Bacillus cereus 95/8201 (**taxid 526979**)  215     0:55 1:2 0:11 1:6 0:3 1:5 0:3 1:2 0:11 1:6 0:16 1:6 0:3 1:5 0:2 1:2 0:7 **526979:1** 0:35

This is the fastq read that is corresponding to the above classification.

@A00804:223:H337FDSX7:1:1108:27706:11115_ACGGATATTGACCTTAGACAAGTAGAA_CGATTG kraken:taxid|20467
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATATTATTTATACATTCTATAACATATAAAAAAAAAAAAAAAATTATA

My question is: why kraken2 classifying this read based on just one mer? How can I filter out such spurious classifications?

Thanks

Kraken2 • 791 views

ADD COMMENT • link updated 12 weeks ago by colindaven 7.0k • written 12 weeks ago by Jyoti • 0

0

Entering edit mode

I would filter out such terrible reads before giving them to a metagenomic classifier. Maybe with a read trimmer like trimmomatic or others?

ADD REPLY • link 12 weeks ago by colindaven 7.0k

0

Entering edit mode

Thank you for the suggestion. The dataset isn't specifically metagenomic; I'm simply looking where the unmapped reads are aligning. The read quality, however, is above 30.

ADD REPLY • link 12 weeks ago by Jyoti • 0

0

Entering edit mode

The read quality, however, is above 30.

That may be but do you think a read that is mostly poly-A is going to provide any diagnostic information as to which organismal genome it may be from.

ADD REPLY • link 12 weeks ago by GenoMax 148k

0

Entering edit mode

True. I don't think it will. Thanks for your suggestion.

ADD REPLY • link 12 weeks ago by Jyoti • 0

0

Entering edit mode

Don't always believe sequencer quality scores - here is a nice writeup - https://sequencing.qcfail.com/articles/illumina-2-colour-chemistry-can-overcall-high-confidence-g-bases/

ADD REPLY • link 12 weeks ago by colindaven 7.0k

score 1 · Answer 1 · 2024-09-28

I recommend using the confidence flag for Kraken2; otherwise, it can generate wildly inconsistent results depending on the input.

Like the one you notice, which is ludicrously incorrect.

It ought to be unacceptable for a classifier to call a read as coming from Bacillus Cereus just because a single kmer out of 36 matched that genus. This shows that the tool was designed with high recall and low specificity.