Kraken classified based on 1 mer
1
0
Entering edit mode
12 weeks ago
Jyoti • 0

I have classified some reads using kraken2. This is one of the lines of the output file that is generated. According to this only 1 mer is used to classify to Bacillus cereus

C       A00804:223:H337FDSX7:1:1108:27706:11115_ACGGATATTGACCTTAGACAAGTAGAA_CGATTG      Bacillus cereus 95/8201 (**taxid 526979**)  215     0:55 1:2 0:11 1:6 0:3 1:5 0:3 1:2 0:11 1:6 0:16 1:6 0:3 1:5 0:2 1:2 0:7 **526979:1** 0:35

This is the fastq read that is corresponding to the above classification.

@A00804:223:H337FDSX7:1:1108:27706:11115_ACGGATATTGACCTTAGACAAGTAGAA_CGATTG kraken:taxid|20467
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATATTATTTATACATTCTATAACATATAAAAAAAAAAAAAAAATTATA

My question is: why kraken2 classifying this read based on just one mer? How can I filter out such spurious classifications?

Thanks

Kraken2 • 790 views
ADD COMMENT
0
Entering edit mode

I would filter out such terrible reads before giving them to a metagenomic classifier. Maybe with a read trimmer like trimmomatic or others?

ADD REPLY
0
Entering edit mode

Thank you for the suggestion. The dataset isn't specifically metagenomic; I'm simply looking where the unmapped reads are aligning. The read quality, however, is above 30.

ADD REPLY
0
Entering edit mode

The read quality, however, is above 30.

That may be but do you think a read that is mostly poly-A is going to provide any diagnostic information as to which organismal genome it may be from.

ADD REPLY
0
Entering edit mode

True. I don't think it will. Thanks for your suggestion.

ADD REPLY
0
Entering edit mode

Don't always believe sequencer quality scores - here is a nice writeup - https://sequencing.qcfail.com/articles/illumina-2-colour-chemistry-can-overcall-high-confidence-g-bases/

ADD REPLY
1
Entering edit mode
12 weeks ago

I recommend using the confidence flag for Kraken2; otherwise, it can generate wildly inconsistent results depending on the input.

Like the one you notice, which is ludicrously incorrect.

It ought to be unacceptable for a classifier to call a read as coming from Bacillus Cereus just because a single kmer out of 36 matched that genus. This shows that the tool was designed with high recall and low specificity.

ADD COMMENT

Login before adding your answer.

Traffic: 1786 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6