I have classified some reads using kraken2. This is one of the lines of the output file that is generated. According to this only 1 mer is used to classify to Bacillus cereus
C A00804:223:H337FDSX7:1:1108:27706:11115_ACGGATATTGACCTTAGACAAGTAGAA_CGATTG Bacillus cereus 95/8201 (**taxid 526979**) 215 0:55 1:2 0:11 1:6 0:3 1:5 0:3 1:2 0:11 1:6 0:16 1:6 0:3 1:5 0:2 1:2 0:7 **526979:1** 0:35
This is the fastq read that is corresponding to the above classification.
@A00804:223:H337FDSX7:1:1108:27706:11115_ACGGATATTGACCTTAGACAAGTAGAA_CGATTG kraken:taxid|20467
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATATTATTTATACATTCTATAACATATAAAAAAAAAAAAAAAATTATA
My question is: why kraken2 classifying this read based on just one mer? How can I filter out such spurious classifications?
Thanks
I would filter out such terrible reads before giving them to a metagenomic classifier. Maybe with a read trimmer like trimmomatic or others?
Thank you for the suggestion. The dataset isn't specifically metagenomic; I'm simply looking where the unmapped reads are aligning. The read quality, however, is above 30.
That may be but do you think a read that is mostly
poly-A
is going to provide any diagnostic information as to which organismal genome it may be from.True. I don't think it will. Thanks for your suggestion.
Don't always believe sequencer quality scores - here is a nice writeup - https://sequencing.qcfail.com/articles/illumina-2-colour-chemistry-can-overcall-high-confidence-g-bases/