Dear Biostars community:
I have been using Kraken extensively for the characterization of microbiomes. My colleague and I have a bit of a disagreement on how to interpret the results from the Kraken reports.
For example, let’s imagine we have a subset of a Kraken report that looks like this:
1.93 104417 104105 P 1224 Proteobacteria
0.18 96419 1968 P 201174 Actinobacteria
0.17 80738 10469 P 1239 Firmicutes
The columns of the report, according to the Kraken manual are:
1. Percentage of reads covered by the clade rooted at this taxon
2. Number of reads covered by the clade rooted at this taxon
3. Number of reads assigned directly to this taxon
4. A taxonomy rank code
5. NCBI taxonomy ID
6. indented scientific name
In this example subset, if one looks at column #2 and #3, we get different answers as to which taxon has more reads (i.e. which one is more abundant). If I use column #2, I can say more Actinobacteria reads were detected than Firmicutes reads. However, if I look at column #3, then the opposite is true.
If one was to use read count as proxy for abundance, which column of the Kraken report is more appropriate to use: column 2 or column 3? In my opinion column #2 is more appropriate, but my collaborator seems to think it is #3. I think column #2 is more appropriate because it is the sum of the reads that were specific to the particular taxon, plus all the reads that are part of the same clade at which said taxon is rooted. I would be very interested in seeing what you think.
Column #2 should be used in my opinion. Column #3 is just the sum of reads unassigned in lower taxonomic levels.
I agree. This makes sense. Thanks, @Asaf.