Question

Interpretation of Kraken results

3

Entering edit mode

6.9 years ago

ropolocan ▴ 830

Dear Biostars community:

I have been using Kraken extensively for the characterization of microbiomes. My colleague and I have a bit of a disagreement on how to interpret the results from the Kraken reports.

For example, let’s imagine we have a subset of a Kraken report that looks like this:

1.93    104417  104105  P   1224    Proteobacteria

0.18    96419   1968    P   201174  Actinobacteria

0.17    80738   10469   P   1239    Firmicutes

The columns of the report, according to the Kraken manual are:

1. Percentage of reads covered by the clade rooted at this taxon
2. Number of reads covered by the clade rooted at this taxon
3. Number of reads assigned directly to this taxon
4. A taxonomy rank code
5. NCBI taxonomy ID
6. indented scientific name

In this example subset, if one looks at column #2 and #3, we get different answers as to which taxon has more reads (i.e. which one is more abundant). If I use column #2, I can say more Actinobacteria reads were detected than Firmicutes reads. However, if I look at column #3, then the opposite is true.

If one was to use read count as proxy for abundance, which column of the Kraken report is more appropriate to use: column 2 or column 3? In my opinion column #2 is more appropriate, but my collaborator seems to think it is #3. I think column #2 is more appropriate because it is the sum of the reads that were specific to the particular taxon, plus all the reads that are part of the same clade at which said taxon is rooted. I would be very interested in seeing what you think.

kraken classification taxonomy metagenomics • 9.4k views

ADD COMMENT • link updated 6.9 years ago by Joseph Hughes ★ 3.0k • written 6.9 years ago by ropolocan ▴ 830

1

Entering edit mode

Column #2 should be used in my opinion. Column #3 is just the sum of reads unassigned in lower taxonomic levels.

ADD REPLY • link 6.9 years ago by Asaf 10k

0

Entering edit mode

I agree. This makes sense. Thanks, @Asaf.

ADD REPLY • link 6.9 years ago by ropolocan ▴ 830

score 4 · Answer 1 · 2018-01-16

4

Entering edit mode

6.9 years ago

Joseph Hughes ★ 3.0k

It makes much more sense to use the results from column 1 and 2 as these represent all taxa that are assigned to (for example) Proteobacteria and any descendant taxa of Proteobacteria, e.g. Acidithiobacillia and Alphaproteobacteria etc... If using column 3, you would be looking at k-mers that are only assigned to Proteobacteria and not to any descendant nodes. This is equivalent to only counting reads with this type of assignment in NCBI (see ORGANISM): https://www.ncbi.nlm.nih.gov/nuccore/X97116.1