Interpretation of Kraken results
1
3
Entering edit mode
6.9 years ago
ropolocan ▴ 830

Dear Biostars community:

I have been using Kraken extensively for the characterization of microbiomes. My colleague and I have a bit of a disagreement on how to interpret the results from the Kraken reports.

For example, let’s imagine we have a subset of a Kraken report that looks like this:

1.93    104417  104105  P   1224    Proteobacteria

0.18    96419   1968    P   201174  Actinobacteria

0.17    80738   10469   P   1239    Firmicutes

The columns of the report, according to the Kraken manual are:

1. Percentage of reads covered by the clade rooted at this taxon
2. Number of reads covered by the clade rooted at this taxon
3. Number of reads assigned directly to this taxon
4. A taxonomy rank code
5. NCBI taxonomy ID
6. indented scientific name

In this example subset, if one looks at column #2 and #3, we get different answers as to which taxon has more reads (i.e. which one is more abundant). If I use column #2, I can say more Actinobacteria reads were detected than Firmicutes reads. However, if I look at column #3, then the opposite is true.

If one was to use read count as proxy for abundance, which column of the Kraken report is more appropriate to use: column 2 or column 3? In my opinion column #2 is more appropriate, but my collaborator seems to think it is #3. I think column #2 is more appropriate because it is the sum of the reads that were specific to the particular taxon, plus all the reads that are part of the same clade at which said taxon is rooted. I would be very interested in seeing what you think.

kraken classification taxonomy metagenomics • 9.4k views
ADD COMMENT
1
Entering edit mode

Column #2 should be used in my opinion. Column #3 is just the sum of reads unassigned in lower taxonomic levels.

ADD REPLY
0
Entering edit mode

I agree. This makes sense. Thanks, @Asaf.

ADD REPLY
4
Entering edit mode
6.9 years ago
Joseph Hughes ★ 3.0k

It makes much more sense to use the results from column 1 and 2 as these represent all taxa that are assigned to (for example) Proteobacteria and any descendant taxa of Proteobacteria, e.g. Acidithiobacillia and Alphaproteobacteria etc... If using column 3, you would be looking at k-mers that are only assigned to Proteobacteria and not to any descendant nodes. This is equivalent to only counting reads with this type of assignment in NCBI (see ORGANISM): https://www.ncbi.nlm.nih.gov/nuccore/X97116.1

ADD COMMENT
0
Entering edit mode

Agreed. Thanks for your answer, @Joseph Hughes.

ADD REPLY

Login before adding your answer.

Traffic: 2505 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6