Using kraken2 version 2.1.1
When using the flag --report, I should be getting an output file that gives taxon percentages in the first column for the taxon hits.
Something like below (taken from https://genomics.sschmeier.com/ngs-taxonomic-investigation/index.html). Column1 are the percentages for that category (last column), followed by number of reads in the clade rooted at the taxon, then the number of reads assigned to that taxon, classification rank (i.e. U = unclassified, I also have no idea what R is as it is not explained in the documents), then the NCBI taxonomy ID, and finally the category with scientific name.
83.56 514312 514312 U 0 unclassified
16.44 101180 0 R 1 root
16.44 101180 0 R1 131567 cellular organisms
16.44 101180 2775 D 2 Bacteria
13.99 86114 1 D1 1783270 FCB group
13.99 86112 0 D2 68336 Bacteroidetes/Chlorobi group
13.99 86103 8 P 976 Bacteroidetes
13.94 85798 2 C 200643 Bacteroidia
13.94 85789 19 O 171549 Bacteroidales
13.87 85392 0 F 815 Bacteroidaceae
Instead my --report output gives me something like the following:
d__Bacteria 7879
d__Bacteria|p__Proteobacteria 7783
d__Bacteria|p__Proteobacteria|c__Alphaproteobacteria 4240
d__Bacteria|p__Proteobacteria|c__Alphaproteobacteria|o__Rickettsiales 4182
d__Bacteria|p__Proteobacteria|c__Alphaproteobacteria|o__Rickettsiales|f__Anaplasmataceae 4181
d__Bacteria|p__Proteobacteria|c__Alphaproteobacteria|o__Rickettsiales|f__Anaplasmataceae|g__Wolbachia 4181
I also tried --output which gives the following:
U A00977:183:HLLKYDSXY:3:2426:20166:2315 unclassified (taxid 0) 146|146 0:112 |:| 0:112
C A00977:183:HLLKYDSXY:3:2426:29613:2550 Wolbachia (taxid 953) 146|145 0:112 |:| 2591635:2 0:41 1845000:2 0:17 169402:1 0:48
C A00977:183:HLLKYDSXY:3:2426:5538:2566 Klebsiella pneumoniae (taxid 573) 146|146 0:15 28216:2 0:18 1236:8 767434:2 2:5 0:33>
U A00977:183:HLLKYDSXY:3:2426:4318:2832 unclassified (taxid 0) 146|146 0:112 |:| 0:112
So nothing in Kraken2 github explains how to get the taxon percentages and the outputs are all different with no clear updates on why the outputs are different from their tutorials....can anyone help me understand this? I just want to gather taxon identification and abundance for my samples using Kraken2.
Thank you!
Which is the exact command that you're using with kraken2?
So I tried different flags and so far it appears to work as expected if I remove --use-names and --use-mpa-style. Now testing if it will work when I include --report-zero-counts.
I don't have enough experience with the software to help you further. My advice is for you to reach the developers with these doubts in their github repository by open a new issue.
Although I think you or someone with the same doubt already did it: kraken2 output does not produce a column of taxon percentages.
As I said I don't have enough experience with the software to help you further, though I think the issue may lie in the option --use-mpa-style, which as far I understand and according to the documentation produces a similar output to MetaPhlAn (citing): In addition, we also provide the option --use-mpa-style that can be used in conjunction with --report. This option provides output in a format similar to MetaPhlAn's output. The output with this option provides one taxon per line, with a lowercase version of the rank codes in Kraken 2's standard sample report format (except for 'U' and 'R'), two underscores, and the scientific name of the taxon (e.g., "d__Viruses"). The full taxonomy of each taxon (at the eight ranks considered) is given, with each rank's name separated by a pipe character (e.g., "d__Viruses|o_Caudovirales"). Following this version of the taxon's scientific name is a tab and the number of fragments assigned to the clade rooted at that taxon.
If you notice in the last sentence - following this version of the taxon's scientific name is a tab and the number of fragments assigned to the clade rooted at that taxon, you'll see that the output that you're getting is expected.
Perhaps since this option tries to mimic MetaPhlAn output, which includes percentages, but it prefers to include number of fragments instead which can be used to determine percentage anyway (although you need to confirm this because I don't know - it is just my interpretation after reading the documentation).
I hope this helps,
António
Yep! I got it to finally work when removing the --use-mpa-style. I think that change in format was interfering with the report output because the file output is very different.