Hi everyone,
If you, like me, work with metagenomic data, you probably have used kaiju2table
in the past. It's a tool provided with the Kaiju source code. It produces tsv tables that can easily be handled later on, for example for plotting.
The input data is the classic output format of Kaiju and also of Kraken:
C A00700:50:HF7LGDRXX:1:1101:1000:10144#CCGGCATCATCTACGA 1578 100 1578:66
C A00700:50:HF7LGDRXX:1:1101:1000:19225#CCGGCATCATCTACGA 186802 100 0:11 186802:55
C A00700:50:HF7LGDRXX:1:1101:1000:23234#CCGGCATCATCTACGA 1578 100 1578:66
And can be from as many files as you need, which will be combined in one file containing percentages, like this:
file percent reads taxon_id taxon_name
F14_A_R1.s.out 59.89815343509682 7161673 0 Unclassified
F14_A_R1.s.out 1.080231644647389 129157 1301 Streptococcus
F14_A_R1.s.out 2.129960840275143 254667 1350 Enterococcus
F14_A_R1.s.out 1.3252716093792982 158455 1485 Clostridium
F14_A_R1.s.out 6.908532882384414 826013 1578 Lactobacillus
F14_A_R1.s.out 1.880613565083921 224854 204475 Gemmiger
F14_A_R1.s.out 3.163456075511585 378236 572511 Blautia
F14_A_R1.s.out 1.4769725746433902 176593 946234 Flavonifractor
F14_A_R1.s.out 1.0168598167829042 121580 1017280 Pseudoflavonifractor
As far as I could find, there is no such tool made for Kraken2, which is perhaps more used than Kaiju as a tool. You could, of course, try to use kaiju2table
with the kraken results, but you would have to install Kaiju to have it.
Hence, for my own convenience I have made a tool called kraken2table that converts the *.out
files produced by Kraken2 (mpa format) to *.tsv
tables that resemble those produced by kaiju2table
.
You can find it here: https://github.com/MatteoSchiavinato/Utilities/blob/master/kraken2table
It depends on:
- ete3
- dask[complete]
The options are quite simple:
usage: kraken2table [-h] -i [INPUT_FILES [INPUT_FILES ...]] -o OUTPUT_FILE
[-p THREADS] [-r RANK] [-m MIN_FRAC] [-c MIN_COUNT] [-u]
optional arguments:
-h, --help show this help message and exit
-i [INPUT_FILES [INPUT_FILES ...]], --input-files [INPUT_FILES [INPUT_FILES ...]]
Name of input files (SPACE-separated).
-o OUTPUT_FILE, --output-file OUTPUT_FILE
Name of output file.
-p THREADS, --threads THREADS
Number of parallel threads
-r RANK, --rank RANK Taxonomic rank to be output, all lowercase (Default:
species)
-m MIN_FRAC, --min-frac MIN_FRAC
Number in [0, 100], denoting the minimum required
percentage for the taxon (except viruses) to be
reported (default: 0.0)
-c MIN_COUNT, --min-count MIN_COUNT
Integer number > 0, denoting the minimum required
number of reads for the taxon (except viruses) to be
reported (default: 0)
-u, --exclude-unclassified
Unclassified reads are not counted for the total reads
when calculating percentages for classified reads.
To be honest it is not clear what this tool does, and that is perhaps the most important requirement of any software.
Kraken does produce various outputs and it is not clear in what way is this tool different and what it does.
PS you also say it depends on multiprocessing, why is that? your software does not seem to use multiprocessing
I edited answering your questions. Also, for the multiprocessing, yeah it was my mistake. My first version depended on multiprocessing, but using Dask allowed me to get rid of the multiprocessing module.
Thanks for editing the question.
Now that I understand it better I will mention that Kraken2 does have an output called report format that produces output in the following form:
In addition the recommended workflow is to process the kraken2 report with bracken:
https://ccb.jhu.edu/software/bracken/index.shtml?t=manual
that will create an output that contains a column oriented output like so
in addition the bracken tool will concatenate several files, thus creating a tabular report across all samples.
Sure. But in my workflow I'm combining many tools, so I needed consistency of format and noticed that a tool similar to kaiju2table wasn't available.