Question

Kraken2 to Phyloseq

3

Entering edit mode

5.5 years ago

c.e.chong ▴ 60

Hi all,

I am quite new to Metagenomics and how to statistically analyse my data.

I have run kraken2 to taxonomically profile my assembled metagenomics samples. I have three different disease state groups and I would like to see if there are any statistical differences between them. I decided to try to run metagenomeseq and/or phyloseq, however I am unsure of how to go from my kraken reports to inputting this into R.

I thought to create biom-tables with the program kraken-biom, but I am unsure if I should create one table per group or one table per sample.

Any information anyone has on Metagenome stats and using metagenomeseq/phyloseq I'd be grateful for your help!

Thanks!

R shotgun metagenomics statistics kraken • 6.7k views

ADD COMMENT • link updated 5.5 years ago by Asaf 10k • written 5.5 years ago by c.e.chong ▴ 60

score 2 · Answer 1 · 2019-06-13

2

Entering edit mode

5.5 years ago

Asaf 10k

You'll be losing data if you will only work with contigs mapped by kraken or working with kraken assignments only. I think that the comparison should be done between the number of reads mapped to each contig (or contig bins) then finding differential contigs and then trying to figure out what they are using kraken (or protein composition etc.).

ADD COMMENT • link 5.5 years ago by Asaf 10k

0

Entering edit mode

I have previously mapped my reads to databases made from the contigs of each sample when using anvio to bin my samples. Would these bam files contain the sufficient information to do the comparisons? To use kraken to find out what the differential contigs are should I input the concoct contig bins I have created into kraken?

Thank you very much for your help!

ADD REPLY • link 5.5 years ago by c.e.chong ▴ 60

1

Entering edit mode

You can generate a table of the number of reads mapped to each contig from your bam file, then you can sum them up for each bin and use these counts for comparison (with DESeq for instance). Since you have bins I would take other approaches for determining taxonomy, gtdb-tk for instance which is fast but more accurate (a larger DB and more refined method).

ADD REPLY • link 5.5 years ago by Asaf 10k

0

Entering edit mode

Thank you for your reply, is there a preferred method for generating a table of the number of reads mapped to each contig from the bam file? Is this the same as calculating coverage?

Also is DeSeq recommended to find taxonomic differences between samples/groups of samples? Or more the genes within samples?

ADD REPLY • link 5.4 years ago by c.e.chong ▴ 60

1

Entering edit mode

I'm using samtools idxstats <file.bam> | awk '{print $1"\t"int(($3+$4)/2)}' to get the table for each bam file. Using DESeq to compare samples at the gene level will leave you with a long list of highly dependent features compared between samples, I would compare contigs (or bins) and then figure out what's in the differential ones.

ADD REPLY • link 5.4 years ago by Asaf 10k