Question

kraken2 different bacteria read counts on custom database

1

Entering edit mode

6.2 years ago

bangbangphil2 ▴ 10

Hi,

Using kraken2, I did two classification tasks on the same sample: one using kraken2 standard database which includes homo sapiens, and the other using a custom database built by kraken2 that doesn't contains homo sapiens. Of the 29 millions reads, I get 16k reads on bacteria when using the standard database (with HS). When using the custom database without HS I get 1.06 million reads on bacteria.

My question is: what should I believe? There is clearly a human contamination in the sample, but when I ignore it in classification I get much more bacterial reads, and much more diversity too. But I am tempted to put my money on the classification using bacteria and human, as for me the read count difference must come from some sequence homology between human and bacteria, where some reads are favored to human when both targets are available.

What do you think? Does my impression fits with kraken's internal alignment algorithm?

thanks!

Phil

metagenomics kraken2 dna-seq • 4.7k views

ADD COMMENT • link updated 5.8 years ago by ilyzdd ▴ 10 • written 6.2 years ago by bangbangphil2 ▴ 10

0

Entering edit mode

I'm curious what happens if you remove the mitochondrial DNA from the reference and re-run. I had a similar problem which I solved, see here: Kraken2 database curation might not be a problem with human though (except for the mitochondria)

ADD REPLY • link 6.2 years ago by Asaf 10k

0

Entering edit mode

thanks for the info. I did try with a new database not containing human mitochondrial DNA, but the count doesn't change much ...

ADD REPLY • link 6.2 years ago by bangbangphil2 ▴ 10

score 1 · Answer 1 · 2019-10-16

1

Entering edit mode

5.8 years ago

ilyzdd ▴ 10

Hi,

Have you decontaminated the raw reads before using Kraken2, like using Bowtie2 or BWA to mapping all the reads to the Human reference genome and excluding all the reads that can map? If the sample is from a human stool, in this way, it can make the reads contain fewer human reads.

ADD COMMENT • link 5.8 years ago by ilyzdd ▴ 10

score 0 · Answer 2 · 2019-05-23

0

Entering edit mode

6.2 years ago

ctseto ▴ 310

If you like reading kraken --output files, for each contig You might have Bacteria:1 9606:12 0:1000 (where 0 is unclassified) Eliminate the host 9606 and it turns to Bacteria:1 0:1012, the vote switches to Bacteria Eliminate the host 9606 and it turns to Bacteria:N 0:1000+(12-N), the vote switches to Bacteria

I suspect one needs a human "sink" to assure that Kmers have a place to go, vs traversing LCA and ending up somewhere else that they shouldn't be? However, I find it hard to believe that the difference is 16k vs 1,006k bacteria reads with and without human?

In the end, check the first few lines of your kraken.out from both databases and see how the kmer assignments look.

ADD COMMENT • link 6.2 years ago by ctseto ▴ 310

0

Entering edit mode

Looking at the output files I see things like this :

from database with human:

C   NB502083:48:HKTMTAFXY:1:11101:19388:1052    Homo sapiens (taxid 9606)   76|76   9606:3 131567:5 9606:1 131567:1 9606:5 131567:3 9606:24 |:| 9606:21 2759:5 9606:5 2759:6 9606:5

from database without human

C   NB502083:48:HKTMTAFXY:1:11101:19388:1052    1280    76|76   0:3 1280:5 0:1 1280:1 0:5 1280:3 0:24 |:| 0:3 1280:5 0:1 1280:1 0:5 1280:3 0:24

taxon 1280 is Staphylococcus aureus, but there are many kmer not in database '0:'. Taking a closer look at the output I see that to be unclassified both reads must be completely absent from kmer db. I guess from this observation that one is better with the most complete kmer database.

ADD REPLY • link 6.2 years ago by bangbangphil2 ▴ 10

0

Entering edit mode

My interpretation here is that db two without human classifies human as "0" (unclassified. It seems it is 131567 /or/ 1280, depending on the database; at least for Read1 In read2 it is either 9606 or 2759, in your database sans human 0 or 1280. In read 2 the first 21 kmers are human; without human in the db it is a mix of 0 and 1280 and ends with a bunch of unknowns.

In this case I would probably lean towards your first database,

ADD REPLY • link 5.8 years ago by ctseto ▴ 310