Hi,
Using kraken2, I did two classification tasks on the same sample: one using kraken2 standard database which includes homo sapiens, and the other using a custom database built by kraken2 that doesn't contains homo sapiens. Of the 29 millions reads, I get 16k reads on bacteria when using the standard database (with HS). When using the custom database without HS I get 1.06 million reads on bacteria.
My question is: what should I believe? There is clearly a human contamination in the sample, but when I ignore it in classification I get much more bacterial reads, and much more diversity too. But I am tempted to put my money on the classification using bacteria and human, as for me the read count difference must come from some sequence homology between human and bacteria, where some reads are favored to human when both targets are available.
What do you think? Does my impression fits with kraken's internal alignment algorithm?
thanks!
Phil
I'm curious what happens if you remove the mitochondrial DNA from the reference and re-run. I had a similar problem which I solved, see here: Kraken2 database curation might not be a problem with human though (except for the mitochondria)
thanks for the info. I did try with a new database not containing human mitochondrial DNA, but the count doesn't change much ...