Low Kraken mapping rate for unknown samples
2
0
Entering edit mode
5.2 years ago
LRStar ▴ 200

This is a somewhat of a follow-up from a previous post ( Assemble bacterial .fastq files and find differences (SPAdes followed by MUMmer) ). I was recently given 4 paired-end .fastq files (where each read has about 150 bases), each from a different strain of bacteria (they believe). The researchers seem to expect these 4 samples to likely be from the same genre (Helicobacter), although species within this genre are known for having huge genetic diversity.

My job is to determine the identity (or close identity?) of these samples and then to do analyses on their genes. I am still stuck on the first part of this job.

Most recently, I used Kraken (on Galaxy as I was unable to install it) to examine these four samples. I used default settings (k-mer=31 etc) to check for unclassified and classified plasmids, viruses, and bacteria. Main results showed that Sample 1 was 87% bacteria classified, whereas Samples 2-4 were only ~12-14% bacteria classified (see image https://ibb.co/H2sNXbx).

Looking deeper at report results, percentage of reads covered by the clade are as follows:

Sample 1: 86.83% mapping to Helicobacter cetorum (https://ibb.co/Bc0bbxF)
Sample 2: 5.66% mapping to Helicobacter pylori (https://ibb.co/k5c7fKJ)
Sample 3: 5.28% mapping to Helicobacter pylori (https://ibb.co/NC5zBfS)
Sample 4: 5.44% mapping to Helicobacter pylori (https://ibb.co/ryQwLbG)

I am uncertain how to interpret the very low read percentage coverage of the last three samples. The wet lab researchers seem to predict these samples may be H. pylori but would like me to confirm (or state otherwise) computationally. My question is: What are computational approaches would you recommend for someone like me to consider performing next to better understand what species these samples may be? I just feel unconvinced these are H. pylori with such low read percentage coverage. I worry these may be something other than plasmids, viruses, and bacteria (which was the extent of what Galaxy Kraken allowed me). Thank you for sharing any advice/wisdom.

bacteria alignment Assembly gene kraken • 2.4k views
ADD COMMENT
2
Entering edit mode
5.2 years ago
Asaf 10k

If it's isolated bacteria you won't have a problem assembling each genome and then comparing to a reference set of genome. Take a look at tgdbtk for the comparison.

ADD COMMENT
0
Entering edit mode

Thanks @Asaf. I am trying to see more about "tgdbtk" but upon a Google search, nothing comes up. Is it this? https://gtdb.ecogenomic.org/about.

ADD REPLY
0
Entering edit mode

That's indeed the database and the tool for comparing against the database is here: https://github.com/Ecogenomics/GtdbTk

ADD REPLY
0
Entering edit mode

Thanks @Asaf. I am curious why this software may provide better results than my attempts with Kraken? Also, I am unsure how I should determine a reference genome for these samples (given their low mapping rates so far using other software). I may need to focus on contamination issues first?

ADD REPLY
1
Entering edit mode
5.2 years ago
h.mon 35k

It seems you have some contamination in your samples. If what you are running through kraken are the assembled contigs, then the issue could be minor: even though at first sight most of your contigs are not from the expected Helicobacter genus, these contigs could represent just a tiny proportion of the sequencing reads. Filter the assembly to remove small contigs and particularly contigs with low coverage. I like blobtools to explore assembly contamination, but bbtools sendsketch will return an estimative almost instantaneously.

If you are running the raw sequencing reads through kraken, then it seems you have major contamination. What kraken database are you using?

ADD COMMENT
0
Entering edit mode

@h.mon Thank you for your comments. I was unable to install Kraken (might try again later), so I was using the Galaxy platform. I input the raw sequencing .fastq paired-end read files. Then, I ran the "Kraken" application on them. There were only three databases available named "plasmids", "viruses", and "bacteria". So, I ran each one once for each sample. I am currently running the contigs just to see what happens, but based on your input, I am pretty worried now about contamination. I am looking into blobtools and sendsketch right now, as you suggested.

ADD REPLY
0
Entering edit mode

@h.mon I am also checking out sendsketch and have some of the commands working. However, in the link you provided, it uses a file called "refseq.fa.gz". Do you know where I can download this file?

ADD REPLY
0
Entering edit mode

Thanks for your help. I recently ran a sendsketch command sendsketch.sh in=contigs.fasta) for these 4 samples. (I made the contigs.fasta file for each sample using SPAdes.) The first sample had WKID = 51.3%; KID = 49.5%; ANI = 97.6%; Contam = 12.6% for H. cetorum MIT 00-7128 ( https://ibb.co/KG9BXBr ). The last three samples (Samples 2-4) were similar, so for simplicity, I will just state about Sample 2: It had WKID = 38.9%; KID = 0.05%; ANI = 96.6%; Contam = 2.5% for H. macacae (and similar numbers for H. mastomyrinus) and WKID = 1.1%; KID = 0.85%; ANI = 84.8%; Contam = 1.7% for H. cetorum MIT 00-7128 ( https://ibb.co/HpT4nJP ). I did not see H. macacae or H. mastomyrinus come up using Kraken (which I also recently ran on the contigs, instead of raw reads, and got similar results, ie. very poor for Samples 2-4).

ADD REPLY
0
Entering edit mode

@h.mon I am pretty excited to see the results in sendsketch, at least the WKID values from sendksetch are higher than the Kraken mapping values. However, I am a bit unclear on how to intepret the results. Is it safe to say that Sample 2 may be closest related to H. macacae? It has a very low KID value, but is its WKID value reasonable (38.9%)? In the link you provided about sendsketch, it seems that WKID is "the column that tells you the actual sequence similarity (disregarding size)." So I am focusing on that for now. Strangely, it seems that "contamination" is low in Sample 2 (2.5%) but higher in Sample 1 (12.6%). So, I am curious on your input for two items: 1) If I were to try to determine what strain this sample is (or is closest to), do you think this is sufficient? I do not feel too confident about it personally (but am inexperienced). 2) It seems maybe there still needs to be contamination removal? I read the paper for the software you kindly suggested (blobtools) and it seems that, to use that software, I have to make .bam files. I had a hard time finding reproducible examples for this software. Nonetheless, I suppose this means I need to align my raw reads with a reference genome to use blobtools? Is this correct? And if so, would it be reasonable to align it to H. macacae?

ADD REPLY

Login before adding your answer.

Traffic: 2261 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6