This is a somewhat of a follow-up from a previous post ( Assemble bacterial .fastq files and find differences (SPAdes followed by MUMmer) ). I was recently given 4 paired-end .fastq files (where each read has about 150 bases), each from a different strain of bacteria (they believe). The researchers seem to expect these 4 samples to likely be from the same genre (Helicobacter), although species within this genre are known for having huge genetic diversity.
My job is to determine the identity (or close identity?) of these samples and then to do analyses on their genes. I am still stuck on the first part of this job.
Most recently, I used Kraken (on Galaxy as I was unable to install it) to examine these four samples. I used default settings (k-mer=31 etc) to check for unclassified and classified plasmids, viruses, and bacteria. Main results showed that Sample 1 was 87% bacteria classified, whereas Samples 2-4 were only ~12-14% bacteria classified (see image https://ibb.co/H2sNXbx).
Looking deeper at report results, percentage of reads covered by the clade are as follows:
Sample 1: 86.83% mapping to Helicobacter cetorum (https://ibb.co/Bc0bbxF)
Sample 2: 5.66% mapping to Helicobacter pylori (https://ibb.co/k5c7fKJ)
Sample 3: 5.28% mapping to Helicobacter pylori (https://ibb.co/NC5zBfS)
Sample 4: 5.44% mapping to Helicobacter pylori (https://ibb.co/ryQwLbG)
I am uncertain how to interpret the very low read percentage coverage of the last three samples. The wet lab researchers seem to predict these samples may be H. pylori but would like me to confirm (or state otherwise) computationally. My question is: What are computational approaches would you recommend for someone like me to consider performing next to better understand what species these samples may be? I just feel unconvinced these are H. pylori with such low read percentage coverage. I worry these may be something other than plasmids, viruses, and bacteria (which was the extent of what Galaxy Kraken allowed me). Thank you for sharing any advice/wisdom.
Thanks @Asaf. I am trying to see more about "tgdbtk" but upon a Google search, nothing comes up. Is it this? https://gtdb.ecogenomic.org/about.
That's indeed the database and the tool for comparing against the database is here: https://github.com/Ecogenomics/GtdbTk
Thanks @Asaf. I am curious why this software may provide better results than my attempts with Kraken? Also, I am unsure how I should determine a reference genome for these samples (given their low mapping rates so far using other software). I may need to focus on contamination issues first?