Hello Biostars,
I am assembling my first illumina genome. It is a bacterial genome about 2.5 MB in size. genus Ignatzschineria.
FastQC shows the following stats for the R1 and R2 reads:
File type-----------------------------Conventional base calls
Encoding----------------------------Sanger / Illumina 1.9
Total Sequences---------------------16770385
Sequences flagged as poor quality---0
Sequence length---------------------124
%GC---------------------------------41
I used several other tools to evaluate the raw genomes before the assembly.
Bowtie2 maps 11.86% (~2 million) of the total reads to the reference genome Ignatzschineria larvae Minimap2 maps 19.12% (3.2 million) of the total reads to the reference genome Ignatzschineria larvae
Kraken2 (default options) is able to classify only 42.9% (7187522) of those 16.7 million reads as follows: 42.6% bacterial and 0.0239 viral. Only 5.78 million reads classify to Ignatzschineria larvae.
This is a very LOW classification/mapping percentage to a sister species (assumimg that we indeed sequenced an Ignatzschineria sp). Kraken2 reports 57.1% of the 16770385 reads are unclassified. I wonder to what living entities those 57% unclassified reads belong? I used Kraken2 databases that include archaea, bacteria, viruses, plasmids, humans, parasites of invertebrates. I did not try plants. The sample was collected from a field experiment on blowflies and the bacteria were cultured in the lab. The target colonies were isolated from those cultures.
I used Metaxa2.22 to check the 16sRNA contained in the reads, and the results of this survey are as follows:
Out of 54584 hits:
14059 ----------unclassified Gammaproteobacteria
3613 ----------Igntazschineria
33173 ----------unclassified Xanthomonadaceae
Of course, there are other minor hits.
So, I checked the kraken2 database and I was able to verify that all published genomes of Xanthomonadaceae are present.
Does any of you have a suggestion what tool/database would be appropriate for finding out what are the entities that are not classified by Kraken2 and reported as "unclassified Xanthomonadaceae" by Mataxa?
Any help will be appreciated.
Regards,
This post is kind of hard to read and perhaps that is one reason it has had no responses yet.
I assume this experiment is referring to a pure colony of a bacterium that was used to create a library for genome sequencing. I am not sure why you have so much other contamination in your data if this is supposed to be one single organism. If you did not use a single colony to make the libraries then perhaps this would explain some of the other
stuff
you appear to have picked up in this experiment.I assume that is the actual name of the bacterium. Word larvae just happens to be a species name?
Yes sir, a funny and confusing specific epithet for a bacterial species, right? "larvae".
Sir, we have our hypothesis on how we got the "other stuff" in our genome. But that is not the point of my post.
I would appreciate if someone could direct me to some databases/tools (besides the ones I used) that would allow me to find out what the "other stuff" is. I used the latest version of Kraken2, which uses the latest NCBI taxonomy database. I also used Metaxa2, which uses, according to their website, the latest release of the Silva database for 16sRNA.
Perhaps someone having more experience than I do on environmental sampling of bacteria for genome sequencing could help?
Thank you for giving me the opportunity to clarify a bit more this post of mine that is so hard to read.
Regards
cecilio11