I am aiming to determine the species/subspecies identity of a (possibly bacteria) sample. I applied the following sendsketch command:
sendsketch.sh in=contigs.fasta,
where the input file (contigs.fasta) was previously generated from the raw paired-end read file using SPAdes. The results of the above command can be viewed here: here. The sendsketch website states that WKID is "the column that tells you the actual sequence similarity (disregarding size)." As a result, I believe I should focus on H. macacae since it had the highest WKID value (along with KID = 0.05%; ANI = 96.6%; Contam = 2.5%).
1) Are the output values here sound or a cause for alarm? Is the WKID (38.9%) or KID (0.05%) value too low? Or are other values questionable?
2) If the values are sound, I hope to check the mapping rate (using BWA) of the sample to the H. macacae genome (so that I can use it as a reference genome for further steps). However, I am a bit stuck on figuring out the appropriate sequence to download. The taxID number output from the sendsketch was 398626. It seems there are 46 sequences available on NCBI (here). How should I go about determining the optimal H. macacae sequence to map my samples to?
Side note: I have a similar problem with H. mastomyrinus, the second-highest WKID value from sendsketch. There are 17 sequences available on NCBI here. In this case, some are 16S and some are 23S. How do I know which sequence to map against in this scenario?
@5heikki. Thank you for your suggestion. I am probably only asking this because I am lacking very basic knowledge, but why is the one you posted necessarily the best compared to those listed on NCBI? I noticed on NCBI that 25 of the 46 taxID= 398626 results are of the strain you listed (MIT 99-5501). Thank you for the link as well. On your link, I am guessing I should download the one with extension genomic.fna.gz to use as reference genome for DNA sequence? Finally, how did you obtain a "local RefSeq"? Thank you again for your support.
NCBI's RefSeq project aims to provide
So the genome linked by @5heikki can be considered a good representative of H. macacae for the analysis you are looking to do.
You can get the genomic.fna.gz file which will have the sequence in fasta format.
local RefSeq
is likely referring to a local copy of the RefSeq database. Some institutions may download a local copy for ease of access. You can find a copy of RefSeq sequences here. Refseq blast preformatted BLAST databases are also available here.Do you have a pure sample of the bacteria that was sequenced? i.e. you are sure there should be only species present.
Thanks for your help, @genomax and sorry for my earlier thread-organization blunder. Thanks for your question. I am not sure if there is only one species present. I only just began analyzing this type of data and received this dataset from my advisor (who received it from a wet lab researcher). I believe the wet lab researchers think the sample is one strain (or at least one species), but I am not sure about it. I have been attacking this dataset with numerous software suggested to me kindly by biostars users these past weeks, including sourmash (which resulted at k-mer=11 to about 43.5% similarity to various H. pylori strains and some other Helicobacter strains), Kraken on Galaxy (which had 86% bacteria unmapping and only resulted in 5.66% mapping to H.pylori and H. cetorum and other Helicobacter strains), and most recently sendsketch with WKID=38.9% mapping to H. macacae. If you are familiar with this type of analysis, what do you think of these results? I am having a very hard time interpreting it. The wet lab researchers would like me to replicate something like Table 2 from this paper. I believe to do so, I have to first determine what "genome" this sample belongs to. I wonder, if you have experience with this type of analysis, what would you do next given my results so far? I would be grateful to hear any ideas if you have any!
@genomax. Also, a quick update that I downloaded the H. macacae fasta file and used BWA to map the paired-end raw read sample to it. Then, I investigated the results with Qualimap (this pipeline was kindly suggested to me by a biostars user a few weeks ago). I may be doing something wrong, but am disappointed to see that only 1.71% of reads mapped. To me (with little experience), the mapping rate seems surprisingly low given the WKID=38.9% mapping rate to H. macacae in sendsketch (but perhaps that is also considered low anyway)? Sorry for another message here. I am just having difficulty figuring out what to do :oD
LRStar : Can you take a small selection of reads (~20-25, convert them to fasta format using following code for use with BLAST)
and blast them using NCBI's web blast page. Check what comes back on those reads.
If this is a pure culture you should get some clear indication of that (and what organism it could be) from this search. If results are inconclusive then let us know.
@genomax. Thank you for your support. I followed your suggestion running BLAST (using a Database of "Nucleotide collection (nr/nt)". I ran it using "Optimize for" values of megablast and blastN as listed here. I did this for both the first 25 bases and the last 25 bases in the file. BlastN on the first 25 bases had Helicobacter cetorum MIT 00-7128 with lowest e-value (3e-39) (Screenshot here), megablast on the first 25 bases had "no significant similarity found message" (Screenshot here), BlastN on the last 25 bases had Helicobacter felis ATCC 49179 with lowest e-value (1e-25) (Screenshot here), and megablast on the last 25 bases had Helicobacter felis ATCC 49179 with lowest e-value (3e-19) (Screenshot here). I am not sure if there are certain parameters I should use (such as blastN versus megablast) and whether theses results seem consistent - especially because the highest WKID value from sendsketch was a different Helicobacter species.
@genomax. I also checked with my advisor and he believes these samples are "isolated". He also said they could not be bacteria (could be archaea or even other organisms); his contacts just have a suspicion this is bacteria. So, with that information, I am not sure if I should be doing yet even more exhaustive searching.
This will tell you how many percent of the reference genome your reads cover (or vice versa if you map ref reads on your assembly). Here reads are unpaired, so adapt to your needs where necessary..
Please use
ADD COMMENT/ADD REPLY
when responding to existing posts to keep threads logically organized.SUBMIT ANSWER
is for new answers to original question.What is "listed on NCBI" (your link) includes any DNA sequence from txid398626, not just genome assemblies. You can check "the best available" bacterial genome assemblies from ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt (I recommend that you don't open the link in browser, it's a large file). Supposedly RefSeq is curated, but from experience I can say that it still includes numerous falsely annotated (taxonomy) genomes, like 1-5% from total (not really relevant to your current question)..