Entering edit mode
6.5 years ago
DanielC
▴
170
Dear Friends,
I ran SPAdes on an ion torrent bam file with single-reads of a bacteriophage sample using the command line:
spades.py --iontorrent -k 21,33,55,77,99,127 --only-assembler -s IonXpress.bam -o spades-out --careful --mismatch-correction
After the run was finished, there were error correction and assembling warnings:
=== Error correction and assembling warnings:
* 0:02:06.148 464M / 9G WARN General (kmer_coverage_model.cpp : 327) Valley value was estimated improperly, reset to 1
* 0:02:06.150 464M / 9G WARN General (kmer_coverage_model.cpp : 366) Failed to determine erroneous kmer threshold. Threshold set to: 1
* 0:01:09.757 692M / 9G WARN General (kmer_coverage_model.cpp : 327) Valley value was estimated improperly, reset to 10
* 0:01:09.758 692M / 9G WARN General (kmer_coverage_model.cpp : 366) Failed to determine erroneous kmer threshold. Threshold set to: 10
* 0:00:51.891 688M / 9G WARN General (kmer_coverage_model.cpp : 218) Too many erroneous kmers, the estimates might be unreliable
Could you please let me know if I should be bothered about these warnings and what these warnings mean, and if there is a way these warnings can be eliminated and the efficiency of the assembling can be improved.
Thanks so much!
Have you looked at the resulting assembly?
Warnings are usually just that, warnings. They can generally be ignored, if you know what you're doing. They are not the same as errors.
If the assembly you got back looks decent enough for your purposes, you can probably proceed.
To sanity check your assembly, check the usual assembly statistics (N50, number of contigs etc), and map the reads back to the assembly. You may want to check for any variants that the reads support that arent in the assembly or vice versa, just in case these weird kmer errors affected the assembly in any way.
Thank you very much for your response! I looked at the result and I understand that good assembly should have less number of contigs and high N50 value. When looking at the results of the Kmers I found that K77 have the least number of contigs with high N50 like this:
Please let me know what you think of this result.
Thanks much!
A largest contig of 2749 seems quite poor to me. Even for something like a bacteriophage, I would have expected better. It seems you will have a large number of contigs of about 1000bp or less, which generally speaking are not to be trusted.
Do you know what total genome size you’re expecting? Did you do much/any QC on your reads before assembly?
The total size of the bacteriophage genome is generally about 40,000 bp. And, yes, I did not do the QC, I wanted to see how the reads perform without QC. Do you think QC will improve the results? Pleas let me know of your suggestions.
You should do the QC first. It might explain your SPAdes warnings. There is obviously something up with the Kmer distribution in your data.
That could be because its a small genome, but it could also be dodgy reads.
I doubt QC will improve your results specifically, its more likely to tell your what your data died of.
There may be some things you can do like removing adapters, if they weren’t already removed. What is your depth of sequencing? It’s possible that for a small genome, you may have very deep coverage, which can cause assemblers to choke, so you may also see improvement from downsampling your reads.
Other people might have some alternative ideas, but otherwise I think it’s probably a case of resequencing the sample, it could just be a bad library.
Thank you very much! Could you please suggest what QC package would be best for such data?
The best tool I know of is MultiQC, but it only aggregates results from other QC tools.
FastQC is the obvious starting point, but its kmer warnings have been known to be spurious for a long time.
Start on the MultiQC website, and run as many of the tools they aggregate as possible would be my suggestion! It supports 68 different tools, as well as providing summary stats of its own, all of which can be seen on their site: https://multiqc.info/
Thank you for suggestions! I ran ClinQC (https://sourceforge.net/p/clinqc/wiki/ClinQC_Manual/) with FastQC and the results after quality control are:
the best K-mer result (low number of contigs and high N50) is for K99 using SPAdes:
When compared to the previous result without QC (posted above), I don't see much difference. Your suggestions will be highly appreciated - if anything more can be done to improve the contig assembly. Do you think I should run MultiQC too for QC?
Thanks much!
Honestly, my suspicion is that your input DNA library was probably not the best. Either that or that genome is incredibly repetitive (not beyond the realms of possibility for a bacteriophage).
The assembly you have might be sufficient for your needs, though I think what you're looking at there is just poor input data, which means poor assemblies out the other end.
It probably wants re-sequencing. It may be that the DNA you had input at the start was overly fragmented.
Have you examined the insert size distribution? (A tool like Quast will do this).
Thanks much for your reply Healey! It is very informative. I ran quast and got the above result. Could you please tell me how to figure the insert size using Quast?
Sorry, my mistake. I meant Qualimap, not Quast.
Thanks much! Curious to know - here by "insert size" you mean DNA fragments put for sequencing?
Yep, I’m wondering if you had many fragments that were shorter than they should have been.
Hi,
Thanks for your previous helpful comments! I have got the contigs from spades and now looking to get the coverage. I have got this below result by using "bbmap.sh" on my data to get the coverage. Could you please tell me how to interpret the coverage here? Average coverage shown here is 209.654 - what does this mean? I would really appreciate your input.
command line:
covstats.txt
Thanks!