Hi All,
I am working on denovo genome assembly of a FUngal sample. The raw fastq data is PE150 at 100x coverage.
I performed denovo genome assembly using SPAdes v4 in 4 different combinations.
- SPAdes denovo with --careful option
- SPAdes denovo with --isolate option
- SPAdes with --trusted-contigs and --careful option
- SPAdes with --trusted-contigs and --isolate option
I used --isolate option as in previous run with --careful option, there was warning that "data has HIGH uniform coverage, recomended to use option --isolate" previous post
the command i used for spades along with other parameters is
$spades -o $spades_out -1 ILL_1.fq.gz -2 ILL_2.fq.gz --careful --threads 14 --memory 240 -k 21 33 55 77 99 111 127
$spades -o $spades_out -1 ILL_1.fq.gz -2 ILL_2.fq.gz --isolate --threads 14 --memory 240 -k 21 33 55 77 99 111 127
$spades -o $spades_out2 -1 ILL_1.fq.gz -2 ILL_2fq.gz --trusted-contigs $Reference --careful --threads 14 --memory 240 -k 21 33 55 77 99 111 127
$spades -o $spades_out2 -1 ILL_1.fq.gz -2 ILL_2fq.gz --trusted-contigs $Reference --isolate --threads 14 --memory 240 -k 21 33 55 77 99 111 127
$Reference contailns a reference fasta file which is a chromosome level assembly for Fusarium oxysporum. removed all unplaced contigs and only kept the chromosomes.
after running Quast on the scaffold.fasta generated in all 4 combinations i get the following results
My initial assesment is that the denovo assembly with --careful option is generating a better assembly as it has less contigs# and bigger N50 value.
for ref-Guided assembly, I am a bit shocked as it assembled more contigs which i did not expected. (if you can share the reason that will be helpful)
I want to know your opinion on this and which method I should use for analyzing the remaining samples.
AFAIK there is no reason not to use
--careful
pretty much all the time, but if there is low risk of contamination I guess that's where--isolate
comes in. Looking at the results, isolate is clearly helping, but isn't making much difference to your contig lengths.Generally, whatever gives you the highest N50s is going to likely be the best (or at least most immediately _useful_) genome.
I'm not sure what's going on with the ref-guided other than the added information is probably allowing data that could not otherwise be resolved satisfactorily to be incorporated into new contigs. You'll have to do a bit more investigation, but it may well be the case that a number of those new/additional contigs are largely duplications.
A few other angles/questions:
100X coverage isn't crazy-high, but sometimes it can be worth experimenting with downsampling the coverage too.
HI, Thankyout for your helpful insight.
Can you shed some light on down-sampling the coverage? Like how to to that ?
Since it's haploid, you might also want to try
shovill
(https://github.com/tseemann/shovill).It's intended for bacteria, but it does say it will work for other small microbes as long as they're haploid. Your genome might be too big, but worth a try.
Downsampling is basically just picking a subset of reads at random to reduce the overall coverage. For bacterial genomes, 30X is the pretty widely accepted 'norm'. There are a few tools that can do it or you can write something yourself. Have a look on this forum for other posts about downsampling.
Id have a quick look at BUSCO scores before judging these genomes
Hi,
How long did it take to run the whole process with this command?
Regards,
I didn't keep the log files but as far as i remember it was around 5 6 hours.
Recently i generated a hybrid assebly isng SPADES with 10 threads using both
--isolate
and--careful
optionsIllumina data was same as the command above
ILL11
+ Nanopore DataFor
--Isolate
it took 485 minutes (~8 hr) at 10 threadsFor
--careful
it took 660 minutes (~11 hr) at 10 threads