Question

Spades vs Velvelt_why do I get much more contigs with SPADEs?

2

Entering edit mode

7.4 years ago

anna ▴ 20

Hi all,

I am running SPADEs assembly for Illumina reads on several bacterial genomes. I used this script:

spades.py -1 ...R1.fastq -2 ...R2.fastq --careful -t 3 -m 30 -o

The same genomes were previously assembled with Velvet (exactly same raw data).

When I looked at the results, I have much more contigs with Spades!! Any idea why? Is there something I can do after running spades to improve the assembly quality of my genomes? Here an example of the differences I get with the 2 assemblers (after running "seqkit stats"):

scaffolds.fasta  FASTA   DNA        877  2,223,301       56  2,535.1  116,526  111  216    297       17  24,875 SPADES
Velvet.fa       FASTA   DNA        313  2,108,423      197  6,736.2  116,024  240  414  7,858        0  24,461  VELVET

file             format  type  num_seqs    sum_len  min_len  avg_len  max_len   Q1   Q2     Q3  sum_gap     N50 
scaffolds.fasta  FASTA   DNA      1,234  2,319,934       56    1,880  132,700  168  223    295       18  25,849 SPADES
Velvet.fa       FASTA   DNA        332  2,122,470      193    6,393  132,734  234  344  6,301        0  26,473  VELVET

thanks for any possible help! Anna

Assembly • 5.5k views

ADD COMMENT • link updated 7.0 years ago by Biostar 20 • written 7.4 years ago by anna ▴ 20

0

Entering edit mode

Your N50s are basically the same, its just that spades has kept many of the smaller contigs. Presumably velvet is stricter in its filtering processes, unless specified.

Your spades assembly is not necessarily worse than velvet, as the N50s are both good. Spades has actually given you more data, but now you have to consider what you’ll do with the potentially lower quality stuff. First thing you could do is just discard anything smaller than a kilo base (for example), and then run seqkit again - the numbers will start to look a lot more like Velvet’s I suspect.

ADD REPLY • link 7.0 years ago by Joe 22k

0

Entering edit mode

You could try --careful option while running the SPADEs, It may reduce the number of mismatches and short indels.

ADD REPLY • link 5.5 years ago by the_cowa ▴ 40

score 0 · Answer 1 · 2018-02-27

0

Entering edit mode

7.4 years ago

h.mon 35k

It seems you are applying different minimal contig length to both datasets, with a lower threshold for SPAdes. This would increase contig count, without improving assembly quality - in fact, the opposite is probably true, these small contigs are most likely noise (unresolved repeats, low quality / low coverage contigs, etc).

Apply the same filter to both assemblies before comparing. And, most importantly, remember that length metrics alone are not a good indicator of assembly quality.

ADD COMMENT • link 7.4 years ago by h.mon 35k

0

Entering edit mode

thanks for your suggestion. However, I could not find a way to set the min contig size in Spades. Any idea if this is possible and how?

ADD REPLY • link 7.4 years ago by anna ▴ 20

0

Entering edit mode

Why don't you filter the assembly fasta? Suggestions on how to do this here, and here. You can also use reformat.sh from the BBTools package.

ADD REPLY • link 7.4 years ago by h.mon 35k

score 0 · Answer 2 · 2018-02-27

0

Entering edit mode

7.4 years ago

Carambakaracho ★ 3.3k

I'd recommend using quast for the comparison of assemblies. It gives you the number of contigs bigger than certain thresholds, N50, plus some overview graphs. Definitely worth trying

ADD COMMENT • link 7.4 years ago by Carambakaracho ★ 3.3k