SPAdes did not assemble the genome completely
2
1
Entering edit mode
3.4 years ago
Flexogore ▴ 10

Hi everyone.

I have a goal to assemble the SARS-CoV-2 having forward and reverse FASTQ reads. I have used the SPAdes tool and the best result I managed to receive is a FASTA with a bunch of scaffolds, namely 38 pieces. What should I do in order to get a full single FASTA?

SPAdes FASTQ genome assembly FASTA • 3.2k views
ADD COMMENT
2
Entering edit mode

It is possible that you simply have way too much data (considering the small size of SARS genome). You can normalize/downsample your data and try again. Use a tool like bbnorm.sh from BBMap suite to normalize the data. Since there are so many SARS genomes available you may simply want to align your data instead of doing an assembly.

ADD REPLY
1
Entering edit mode

Use some long-read sequencing and perform a hybrid assembly, or use a reference sequence and do reference guided alignment.

You are unlikely to ever achieve a complete genome with short reads no matter what assembler you use.

ADD REPLY
2
Entering edit mode
3.4 years ago
Mensur Dlakic ★ 28k

If you have a depth of coverage that is 1000+x, it is almost a guarantee that non-random sequencing errors are causing the fragmentation in your assembly. Like GenoMax suggested, the way around that is to error-correct the data and to downsample to something like 50-100x. I know that throwing away the data sounds like a no-no, but it works.

If a total number of reads is below 30-40 million, you may want to try a true overlap assembler such as MIRA. In that case you would not need to error-correct the reads because MIRA will do it for you, but it still may be helpful to downsample the reads. I could possibly give you a better advice if you tell us the average sequence coverage in your assembly.

http://mira-assembler.sourceforge.net/docs/DefinitiveGuideToMIRA.html

ADD COMMENT
1
Entering edit mode

Interesting to know that MIRA is still a valid option! In addition to error correction and other assembly programs, OP may try to scaffold their contigs with e.g. ragtag, which would take a SARS-COV-2 reference sequence as basis for scaffolding. I'm not too sure MIRA would be able to surpass SPAdes in terms of assembly quality though, especially when only illumina reads are being used for the assembly!

ADD REPLY
1
Entering edit mode

I'm not too sure MIRA would be able to surpass SPAdes in terms of assembly quality though, especially when only illumina reads are being used for the assembly!

I obtained a considerably better metagenome assembly with MIRA from Illumina data downsampled to 60x than from SPAdes with a full dataset. Keep in mind that here we have a single genome assembly, and I think that would work at least as good if not better. The only thing is that MIRA is very slow and memory-hungry, so it isn't an option for large datasets.

ADD REPLY
2
Entering edit mode
3.1 years ago
anton ▴ 70

The answer is simple: coronaSPAdes which is a part of SPAdes 3.15 release series.

ADD COMMENT
0
Entering edit mode

Thank you for making us aware of coronaSPAdes. Does one need to downsample the data or will the program handle an excess of coverage internally?

ADD REPLY

Login before adding your answer.

Traffic: 1339 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6