High coverage bacterial genome reads causing spurious assemblies
5
1
Entering edit mode
8.7 years ago
b2060780 ▴ 10

Hi,

I'm using SPAdes to assemble bacterial genome sequenced on illumina platforms.

I have a few files that I'm struggling to assemble due to the incredibly large coverage of my raw reads. Anything over 250x seems to cause spurious assemblies, with the final fasta being three times the expected size.

At first I thought it could be contamination, but annotation shows only target species genes present. I've heard SPAdes can struggle with particularly high coverage files - so what can I do to get them assembled? One file is ~500x coverage...

Thanks,

Assembly illumina spades • 3.8k views
ADD COMMENT
5
Entering edit mode
8.6 years ago
Rohit ★ 1.5k

Usually de-brujin assemblers work best around 60-80x coverage (probably even 100x), then the problem of spurious contigs appears. As suggested by others, do a normalisation step. bbnorm of bbmap is a really good normalisation tool that can get rid of low coverage regions and normalise highly covered regions to the expected coverage. Also, it has a nicely built-in pre-filtering step for sensitivity and a kmer value you can choose if required. It can be used it as follows -

bbnorm.sh in=input.fastq out=output.fastq target=80 mindepth=10 -Xmx200g threads=28 prefilter=t

ADD COMMENT
2
Entering edit mode
8.7 years ago

It's also possible that you have contamination, which will bloat the assembly. Spades may generate a somewhat inferior assembly due to high coverage, but 3x the expected size due to 500x coverage would be extremely unusual in my experience - 5% too big would be more what I'd expect. It is designed to deal with super-high coverage, after all (though I still find normalization often improves its output). So, please BLAST your assembled contigs against a large database to make sure you are hitting what you expect. You can also analyze the kmer-frequency distribution, or do a contig-length versus coverage plot, or a coverage versus GC% plot, or just a GC% plot, to spot probable contamination.

ADD COMMENT
0
Entering edit mode

Brian, would you know whether anyone has evaluated whether bbnorm/khmer or other normalization techniques cause mis-assemblies? Titus Brown mentioned a few years ago that khmer may not be the best for highly repetitive genomes (e.g. plant, but some bacteria fall into this category), and I can see that reduction of repetitive sequences would removing formerly ambiguous points in the assembly and potentially lead to mis-joins.

ADD REPLY
0
Entering edit mode
8.7 years ago

You can either subsample (e.g. seqtk sample command) or normalize your reads (using digital normalization) and pick whichever will yield better assemblies for you. If you search biostars, there have been many similar questions to yours already answered (I just tried "high coverage assembly" as keywords and found some very nice posts about the topic).

There is no need to use extremely high coverage data _just_ because that's what you have.

ADD COMMENT
0
Entering edit mode
8.6 years ago
MathGon ▴ 10

If your genome contains repeated sequences (CRISPRs, ISs...) you obtain contigs with an excess of coverage. You can estimate the number of copies by comparison with the median sequencing coverage of your larger contigs.

SPAdes produce a a graph file *.fastg*. You can open it with Bandage to show lins between your contigs.

ADD COMMENT
0
Entering edit mode
8.6 years ago
Shyam ▴ 150

You can try digital normalization to remove excess coverage. You can use programs like Khmer see this (http://khmer.readthedocs.org/en/v1.1/guide.html). You can also try Platanus assembler. The manual says it works best with >80x coverage. But they also say it needs mate-pair data for better assembly. You can try if it helps for your data.

ADD COMMENT

Login before adding your answer.

Traffic: 2685 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6