Question

understanding contig statistics

0

Entering edit mode

8.1 years ago

mary ▴ 10

I am trying to assemble a bacterial isolate genome from illumina miseq reads. the propose is to go forward for closing the complete genome and use it for comparative genomics (for now, I just have draft genome in 100 contigs). I have gone trough each step of read quality trimming, de novo assembly and scaffolding and then mapped the reads to these scaffolds. then I checked the completeness and contamination of the genome.

I have many problems understanding the basics. anybody can help me through these questions?

1- I get around 115 contigs larger than 500bp, each one with a different coverage, ranging from 4 to 180! basically I know that low coverage means it is not reliable or high coverage ones might be of plasmid origin. but how much coverage is the minimum acceptable or how much coverage is considered as high in my case?

p.s: I have checked the contigs against plasmid databases but no plasmid gene or plasmid ori seqs was detected. same for bacteriophages

2-when I assemble using Spades, a coverage is reported for each contig. when I map reads back to these contigs using qualimap, another coverage is reported by qualimap which is very different from Spades report. e. g. for contig number 99 (the second one in linked image), spades reports a 27x coverage where qualimap reports 75! what's the reason they are so different and which one should I rely on?

3- when I check the mapped contigs, in most cases, both ends of each contig has much lower coverage than the mid part and sometimes it is just mapped by one or few reads. are these contig ends reliable or should I omit (trim) these ends?

coverage, standard deviation and length of contigs as reported by Spades and qualimap

thank you in advance

Assembly coverage spades qualimap mapping reads • 7.9k views

ADD COMMENT • link updated 8.1 years ago by Brian Bushnell 20k • written 8.1 years ago by mary ▴ 10

score 5 · Answer 1 · 2017-02-19

Mapping and assembly give different values for coverage. Generally, mapping is a more accurate way of evaluating coverage. The coverage reported by Spades is probably the kmer coverage, which is both lower than the read coverage, and also calculated only from perfect matches.

I do not recommend trimming the ends of contigs; there's generally no reason to do so and it will just make your assembly more incomplete. The reason ends have low coverage is probably because the coverage became too low there to continue assembly - e.g., once coverage drops to zero, you can't assemble any more.

There's no specific level of coverage that is "acceptable" in general, but 4 is too low for a good assembly. And for a microbe, max coverage of 180 is also pretty low; I try to target at least 100x average (that's 100x read coverage, not kmer coverage; I'm not sure which one you were describing). Low coverage gives you more assembly errors and lower contiguity.

You are almost certainly not going to create a closed bacterial assembly with a single relatively low-coverage Illumina library; if you want to do so, it's best to run a single PacBio smart cell. You may be able to improve your current assembly - I think we typically end up with Spades/Illumina bacterial assemblies under 50 contigs - but especially with such high coverage variability, it will be difficult, and it also depends on the repeat and GC content of the organism.

Often, you can improve assemblies by error-correcting and merging paired reads. For example, with the BBMap package, I might preprocess and assemble reads like this:

#Trim adapters
bbduk.sh in=r1.fq in2=r2.fq out=trimmed.fq ktrim=r k=23 mink=11 hdist=1 ref=/bbmap/resources/adapters.fa tbo tpe

#Remove contaminants
bbduk.sh in=trimmed.fq out=filtered.fq k=31 ref=/bbmap/resources/sequencing_artifacts.fa.gz,/bbmap/resources/phix174_ill.ref.fa.gz

#Error-correct 1
bbmerge.sh in=filtered.fq out=ecco.fq ecco mix vstrict adapters=default

#Error-correct 2
clumpify.sh in=ecco.fq out=eccc.fq ecc passes=6

#Error-correct 3
tadpole.sh in=eccc.fq out=ecct.fq ecc k=62

#Merge
bbmerge.sh in=eccc.fq out=merged.fq outu=unmerged.fq rem k=62 extend2=50 adapters=default

#Assemble
spades.py -k 21,41,71,101,127 -o out -12 unmerged.fq -s merged.fq

#Evaluate
stats.sh in=/out/scaffolds.fasta

#Quantify coverage
bbmap.sh ref=/out/scaffolds.fasta in=trimmed.fq out=mapped.sam covstats=covstats.txt