Average coverage of de novo SPAdes assembly deviates a lot from theretical coverage
2
0
Entering edit mode
9.2 years ago
ALchEmiXt ★ 1.9k

Usually when we assemble PE (100, 150, 250, 300) fastq Illumina data we see that roughly the expected theoretical coverage is close to the actual coverage over de novo assembled contigs using for instance SPAdes.

However I currently have an older dataset PE 76->51 bases with a theoretical coverage of over 200 fold. However when I inspect the assembly the average coverage is only ~5 fold....

I checked;

  1. reads are of top quality (1.9 encoding) all far above Q36 (according to fastQC)
  2. used custom kmer settings for SPAdes in the range 21 -> 49 in PE mode
  3. SUM length of assembled contigs is approximately the expected genome size ~900k
  4. There are some repeats with high (but not extreme high) coverage.

For a contaminant I would expect to find many more small low coverage contigs.... as we usually for other species do...

Is the high AT% of this species disregulating SPAdes?

Any pointers where to look at are welcome!

Assembly SPAdes de-novo bacterial-species • 7.8k views
ADD COMMENT
0
Entering edit mode

What does the k-mer spectra look like? Do you see an extremely high peak at low coverage?

ADD REPLY
0
Entering edit mode

Is your data set heterozygous/from pooled samples...? And how high is AT rich?

ADD REPLY
2
Entering edit mode
9.2 years ago
ALchEmiXt ★ 1.9k

To conclude (for those searching for similar questions/answers):

1) Indeed the k-mer coverage is reported. Please give points to the respective comments.

2) Indeed the higher k-mers did not significantly improve the assembly. If contig length is a parameter for quality the latest SPAdes 4.6.1 did a better job then 4.6.0.

3) The AT richness (70%) did not influence this.

Thanks for solving this basic question.

ADD COMMENT
1
Entering edit mode

Are you sure about SPAdes 4.6.2 ? According to their website as it opens up in my browser, their latest version is 3.6.1? Thanks for reporting back, will be very helpful to others!

ADD REPLY
0
Entering edit mode

oepsie. Corrected to 3.6.1

ADD REPLY
0
Entering edit mode

Please note: This answer is just to have the question answered. Not to score points. Please give points to the respective comments pointing out the summarised answer.

ADD REPLY
2
Entering edit mode
9.2 years ago
thackl ★ 3.0k

My guess, it is related to your very short reads and the fact that SPAdes runs with increasing kmer size. Coverage is probably estimated in the final iteration using k55 or larger and your reads each only contain a few good 55-mers.

EDIT: Just saw that you used 49mers - then this answer probably does not make sense..

ADD COMMENT
5
Entering edit mode

This answer does make sense. I think the OP is confusing Nucleotide Coverage, which computes average coverage for each base pair with k-mer coverage. The latter is what is reported by SPAdes based on the last k chosen, in this case 49. k-mer coverage is always smaller than nucleotide coverage. This is from Velvet, but it also applies to SPAdes:

All coverage values in Velvet are provided in k-mer coverage, i.e. how many times has a k-mer been seen among the reads. The relation between k-mer coverage Ck and standard (nucleotide-wise) coverage C is Ck = C * (L - k + 1) / L where k is your hash length, and L you read length.

ADD REPLY
1
Entering edit mode

Yeah, I had the same formula in mind. However, with 76 bp reads and 49-mers, you still would expect a lot more then 5X, even if you account for a large portion of reads having bad tails... It probably is contributing factor though. A good way to check probably would be to compute a kmer-profile with 49mers..

read length 76: 200 * (76-49+1) / 76 = 74
read length 55: 200 * (55-49+1) / 55 = 25
ADD REPLY
0
Entering edit mode

Good point. I should have thought about that. It would fit for these cases since it would give me roughly 5 fold coverage x 49 kmer ~ 250x... the expected coverage. Or would that be the wrong way to calculate it? My trimmed corrected reads in SPAdes are mostly 50 bases and my highest k-mer 49.

Strange I didn't notice this with other datasets before. Maybe because coverage was generally very high so this was not a critical point in absolute measures.

Is there anywhere a description of this value in the header? I searched the website and looked through the original paper: http://online.liebertpub.com/doi/pdf/10.1089/cmb.2012.0021 but no luck

ADD REPLY
0
Entering edit mode

If you average is 50 bp after trimming, that means that a significant portion of your reads would be discarded when k=49, as they would be smaller than 49bp. I recommend a smaller final k-mer, perhaps 35 or 41. You will surely notice higher k-mer coverage values.

ADD REPLY
0
Entering edit mode

I can bench mark this. But why would that happen since 49 is still lower than 50-51. Most likely a matter of probability?

Anyway my reads are all 51 after adapter trimming. After read correction in SPAdes effectively I see many truncated at the first base. The majority (read almost all) reads are 50 bases or longer.

ADD REPLY
2
Entering edit mode

The header are formatted analogous to Velvet.

And if in fact the majority of your reads is at ~50bp, than the kmer-vs-read coverage would account for the 5X.

Have a look at spades/K49/final_contigs.fasta and spades/<second to last K>/final_contigs.fasta. I'm assuming that the final iteration didn't really improve the assembly, given the low coverage connections in the underlying graph.

ADD REPLY

Login before adding your answer.

Traffic: 1552 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6