Yes I thought about it too. I made a random subsample to get 50X at the end.
I've didn't ran an assembly with it yet but I've plotted the k-mer spectrum and I get the same distribution, which make me think that the assembly will still not be good.
Yes, looking at the data, it appears you have a biased coverage distribution. You might try normalizing and error-correcting the data, which often improve the assembly when you have highly biased coverage. Using the BBMap package, and assuming the reads are paired and interleaved in a single file:
Incidentally, overly-strict quality filtering/trimming can damage the assembly and bias the coverage. A loss of 20% of your data is pretty large. What settings did you use?
1) You can use Genomax's command to interleave your paired reads first.
2) Many assemblers have problems with uneven coverage due to assumptions in the code. The only way to know for certain is to try assembling the reads with and without normalization, to see which produces a better assembly.
3) That's probably OK, then. Although duplicate-removal is not a normal part of assembly; it can make error-correction less accurate, so I don't usually recommend it. Given that you had a lot of duplicates, and uneven coverage, it looks like your data may have been over-amplified.
Actually, my commands were under the assumption that all of your reads were in a single library. Normalization is more complicated when you have reads in multiple libraries. How many files do you have, of what type, and how big are they?
As for assemblers, I think AllPaths and Spades are both worth trying. Spades may fail, though; it's not really designed for such big organisms.
Remember, the point is to get a better assembly, not a better-looking kmer distribution, so I suggest trying assembly after each step to see if it improved or got worse. It would be helpful if you posted the stderr (screen output) from both Tadpole and BBNorm. It looks like a lot of data was removed...
This kind of depends on your goal of normalization. If you want a 100x depth target for all libraries combined, you'd need to normalize all of them together (basically concatenate them all, normalize, then demultiplex them by name).
Normalization works best when it takes place in the context of all data simultaneously. If you try to hit 100x by normalizing two libraries independently to 50x each, you will not hit the target. Say one library is 20x in a region and the other is 500x. The result of combined normalization would be 100x, but independent normalizations would yield 20x + 50x = 70x, below the target. Furthermore, the combined normalization would give a 20:500 ratio of the individual libraries, so you'd only get a handful of reads from the lower-coverage library, which is probably not ideal.
Generally, if you have high coverage in a short fragment library and low coverage in a long-mate-pair library (or something similar), and you need to do normalization, I would suggest normalizing the high-coverage library alone and not doing anything to the lower-coverage long-mate-pair library.
The coverage peaks at 1 and then kind of monotonically declines. That kind of pattern indicates a highly biased coverage distribution that you see in situations like metagenomes (which have an exponential coverage distribution) and highly amplified single-cell libraries (which also have an exponential coverage distribution). Isolates with uniform coverage will have clear peaks centered at the average coverage and multiples of the average coverage.
Some insects host bacteria in places other than their gut. You may want to BLAST your assembled contigs to see if they hit bacteria; they would have a very different coverage distribution than the main genome and might be interfering with the kmer frequency histogram.
Gosh, I'd say 85% is terrible... but then, I've never sequenced an insect, so I don't really know. It seems to me that your amplification caused problems and you should probably try again with more DNA and (ideally) zero amplification. I would always recommend zero amplification when possible for any project, but especially assembly.
Since you have tried multiple assemblers and the results are bad (what criteria was used for reaching that conclusion) it is possible that the library you are using does not have the entire genome represented (I can't see the image you included in OP).
As Brian Bushnell suggested, I would try Spades. The advantage of Spades to most other assemblers in this situation is the ability to handle “wide” k-mer distributions, such as the one you have. Although Spades assumes that the wide k-mer distribution is a result of amplification bias caused by the Multiple Displacement Amplification (MDA) technology for single cell sequencing, my guess is that the algorithmic solution to deal with this coverage bias should not differ between single cell data and your case (not a 100% sure though).
I fear that this is a not so great/representative library to be begin with and it may be the end of the road as far as bioinformatics tools are concerned. This library may have to be redone and sequenced afresh.
Oh I missed that, my bad! It's likely that you are correct about bad indata. But just for the sake of further possible alternatives if OP decides to explore current data more:
There are many genome projects of e.g., plants or fish, that are struggling with the contiguity of assemblies (N50 < 10kbp). Usually, this is due to very heterozygous genomes and/or a complex repeat structure. To get a further sense of this, it would be interesting to know something about the nature of organism that is sequenced (if organism is confidential) and to see the statistics such as N50 and total assembly length of the different assemblers. For example, if total assembly length is much longer than one expects, it could be because of high heterozygosity.
Again, as the k-mer plots does not show a tendency of two or more clear peaks (as common for diploid with high heterozygosity rate) --- its not likely to be a cause. Still, it could be worth keeping these possibilities in mind.
We don't know much about it but we assumed it as a diploid organim.
We suspect the problems arise from either whole genome amplification(we didn't have a high start amount of dna) or maybe weird structure (repetition ,inversion etc.).
As I said in one of my post, the expected genome size is 300MB but the assemblies result vary to much (from 250 Mb to 400 Mb) and a N50 always < 5K (after scaffolding step).
Have you tried with smaller or larger k-mer size?
The k-mer profile is made with kmergenie, I've only uploaded the k=51 but this is the same pattern with a lower or higher k-mer. (from 21 to > 100).
Assemblies have been done with allpaths-lg, soapdenovo, abyss and spades with differents set of k-mer (same bad result) .
What is the expected genome size and how much gross coverage do you have in your data?
I agree with genomax2: maybe your coverage is too low.
the expected genome size is 300MB.
The coverage is quite enough:
From raw data: 150X
After all the filtering I mentioned: 120X
I don't think the coverage is the problem.
Having high coverage can also cause issues with assemblies: Very deep coverage data
de novo sequence assembly with extremely high coverage
Yes I thought about it too. I made a random subsample to get 50X at the end.
I've didn't ran an assembly with it yet but I've plotted the k-mer spectrum and I get the same distribution, which make me think that the assembly will still not be good.
If your library has not sampled the genome reasonably uniformly your hunch may be correct.
Yes, looking at the data, it appears you have a biased coverage distribution. You might try normalizing and error-correcting the data, which often improve the assembly when you have highly biased coverage. Using the BBMap package, and assuming the reads are paired and interleaved in a single file:
Incidentally, overly-strict quality filtering/trimming can damage the assembly and bias the coverage. A loss of 20% of your data is pretty large. What settings did you use?
Interleaving R1/R2 reads can be easily done by (with BBmap):
reformat.sh in1=Samp_R1.fq.gz in2=Samp_R2.fq.gz out=Samp.fq.gz
Thanks for advice;
1) My reads are paired end (not interleaved); can it be done with that kind of data ? or I have to "interleave" my PE ?
2) I am not sure to understand why normalizing reads can improve an assembly ?
3) I didn't lost too many reads during quality filtering (using trimmomatic , min len = 90 and mean quality = 15 though 4 bases)
However, I got a lot of duplicates (around 30%) and some contaminants too: this is where I lost data; but I still got a lot of coverage.
1) You can use Genomax's command to interleave your paired reads first.
2) Many assemblers have problems with uneven coverage due to assumptions in the code. The only way to know for certain is to try assembling the reads with and without normalization, to see which produces a better assembly.
3) That's probably OK, then. Although duplicate-removal is not a normal part of assembly; it can make error-correction less accurate, so I don't usually recommend it. Given that you had a lot of duplicates, and uneven coverage, it looks like your data may have been over-amplified.
1) So if understand, tadpole.sh is used for an error correction ? Because based on the manual and your command, I don't see
mode=correct
The manual is
2) Actually I have one overlapping paired end. What about the non overlapped, other libraries ?
3) Also, do you recommend any particular assembler after normalization ?
@Brian is the author of BBMap :)
Tadpole is doing correction (
ecc
).About the command Brian gave:
It's look like it performs assembly rather ?
Ok, the ecc option performs error correction.
Yes, "ecc" is short for "mode=correct".
Actually, my commands were under the assumption that all of your reads were in a single library. Normalization is more complicated when you have reads in multiple libraries. How many files do you have, of what type, and how big are they?
As for assemblers, I think AllPaths and Spades are both worth trying. Spades may fail, though; it's not really designed for such big organisms.
Majority of my coverage (70X) come from the overlapped library so it's worth trying.
However ALLPATHS need a mate pair library (I have this) for scaffolding. Should I mix my overlapped (and normalize reads) and the mate pair ?
No, I recommend just normalizing the overlapped library. Don't bother with running BBMerge or BBNorm on the mate pair library.
I mean assembling normalized reads and "normal reads" will not cause some problem ?
No, that won't cause any problems.
Thanks Brian for your support, it looks a bit better, but still weird.
http://hpics.li/8db6ecf
Do you have any recommendation on parameters I can play with to maximise the peak ?
Remember, the point is to get a better assembly, not a better-looking kmer distribution, so I suggest trying assembly after each step to see if it improved or got worse. It would be helpful if you posted the stderr (screen output) from both Tadpole and BBNorm. It looks like a lot of data was removed...
A good k-mer distribution is linked to a good assembly no ?
Tadpole log:
BBnorm log
Hi Brian,
Can you explain why normalization is more complicated when reads are in multiple libraries (so different insert size) ?
This kind of depends on your goal of normalization. If you want a 100x depth target for all libraries combined, you'd need to normalize all of them together (basically concatenate them all, normalize, then demultiplex them by name).
Normalization works best when it takes place in the context of all data simultaneously. If you try to hit 100x by normalizing two libraries independently to 50x each, you will not hit the target. Say one library is 20x in a region and the other is 500x. The result of combined normalization would be 100x, but independent normalizations would yield 20x + 50x = 70x, below the target. Furthermore, the combined normalization would give a 20:500 ratio of the individual libraries, so you'd only get a handful of reads from the lower-coverage library, which is probably not ideal.
Generally, if you have high coverage in a short fragment library and low coverage in a long-mate-pair library (or something similar), and you need to do normalization, I would suggest normalizing the high-coverage library alone and not doing anything to the lower-coverage long-mate-pair library.
Hi Brian, can you explain me how you see a biased coverage distribution when you look at my data ?
The coverage peaks at 1 and then kind of monotonically declines. That kind of pattern indicates a highly biased coverage distribution that you see in situations like metagenomes (which have an exponential coverage distribution) and highly amplified single-cell libraries (which also have an exponential coverage distribution). Isolates with uniform coverage will have clear peaks centered at the average coverage and multiples of the average coverage.
Some insects host bacteria in places other than their gut. You may want to BLAST your assembled contigs to see if they hit bacteria; they would have a very different coverage distribution than the main genome and might be interfering with the kmer frequency histogram.
We didn't had too much starting DNA and indeed we have done a whole genome amplification by PCR.
1)Is that possible that PCR causes too many errors (which have been sequenced) ? and this is why we have difficulty to get a high N50 ?
2) Unfortunately, normalization didn't improve the N50. I am looking for others idea, if you got some, it will be helpful.
I am a frustrated because the completeness (evaluated by cegma and busco) is not that bad (85%).
Thanks Brian.
Gosh, I'd say 85% is terrible... but then, I've never sequenced an insect, so I don't really know. It seems to me that your amplification caused problems and you should probably try again with more DNA and (ideally) zero amplification. I would always recommend zero amplification when possible for any project, but especially assembly.
what do you mean by
Since you have tried multiple assemblers and the results are bad (what criteria was used for reaching that conclusion) it is possible that the library you are using does not have the entire genome represented (I can't see the image you included in OP).
Result are bad was based on N50 (not even reach 10Kb) and cegma evaluation.