Any advice for a de novo genome assembly
1
1
Entering edit mode
8.0 years ago
Picasa ▴ 650

Hi,

I am stuck with a de novo genome assembly. I have done a lot of QC filtering :

1) quality and adapter trimming

2) removing contamination

3) removing duplication

I always end up with that kind of k-mer spectrum (and a bad N50):

http://hpics.li/d6e0919

based on that paper:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2784323/

I know that I should end up with a mono or multimodal distribtution.. which is not the case.

If you guy have some idea, would be helpful.

Thanks

k-mer de novo genome assembly • 4.9k views
ADD COMMENT
0
Entering edit mode

Have you tried with smaller or larger k-mer size?

ADD REPLY
0
Entering edit mode

The k-mer profile is made with kmergenie, I've only uploaded the k=51 but this is the same pattern with a lower or higher k-mer. (from 21 to > 100).

Assemblies have been done with allpaths-lg, soapdenovo, abyss and spades with differents set of k-mer (same bad result) .

ADD REPLY
1
Entering edit mode

What is the expected genome size and how much gross coverage do you have in your data?

ADD REPLY
0
Entering edit mode

I agree with genomax2: maybe your coverage is too low.

ADD REPLY
0
Entering edit mode

the expected genome size is 300MB.

The coverage is quite enough:

From raw data: 150X

After all the filtering I mentioned: 120X

I don't think the coverage is the problem.

ADD REPLY
2
Entering edit mode

Having high coverage can also cause issues with assemblies: Very deep coverage data
de novo sequence assembly with extremely high coverage

ADD REPLY
0
Entering edit mode

Yes I thought about it too. I made a random subsample to get 50X at the end.

I've didn't ran an assembly with it yet but I've plotted the k-mer spectrum and I get the same distribution, which make me think that the assembly will still not be good.

ADD REPLY
0
Entering edit mode

If your library has not sampled the genome reasonably uniformly your hunch may be correct.

ADD REPLY
2
Entering edit mode

Yes, looking at the data, it appears you have a biased coverage distribution. You might try normalizing and error-correcting the data, which often improve the assembly when you have highly biased coverage. Using the BBMap package, and assuming the reads are paired and interleaved in a single file:

bbmerge.sh in=reads.fq.gz out=ecct.fq.gz ecct strict mix
tadpole.sh in=ecct.fq.gz out=corrected.fq.gz ecc prealloc prefilter tossjunk
bbnorm.sh in=corrected.fq.gz out=normalized.fq.gz target=100 min=2

Incidentally, overly-strict quality filtering/trimming can damage the assembly and bias the coverage. A loss of 20% of your data is pretty large. What settings did you use?

ADD REPLY
0
Entering edit mode

Interleaving R1/R2 reads can be easily done by (with BBmap): reformat.sh in1=Samp_R1.fq.gz in2=Samp_R2.fq.gz out=Samp.fq.gz

ADD REPLY
0
Entering edit mode

Thanks for advice;

1) My reads are paired end (not interleaved); can it be done with that kind of data ? or I have to "interleave" my PE ?

2) I am not sure to understand why normalizing reads can improve an assembly ?

3) I didn't lost too many reads during quality filtering (using trimmomatic , min len = 90 and mean quality = 15 though 4 bases)

However, I got a lot of duplicates (around 30%) and some contaminants too: this is where I lost data; but I still got a lot of coverage.

ADD REPLY
0
Entering edit mode

1) You can use Genomax's command to interleave your paired reads first.

2) Many assemblers have problems with uneven coverage due to assumptions in the code. The only way to know for certain is to try assembling the reads with and without normalization, to see which produces a better assembly.

3) That's probably OK, then. Although duplicate-removal is not a normal part of assembly; it can make error-correction less accurate, so I don't usually recommend it. Given that you had a lot of duplicates, and uneven coverage, it looks like your data may have been over-amplified.

ADD REPLY
0
Entering edit mode

1) So if understand, tadpole.sh is used for an error correction ? Because based on the manual and your command, I don't see mode=correct

The manual is

Usage:
Assembly:     tadpole.sh in=<reads> out=<contigs>
Extension:    tadpole.sh in=<reads> out=<extended> mode=extend
Correction:   tadpole.sh in=<reads> out=<corrected> mode=correct

2) Actually I have one overlapping paired end. What about the non overlapped, other libraries ?

3) Also, do you recommend any particular assembler after normalization ?

ADD REPLY
0
Entering edit mode

@Brian is the author of BBMap :)

Tadpole is doing correction (ecc).

ADD REPLY
0
Entering edit mode

About the command Brian gave:

tadpole.sh in=ecct.fq.gz out=corrected.fq.gz ecc prealloc prefilter tossjunk

It's look like it performs assembly rather ?

Ok, the ecc option performs error correction.

ADD REPLY
0
Entering edit mode

Yes, "ecc" is short for "mode=correct".

Actually, my commands were under the assumption that all of your reads were in a single library. Normalization is more complicated when you have reads in multiple libraries. How many files do you have, of what type, and how big are they?

As for assemblers, I think AllPaths and Spades are both worth trying. Spades may fail, though; it's not really designed for such big organisms.

ADD REPLY
0
Entering edit mode

Majority of my coverage (70X) come from the overlapped library so it's worth trying.

However ALLPATHS need a mate pair library (I have this) for scaffolding. Should I mix my overlapped (and normalize reads) and the mate pair ?

ADD REPLY
0
Entering edit mode

No, I recommend just normalizing the overlapped library. Don't bother with running BBMerge or BBNorm on the mate pair library.

ADD REPLY
0
Entering edit mode

I mean assembling normalized reads and "normal reads" will not cause some problem ?

ADD REPLY
0
Entering edit mode

No, that won't cause any problems.

ADD REPLY
0
Entering edit mode

Thanks Brian for your support, it looks a bit better, but still weird.

http://hpics.li/8db6ecf

Do you have any recommendation on parameters I can play with to maximise the peak ?

ADD REPLY
0
Entering edit mode

Remember, the point is to get a better assembly, not a better-looking kmer distribution, so I suggest trying assembly after each step to see if it improved or got worse. It would be helpful if you posted the stderr (screen output) from both Tadpole and BBNorm. It looks like a lot of data was removed...

ADD REPLY
0
Entering edit mode

A good k-mer distribution is linked to a good assembly no ?

Tadpole log:

 java -ea -Xmx200g -Xms200g -cp BBMap/36.59/bin/current/ assemble.Tadpole in=merged.180.pair12.fastq out=merged.corrected.180.pair12.fastq ecc prealloc prefilter tossjunk -Xmx200g
Executing assemble.Tadpole1 [in=merged.180.pair12.fastq, out=merged.corrected.180.pair12.fastq, ecc, prealloc, prefilter, tossjunk, -Xmx200g]

Switching to correct mode because ecc=t.
Using 64 threads.
Executing kmer.KmerTableSet [in=merged.180.pair12.fastq, out=merged.corrected.180.pair12.fastq, ecc, prealloc, prefilter, tossjunk, -Xmx200g]

Initial size set to 64677143
Initial:
Ways=163, initialSize=64677143, prefilter=t, prealloc=1.0
Memory: max=205800m, free=201505m, used=4295m

Initialization Time:        0.086 seconds.

Loading kmers.

Made prefilter:     hashes = 2       mem = 31.39 GB     cells = 134.81B     used = 0.828%
Estimated valid kmers:      265629490
Prefilter time: 542.296 seconds.
After prefilter:
Memory: max=206296m, free=162049m, used=44247m

Estimated kmer capacity:    9488136934
After table allocation:
Memory: max=206288m, free=35143m, used=171145m

After loading:
Memory: max=206288m, free=29407m, used=176881m

Input:                          118950460 reads         21333211872 bases.
Unique Kmers:                   265562409
Load Time:                      1676.167 seconds.

Reads Processed:        118m    70.97k reads/sec
Bases Processed:      21333m    12.73m bases/sec
Removed 6179 kmers.
Remove time: 17.889 seconds.

Extending/error-correcting/discarding.


After extending reads:
Memory: max=206288m, free=37161m, used=169127m

Input:                          118950460 reads         21333211872 bases.
Output:                         118950460 reads         21333211872 bases.
Errors detected:                66103204
Errors corrected:               15310556    (15310556 reassemble)
Reads with errors detected:     21779443    (18.31%)
Reads fully corrected:          11926251    (54.76% of detected)
Reads partly corrected:         1194999     (5.49% of detected)
Rollbacks:                      611653  (2.81% of detected)
Reads discarded:                3736    (0.00%)
Bases discarded:                633476  (0.00%)
Extend/error-correct time:      791.699 seconds.

Total Time:                 2467.996 seconds.

Reads Processed:        118m    48.20k reads/sec
Bases Processed:      21333m    8.64m bases/sec
ADD REPLY
0
Entering edit mode

BBnorm log

java -ea -Xmx150g -Xms150g -cp BBMap/36.59/bin/current/ jgi.KmerNormalize bits=32 in=merged.corrected.180.pair12.fastq out=merged.corrected.norm.180.pair12.fastq target=100 min=2 -Xmx150g threads=15
Executing jgi.KmerNormalize [bits=32, in=merged.corrected.180.pair12.fastq, out=merged.corrected.norm.180.pair12.fastq, target=100, min=2, -Xmx150g, threads=15]

Set threads to 15

   ***********   Pass 1   **********   


Settings:
threads:            15
k:                  31
deterministic:      true
toss error reads:   false
passes:             1
bits per cell:      16
cells:              56.30B
hashes:             3
base min quality:   5
kmer min prob:      0.5

target depth:       400
min depth:          2
max depth:          500
min good kmers:     15
depth percentile:   64.8
ignore dupe kmers:  true
fix spikes:         false

Made hash table:    hashes = 3       mem = 104.83 GB    cells = 56.28B      used = 2.166%

Estimated unique kmers:         410836047

Table creation time:        978.232 seconds.
Started output threads.
Table read time:        921.927 seconds.    23139.11 kb/sec
Total reads in:         118946724   66.637% Kept
Total bases in:         21332578396 69.093% Kept
Error reads in:         2656084     2.233%
Error type 1:           13371       0.011%
Error type 2:           634088      0.533%
Error type 3:           2351229     1.977%
Total kmers counted:            17734375788
Total unique kmer count:        411065990
Includes forward kmers only.
The unique kmer estimate can be more accurate than the unique count, if the tables are very full.
The most accurate value is the greater of the two.

Percent unique:                  2.32%
Depth average:                  43.14   (unique kmers)
Depth median:                   13  (unique kmers)
Depth standard deviation:       598.18  (unique kmers)

Depth average:                  1827.17 (all kmers)
Depth median:                   116 (all kmers)
Depth standard deviation:       20091.46    (all kmers)

Approx. read depth median:      139.30

   ***********   Pass 2   **********   


Settings:
threads:            15
k:                  31
deterministic:      true
toss error reads:   false
passes:             1
bits per cell:      16
cells:              56.30B
hashes:             3
base min quality:   5
kmer min prob:      0.5

target depth:       100
min depth:          2
max depth:          100
min good kmers:     15
depth percentile:   54.0
ignore dupe kmers:  true
fix spikes:         false

Made hash table:    hashes = 3       mem = 104.83 GB    cells = 56.28B      used = 1.901%

Estimated unique kmers:         360054512

Table creation time:        683.713 seconds.
Started output threads.
Table read time:        664.810 seconds.    22170.81 kb/sec
Total reads in:         79262607    84.448% Kept
Total bases in:         14739375964     84.540% Kept
Error reads in:         564152      0.712%
Error type 1:           12055       0.015%
Error type 2:           450975      0.569%
Error type 3:           247816      0.313%
Total kmers counted:            12360843169
Total unique kmer count:        360229691
Includes forward kmers only.
The unique kmer estimate can be more accurate than the unique count, if the tables are very full.
The most accurate value is the greater of the two.

Percent unique:                  2.91%
Depth average:                  34.31   (unique kmers)
Depth median:                   17  (unique kmers)
Depth standard deviation:       58.61   (unique kmers)

Depth average:                  134.41  (all kmers)
Depth median:                   74  (all kmers)
Depth standard deviation:       461.23  (all kmers)

Approx. read depth median:      88.23

Removing temp files.

Total time:             3249.139 seconds.       11102.00 kb/sec
ADD REPLY
0
Entering edit mode

Hi Brian,

Can you explain why normalization is more complicated when reads are in multiple libraries (so different insert size) ?

ADD REPLY
1
Entering edit mode

This kind of depends on your goal of normalization. If you want a 100x depth target for all libraries combined, you'd need to normalize all of them together (basically concatenate them all, normalize, then demultiplex them by name).

Normalization works best when it takes place in the context of all data simultaneously. If you try to hit 100x by normalizing two libraries independently to 50x each, you will not hit the target. Say one library is 20x in a region and the other is 500x. The result of combined normalization would be 100x, but independent normalizations would yield 20x + 50x = 70x, below the target. Furthermore, the combined normalization would give a 20:500 ratio of the individual libraries, so you'd only get a handful of reads from the lower-coverage library, which is probably not ideal.

Generally, if you have high coverage in a short fragment library and low coverage in a long-mate-pair library (or something similar), and you need to do normalization, I would suggest normalizing the high-coverage library alone and not doing anything to the lower-coverage long-mate-pair library.

ADD REPLY
0
Entering edit mode

Hi Brian, can you explain me how you see a biased coverage distribution when you look at my data ?

ADD REPLY
1
Entering edit mode

The coverage peaks at 1 and then kind of monotonically declines. That kind of pattern indicates a highly biased coverage distribution that you see in situations like metagenomes (which have an exponential coverage distribution) and highly amplified single-cell libraries (which also have an exponential coverage distribution). Isolates with uniform coverage will have clear peaks centered at the average coverage and multiples of the average coverage.

Some insects host bacteria in places other than their gut. You may want to BLAST your assembled contigs to see if they hit bacteria; they would have a very different coverage distribution than the main genome and might be interfering with the kmer frequency histogram.

ADD REPLY
0
Entering edit mode

We didn't had too much starting DNA and indeed we have done a whole genome amplification by PCR.

1)Is that possible that PCR causes too many errors (which have been sequenced) ? and this is why we have difficulty to get a high N50 ?

2) Unfortunately, normalization didn't improve the N50. I am looking for others idea, if you got some, it will be helpful.

I am a frustrated because the completeness (evaluated by cegma and busco) is not that bad (85%).

Thanks Brian.

ADD REPLY
0
Entering edit mode

Gosh, I'd say 85% is terrible... but then, I've never sequenced an insect, so I don't really know. It seems to me that your amplification caused problems and you should probably try again with more DNA and (ideally) zero amplification. I would always recommend zero amplification when possible for any project, but especially assembly.

ADD REPLY
0
Entering edit mode

what do you mean by

your library has not sampled the genome reasonably uniformly

ADD REPLY
1
Entering edit mode

Since you have tried multiple assemblers and the results are bad (what criteria was used for reaching that conclusion) it is possible that the library you are using does not have the entire genome represented (I can't see the image you included in OP).

ADD REPLY
0
Entering edit mode

Result are bad was based on N50 (not even reach 10Kb) and cegma evaluation.

ADD REPLY
0
Entering edit mode
8.0 years ago
krsahlin ▴ 60

As Brian Bushnell suggested, I would try Spades. The advantage of Spades to most other assemblers in this situation is the ability to handle “wide” k-mer distributions, such as the one you have. Although Spades assumes that the wide k-mer distribution is a result of amplification bias caused by the Multiple Displacement Amplification (MDA) technology for single cell sequencing, my guess is that the algorithmic solution to deal with this coverage bias should not differ between single cell data and your case (not a 100% sure though).

ADD COMMENT
1
Entering edit mode

Already done : C: Any advice for a de novo genome assembly
Perhaps can be tried again after normalization/error correction.

I fear that this is a not so great/representative library to be begin with and it may be the end of the road as far as bioinformatics tools are concerned. This library may have to be redone and sequenced afresh.

ADD REPLY
1
Entering edit mode

Oh I missed that, my bad! It's likely that you are correct about bad indata. But just for the sake of further possible alternatives if OP decides to explore current data more:

There are many genome projects of e.g., plants or fish, that are struggling with the contiguity of assemblies (N50 < 10kbp). Usually, this is due to very heterozygous genomes and/or a complex repeat structure. To get a further sense of this, it would be interesting to know something about the nature of organism that is sequenced (if organism is confidential) and to see the statistics such as N50 and total assembly length of the different assemblers. For example, if total assembly length is much longer than one expects, it could be because of high heterozygosity. Again, as the k-mer plots does not show a tendency of two or more clear peaks (as common for diploid with high heterozygosity rate) --- its not likely to be a cause. Still, it could be worth keeping these possibilities in mind.

ADD REPLY
0
Entering edit mode

Good points. Let us see what @Picasa says.

ADD REPLY
0
Entering edit mode

The organism sequenced in an insect.

We don't know much about it but we assumed it as a diploid organim.

We suspect the problems arise from either whole genome amplification(we didn't have a high start amount of dna) or maybe weird structure (repetition ,inversion etc.).

As I said in one of my post, the expected genome size is 300MB but the assemblies result vary to much (from 250 Mb to 400 Mb) and a N50 always < 5K (after scaffolding step).

ADD REPLY
0
Entering edit mode

I am running now the assembly for the corrected and normalize data that Brian suggested.

I have also an another normalization at 15. First SPAdes assembly gave me a N50 of 2Kb (not scaffolding yet). I am desperate.

ADD REPLY

Login before adding your answer.

Traffic: 1927 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6