I am trying to do a de novo sequence assembly of a 7500 bp virus with Illumina data. The coverage is very high (in the millions), so every piece of software I've tried to use is unable to complete this task. Are there any programs out there that are able to handle this? Any advice would be appreciated.
I would recommend to select a random subset of your reads to perform a "classic" assembly from your data. In fact, having a "too" high coverage can lead to a wrong assembly. I have seen artefactual repetitions of wide genomic regions in those kind of assemblies.
Don't try to assembly all your data at once. Determine the approximate coverage, for example using a kmer-frequency plot, and down-sample by randomly drawing reads for about 100X.
Thanks. I'll need to get approval from a supervisor to cut down on the coverage -- are there any risks involved with random selection and reducing the coverage?
ADD REPLY
• link
updated 5.1 years ago by
Ram
44k
•
written 9.1 years ago by
ldorsey6
▴
20
0
Entering edit mode
You can do this multiple times and see if the results are consistent by aligning final assemblies. If they are, there are no worries. Another thing to try would be digital normalization.
I'm not entirely sure about Trinity's specifically, but in general normalization results on downsampling redundant stuff, while random drawing - well it's random. For example, if you normalize a set with let's say a 500X contig (A) and a 50X contig (B) to 50X, than you ideally will have 50X for both in the normalized read set. If you are interested in both 50X and 500X at the same time, use normalization. But be aware, since most errors are low coverage, their coverage will not be reduced, same goes for contaminations. Meaning if you have 10X erroneous reads on your 500X contig, this initially is a low ratio, but after normalization you may still have 10X erroneous reads, but only 50X good reads. Essentially, you have to be aware that normalization can increase the noise to signal ratio
If you only need 500X stuff, as ldorsey6 does, than random sampling is better. It will bring A to 5X and B to 50X and the 5X fraction should get filtered during assembly and not interfere with what you are actually interested in. This strategy can be improved by filtering the subsampled read set for coverage/error correction.
I recommend Tadpole, from the BBMap package - it does a good job on small genomes with super-high coverage, such as viruses. You can also specify a coverage range to assemble. For example:
I'm really a fan of bbnorm (just saved with a 3GB plant genome assembly), but don't you think that in this case normalization would not be ideal, because you enrich for all the noisy low coverage stuff while only downsampling the reads you are interested in?
BBNorm by default uses 2-pass mode. In the first pass, it attempts to selectively remove reads containing errors (meaning they have both high and low coverage regions, with a sharp discontinuity) while in the second pass it removes all reads with equal weight regardless of whether they seem to contain errors. This makes it fairly robust against concentrating reads with errors, which would happen using 1-pass normalization. It does not really help against low-coverage contaminants and so forth. But, you can also explicitly direct it to discard that junk:
In that case, it would discard anything with depth under 400x, and normalize everything else to 100x. There's an interplay between 2-pass mode and setting a high value for min, though, so it may be better to do that in 2 commands like this:
Tadpole is working great for me. Thanks for the recommendation. I've been aligning the contigs to a reference to see if the de novo assembly is working, and I'm gettting 100% identity. However, tadpole is producing separate contigs that definitely overlap. Why is it producing 5 overlapping contigs instead of just 1?
ADD REPLY
• link
updated 5.1 years ago by
Ram
44k
•
written 9.1 years ago by
ldorsey6
▴
20
0
Entering edit mode
How much do they overlap? If they have 100% identity to a reference, they will not overlap by more than a kmer length, because a given kmer will only occur at most once in the output.
The reason for there being 5 contigs is likely polymorphism in the data, or repeats in the genome, though it's hard to say. You may be able to reduce the number of contigs by adding the flag bm1=8, or by first error-correcting the data, or by using a longer kmer like 90 (or if you have 150bp reads, even k=120).
"bm1" or "branchmult1" has a default value of 20. It is the minimum allowed ratio between the depths of two possible next kmers when extending a contig (a branch in the graph). So, for example, say you have a contig ACGTTA and you can extend it to the right by adding a C to make ACGTTAC or a G to make ACGTTAG. If there are 10 reads supporting the C and 50 reads supporting the G, then the ratio is 50/10=5, which is too low, and extension would stop. But if there are 10 reads supporting C and 500 supporting G, then the ratio is 50 which would pass the ratio so extension would continue, adding a G.
A higher value is more conservative and leads to a more fragmented assembly, but a lower rate of misassemblies and errors. A lower value will usually give a less fragmented assembly, but with a potentially higher rate of misassemblies and errors.
you could downsample a certain coverage (~200-400x) of reads and assemble to get an initial assembly.
You can use the rest of the reads to then fill in the gaps or superscaffold your starting assembly. That way, you will utilize all your data and not overwhelm the assembler.
Digital Normalization. For example, KHMER can be used to decrease sampling variation, discard redundant data, and removing the majority of errors to retain only useful information.
Hi,
I would recommend to select a random subset of your reads to perform a "classic" assembly from your data. In fact, having a "too" high coverage can lead to a wrong assembly. I have seen artefactual repetitions of wide genomic regions in those kind of assemblies.