Hi all,
Recently I have been working on a genome assembly of a fish. The estimated genome size given by Genomescope is 2509MB, which looks like a big genome. After the first-stage assembly with WTDBG2, the original genome has 47000+ contigs. It seems weird as the number is too large. I am writing to require any methods to reduce it.
Thanks,
Perhaps you could add some more details about your data and the assembly like:
Assembly size with all your contigs? The assembly n50? Etc
What is the read length distribution? Median? n50?
Has it been previously assembled with short reads?
I had better results with Fly, compared to wtdbg2, but for PacBio. How much coverage do you have? Did you filter you reads somehow?
In order to get a 100X coverage, we employed a company to produce near 250GB data from three batches (89GB, 31GB and 129GB). As we always get the CLEANED data from the company, we have not filtered any long reads.
What does 'cleaned' mean in this sense? 100X is a lot of coverage. You could downsample for the longest 50X and retry But again (as I commented above), the stats of your assembly and reads will give hints as to potentially why you have so many contigs.
'cleaned' means that the reads we got from the sequencing company have been at least adapter removed and low-quality reads filtered.
I had been downsampling as a trail using the longest, but no improvement has been made.
Below are the stats of the integrated reads from three sequencing batches generated by Nanoplot:
Below are the code and stats of Illumina short reads generated by Jellyfish and Genomescope.
What about stats for your highly fragmented assembly? Does downsampling and re-assembling at least improve the assembly somewhat?
The assembly generated from three sequencing batches (PAG38564, PAG10123, PAG19859) has 53178 contigs while the re-assembly only with the PAG38564 batch has 43837 contigs. It seems that downsampling did improve the assembly. However, the contig number is still too large. I have been totally confused by the stats.
That is a big ​reduction in contig numbers. However you should using all the data and selecting only the longest reads.
What stats i think are more important for now are the total genome size and the L90, N90 etc. Basically is your genome size what you expect and is most of it in large contigs.
U R right. It is better to make use of all sequencing data from different batches.
The total genome size is expected 3-4G with 76 chromosomes. The genome assessments show that this species has very competitive sequences and a high heterozygosity rate (>1%, as the above figure shows). Maybe I should try to change the assembly methods, for example, from WTDBG2 to CANU.
Downsampling with all the data will allow you to get a better set of long reads. You can try the tool Filtlong.
Ok that is the expected size, but what is your assembly size? L90?N90? again, this will allow you to determine if the majority of your assembly is in large contigs.
With a genome that size, Canu will take a very long time.. You can try other fast assemblers like Raven, but you will probably not see any significant reduction in the contig number. It also depends if you want to try phasing, but your first issue is your contig count.
what is the K-mer size used in genome assembly? In general, de novo assemblies are done at multiple k-mer lengths (AFAIK).
I had used the Jellyfish and Genomescope with the kmer=21 parameter to estimate the genome size.