Hello,
I have done genome assembly for various yeast strains in pseudochromosome level. All of them was treated exactly the same way. I don't understand why one of them is more fragmented than others. The sequenced data were created using Illumina short-read and Nextera library. FastQC returned errors in "per base sequence content" and "sequence duplication levels" for all strains except this fragmented one (warning). Do you have any suggestion why this assembly could be more fragmented than others?
Any suggestions or help would be much appreciated.
It is difficult to answer this without additional details. For example, what program did you use for the assembly? How about some assembly stats: number of contigs/scaffolds, average contig size and contig N50? Did you read carefully through its output for possible errors or warnings? What is the sequencing depth of your sample?
Oddly enough, fragmented assemblies can be caused both by too shallow and too deep sequencing runs. In the first instance it is because there are not enough reads available. For too deep sequencing, which I suspect may be the issue here, non-random sequencing errors may accumulate at such a level that they disrupt the contiguity of an assembly.
Thank You for suggestions. I used IDBA-UD to assemble reads to contigs and Ragout to create pseudochromosomes. All of the final assemblies have N50 of around 900 kb. This one has 28 kb and almost 5 times more scaffolds. Coverage is comparable in all strains, over 100x. Also, this assembly is almost 2 times larger than others.
Like I said, it sounds like too deep a coverage. As to it being two times larger, that could happen because non-random mutations make it appear as if you have several strains that differ in discreet spots. These artificial SNPs make it appear larger and more fragmented, and you can easily test if this is the case: run your assembly through
cd-hit-est
(available here) and cluster at 90-95% identity, which should drop its size to what is expected. If so, drop the coverage to 60-80x (BBnorm can do that) and see if that assembles better.Thank you for the help. As you suggested I ran assembly through cd-hit-est with parameter -c 0.9. Unfortunately, the size is still twice as large compared to other strains.
Haplotype assembly is the only other thing that comes to mind. How sure are you that your strain is pure and haploid?
I'd still reduce the coverage and see what comes out of that assembly. It may be counter-intuitive that throwing away the data could lead to better assembly, but I have seen it happen many times.
According to the results of flow cytometry, this strain and all others are diploids. It should be pure. I calculated the average coverage differently based on mapping reads after trimming and k-mer based error correction to draft assembly. This strain has significantly lower coverage (80x) compared to other strains (above 140x). Should I still try to lover the coverage? What do you think about trying to assemble the reads before trimming and correction?
I don't know exactly what the reason is but did you assessment your all assemblies using BUSCO tool ? Maybe in the genome assembly you mentioned contains contamination or missing in your raw data ?
I didn't use BUSCO tool but I will check it out. Thank you for the suggestion. I don't know where could be the problem. Maybe I will ask data providers.