I recently assembled my mammalian genome, but number of bp assembled outreaches expected number of bp in the genome. Therefore, I am concerned that my assembly may contact redundancy (smaller contigs overlapping with larger ones).
I plotted my contigs against closely related genome and distribution of contigs along chromosomes seems quite uniform.
Are there any tools to check for redundancy in assemblies, e.g. with BLAST? I am also thinking about running another run of the assembly, but I don't expect this to help as obvious overlaps would be (hopefully) eliminated already in the first round.
Heh, welcome to the world of assembling diploid or polyploid genomes.
What you see is very common, and produced by any type of assembler. Smaller contigs are generated to show differences between the 2 different chromatids of a diploid organism.
What you want, is a haploid assembly. Currently, the only tool that just came out a few months ago, is dipSpades. What it does is it tried to collapse smaller contigs and leave only the consensus contigs, that represent the 2 different haplomes (the 2 different chromatids). It takes as input haplocontigs (contigs from any given assembler) and outputs consensus contigs (the haploid assembly). The tool largely reduces the amount of contigs and assembly size, however, I have reason to believe that it also removes some non-redundant information. When I blast a gene vs my haplocontigs, I find it, and when I blast it against the consensus contigs it's not there any more, so I believe some info is lost.
Let me know if you find something else, or try dipspades and tell me if you see similar loss of non redundant information.
dipSpades doesn't work well at all... I become more and more convinced the approach to tackling redundancy is simply wrong, since what it essentially is, is assembling assemblies, and we all know that's a bad idea. Aside from the fact that dipspades removes some genes completely from the assembly which it shouldn't do, I now see it also creates longer contigs which are not supported by paired end information. I also saw it create tandem duplications of genomic sequences which are also not supported by anything, so yeah, a bit disappointing.
Kmer graph still remains the best way to estimate genome size I suppose.
Nice answer to a tough question! Although I'm not sure if dipSpades works at all for mammalian genomes..
dipSpades doesn't work well at all... I become more and more convinced the approach to tackling redundancy is simply wrong, since what it essentially is, is assembling assemblies, and we all know that's a bad idea. Aside from the fact that dipspades removes some genes completely from the assembly which it shouldn't do, I now see it also creates longer contigs which are not supported by paired end information. I also saw it create tandem duplications of genomic sequences which are also not supported by anything, so yeah, a bit disappointing.
Kmer graph still remains the best way to estimate genome size I suppose.