Hi Biostars, I was hoping someone may be able to give some guidance about troubleshooting a mammalian genome assembly. My species has a similar genome size to humans and I have ~150X coverage of 2x150bp Illumina reads for building contigs.
I've previously had modest luck assembling contigs for a closely related species with much lower coverage (~35-40X of 2x150bp Illumina). For the previous species I used SOAPdenovo2 so I gave that another shot. This time though my contig sizes were pathetically small despite the considerably greater coverage available for this species. I've verified that the data is indeed from the right species and the library isn't very biased toward one part of the genome or another (blasts of randomly selected reads turn up hits from close relatives and I get very high mapping percent to my previous related species' assembly with a fairly level coverage histogram across the genome). I've also tried deduplication of the read data which didn't change the results significantly.
I'm basically at a loss for why contig assembly should come out so much worse for this new species when the underlying genome itself is very similar to my previous species, when the library isn't very biased and when the coverage is significantly better.
As a secondary problem I've also tried SPAdes, but my dataset seems to be crashing the program (despite giving it ~900gb of memory). From what I've read, SPAdes must be loading the total dataset into memory (which is larger than available memory, about 950gb). Is there a good strategy for dividing up a dataset, assembling, then combining the assemblies?
You can subsample the dataset for SPAdes assembly.
Thanks for your reply. I'm just not certain how subsampling is a strategy for dividing the dataset and combining resulting assemblies. Writing a script to take random reads out of a file is easy, making high-quality assemblies from multiple smaller assemblies is a somewhat different problem though.
I tend to use subsampling and it almost certainly always gives better assembly for data with good depth. I'd generate assembly with subsampled reads and then align all reads back to the contigs and call consensus to ensure that the assembly represents the original data.
To second Sej, when coverage is very high errors start to repeat and then contigs are divided. Maybe you can change the parameters and set minimal coverage to a higher value.
Hi Asaf, I think you and Sej are on the right track. I looked back at my previous assemblies and can see the total assbly size is quite a bit larger than it ought to be (suggesting many extra contigs, possibly due to errors). I also ran one of my smaller sets of reads (~50X) and it improved the assembly somewhat over the total set. I might try a random sampling of the total set next. Do you think I should try error correction before or after subsampling?
Thank you Sej, I'll give that a try.
If anyone around has other suggestions I'd really appreciate it. I've tried downsampling and error correcting but neither made more than a trivial difference in contig N50. Everything I can tell about this dataset says its excellent quality of but I can't get even half-decent contigs.
So I've tried downsampling like people suggested in the comments without any success. The difference in contig size was basically trivial. I've tried 200X, 150X, 70X, 50X, 40X and 30X with very little difference in the result (a few hundred bp or so) and the largest contig has hardly changed in size (~28kb). I've also tried error correcting with again made a trivial difference. If anyone has other suggestions I'm getting a little desperate.