Hello,
I'd be very interested to know what recommendations there are for subsampling for coassembly, when computational resources are not available for the full dataset. In my use-case, it would be for de novo assembly with megahit (single node) or metahipmer (multinode).
I have read about normalisation based approaches, but since these mess with coverage I know many would discourage, and the metahipmer developers definitely discourage this.
Random subsampling seems reasonable, but I worry that since my depth varies wildly between samples (due to varying proportion of microbial reads in our human samples), and complexity/coverage will vary, it may not be best to subsample all samples to the same extent.
A possible improvement would be subsampling down to an absolute maximum depth per sample, such that low depth samples are not subsamples, and high depth samples are subsampled more aggressively. However, this would still not take into account that at the same depth, one sample may be well covered (due to low complexity) and another may be poorly covered (due to high complexity).
This leads me to the idea of using Nonpareil curves to guide subsampling. I am considering an approach whereby for each sample I estimate the total base pairs required to achieve (say) 0.95 coverage from each sample. Those with proportion >= 1 are not subsampled, and those with proportion < 1 are subsampled to the required proportion of reads. Thus I reduce the total number of reads, bit more aggressively in better covered samples.
In my head, it feels like this might provide an efficient way of subsampling for assembly. I appreciate that time and memory usage of de novo assemblers is not dependent primarily on the number of sequences, but rather unique kmers and graph structure. Thus, subsampling 50% per sample, and subsampling to 50% overall but with the Nonpareil strategy would perform differently. This doesn't stop the idea of targeting by coverage seem more appropriate.
I will be really grateful for any thoughts! I will be experimenting in tandem!
Best wishes,
Andrew
Add digital normalization to the comparison, you got me curious.
Also, do not forget, the metrics you are reporting are useful, but aren't the only ones when evaluating which method / final assembly is better.