So I've been working on bacterial genome assemblies with long reads (pacbio) of evolved strains for Bacillus subtilis and I have been running into an issue where depending upon the program I'm using, I run into some kind of error with regards to the amount of hardware I have access to. For context, I'm using a SLURM-based computation node that my university provides and my lab rents ~5 CPUs and ~35GB of RAM. So far all my work has been with using short reads and I've had no problem with the assemblies using these tools. However, with long reads I either do not have enough memory or enough computing threads to run a project. For example, if I use flye, I don't have anywhere near enough memory even using 30 GB of RAM or if I use NextDenovo, I don't have enough CPUs to dedicate to each of the separate tasks required to execute the whole pipeline.
When I was researching the memory requirements for flye, I saw that it has very high memory requirements, but I also think my input samples are much larger than they need to be to get a good assembly. According to this benchmarking study, the vast majority of prokaryotic genomes were able to be assembled with fewer than 30 GB of RAM (see 'G' in figure).
Each of my .fastq files for my different strains that need assembling are 42-46 GB zipped. From what I can gather, that looks like way more data than I need to get a good assembly, so I'm wondering if it would be possible for me to randomly sample and subset a small fraction of my .fastq files to perform assemblies with. Does anyone know if this would work and if not do you know any other approach I could take for finishing these assemblies with the limited resources I have?
I think I might be able to help with this. I have some compute sitting around that could be used for your assemblies. You wouldn't have to sub-sample the assemblies that way. If you're interested, let me know how we can contact one another.
Thanks for the offer, but I think the sub sampling method is working pretty well. I was able to get 160X coverage on my assemblies with 1% of my reads, so I think I'll be okay.
Wow! If 1% of your reads = 160x theoretical coverage then you did have way too much data. That can actually lead to bad assemblies (sounds counterintuitive but true).
It was actually 10%, but still definitely too much data.
I'm glad to hear that!! I've only worked with long reads in passing, so I'm surprised to hear you have 160x coverage with such a small fraction!!