Hi all,
I've received some data for a bacterial genome project, where I'll be assembling two de novo genomes of bacteria with expected 4-5 Mbp genomes using PacBio Hifi reads (most reads at ~ 15kb). These two bugs got an entire SMRT cell to themselves and so the expected coverage here is like 2,000-3,000x. This is my first go with prokaryote genome assembly, otherwise have done a few fungi before.
My expected flow is some sanity checks with jellyfish+genomescope, hifiasm, and then annotation, maybe with a prokaryote tweaked funannotate pipeline (we have a little RNAseq data to help there too). (happy to hear about your favorite bacterial genome tools too). I'm taking a look at kmer counts with genomescope right now.
With such massive coverage, the genomescope models fail to converge [kmer=21, read.len=15,000, max.cov=10,000]. But eyeballing what looks like reasonable "true" peaks in that 2-3,000 fold coverage range, I simply dropped anything below 1,000x and got something more familiar. (the 1kb+ title should actually be 1,000x-coverage+, has nothing to do with read length)
I've also seen recommendations (from years ago) for BBnorm to perhaps more smartly reduce my input data. Most of what I've seen about over-sequenced genomes was based on short reads and using SPADEs or so. Since I expect to use Hifiasm, that also allows me to introduce a kmer coverage cutoff during assembly to mimic sort of what I did with these genomescope plots.
So my question is, for these data (2k+ coverage pacbio hifi on bacteria) any specific recommendations for managing (reducing) my input data - and - maybe even using it to my advantage somehow? Having so many reads seems like an opportunity to make a very nicely polished and complete assembly.
Thanks!
While someone will have a more informed answer, downsampling (to 100x or less) the data may be a good place to start. Too much data can be a bad thing for assemblies.
colindaven is knowledgeable about long reads.
For sure, I think downsampling is the right call. I've been debating on whether to simply do this randomly (seqtk or bbmap should make that easy) or try to squeeze the most out of these reads and implement some minimum length or Q-score type filters to be choosy.
Only that, if read quality isn't random across the genome I can start introducing some weird bias into my assembly probably. I think Canu includes a subsampling step to take a pile of "nice" reads plus random/smaller reads to help with this ... but I've switched over to Hifiasm these days for assembly.