Question

randomly sampling long read data for genome assembly

1

Entering edit mode

4.2 years ago

jleehan ▴ 120

So I've been working on bacterial genome assemblies with long reads (pacbio) of evolved strains for Bacillus subtilis and I have been running into an issue where depending upon the program I'm using, I run into some kind of error with regards to the amount of hardware I have access to. For context, I'm using a SLURM-based computation node that my university provides and my lab rents ~5 CPUs and ~35GB of RAM. So far all my work has been with using short reads and I've had no problem with the assemblies using these tools. However, with long reads I either do not have enough memory or enough computing threads to run a project. For example, if I use flye, I don't have anywhere near enough memory even using 30 GB of RAM or if I use NextDenovo, I don't have enough CPUs to dedicate to each of the separate tasks required to execute the whole pipeline.

When I was researching the memory requirements for flye, I saw that it has very high memory requirements, but I also think my input samples are much larger than they need to be to get a good assembly. According to this benchmarking study, the vast majority of prokaryotic genomes were able to be assembled with fewer than 30 GB of RAM (see 'G' in figure).

Each of my .fastq files for my different strains that need assembling are 42-46 GB zipped. From what I can gather, that looks like way more data than I need to get a good assembly, so I'm wondering if it would be possible for me to randomly sample and subset a small fraction of my .fastq files to perform assemblies with. Does anyone know if this would work and if not do you know any other approach I could take for finishing these assemblies with the limited resources I have?

pacbio assembly genome • 2.6k views

ADD COMMENT • link 4.2 years ago by jleehan ▴ 120

0

Entering edit mode

I think I might be able to help with this. I have some compute sitting around that could be used for your assemblies. You wouldn't have to sub-sample the assemblies that way. If you're interested, let me know how we can contact one another.

ADD REPLY • link 4.2 years ago by Dunois ★ 2.9k

1

Entering edit mode

Thanks for the offer, but I think the sub sampling method is working pretty well. I was able to get 160X coverage on my assemblies with 1% of my reads, so I think I'll be okay.

ADD REPLY • link 4.2 years ago by jleehan ▴ 120

0

Entering edit mode

Wow! If 1% of your reads = 160x theoretical coverage then you did have way too much data. That can actually lead to bad assemblies (sounds counterintuitive but true).

ADD REPLY • link 4.2 years ago by GenoMax 153k

0

Entering edit mode

It was actually 10%, but still definitely too much data.

ADD REPLY • link 4.2 years ago by jleehan ▴ 120

0

Entering edit mode

I'm glad to hear that!! I've only worked with long reads in passing, so I'm surprised to hear you have 160x coverage with such a small fraction!!

ADD REPLY • link 4.2 years ago by Dunois ★ 2.9k

score 1 · Answer 1 · 2021-06-14

1

Entering edit mode

4.2 years ago

GenoMax 153k

If you are simply looking to subsample then seqtk sample or reformat.sh (from BBMap suite) should work. Following samples 400 reads.

$ reformat.sh -Xmx5g in=your.fastq out=sample.fastq samplereads=400 qin=33

Parameters of interest for reformat.sh:

Sampling parameters:

reads=-1                Set to a positive number to only process this many INPUT reads (or pairs), then quit.
skipreads=-1            Skip (discard) this many INPUT reads before processing the rest.
samplerate=1            Randomly output only this fraction of reads; 1 means sampling is disabled.
sampleseed=-1           Set to a positive number to use that prng seed for sampling (allowing deterministic sampling).
samplereadstarget=0     (srt) Exact number of OUTPUT reads (or pairs) desired.
samplebasestarget=0     (sbt) Exact number of OUTPUT bases desired.
                        Important: srt/sbt flags should not be used with stdin, samplerate, qtrim, minlength, or minavgquality.
upsample=f              Allow srt/sbt to upsample (duplicate reads) when the target is greater than input.
prioritizelength=f      If true, calculate a length threshold to reach the target, and retain all reads of at least that length (must set srt or sbt).

ADD COMMENT • link 4.2 years ago by GenoMax 153k

0

Entering edit mode

Hi GenoMax, I really appreciate the reply. I'm giving it a try right now and I'll get back to you when I figure out whether or not this works.

ADD REPLY • link 4.2 years ago by jleehan ▴ 120

1

Entering edit mode

It will work question is will it work in a way that you find useful for assemblies. You can keep the data compressed (no need to decompress) by using in=your.fq.gz out=sample.fq.gz.

ADD REPLY • link 4.2 years ago by GenoMax 153k

0

Entering edit mode

So it did work obviously, but when I said I wanted to figure out whether or not that would work, I more meant, whether or not I could figure out how to use those programs, because I am not particularly proficient at command line stuff in general and it can be difficult for me to adapt something someone tells me to do if they don't work in the same exact computing environment as me. But I was able to figure it out. Thank you!

ADD REPLY • link 4.2 years ago by jleehan ▴ 120