I already know that reformat.sh can be used to subsample but this is done by percentage. Is there a method to specify sub-sampling a fixed number of reads? I know the reads=# flag exists but this seems to occur before the subsampling. I know if worst comes to worst I can randomly subsample 99% of the file and then run a second round of reformat using reads= to get my fixed number, but a single step would be much easier.
As long as you set sampleseed the sampling should be deterministic but random.
Have you tried using a combination of that or samplerate along with reads?
Sampling parameters:
reads=-1 Set to a positive number to only process this many INPUT reads (or pairs), then quit.
skipreads=-1 Skip (discard) this many INPUT reads before processing the rest.
samplerate=1 Randomly output only this fraction of reads; 1 means sampling is disabled.
sampleseed=-1 Set to a positive number to use that prng seed for sampling (allowing deterministic sampling).
samplereadstarget=0 (srt) Exact number of OUTPUT reads (or pairs) desired.
samplebasestarget=0 (sbt) Exact number of OUTPUT bases desired.
Important: srt/sbt flags should not be used with stdin, samplerate, qtrim, minlength, or minavgquality.
upsample=f Allow srt/sbt to upsample (duplicate reads) when the target is greater than input.
prioritizelength=f If true, calculate a length threshold to reach the target, and retain all reads of at least that length (must set srt or sbt).
So the deterministic part doesn't matter. My undertsanding is reads refers to how many are processed. So if you did reads=10 samplerate=0.9 it wouldn't give you 9 random reads from the file, but rather the 9 of the first 10 reads randomized which is still the same data. Maybe I'm wrong?
Additionally, to check that the reads were random and not just the top 1M reads from the file, I checked the first few read names and they seem different:
As long as you set
sampleseed
the sampling should be deterministic but random.Have you tried using a combination of that or
samplerate
along withreads
?So the deterministic part doesn't matter. My undertsanding is reads refers to how many are processed. So if you did
reads=10 samplerate=0.9
it wouldn't give you 9 random reads from the file, but rather the 9 of the first 10 reads randomized which is still the same data. Maybe I'm wrong?Please don't delete posts after they have received a comment/answer.
Did my suggestion work? Or were you able to find parameters that work.