Question

Subsampling with BBtools by read number

1

Entering edit mode

5.2 years ago

dylan.lawrence ▴ 100

I already know that reformat.sh can be used to subsample but this is done by percentage. Is there a method to specify sub-sampling a fixed number of reads? I know the reads=# flag exists but this seems to occur before the subsampling. I know if worst comes to worst I can randomly subsample 99% of the file and then run a second round of reformat using reads= to get my fixed number, but a single step would be much easier.

bbtools subsampling • 4.5k views

ADD COMMENT • link updated 4.4 years ago by johnston.mike.j ▴ 40 • written 5.2 years ago by dylan.lawrence ▴ 100

0

Entering edit mode

As long as you set sampleseed the sampling should be deterministic but random.

Have you tried using a combination of that or samplerate along with reads?

Sampling parameters:

reads=-1                Set to a positive number to only process this many INPUT reads (or pairs), then quit.
skipreads=-1            Skip (discard) this many INPUT reads before processing the rest.
samplerate=1            Randomly output only this fraction of reads; 1 means sampling is disabled.
sampleseed=-1           Set to a positive number to use that prng seed for sampling (allowing deterministic sampling).
samplereadstarget=0     (srt) Exact number of OUTPUT reads (or pairs) desired.
samplebasestarget=0     (sbt) Exact number of OUTPUT bases desired.
                        Important: srt/sbt flags should not be used with stdin, samplerate, qtrim, minlength, or minavgquality.
upsample=f              Allow srt/sbt to upsample (duplicate reads) when the target is greater than input.
prioritizelength=f      If true, calculate a length threshold to reach the target, and retain all reads of at least that length (must set srt or sbt).

ADD REPLY • link 5.2 years ago by GenoMax 147k

0

Entering edit mode

So the deterministic part doesn't matter. My undertsanding is reads refers to how many are processed. So if you did reads=10 samplerate=0.9 it wouldn't give you 9 random reads from the file, but rather the 9 of the first 10 reads randomized which is still the same data. Maybe I'm wrong?

ADD REPLY • link 5.2 years ago by dylan.lawrence ▴ 100

2

Entering edit mode

Please don't delete posts after they have received a comment/answer.

Did my suggestion work? Or were you able to find parameters that work.

ADD REPLY • link 5.2 years ago by GenoMax 147k

score 2 · Answer 1 · 2020-07-15

The reformat.sh option samplereadstarget seems to be exactly what you want.

To subsample 1M reads from a file, I ran:

reformat.sh \
    in=input.fq \
    out=output.fq \
    samplereadstarget=1000000 \
    sampleseed=13

To check:

wc -l output.fq

4000000 -- as expected for a fastq file.

Additionally, to check that the reads were random and not just the top 1M reads from the file, I checked the first few read names and they seem different:

grep -m 5 "^@" input.fq

@NB501138:242:H7WV2BGXB:1:11101:20526:1063 1:N:0:AAGTTGAT @NB501138:242:H7WV2BGXB:1:11101:4613:1064 1:N:0:AAGTTGAT @NB501138:242:H7WV2BGXB:1:11101:10306:1066 1:N:0:AAGTTGAT @NB501138:242:H7WV2BGXB:1:11101:1896:1067 1:N:0:AAGTTGAT @NB501138:242:H7WV2BGXB:1:11101:16712:1068 1:N:0:AAGTTGAT

grep -m 5 "^@" output.fq

@NB501138:242:H7WV2BGXB:1:11101:25676:1087 1:N:0:AAGTTGAT @NB501138:242:H7WV2BGXB:1:11101:9954:1149 1:N:0:AAGTTGAT @NB501138:242:H7WV2BGXB:1:11101:22627:1297 1:N:0:AAGTTGAT @NB501138:242:H7WV2BGXB:1:11101:13619:2023 1:N:0:AAGTTGAT @NB501138:242:H7WV2BGXB:1:11101:1974:2172 1:N:0:AAGTTGAT