reformat.sh in1=file_R1.fq.gz in2=file_R2.fq.gz out1=Sampled_R1.fq.gz out2=Sampled_R2.fq.gz parameters_below
Sampling parameters:
reads=-1 Set to a positive number to only process this many INPUT reads (or pairs), then quit.
skipreads=-1 Skip (discard) this many INPUT reads before processing the rest.
samplerate=1 Randomly output only this fraction of reads; 1 means sampling is disabled.
sampleseed=-1 Set to a positive number to use that prng seed for sampling (allowing deterministic sampling).
samplereadstarget=0 (srt) Exact number of OUTPUT reads (or pairs) desired.
samplebasestarget=0 (sbt) Exact number of OUTPUT bases desired.
Important: srt/sbt flags should not be used with stdin, samplerate, qtrim, minlength, or minavgquality.
"reads=100" means it will sample from 100 reads. So if you set"reads=100 samplerate=0.5" you'll get approximately 50 reads, sampled from the first 100 reads in the file. Whereas if you set "samplereadstarget=100" you will get exactly 100 reads, sampled from the full file.
Note that if your reads are paired, these will be the number of PAIRS you get, so the number of reads would be twice that.
With 2 files you can use in1= and in2= instead. There's also "samplereadstarget=X" if you want a specific number of reads (for paired data, it will give you that number of pairs).
Ummm, won't
sort -R
completely screw everything up (and be slow and use a LOT of memory)? Define "a long time", you're mostly limited by IO and gzip.long time: more than 2 hours
I found these commands. However, what is -s (seed)? And is it sampling randomly?
Thank you in advance.
Apply a seed to extract the same reads from two, paired end files:
seqtk sample -s 10 /ebs/ecoli/SRR001666_1.fastq.gz 1000 > SRR001666_1_1000.fastq
seqtk sample -s 10 /ebs/ecoli/SRR001666_2.fastq.gz 1000 > SRR001666_2_1000.fastq
In this particular case the same seed is needed to keep the pairing of reads (i.e. extract the same read pair from two files).
Thank you. But what does the number after s represent?
It's a random seed. It doesn't matter what number you use, you just have to use the same one (I usually use 1234 for things like this).
Got it. thank you very much.
Worked perfectly for me. Thank you, everyone, for responses.