The general technique for sampling exact k
elements without replacement on a data stream is reservoir sampling. It is actually very simple. Here is a one-liner to randomly choose k
lines from a text file with awk:
cat file.txt | awk -v k=10 '{y=x++<k?x-1:int(rand()*x);if(y<k)a[y]=$0}END{for(z in a)print a[z]}'
It requires to keep at most k
lines in memory, not the entire file. For fasta, you can use bioawk as the parser:
awk -c fastx -v k=10 'BEGIN{srand(100)}{y=x++<k?x-1:int(rand()*x);if(y<k)a[y]=">"$name"\n"$seq}END{for(z in a)print a[z]}' seq1.fa.gz > seq1-sub.fa
awk -c fastx -v k=10 'BEGIN{srand(100)}{y=x++<k?x-1:int(rand()*x);if(y<k)a[y]=">"$name"\n"$seq}END{for(z in a)print a[z]}' seq2.fa.gz > seq2-sub.fa
Remember to use identical random seed such as reads are always paired. It is also possible to parse fasta/q with bare awk instead of relying on bioawk.
A simpler and more efficient solution is to use seqtk, the answer I provided in the old post:
seqtk sample -s100 read1.fq 10000 > sub1.fq
seqtk sample -s100 read2.fq 10000 > sub2.fq
To sample 7.5 million sequences, you need about 7.5M*100bp*2=1.5GB memory plus read names, which should be fine in your case.
When you need to sample even more sequences, you can keep the record indices in memory and then extract sequences with a second pass, or use Selecting Random Pairs From Fastq? with sorting, which is slower and needs working disk space but does not require much RAM.
EDIT: actually in your case, I would simply sample 25% (=7.5/30) of reads. You do not get exactly 7.5 million, but the solution is much simpler and sufficiently good in most cases. It is trivial to do that in a variety of ways, or simply use seqtk:
seqtk sample -s100 read1.fq 0.25 > sub1.fq
seqtk sample -s100 read2.fq 0.25 > sub2.fq
Thank you so much. I really appreciate that :)
Thanks Heng, I learnt the existence of bioawk and seqtk I've never heard about in the past!
OK, I just figured you're the seqtk developer...
... and I added the "bio" part in bioawk.