Question

Resampling Fastq Sequences Without Replacement

1

Entering edit mode

11.7 years ago

prtimilsina ▴ 80

Hello, I want to extract a random sample (without replacement) of 7.5 million fastq sequences from illumina sequencing data that contains about 30 million sequences each in of the reads. I want to extract the same sequence from each of the two files ( e.g., if I extracted sequence no. 31 from read 1, I would want to extract the same sequence from read 2 also). How can I do this? Is there a script or module I can use? Any help would be appreciated. Thanks

fastq paired-end • 5.2k views

ADD COMMENT • link updated 11.7 years ago by lh3 33k • written 11.7 years ago by prtimilsina ▴ 80

score 7 · Answer 1 · 2013-03-15

The general technique for sampling exact k elements without replacement on a data stream is reservoir sampling. It is actually very simple. Here is a one-liner to randomly choose k lines from a text file with awk:

cat file.txt | awk -v k=10 '{y=x++<k?x-1:int(rand()*x);if(y<k)a[y]=$0}END{for(z in a)print a[z]}'

It requires to keep at most k lines in memory, not the entire file. For fasta, you can use bioawk as the parser:

awk -c fastx -v k=10 'BEGIN{srand(100)}{y=x++<k?x-1:int(rand()*x);if(y<k)a[y]=">"$name"\n"$seq}END{for(z in a)print a[z]}' seq1.fa.gz > seq1-sub.fa
awk -c fastx -v k=10 'BEGIN{srand(100)}{y=x++<k?x-1:int(rand()*x);if(y<k)a[y]=">"$name"\n"$seq}END{for(z in a)print a[z]}' seq2.fa.gz > seq2-sub.fa

Remember to use identical random seed such as reads are always paired. It is also possible to parse fasta/q with bare awk instead of relying on bioawk.

A simpler and more efficient solution is to use seqtk, the answer I provided in the old post:

seqtk sample -s100 read1.fq 10000 > sub1.fq
seqtk sample -s100 read2.fq 10000 > sub2.fq

To sample 7.5 million sequences, you need about 7.5M*100bp*2=1.5GB memory plus read names, which should be fine in your case.

When you need to sample even more sequences, you can keep the record indices in memory and then extract sequences with a second pass, or use Selecting Random Pairs From Fastq? with sorting, which is slower and needs working disk space but does not require much RAM.

EDIT: actually in your case, I would simply sample 25% (=7.5/30) of reads. You do not get exactly 7.5 million, but the solution is much simpler and sufficiently good in most cases. It is trivial to do that in a variety of ways, or simply use seqtk:

seqtk sample -s100 read1.fq 0.25 > sub1.fq
seqtk sample -s100 read2.fq 0.25 > sub2.fq

score 2 · Answer 2 · 2013-03-15

2

Entering edit mode

11.7 years ago

Josh Herr 5.8k

In addition to the 'Selecting Random Pairs From Fastq?' question, this question, 'Choosing Random Set Of Seqs From Larger Set', helped me in the past.

ADD COMMENT • link 11.7 years ago by Josh Herr 5.8k

0

Entering edit mode

Thank you, Josh.

ADD REPLY • link 11.7 years ago by prtimilsina ▴ 80

0

Entering edit mode

My pleasure! Best of luck to you!

ADD REPLY • link 11.7 years ago by Josh Herr 5.8k

score 1 · Answer 3 · 2013-03-15

1

Entering edit mode

11.7 years ago

ananta.acharya ▴ 20

Selecting Random Pairs From Fastq?

ADD COMMENT • link 11.7 years ago by ananta.acharya ▴ 20

0

Entering edit mode

Thank you Ananta.

ADD REPLY • link 11.7 years ago by prtimilsina ▴ 80