You could use sample:
$ sample -k 1000 -l 4 huge_file.fastq > 1k_sample.fq
This uses reservoir sampling on newline offsets to reduce memory overhead. If your system is thrashing for lack of system memory, this will help a great deal by reducing the memory overhead required for sampling as compared with other tools.
Including GNU sort
, which will often run out of memory on genome-scale inputs, because, while it offers reservoir sampling, it stores the entire line in memory, for every line of input. For genome-scale work, this is often simply too much data.
Specify -k N
for the number N
of samples, and -l L
for number of lines per record-to-sample (L
). In the case of fastq, that would be four lines per record, so -l 4
.
Add -d S
to specify a seed value S
, if desired, or it is drawn from a Mersenne Twister PRNG.
Generating your own random seed and applying that same seed when sampling two fastq files will let you grab the same samples from two files of paired reads:
$ sample -k 1000 -l 4 -s 1234 first_pair.fastq > 1k_first_pair_sample.fq
$ sample -k 1000 -l 4 -s 1234 second_pair.fastq > 1k_second_pair_sample.fq
If you have one interleaved file, then you could specify -l 8
to sample a pair of records from every eight lines:
$ sample -k 1000 -l 8 interleaved.fastq > 1k_paired_sample.fq
You can also add -r
to sample with replacement; the default is to sample without replacement. Or add -p
to print a sample in the order provided by the original input.
Not sure I follow - are you saying a random sampling might introduce bias that wasn't there already? This seems to go against the definition of 'random'.
Do you expect your coverage to be even, otherwise you might get in hot water by introducing biases when randomly sampling over the whole genome/region: Downsampling Bam Files