Is there any method to randomize the read order in a multi-Gbp fastq file?
Is there any method to randomize the read order in a multi-Gbp fastq file?
Assuming you are talking about a single-end file, you can use awk to put each 4-line fastq entry on a single line. You then use GNU shuf
, sort -R
(later versions of sort; if not available, go to GNU Utils), or my shuffle
tool in the filo package. The output will be a shuffled stream of one-line per fastq entry, so you will need to use awk once more to make a 4-line-per-entry file. Below should work.
awk '{OFS="\t"; getline seq; \
getline sep; \
getline qual; \
print $0,seq,sep,qual}' reads.fq | \
sort -R | \
awk '{OFS="\n"; print $1,$2,$3,$4}' \
> reads.shuffled.fq
You could extend this example for paired-end fastq by reading in two files at once with awk.
I did a quick benchmark and found that shuf
was MUCH faster than sort -R
in my environment (linux). I canceled the sort -R
after 5 minutes or so...
langhorst@seq02-i:~$ time shuf /galaxy/galaxy-dist/database/files/020/dataset_20337.dat > /dev/null
real0m4.552s
user0m4.220s
sys0m0.330s
langhorst@seq02-i:~$ time sort -R /galaxy/galaxy-dist/database/files/020/dataset_20337.dat > /dev/null
^C
real5m23.051s
user5m22.530s
sys0m0.500s
I actually created an account just to comment and thank Brad Langhorst.
shuf
is MUCH MUCH faster than sort -R
(Ubuntu 16.04). I had to shuffle 10 million reads, and after 15-20 minutes I stopped the script containing sort -R
, changed that into shuf
and it was done in about 40 seconds (probably even a bit less).
I've been playing with Python trying to solve the paired-end FastQ order randomisation.
After very unsuccessful afternoon, I've decided to try BASH. BASH-based solution is simple and efficient:
paste <(zcat test.1.fq.gz) <(zcat test.2.fq.gz) | paste - - - - | shuf | awk -F'\t' '{OFS="\n"; print $1,$3,$5,$7 > "random.1.fq"; print $2,$4,$6,$8 > "random.2.fq"}'
The most voted answer fails when headers contain spaces (which they might), so here is my solution (paste, shuf, tr, awk):
BBTools' shuffle2.sh can randomize fastq files of arbitrary size, including paired twin files, keeping pairs together. shuffle.sh is there too but it required the input to fit in memory; shuffle2.sh will write temp files when the data won't fit into memory. It handles multiline fasta files too.
Hi Brian, I've tried your solution, but there is an issue I couldn't figure out. I tried to input paired-end fastq.gz files, and it returned me two outputs. However, one is 70GB, and the other is empty. Here is the command I used:
shuffle2.sh -eoom -da -Xmx100G seed=123 ziplevel=2 in=test_R1_001.fastq.gz in2=test_R2_001.fastq.gz out=bbtest_r1.fq.gz out2=bbtest_r2.fq.gz
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
I know this post is a decade old, but I couldn’t get either solution to run. I have paired-end reads, and when I tried feeding both reads in Aaronquinlan’s solution, I just get empty files. I’m certain I haven’t altered the code correctly to accept paired-end read files.
When I tried Leszek’s solution (who also struggled to adapt Aaronquinlan’s for PE reads), it “thinks” for awhile and ultimately just returns the command prompt without generating any files (or errors).
Googling suggests that many people split the files multiple times and recombine them (I think?) but this is not practical for me. I have concatenated about 2,000 reads onto 39M reads. I just want to shuffle them. Any suggestions for how to do this? I feel like this should be possible with a bash (especially awk + shuf) solution and I want to get that working. Thanks!