Randomize Read Order In Multigbp Fastq File?
4
3
Entering edit mode
13.5 years ago

Is there any method to randomize the read order in a multi-Gbp fastq file?

fastq • 8.3k views
ADD COMMENT
0
Entering edit mode

I know this post is a decade old, but I couldn’t get either solution to run. I have paired-end reads, and when I tried feeding both reads in Aaronquinlan’s solution, I just get empty files. I’m certain I haven’t altered the code correctly to accept paired-end read files.

When I tried Leszek’s solution (who also struggled to adapt Aaronquinlan’s for PE reads), it “thinks” for awhile and ultimately just returns the command prompt without generating any files (or errors).

Googling suggests that many people split the files multiple times and recombine them (I think?) but this is not practical for me. I have concatenated about 2,000 reads onto 39M reads. I just want to shuffle them. Any suggestions for how to do this? I feel like this should be possible with a bash (especially awk + shuf) solution and I want to get that working. Thanks!

ADD REPLY
11
Entering edit mode
13.5 years ago

Assuming you are talking about a single-end file, you can use awk to put each 4-line fastq entry on a single line. You then use GNU shuf, sort -R (later versions of sort; if not available, go to GNU Utils), or my shuffle tool in the filo package. The output will be a shuffled stream of one-line per fastq entry, so you will need to use awk once more to make a 4-line-per-entry file. Below should work.

awk '{OFS="\t"; getline seq; \
                getline sep; \
                getline qual; \
                print $0,seq,sep,qual}' reads.fq | \
sort -R | \
awk '{OFS="\n"; print $1,$2,$3,$4}' \
> reads.shuffled.fq

You could extend this example for paired-end fastq by reading in two files at once with awk.

ADD COMMENT
1
Entering edit mode

shuf is listed in the answer. i find that systems that lack sort -R also lack shuf.

ADD REPLY
1
Entering edit mode

I did a quick benchmark and found that shuf was MUCH faster than sort -R in my environment (linux). I canceled the sort -R after 5 minutes or so...

langhorst@seq02-i:~$ time shuf /galaxy/galaxy-dist/database/files/020/dataset_20337.dat > /dev/null

real0m4.552s
user0m4.220s
sys0m0.330s

langhorst@seq02-i:~$ time sort -R /galaxy/galaxy-dist/database/files/020/dataset_20337.dat > /dev/null
^C

real5m23.051s
user5m22.530s
sys0m0.500s
ADD REPLY
0
Entering edit mode

I actually created an account just to comment and thank Brad Langhorst.

shuf is MUCH MUCH faster than sort -R (Ubuntu 16.04). I had to shuffle 10 million reads, and after 15-20 minutes I stopped the script containing sort -R, changed that into shuf and it was done in about 40 seconds (probably even a bit less).

ADD REPLY
0
Entering edit mode

thanks, that's a neat solution. the farm I am using doesn't have sort -R or shuf, so I'll try and see if a more modern version can be locally installed.

ADD REPLY
0
Entering edit mode

I see. I also have a "shuffle" program in the filo package. The downside of that tool is that it reads all of the records into memory.

ADD REPLY
0
Entering edit mode

if sort -R isn't available, you can try the 'shuf' command

ADD REPLY
0
Entering edit mode

I locally installed a modern coreutils following instructions here

ADD REPLY
5
Entering edit mode
10.2 years ago
Leszek 4.2k

I've been playing with Python trying to solve the paired-end FastQ order randomisation.

After very unsuccessful afternoon, I've decided to try BASH. BASH-based solution is simple and efficient:

paste <(zcat test.1.fq.gz) <(zcat test.2.fq.gz) | paste - - - - | shuf | awk -F'\t' '{OFS="\n"; print $1,$3,$5,$7 > "random.1.fq"; print $2,$4,$6,$8 > "random.2.fq"}'
ADD COMMENT
2
Entering edit mode
13 months ago
Artyom ▴ 20

The most voted answer fails when headers contain spaces (which they might), so here is my solution (paste, shuf, tr, awk):

ADD COMMENT
2
Entering edit mode
13 months ago

BBTools' shuffle2.sh can randomize fastq files of arbitrary size, including paired twin files, keeping pairs together. shuffle.sh is there too but it required the input to fit in memory; shuffle2.sh will write temp files when the data won't fit into memory. It handles multiline fasta files too.

ADD COMMENT
0
Entering edit mode

Hi Brian, I've tried your solution, but there is an issue I couldn't figure out. I tried to input paired-end fastq.gz files, and it returned me two outputs. However, one is 70GB, and the other is empty. Here is the command I used:

shuffle2.sh -eoom -da -Xmx100G seed=123 ziplevel=2 in=test_R1_001.fastq.gz in2=test_R2_001.fastq.gz out=bbtest_r1.fq.gz out2=bbtest_r2.fq.gz
ADD REPLY
0
Entering edit mode

Looks like a bug, I'll investigate. What is your input file size?

ADD REPLY
0
Entering edit mode

One is 37G, another is 40G.

ADD REPLY

Login before adding your answer.

Traffic: 1935 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6