Question

Split a FastQ file into smaller files

1

Entering edit mode

4.3 years ago

Matthias Zepper 5.0k

Hello folks,

I happen to have a small problem which seemed to be trivial at first, but keeps me busy for a while now already. Maybe you can help anew, because most Biostars posts on this issue are quite outdated:

Problem:

I need to split 615M paired reads currently in two FastQ files (forward and reversed reads separate) into two file pairs with 308M reads.

Solution attempt A:

I unsuccessfully tried to use line count based tools like split or awk, but the wc -l does not even count them correctly (line counts differ), although they are processed nicely by several tools. By googling, I found that a possible reason for this behavior might be that newline characters may occur in the quality scores, which thus hinder the core utilities.

Solution attempt B:

I tried to use my BBTools, my favorite Swiss army knife of everything sequence related, but...

bbmap/reformat.sh  in=...  in2=... out=... out2=... reads=308000000
bbmap/reformat.sh  in=...  in2=... out=... out2=... skipreads=308000000

resulted in

Input is being processed as paired
Input:                          615307122 reads                 91326404105 bases
Output:                         615307122 reads (100.00%)       91326404105 bases (100.00%)

respectively

Input is being processed as paired
Input:                          615307122 reads                 91326404105 bases
Output:                         0 reads (0.00%)         0 bases (0.00%)

for the second command, such that I got a copy of my data in the first case and an empty file in the second.

Solution attempt C:

Here on Biostars, somebody had suggested famas for a similar problem, so I installed and tried it:

famas --in=...  --in2=... --out=...XXXXXX.fq.gz  --out2=...XXXXXX.fq.gz  -x 308000000

However, the program flooded to output directory with thousands of subfiles files (instead of the actually needed two files each) until the file system couldn't cope with the number of open files anymore and ran out of file descriptors.

ERROR(famas.c|open_output_one:1056): Couldn't open =...compressed.065534.fq.gz
ERROR(famas.c|main:1163): Couldn't open output files. Exiting...

In case somebody is wondering why I mixed the long and short notation here: the parameter --split-every=308000000 was not recognized (unknown parameter).

Since it took me a while to clean that mess up again, I am somewhat reluctant to try out more. Does sombeody have any ideas where I screwed up the commands or has suggestions which tools work better?

Thanks a lot for reading, pondering and help!

Matthias

FastQ bbmap famas awk • 5.1k views

ADD COMMENT • link updated 4.3 years ago by GenoMax 147k • written 4.3 years ago by Matthias Zepper 5.0k

2

Entering edit mode

For reformat.sh the deinterleave (split) syntax is:

reformat.sh in=reads.fq out1=read1.fq out2=read2.fq

While the interleave (combine) syntax is:

reformat.sh in1=read1.fq in2=read2.fq out=reads.fq

Note the difference between (in vs in1/in2).

I think you confused and therefore combined the syntax of interleave/deinterleave.

Also there's the option of seqkit (which I absolutely love). It has seqkit grep and seqkit split2 which I believe both support fastq files. It allows for seemless pipelining between standard bash functions and seqkit specific functions.

edit: I think I miss read the question! oops

ADD REPLY • link 4.3 years ago by Mark ★ 1.6k

0

Entering edit mode

Thank you very much for your response! Indeed, seqkit seems to be the solution I was looking for. Currently it is still running, but the output looks quite promising. The bbmap thing seems to be a glitch since also counting reads while deinterleaving a temporarily interleaved file didn't work.

ADD REPLY • link 4.3 years ago by Matthias Zepper 5.0k

0

Entering edit mode

What's wrong with a really stupid zcat | head -n 1232000000

ADD REPLY • link 4.3 years ago by swbarnes2 14k

0

Entering edit mode

The request was to combine and split by F/R reads.

ADD REPLY • link 4.3 years ago by Mark ★ 1.6k

2

Entering edit mode

4.3 years ago

Mensur Dlakic ★ 28k

First combine your reads:

bbmap/reformat.sh in=reads1.fastq in2=reads2.fastq out=paired.fastq

Then split them:

bbmap/reformat.sh in=paired.fastq out=paired_part1.fastq reads=308000000
bbmap/reformat.sh in=paired.fastq out=paired_part2.fastq skipreads=308000000

ADD COMMENT • link 4.3 years ago by Mensur Dlakic ★ 28k

0

Entering edit mode

I tried your solution to generate an interleaved file temporarily and deinterleave it afterwards again with the read numbers. However, while the interleaving worked nicely, the deinterleaving again produced a file containing all reads and an empty file as before. So I could convert the formats but not split it into smaller files. Nonetheless, thank you very much for your help!

ADD REPLY • link 4.3 years ago by Matthias Zepper 5.0k

0

Entering edit mode

4.3 years ago

GenoMax 147k

I am not sure why this did not work for you. I started with a 1000 PE reads. I am using BBMap v. 38.82.

$ reformat.sh in1=test1000.R1.fq.gz in2=test1000.R2.fq.gz out1=rest1000.R1.fq.gz out2=rest1000.R2.fq.gz reads=151

Set INTERLEAVED to false
Input is being processed as paired
Input:                          302 reads               75500 bases
Output:                         302 reads (100.00%)     75500 bases (100.00%)

$ reformat.sh in1=test1000.R1.fq.gz in2=test1000.R2.fq.gz out1=rest_2.R1.fq.gz out2=rest_2.R2.fq.gz skipreads=151 

Set INTERLEAVED to false
Input is being processed as paired
Input:                          2000 reads              500000 bases
Output:                         1698 reads (84.90%)     424500 bases (84.90%)

ADD COMMENT • link 4.3 years ago by GenoMax 147k

score 4 · Accepted Answer · 2020-07-26

4

Entering edit mode

4.3 years ago

cpad0112 21k

with seqkit from the manual:

$ seqkit split2 -1 reads_1.fq.gz -2 reads_2.fq.gz -p 2 -O out

out is output directory, reads_1 is R1 and reads_2 is R2. This function is to break the file into two parts. try -s option for sequence number based split.