Hello folks,
I happen to have a small problem which seemed to be trivial at first, but keeps me busy for a while now already. Maybe you can help anew, because most Biostars posts on this issue are quite outdated:
Problem:
I need to split 615M paired reads currently in two FastQ files (forward and reversed reads separate) into two file pairs with 308M reads.
Solution attempt A:
I unsuccessfully tried to use line count based tools like split
or awk
, but the wc -l
does not even count them correctly (line counts differ), although they are processed nicely by several tools. By googling, I found that a possible reason for this behavior might be that newline characters may occur in the quality scores, which thus hinder the core utilities.
Solution attempt B:
I tried to use my BBTools
, my favorite Swiss army knife of everything sequence related, but...
bbmap/reformat.sh in=... in2=... out=... out2=... reads=308000000
bbmap/reformat.sh in=... in2=... out=... out2=... skipreads=308000000
resulted in
Input is being processed as paired
Input: 615307122 reads 91326404105 bases
Output: 615307122 reads (100.00%) 91326404105 bases (100.00%)
respectively
Input is being processed as paired
Input: 615307122 reads 91326404105 bases
Output: 0 reads (0.00%) 0 bases (0.00%)
for the second command, such that I got a copy of my data in the first case and an empty file in the second.
Solution attempt C:
Here on Biostars, somebody had suggested famas
for a similar problem, so I installed and tried it:
famas --in=... --in2=... --out=...XXXXXX.fq.gz --out2=...XXXXXX.fq.gz -x 308000000
However, the program flooded to output directory with thousands of subfiles files (instead of the actually needed two files each) until the file system couldn't cope with the number of open files anymore and ran out of file descriptors.
ERROR(famas.c|open_output_one:1056): Couldn't open =...compressed.065534.fq.gz
ERROR(famas.c|main:1163): Couldn't open output files. Exiting...
In case somebody is wondering why I mixed the long and short notation here: the parameter --split-every=308000000 was not recognized (unknown parameter).
Since it took me a while to clean that mess up again, I am somewhat reluctant to try out more. Does sombeody have any ideas where I screwed up the commands or has suggestions which tools work better?
Thanks a lot for reading, pondering and help!
Matthias
For
reformat.sh
the deinterleave (split) syntax is:While the interleave (combine) syntax is:
Note the difference between (
in
vsin1/in2
).I think you confused and therefore combined the syntax of interleave/deinterleave.
Also there's the option of
seqkit
(which I absolutely love). It hasseqkit grep
andseqkit split2
which I believe both support fastq files. It allows for seemless pipelining between standard bash functions and seqkit specific functions.edit: I think I miss read the question! oops
Thank you very much for your response! Indeed,
seqkit
seems to be the solution I was looking for. Currently it is still running, but the output looks quite promising. Thebbmap
thing seems to be a glitch since also counting reads while deinterleaving a temporarily interleaved file didn't work.What's wrong with a really stupid
zcat | head -n 1232000000
The request was to combine and split by F/R reads.