Each of my fastq files is about 20M reads, while I need to split the big fastq files into chunks of 1M reads.
Is there any available tool that can do such jobs?
my thought is, just count the line, and print out the lines after counting every 1M lines.But how can I do that with python?
thx
edit: My input fastq file is actually in .gz compressed form.
I tried
split -l 4000000 XXX.recal.fastq.gz prefix
however, I just got one prefix-aa file which is exactly the same size as input. I don't know if it's because of the .gz form so that we cannot count the line?
when I tried
split -b 46m XXX.recal.fastq.gz prefix
it works well!!! The fastq.gz is successfully split into several smaller fastq.gz files.
so why cannot we use -l 4000000
command?
thx
another question:there is only a "prefix" option for split command; but is there a suffix option?(only suffix_length option)
because with prefix the output is XXX.fastq.gz-ab, which destroys the format of .gz file.
So I want sth. like XXX_1.fastq.gz (changing suffix), how can I do that?
thx
split will only work on text (not gzipped files) and you will likely get truncated records at the end of each file using -b 46m. you can use:
zless recal.fastq.gz | split -l 4000000 prefix
to get a bunch of uncompressed files.
hi brentp, I tried zless...|split -l 4000000 prefix. the error is cannot open `prefix' for reading: No such file or directory. Seems prefix is regarded as input here.
my bad you need '-' to tell it stdin: zless...|split -l 4000000 - prefix
Yeah, works well, until it doesn't;) As I wrote, for sure you'll get your file split always, but the result is not necessarily a valid fastq files, but good it works for you, just check if every file that is generated starts with a '@'. All depends on that the sequence lines are not wrapped, not containing newlines, which they could by definition of the format.
sorry, it doesn't work...error is:"split: invalid option -- E"
my advice is to paste the entire command you're using into your question; then let @Jeremy update his answer since that's the clearest solution.