There is definitely not a real good way to do this directly using R.
But there are two better options:
a good old unix tool split will help. You could also use readLines()
and writeLines()
in R to read a number of lines and write them to a temporary file but this is not very efficient.
Using the fact that each fastq record has exactly 4 lines, no empty lines allowed (if you have empty lines, remove them first):
split -l4000000 input.fastq split.fastq.
splits the input file into fastq files with max 1000000 entries and makes output files like split.fastq.aa
, ..., .zz
Then read in the files sequentially using readFastq()
Using an R function, check out this:
read.DNAStringSet(filepath, format="fastq",
nrec=-1L, skip=0L, use.names=TRUE)
using the parameters nrec
and skip
. However, this function ignores quality values.
So, if you need the qualities, use option 1. if you only need the sequence, option 2 might be ok.
Edit:
There is a 'format' definition, for those doubting option 1 is valid:
<fastq> := <block>+
<block> := @<seqname>\n<seq>\n+[<seqname>]\n<qual>\n
<seqname> := [A-Za-z0-9_.:-]+
<seq> := [A-Za-z\n\.~]+
<qual> := [!-~\n]+
http://maq.sourceforge.net/fastq.shtml
While wrapping sequence by n in fact should be punished hard, according to this definition it is unfortunately possible, though I have never seen such file. So it's situational and depends on the file not containing wrapped sequence.
Another argument for FASTQ not being a 'format' because it's not easy to parse then, cause the quality string could itself contain a @
or +
. So even if it's allowed, it must be bad practice to have wrapped sequences in FASTQ (unlike in FASTA)
Edit:
I now assume again, that the split method is safe, because fastq files have 4 lines per entry and files containing wrapped sequence or quality don't exist, and nobody has ever seen one, so use method 1, it's safe. This is the correct answer. Period. lol
For what it's worth, I ended up using BioPython for this task.