Processing Fastq Files In R/Bioconductor Without Reading Entire Files Into Memory?
5
6
Entering edit mode
14.1 years ago
Ryan Thompson ★ 3.6k

I'm trying to do some analysis on some fastq files using R, and what I'm doing only requires me read one sequence at a time and then discard it. I tried using the ShortRead package from Bioconductor, but readFastq reads the entire file at once. Or tries to, and runs out of RAM.

So is there an easy way to read a fastq file just a few sequences at a time, without writing my own half-broken parser?

Edit: This question was asked years ago, and since then the ShortRead package has added a streaming interface. Use that.

r bioconductor fastq memory • 25k views
ADD COMMENT
0
Entering edit mode

For what it's worth, I ended up using BioPython for this task.

ADD REPLY
3
Entering edit mode
13.7 years ago
Ryan Thompson ★ 3.6k

For what it's worth, the solution I ultimately settled on was to read the fastq in python using BioPython, and then call my R analysis code from python using rpy2.

ADD COMMENT
2
Entering edit mode
14.1 years ago
Neilfws 49k

EDIT: I just noticed in your question that you have one fastq file with many sequences. Obviously, this answer requires that you split that file into separate sequences - which may not be what you want.

One solution might be to read the file names into a vector, then use each name as the argument to readFastq(pattern=...), so as to read one file at a time. For example, assuming you opened an R console in the same directory as the fastq files:

files <- list.files()
for(file in files) {
  fq <- readFastq(".", pattern = file)
  # do stuff with fq...
  # then remove it
  rm(fq)
}

If you're not in the directory of files, you'll need to supply the path to list.files() and to readFastq(). I have not used readFastq(), so check that you can supply . to mean "current directory".

ADD COMMENT
0
Entering edit mode
ADD REPLY
2
Entering edit mode
14.0 years ago
Oscar ▴ 20

Bit of an old question, but just thought that you could find FastqSampler in the ShortRead package interesting since it seems to be aimed at exactly what you asked.

ADD COMMENT
0
Entering edit mode

This function shows it's possible in R, put it generates a random sample, you cannot control which records it reads.

ADD REPLY
1
Entering edit mode
14.1 years ago

ShortRead doesn't include a stream interface (to my knowledge - I haven't checked the devel version).

It's a bit of a hack, but you could convert your Fastq to BAM format e.g. using Picard Fastq2Sam and then use Rsamtools which has the ability to create views on large BAM files. It still doesn't provide a stream interface that I can see, but it may mitigate your problem. It may make things faster too because (at least where I work) Fastq operations are mostly I/O bound anyway.

Failing that, use a Fastq chunker to break up your file into blocks that will fit in RAM and then operate on those sequentially as neilfws suggested, or in parallel if you can avoid saturating your I/O.

ADD COMMENT
1
Entering edit mode
14.0 years ago
Michael 55k

There is definitely not a real good way to do this directly using R.

But there are two better options:

  1. a good old unix tool split will help. You could also use readLines() and writeLines() in R to read a number of lines and write them to a temporary file but this is not very efficient.

    Using the fact that each fastq record has exactly 4 lines, no empty lines allowed (if you have empty lines, remove them first):

    split -l4000000 input.fastq split.fastq.
    

    splits the input file into fastq files with max 1000000 entries and makes output files like split.fastq.aa, ..., .zz Then read in the files sequentially using readFastq()

  2. Using an R function, check out this:

    read.DNAStringSet(filepath, format="fastq",
              nrec=-1L, skip=0L, use.names=TRUE)
    

    using the parameters nrec and skip. However, this function ignores quality values. So, if you need the qualities, use option 1. if you only need the sequence, option 2 might be ok.

Edit:

There is a 'format' definition, for those doubting option 1 is valid:

<fastq> :=  <block>+
<block> :=  @<seqname>\n<seq>\n+[<seqname>]\n<qual>\n
<seqname>   :=  [A-Za-z0-9_.:-]+
<seq>   :=  [A-Za-z\n\.~]+
<qual>  :=  [!-~\n]+

http://maq.sourceforge.net/fastq.shtml

While wrapping sequence by n in fact should be punished hard, according to this definition it is unfortunately possible, though I have never seen such file. So it's situational and depends on the file not containing wrapped sequence.

Another argument for FASTQ not being a 'format' because it's not easy to parse then, cause the quality string could itself contain a @ or +. So even if it's allowed, it must be bad practice to have wrapped sequences in FASTQ (unlike in FASTA)

Edit: I now assume again, that the split method is safe, because fastq files have 4 lines per entry and files containing wrapped sequence or quality don't exist, and nobody has ever seen one, so use method 1, it's safe. This is the correct answer. Period. lol

ADD COMMENT
2
Entering edit mode

Fastq is quite straightforward, really. Just count the sequence characters for each record and then you know how many quality characters to expect. A problem does arise if your sequence alphabet contains the '+' character because then you can't distinguish the quality header. I agree that Fastq is not a viable format in the long run, but rather because it doesn't compress well. We use BAM files instead, even for unmapped reads.

ADD REPLY
1
Entering edit mode

Fastq records may have wrapped lines, much like Fasta, so the 4 lines per record assumption is not safe.

ADD REPLY
0
Entering edit mode

Really? I've never seen such... Will look it up

ADD REPLY
0
Entering edit mode

Keith, I disagree, then that file would be 'non-standard' (there is no standard unfortunately so FASTQ is not a viable 'format' in the long run anyway):

ADD REPLY
0
Entering edit mode

A FASTQ file normally uses four lines per sequence. Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line). Line 2 is the raw sequence letters. Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again. Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence.

ADD REPLY
0
Entering edit mode

Given, of course, that FASTQ is not even a format, anything is possible, but I have never seen such file, and if you got such I would just punish the provider hard.

ADD REPLY
0
Entering edit mode

[?] := [?]+ [?] := @[?]n[?]n+[[?]]n[?]n [?] := [A-Za-z0-9_.:-]+ [?] := [A-Za-zn.~]+ [?] := [!-~n]+

ADD REPLY
0
Entering edit mode

You are right, it just shouldn't be like this (and luckily most of the time it isn't)

ADD REPLY
0
Entering edit mode

I know, I guess we could make another parser for R, I can easily think of 10 lines in perl, but Ryan wanted something that works in R without building his own parser. So the first option was not that good anyway.

ADD REPLY

Login before adding your answer.

Traffic: 1629 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6