So, fastq-dump has the ability to be run on just an SRA file accession number, such that the SRA is converted to FASTQ on-the-fly, and the SRA doesn't have to be written to disk.
I'm curious whether it would be possible to use fastq-dump to write to a named pipe (using mkfifo) and feed that into another program, for example Trinity, to run an assembly on the FASTQ file(s) without ever having to write all that data to disk. For large datasets, this could actually save quite a bit of time in aggregate.
Has anyone done something similar? I am going to try and experiment with the technique soon, but I a) don't know much about the mkfifo process to begin with and b) am unsure of how this procedure would work for paired-end data where fastq-dump is splitting the SRA file as it goes. How would one specify which output would go to which pipe?
I would welcome any thoughts from more experienced users!
UPDATE: Okay, so just for others who might stumble upon this, here is a brief description of one implementation of this technique to run with paired-end RNA-seq data:
fastq-dump SRA_file \
--split-files \
-I \
-Z | \
tee >(grep '@.*\.1\s' -A3 --no-group-separator > namedPipe_1) \
>(grep '@.*\.2\s' -A3 --no-group-separator > namedPipe_2) \
>/dev/null
This first requires the creation of two named pipes using mkfifo
. For paired-end data, the -Z
flag becomes problematic because it forces the data into a single stream. There are many ways to regain the two pairs, but the way I've elected to do it is to use --split-files
to break up the stream beforehand, -I
to append either ".1" or ".2" to the end of each header, and then use tee
to duplicate the stream plus grep
with a regex to parse the info from each pair back out into separate pipes for downstream use.
I have tested this with Trinity, running on each named pipe just as I would with a FASTQ file, and it seems to be working fine. While I am not 100% sure that Trinity won't try and go back to the original FASTQ files, the first thing Trinity does is take those FASTQ files and parse them into FASTA format, which it later concatenates into "both.fa", and so I'm pretty confident that this will work.
Thanks to everyone who repsonded! Hope this can be useful for someone else in the future.
fastq-dump --split-spot -Z
produces 8 line fastq format which can be piped to awk or perl to create separate streams.Also, new java and python apis available from github ncbi/ngs
Examples available for java and python could be extended to suit your needs: https://github.com/ncbi/ngs/blob/master/ngs-java/examples/examples/FragTest.java
Thanks for the response! I am unsure of the advantage of
--split-spot
over--split-files
, but you've outlined the general strategy I've decided on. I've messed around a bit with some simpleawk
regex parsing of the convolved--split-files -I -Z
output, where the-I
flag should allow me to separate the different reads back out, as their headers are appended with a "1" or "2" depending on their source.Thanks for this. It got me on (what I think will be) the right track.