fastq-dump stream to named pipe (fifo) to Trinity
1
1
Entering edit mode
10.1 years ago
glarue ▴ 70

So, fastq-dump has the ability to be run on just an SRA file accession number, such that the SRA is converted to FASTQ on-the-fly, and the SRA doesn't have to be written to disk.

I'm curious whether it would be possible to use fastq-dump to write to a named pipe (using mkfifo) and feed that into another program, for example Trinity, to run an assembly on the FASTQ file(s) without ever having to write all that data to disk. For large datasets, this could actually save quite a bit of time in aggregate.

Has anyone done something similar? I am going to try and experiment with the technique soon, but I a) don't know much about the mkfifo process to begin with and b) am unsure of how this procedure would work for paired-end data where fastq-dump is splitting the SRA file as it goes. How would one specify which output would go to which pipe?

I would welcome any thoughts from more experienced users!

UPDATE: Okay, so just for others who might stumble upon this, here is a brief description of one implementation of this technique to run with paired-end RNA-seq data:

fastq-dump SRA_file \
  --split-files \
  -I \
  -Z | \
  tee >(grep '@.*\.1\s' -A3 --no-group-separator > namedPipe_1) \
  >(grep '@.*\.2\s' -A3 --no-group-separator > namedPipe_2) \
  >/dev/null

This first requires the creation of two named pipes using mkfifo. For paired-end data, the -Z flag becomes problematic because it forces the data into a single stream. There are many ways to regain the two pairs, but the way I've elected to do it is to use --split-files to break up the stream beforehand, -I to append either ".1" or ".2" to the end of each header, and then use tee to duplicate the stream plus grep with a regex to parse the info from each pair back out into separate pipes for downstream use.

I have tested this with Trinity, running on each named pipe just as I would with a FASTQ file, and it seems to be working fine. While I am not 100% sure that Trinity won't try and go back to the original FASTQ files, the first thing Trinity does is take those FASTQ files and parse them into FASTA format, which it later concatenates into "both.fa", and so I'm pretty confident that this will work.

Thanks to everyone who repsonded! Hope this can be useful for someone else in the future.

next-gen-sequencing RNA-Seq fastq-dump • 5.7k views
ADD COMMENT
0
Entering edit mode
10.1 years ago

From the manual:

Workflow and piping:
-O     |     --outdir <path>     Output directory, default is current working directory ('.').
-Z     |     --stdout     Output to stdout, all split data become joined into single stream.
            --gzip     Compress output using gzip.
            --bzip2     Compress output using bzip2.

I don't know how the output looks like (interleaved fastq ?) but it should be possible to do something like:

fastq-dump (options) -Z | awk? | bwa mem -p REF.fa -
ADD COMMENT
1
Entering edit mode

fastq-dump --split-spot -Z produces 8 line fastq format which can be piped to awk or perl to create separate streams.

Also, new java and python apis available from github ncbi/ngs

Examples available for java and python could be extended to suit your needs: https://github.com/ncbi/ngs/blob/master/ngs-java/examples/examples/FragTest.java

ADD REPLY
0
Entering edit mode

Thanks for the response! I am unsure of the advantage of --split-spot over --split-files, but you've outlined the general strategy I've decided on. I've messed around a bit with some simple awk regex parsing of the convolved --split-files -I -Z output, where the -I flag should allow me to separate the different reads back out, as their headers are appended with a "1" or "2" depending on their source.

ADD REPLY
0
Entering edit mode

Thanks for this. It got me on (what I think will be) the right track.

ADD REPLY

Login before adding your answer.

Traffic: 3561 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6