I have a situation where I want to run batch script to align reads from a bunch of different samples in a GEO accession. Some are single-ended and some are paired, but the meta-data in the series matrix file does not indicate which is which. Now, I can manually convert to fastq and inspect the files to determine it, but I'd like to find an automated way to do this. I know that the SRA file must have meta-data stored in it to explain where the split should occur, but I can't figure out how to get at it. The only thing that looks like it might be what I want is the sra-stat program in the sra toolkit, however I can't find any documentation on its output, and the default text output is just a cryptic series of numbers divided up by colons/pipes.
I could always run sra-stat with the -s
option, output as XML, and find the answer there, but this requires the routine to go through the entire file, which takes a while. I could also just run fastq-dump with the --split-files
option and look to see if I get one or two files as a result, but this also seems like a bit of a hack. Is there a better way?
It feels like there should be some header information in the file that I could quickly access.
This is a great idea to use the --split-spot option of the fastq-dump. Although your way above is definitely good, I think that davedeto has a slightly simpler solution which I incorporate here:
cool! very simplified, thanks ! :)