Hi folks, I'm trying to work with a dataset from SRA where instead of the paired-end fastq format that I'm used to, the submitting researcher seems to have concatenated the reads so that all the reads look like the following:
@SRR########.1 1 length=300 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+SRR########.1 1 length=300 FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF::FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF FFFFFFFFFFFFFFF:FFFFFFFFFFF,FFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF,FFFFFFFFFFFFFFFFFFFFF:,FFFFFFFFFFF,FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@SRR########.2 2 length=300 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+SRR########.2 2 length=300 FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF::FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF FFFFFFFFFFFFFFF:FFFFFFFFFFF,FFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF,FFFFFFFFFFFFFFFFFFFFF:,FFFFFFFFFFF,FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@SRR########.3 3 length=300 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+SRR########.3 3 length=300 FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF::FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF FFFFFFFFFFFFFFF:FFFFFFFFFFF,FFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF,FFFFFFFFFFFFFFFFFFFFF:,FFFFFFFFFFF,FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
Where there are two reads there that aren't separated by anything, just another read begins after 150 characters. So, the normal way of going about splitting the files won't work in this situation since the reads are concatenated instead of the conventional way of putting them on different lines. Is anyone aware of a tool to fix this? I can imagine a fairly straightforward way to do this with a simple script, but would rather use a tool if there is one.
Is this read-concatenating common practice on the SRA? It's my first time using this resource but I can't imagine what the advantage is here, since the data is unusable in this configuration. I should say that I haven't worked with RNAseq data in a few years, and I could conceivably be out of touch. Thanks for any insights
If you have the SRA accession numbers just download the data again. If someone has messed with primary data like this you can't be sure there is nothing else that is wrong with it.
No it should not be the case.
How did you download the data? Did you use the
split-3
command when usingfastq-dump
? If memory serves this terrible behaviour offastq-dump
(concenenating paired reads into a single read) happens whenfastq-dump
is called without the splitting option on paired data. Should at least throw a warning, but as many things with the sra-toolkit...well, it doesn't and creates this mess. Check sra-explorer.info for direct download links.I didn't use the
split-3
command, no. Sounds like I'll be giving that a shot now. Thanks for your adviceEDIT: I just tried this and it worked like a charm. Two files output, length=150. Thanks a bunch!