How to split PE fastq file (hear me out), where instead of listing as a pair of 150bp reads, the previous user concatenated them into one 300bp string?
0
0
Entering edit mode
4.2 years ago
turo.1 ▴ 20

Hi folks, I'm trying to work with a dataset from SRA where instead of the paired-end fastq format that I'm used to, the submitting researcher seems to have concatenated the reads so that all the reads look like the following:

@SRR########.1 1 length=300 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+SRR########.1 1 length=300 FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF::FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF FFFFFFFFFFFFFFF:FFFFFFFFFFF,FFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF,FFFFFFFFFFFFFFFFFFFFF:,FFFFFFFFFFF,FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

@SRR########.2 2 length=300 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+SRR########.2 2 length=300 FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF::FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF FFFFFFFFFFFFFFF:FFFFFFFFFFF,FFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF,FFFFFFFFFFFFFFFFFFFFF:,FFFFFFFFFFF,FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

@SRR########.3 3 length=300 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+SRR########.3 3 length=300 FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF::FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF FFFFFFFFFFFFFFF:FFFFFFFFFFF,FFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF,FFFFFFFFFFFFFFFFFFFFF:,FFFFFFFFFFF,FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

Where there are two reads there that aren't separated by anything, just another read begins after 150 characters. So, the normal way of going about splitting the files won't work in this situation since the reads are concatenated instead of the conventional way of putting them on different lines. Is anyone aware of a tool to fix this? I can imagine a fairly straightforward way to do this with a simple script, but would rather use a tool if there is one.

Is this read-concatenating common practice on the SRA? It's my first time using this resource but I can't imagine what the advantage is here, since the data is unusable in this configuration. I should say that I haven't worked with RNAseq data in a few years, and I could conceivably be out of touch. Thanks for any insights

rna-seq paired-end fastq RNA-Seq sequence • 1.2k views
ADD COMMENT
0
Entering edit mode

If you have the SRA accession numbers just download the data again. If someone has messed with primary data like this you can't be sure there is nothing else that is wrong with it.

Is this read-concatenating common practice on the SRA?

No it should not be the case.

ADD REPLY
0
Entering edit mode

How did you download the data? Did you use the split-3 command when using fastq-dump? If memory serves this terrible behaviour of fastq-dump (concenenating paired reads into a single read) happens when fastq-dump is called without the splitting option on paired data. Should at least throw a warning, but as many things with the sra-toolkit...well, it doesn't and creates this mess. Check sra-explorer.info for direct download links.

ADD REPLY
0
Entering edit mode

I didn't use the split-3 command, no. Sounds like I'll be giving that a shot now. Thanks for your advice

EDIT: I just tried this and it worked like a charm. Two files output, length=150. Thanks a bunch!

ADD REPLY

Login before adding your answer.

Traffic: 1616 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6