I run into trouble when I used fastq-dump from the SRA toolkit to create fastqs from sra files. I originally used the following command
fastq-dump -I --split-files <SRA.FILE>
and this worked for nearly all of the sra files in project phs000452.v2.p1.c1 but 3, where the resulting fastqs were of unequal size:
-rw------- 1 t.kuilman domain users 68G Apr 5 21:41 SRR482001_1.fastq
-rw------- 1 t.kuilman domain users 46G Apr 5 21:41 SRR482001_2.fastq
-rw------- 1 t.kuilman domain users 19G Apr 17 2012 SRR482001.sra
-rw------- 1 t.kuilman domain users 47G Apr 5 20:35 SRR482005_1.fastq
-rw------- 1 t.kuilman domain users 24G Apr 5 20:35 SRR482005_2.fastq
-rw------- 1 t.kuilman domain users 12G Apr 16 2012 SRR482005.sra
-rw------- 1 t.kuilman domain users 46G Apr 5 20:32 SRR482008_1.fastq
-rw------- 1 t.kuilman domain users 22G Apr 5 20:32 SRR482008_2.fastq
-rw------- 1 t.kuilman domain users 11G Apr 16 2012 SRR482008.sra
The files SRR482001_1.fastq
and SRR482001_2.fastq
contain 812876688
and 393188416
lines, respectively, which is obviously incorrect. I found out that the reason for this is that the 3 sra files contain both paired and unpaired reads, and using --split-files would assume that ALL reads are paired, giving rise to all paired forward reads AND unpaired reads to end up is *_1.fastq. You can, however, use --split-3
instead, which will create three files, a forward fastq, a reverse fastq, and an unpaired fastq. The point is that in the help function of fastq-dump --split-3
is marked as a legacy function, and therefore it was not the first thing I tried (also the description of --split-3
could be improved upon). Also, it is not obvious (if at all possible) from the SRA database website to see that some sra files contan both paired and unpaired reads.
I double-checked with SRA that using --split-3
is indeed the solution to this problem. Bottomline is, I would either carefully check the size of fastqs that are created using fastq-dump, or (maybe better) I would use --split-3
in the first place to obtain paired fastqs.
I hope this is helpful to you!
Thanks for sharing thisĀ One question. I checked your SRAs e.g. SRR482001 and you see under "spot descriptor" it is FORWARD. Doesn't this mean the SRA is SINGLE-end read not Paird-end while you split them into forward and reverse?
Also I liked to download the SRA but couldn't get it. If you have a link for it, I'd like to test it by myself too.
Thanks
You are welcome. I agree with you that the spot descriptor is suggestive of single-end data, but the --split-3 option should result in reads going into the right fast. With other words, if there wouldn't be any paired-end reads, *_1.fastq and *_2.fastq would be empty (or potentially even absent). With regards to the data, part of the SRA database is under restricted access. You can browse datasets most easily here, and you will be noticed if your dataset is restricted. In this case, you can register a request here.