I'm having some trouble converting an Illumina paired end accession from NCBI's SRA to the paired _1 and _2 fastq files using fastq-dump from the SRA toolkit. I'm running fastq-dump version 2.1.0 (June 22, 2011) and following instructions from the NCBI website here.
When I download this (or other accessions from the same project) and convert to fastq, one or the other of the _1 and _2 fastq files has 2x as many sequences, with all of the sequences from the smaller file being included in the larger file, e.g.
$ wget -rq ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/litesra/SRX/SRX058/SRX058150/
$ fastq-dump -A SRR189044 ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/litesra/SRX/SRX058/SRX058150/SRR189044/SRR189044.lite.sra
$ head SRR189044*fastq
==> SRR189044_1.fastq <==
@SRR189044.1 HWI-EAS66_0007_FC705J6:5:1:1044:2452 length=76
NGTTTCACGTCTGGCGATTTTGACTCATTTTTGAACGAATGCAATGTAACNNNNNNNNAAAATGCAACAGGACCGN
+SRR189044.1 HWI-EAS66_0007_FC705J6:5:1:1044:2452 length=76
############################################################################
@SRR189044.1 HWI-EAS66_0007_FC705J6:5:1:1044:2452 length=76
ATTTTTTCTTTTAACTCATCCATAATTTGTCCTTTTTGTTGTCACCCACAGAACATAATGCTAGGATACTGTTTAA
+SRR189044.1 HWI-EAS66_0007_FC705J6:5:1:1044:2452 length=76
-62077328'//5A5:6?:BA5?--:=67C6?>>=5=;8,'--B?:A69?-5?>-<--;64-?:-CD:BA.@;>4>
@SRR189044.2 HWI-EAS66_0007_FC705J6:5:1:1045:19202 length=76
NTCGATCGCACTGGCGAAGATGAGGAAGCTGTTCTTTCTGGTGATGCTGACNNNNGTCGCCTCGGCCACCGCCTGG
==> SRR189044_2.fastq <==
@SRR189044.1 HWI-EAS66_0007_FC705J6:5:1:1044:2452 length=76
ATTTTTTCTTTTAACTCATCCATAATTTGTCCTTTTTGTTGTCACCCACAGAACATAATGCTAGGATACTGTTTAA
+SRR189044.1 HWI-EAS66_0007_FC705J6:5:1:1044:2452 length=76
-62077328'//5A5:6?:BA5?--:=67C6?>>=5=;8,'--B?:A69?-5?>-<--;64-?:-CD:BA.@;>4>
@SRR189044.2 HWI-EAS66_0007_FC705J6:5:1:1045:19202 length=76
AGCTGCTCGCAGGCGAATATCAGCCAAGAGCAGAACATCACGTCGCATAGATTGGAGCGGTTCATCGAGACGAGCA
+SRR189044.2 HWI-EAS66_0007_FC705J6:5:1:1045:19202 length=76
DBDBC:5CC-A-A:CC-C,C55-::CCB,-DD--5?@=D?=:5D=:=?############################
@SRR189044.3 HWI-EAS66_0007_FC705J6:5:1:1047:3502 length=76
TATGATGTTTAATGCGTTCCCCATTTACTCTTATAAAGTCTTCTAGATTTGTTATTGCAATGCAGGAATTAATAGG
Notice how the _1 file has two reads named SRR189044.1, one of which is the corresponding read in the _2 file.
I've checked with the data submitters and NCBi and it looks like there is no duplication of data in the original submission. There is a related post on SeqAnswers that unfortunately does not address or help solve this issue. Any ideas on what might be going on here would be appreciated.
Many thanks, Casey
Adding this self-Q&A to help others with the same problem.