How To Convert Sra-Lite Paired-End Submission To Fastq?
2
15
Entering edit mode
13.3 years ago

I'm having some trouble converting an Illumina paired end accession from NCBI's SRA to the paired _1 and _2 fastq files using fastq-dump from the SRA toolkit. I'm running fastq-dump version 2.1.0 (June 22, 2011) and following instructions from the NCBI website here.

When I download this (or other accessions from the same project) and convert to fastq, one or the other of the _1 and _2 fastq files has 2x as many sequences, with all of the sequences from the smaller file being included in the larger file, e.g.

$ wget -rq ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/litesra/SRX/SRX058/SRX058150/
$ fastq-dump -A SRR189044 ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/litesra/SRX/SRX058/SRX058150/SRR189044/SRR189044.lite.sra
$ head SRR189044*fastq
==> SRR189044_1.fastq <==
@SRR189044.1 HWI-EAS66_0007_FC705J6:5:1:1044:2452 length=76
NGTTTCACGTCTGGCGATTTTGACTCATTTTTGAACGAATGCAATGTAACNNNNNNNNAAAATGCAACAGGACCGN
+SRR189044.1 HWI-EAS66_0007_FC705J6:5:1:1044:2452 length=76
############################################################################
@SRR189044.1 HWI-EAS66_0007_FC705J6:5:1:1044:2452 length=76
ATTTTTTCTTTTAACTCATCCATAATTTGTCCTTTTTGTTGTCACCCACAGAACATAATGCTAGGATACTGTTTAA
+SRR189044.1 HWI-EAS66_0007_FC705J6:5:1:1044:2452 length=76
-62077328'//5A5:6?:BA5?--:=67C6?>>=5=;8,'--B?:A69?-5?>-<--;64-?:-CD:BA.@;>4>
@SRR189044.2 HWI-EAS66_0007_FC705J6:5:1:1045:19202 length=76
NTCGATCGCACTGGCGAAGATGAGGAAGCTGTTCTTTCTGGTGATGCTGACNNNNGTCGCCTCGGCCACCGCCTGG

==> SRR189044_2.fastq <==
@SRR189044.1 HWI-EAS66_0007_FC705J6:5:1:1044:2452 length=76
ATTTTTTCTTTTAACTCATCCATAATTTGTCCTTTTTGTTGTCACCCACAGAACATAATGCTAGGATACTGTTTAA
+SRR189044.1 HWI-EAS66_0007_FC705J6:5:1:1044:2452 length=76
-62077328'//5A5:6?:BA5?--:=67C6?>>=5=;8,'--B?:A69?-5?>-<--;64-?:-CD:BA.@;>4>
@SRR189044.2 HWI-EAS66_0007_FC705J6:5:1:1045:19202 length=76
AGCTGCTCGCAGGCGAATATCAGCCAAGAGCAGAACATCACGTCGCATAGATTGGAGCGGTTCATCGAGACGAGCA
+SRR189044.2 HWI-EAS66_0007_FC705J6:5:1:1045:19202 length=76
DBDBC:5CC-A-A:CC-C,C55-::CCB,-DD--5?@=D?=:5D=:=?############################
@SRR189044.3 HWI-EAS66_0007_FC705J6:5:1:1047:3502 length=76
TATGATGTTTAATGCGTTCCCCATTTACTCTTATAAAGTCTTCTAGATTTGTTATTGCAATGCAGGAATTAATAGG

Notice how the _1 file has two reads named SRR189044.1, one of which is the corresponding read in the _2 file.

I've checked with the data submitters and NCBi and it looks like there is no duplication of data in the original submission. There is a related post on SeqAnswers that unfortunately does not address or help solve this issue. Any ideas on what might be going on here would be appreciated.

Many thanks, Casey

sra paired fastq conversion • 28k views
ADD COMMENT
0
Entering edit mode

Adding this self-Q&A to help others with the same problem.

ADD REPLY
24
Entering edit mode
13.3 years ago

The problem you are experiencing is that the version of the SRA toolkit is out of date and that there is now an un(der)documented option in fastq-dump to dump paired end data from an SRA-lite submission. The guidance notes on the NCBI website you refer to are for version 2.0.1, and state that they are not up to date:

This guide is current to SRA Toolkit version 2.0.1 release candidate 1. Instructions for previous versions of the SRA Toolkit may be different from those provided in this guide. We recommend that users stay current with SRA Toolkit updates to benefit from feature additions and bug fixes.

In the latest version of SRA tool 2.1.2 (July 26 2011), there are now options to split paired end reads into separate file:

 --split-files                    Dump each read into a separate file.Files will received suffix corresponding to read number
 --split-3                        Legacy 3-file splitting for mate-pairs:
                                  First 2 biological reads satisfying dumping conditions
                                  are placed in files *_1.fastq and *_2.fastq
                                  If only 1 biological read is dumpable - it is placed in *.fastq

The explanation of the "--split-files" option says that each read will be dumped into a separate file, which is ambiguous and could mean that every reads is put into a separate file. It actually means that each read from a mate pair is put into a _1 or _2 fastq file, which is the desired outcome.

For your example, you should upgrade to SRA Tools v2.1.2 and run the following commands:

$ wget -rq ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/litesra/SRX/SRX058/SRX058150/
$ fastq-dump --split-files ftp-trace.ncbi.nlm.nih.gov/sra/srainstant/reads/ByExp/litesra/SRX/SRX058/SRX058150/SRR189044/SRR189044.lite.sra
$ head SRR189044*fastq
==> SRR189044_1.fastq <==
@SRR189044.1 HWI-EAS66_0007_FC705J6:5:1:1044:2452 length=76
NGTTTCACGTCTGGCGATTTTGACTCATTTTTGAACGAATGCAATGTAACNNNNNNNNAAAATGCAACAGGACCGN
+SRR189044.1 HWI-EAS66_0007_FC705J6:5:1:1044:2452 length=76
############################################################################
@SRR189044.2 HWI-EAS66_0007_FC705J6:5:1:1045:19202 length=76
NTCGATCGCACTGGCGAAGATGAGGAAGCTGTTCTTTCTGGTGATGCTGACNNNNGTCGCCTCGGCCACCGCCTGG
+SRR189044.2 HWI-EAS66_0007_FC705J6:5:1:1045:19202 length=76
#4767;;7<:>?@@##############################################################
@SRR189044.3 HWI-EAS66_0007_FC705J6:5:1:1047:3502 length=76
AACAGATTGTATATGTGTTTTTTTTACATGGCTCATTGGCAAATGTTTTTGNNNNATCGAAATCTTTCTCGTATAC

==> SRR189044_2.fastq <==
@SRR189044.1 HWI-EAS66_0007_FC705J6:5:1:1044:2452 length=76
ATTTTTTCTTTTAACTCATCCATAATTTGTCCTTTTTGTTGTCACCCACAGAACATAATGCTAGGATACTGTTTAA
+SRR189044.1 HWI-EAS66_0007_FC705J6:5:1:1044:2452 length=76
-62077328'//5A5:6?:BA5?--:=67C6?>>=5=;8,'--B?:A69?-5?>-<--;64-?:-CD:BA.@;>4>
@SRR189044.2 HWI-EAS66_0007_FC705J6:5:1:1045:19202 length=76
AGCTGCTCGCAGGCGAATATCAGCCAAGAGCAGAACATCACGTCGCATAGATTGGAGCGGTTCATCGAGACGAGCA
+SRR189044.2 HWI-EAS66_0007_FC705J6:5:1:1045:19202 length=76
DBDBC:5CC-A-A:CC-C,C55-::CCB,-DD--5?@=D?=:5D=:=?############################
@SRR189044.3 HWI-EAS66_0007_FC705J6:5:1:1047:3502 length=76
TATGATGTTTAATGCGTTCCCCATTTACTCTTATAAAGTCTTCTAGATTTGTTATTGCAATGCAGGAATTAATAGG

Hope this helps!

ADD COMMENT
0
Entering edit mode

This explains EVERYTHING. But seriously, this option should be stressed. I can't believe I missed it, and without it my assemblies were making almost no contigs.

ADD REPLY
0
Entering edit mode

Followup question: is there a way to output an interwoven or shuffled file for input to Velvet?

ADD REPLY
0
Entering edit mode

I would also recommend the --helicos option, which makes the generated fastq files smaller :)

ADD REPLY
0
Entering edit mode
9.3 years ago

I had the same problem. This help message is prone to misunderstanding: "Dump each read into a separate file".

ADD COMMENT
0
Entering edit mode

emailed SRA and pointed it out

ADD REPLY

Login before adding your answer.

Traffic: 1835 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6