prefetch and fast-dump problems?
1
0
Entering edit mode
3.7 years ago
debitboro ▴ 270

Dear all,

I want to download some SRR files, and then convert them to fastq files. For that, I've used the following SRA-toolkit commands:

prefetch SRR3159525
fastq-dump SRR3159525.sra

The download was done successfully, and the size of the resulted fastq file seems correct (~8G).

But when I've checked the content of the fastq file, I found the file was strangely formatted as follows:

 @SRR3159522.1 2_33_78 length=50
 T..................................................G
 +SRR3159522.1 2_33_78 length=50
 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
 @SRR3159522.2 2_36_51 length=50
 T..................................................G
 +SRR3159522.2 2_36_51 length=50
 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
 @SRR3159522.3 2_39_77 length=50
 T30.0..2.0.....2.2..2.0..0......0....1...220.2.3322G
 +SRR3159522.3 2_39_77 length=50
 !(*!%!!(!%!!!!!%!%!!%!&!!%!!!!!!*!!!!&!!!%%*!%!&%'%!
 @SRR3159522.4 2_39_134 length=50
 T01.0..0.1.....2.0..2.2..2......1....1...231.0.3312G
 +SRR3159522.4 2_39_134 length=50
 !1&!(!!&!.!!!!!%!(!!%!%!!)!!!!!!)!!!!%!!!%%(!%!/)%%!
 ...
 ...

As you can see, the sequence of the reads contains integers delimited by T and G?

Thank you for your help in advance

prefetch SRA fastq-dump • 1.3k views
ADD COMMENT
0
Entering edit mode

There is something weird with this submission. Initial reads are odd looking as you posted while some of the later ones look like

>gnl|SRA|SRR3159525.999995.1 60_1827_793 F3 (Biological)
ACGCATGCCTGCTGTAGTCAATTAAGTACACAAACTGACATCCANNNNNN
>gnl|SRA|SRR3159525.999995.2 60_1827_793 (Biological)
Empty read

Looks like Read1 = 50 is somewhat OK, Read 2 = 35 bp is empty :-(

Contact SRA support to see if they have anything to say.

ADD REPLY
0
Entering edit mode
3.7 years ago
debitboro ▴ 270

After googling, some biostars posts like (Transforming And Manipulating Color Space Reads) talk about the Color Space representation of the reads generated by some sequencing instruments which operate with color space formats like ABI-SOLID. For such a system the content of the reads are integers representing the colors, then an encoding table can be used to convert the integers to DNA bases.

Please refer to this excellent post which explains the system in more details: Transforming And Manipulating Color Space Reads

ADD COMMENT
0
Entering edit mode

Indeed. SOLiD datasets are rare that it was an easy miss. SOLiD data is likely not worth the hassle since only one or two aligners (older versions) likely support it.

ADD REPLY

Login before adding your answer.

Traffic: 2604 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6