Question

prefetch and fast-dump problems?

0

Entering edit mode

4.4 years ago

debitboro ▴ 270

Dear all,

I want to download some SRR files, and then convert them to fastq files. For that, I've used the following SRA-toolkit commands:

prefetch SRR3159525
fastq-dump SRR3159525.sra

The download was done successfully, and the size of the resulted fastq file seems correct (~8G).

But when I've checked the content of the fastq file, I found the file was strangely formatted as follows:

 @SRR3159522.1 2_33_78 length=50
 T..................................................G
 +SRR3159522.1 2_33_78 length=50
 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
 @SRR3159522.2 2_36_51 length=50
 T..................................................G
 +SRR3159522.2 2_36_51 length=50
 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
 @SRR3159522.3 2_39_77 length=50
 T30.0..2.0.....2.2..2.0..0......0....1...220.2.3322G
 +SRR3159522.3 2_39_77 length=50
 !(*!%!!(!%!!!!!%!%!!%!&!!%!!!!!!*!!!!&!!!%%*!%!&%'%!
 @SRR3159522.4 2_39_134 length=50
 T01.0..0.1.....2.0..2.2..2......1....1...231.0.3312G
 +SRR3159522.4 2_39_134 length=50
 !1&!(!!&!.!!!!!%!(!!%!%!!)!!!!!!)!!!!%!!!%%(!%!/)%%!
 ...
 ...

As you can see, the sequence of the reads contains integers delimited by T and G?

Thank you for your help in advance

prefetch SRA fastq-dump • 1.6k views

ADD COMMENT • link updated 2.3 years ago by Ram 45k • written 4.4 years ago by debitboro ▴ 270

0

Entering edit mode

There is something weird with this submission. Initial reads are odd looking as you posted while some of the later ones look like

>gnl|SRA|SRR3159525.999995.1 60_1827_793 F3 (Biological)
ACGCATGCCTGCTGTAGTCAATTAAGTACACAAACTGACATCCANNNNNN
>gnl|SRA|SRR3159525.999995.2 60_1827_793 (Biological)
Empty read

Looks like Read1 = 50 is somewhat OK, Read 2 = 35 bp is empty :-(

Contact SRA support to see if they have anything to say.

ADD REPLY • link 4.4 years ago by GenoMax 152k

score 0 · Answer 1 · 2021-03-16

0

Entering edit mode

4.3 years ago

debitboro ▴ 270

After googling, some biostars posts like (Transforming And Manipulating Color Space Reads) talk about the Color Space representation of the reads generated by some sequencing instruments which operate with color space formats like ABI-SOLID. For such a system the content of the reads are integers representing the colors, then an encoding table can be used to convert the integers to DNA bases.

Please refer to this excellent post which explains the system in more details: Transforming And Manipulating Color Space Reads

ADD COMMENT • link 4.3 years ago by debitboro ▴ 270

0

Entering edit mode

Indeed. SOLiD datasets are rare that it was an easy miss. SOLiD data is likely not worth the hassle since only one or two aligners (older versions) likely support it.

ADD REPLY • link 4.3 years ago by GenoMax 152k