SRA file looks wierd after conversion to fastq
2
0
Entering edit mode
9.9 years ago
Saad Khan ▴ 440

Hi I am trying to use a small RNA data from CD34 bone marrow cells to compare with another private data that I have.

I just downloaded it from SRA (sra id: SRR772115) and converted it using fastq-dump. But the results don't look typical of a fastq file. Since each read in fastq file is represented in 4 lines while here its not the case. Here is how the converted fastq file looks like.

@SRR772115.1 FCC0B8BACXX:8:1101:1416:2034 length=49
TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACGAGTGGATCTCGTATG
+SRR772115.1 FCC0B8BACXX:8:1101:1416:2034 length=49
bbbeeeeeggggghhiiiiihiiiiiiiiggfhicfghihhiiihhhii
@SRR772115.2 FCC0B8BACXX:8:1101:1317:2047 length=49
TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACGAGTGGATCTCGTATG
+SRR772115.2 FCC0B8BACXX:8:1101:1317:2047 length=49
bbbeeeeegggggghiiiiiiiiiiiiiiihiiicgghhghhiiiihih
@SRR772115.3 FCC0B8BACXX:8:1101:1437:2047 length=49
TATGGTCGCAAGGCTGAAACTTAAAGAAATTGATGGAATTCTCGGGTGC

Can anybody tell me if I am missing something here. And how to get the fastq in proper format.

regards

fastq SRA fastq-dump • 2.3k views
ADD COMMENT
0
Entering edit mode
9.9 years ago
Ram 44k

It does look like well formatted FASTQ, albeit with a bit of an odd encoding. You can use this to find the encoding: https://github.com/brentp/bio-playground/blob/master/reads-utils/guess-encoding.py

@SRR772115.1 FCC0B8BACXX:8:1101:1416:2034 length=49 #ID
TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACGAGTGGATCTCGTATG #Seq
+SRR772115.1 FCC0B8BACXX:8:1101:1416:2034 length=49 #ID
bbbeeeeeggggghhiiiiihiiiiiiiiggfhicfghihhiiihhhii #Qual

@SRR772115.2 FCC0B8BACXX:8:1101:1317:2047 length=49
TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACGAGTGGATCTCGTATG
+SRR772115.2 FCC0B8BACXX:8:1101:1317:2047 length=49
bbbeeeeegggggghiiiiiiiiiiiiiiihiiicgghhghhiiiihih

@SRR772115.3 FCC0B8BACXX:8:1101:1437:2047 length=49
TATGGTCGCAAGGCTGAAACTTAAAGAAATTGATGGAATTCTCGGGTGC
...
...
ADD COMMENT
0
Entering edit mode
9.9 years ago
matted 7.8k

What do you think is wrong with it? These reads look, and FastQC agrees:

##FastQC        0.10.1
>>Basic Statistics      pass
#Measure        Value
Filename        temp.fq
File type       Conventional base calls
Encoding        Illumina 1.5
Total Sequences 2
Filtered Sequences      0
Sequence length 49
%GC     53

It repeats the read name on the quality line (+) as well as the nucleotide line (@), but that's fine.

ADD COMMENT

Login before adding your answer.

Traffic: 2127 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6