Fastq files with integer instead of acii quality scores
3
0
Entering edit mode
6.7 years ago
bgbrink ▴ 60

I was going to align a bunch of old fastq files with bwa and got no result. When I looked into the files, I saw that the base quality is reported as integers as opposed to ascii:

@1_21_9:1:2:1565:591
GTGTTGTTTAGAAGCTGAACTACCTTTTTCGCTGAG
+1_21_9:1:2:1565:591
 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 31 5 40 40 1 40 15 40 40 40 40 40 4 2 40 40 15 1 39
@1_21_9:1:2:1307:745
GATCGGAAGAGCTCGTCTGCCGTCTTCTGCTTTGCT
+1_21_9:1:2:1307:745
 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 4 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 -2 1 1 1

Has anyone ever seen this encoding before and knows a tool that can convert this into proper fastq?

Note that there are negative values as well. Could this be old Solexa quality scores?

sequencing • 4.2k views
ADD COMMENT
0
Entering edit mode

That file does not meet fastq format definition. Where did you get this data BTW? Do you know what technology is it from?

ADD REPLY
0
Entering edit mode

I have seen GAIIx data that was in separate sequence and score (as integers) files. Maybe somebody just mashed them together without knowing that they need to be encoded...

ADD REPLY
0
Entering edit mode

That could be it. I don't have any hard proof from what technology this data is from though. Does it still make sense to try and convert the scores manually?

ADD REPLY
0
Entering edit mode

If you have a clue which encoding/phred scale is used you could convert it to a sane fastq, using some scripting. Alternatively you could just convert it to a fasta file and forget about the quality scores...

ADD REPLY
3
Entering edit mode
6.7 years ago
sacha ★ 2.4k

It seems you are using Solexa+64 encoding ( -5 to 40 ). You can convert to ASCII easily helped by the following picture. enter image description here

ADD COMMENT
2
Entering edit mode
6.7 years ago
sacha ★ 2.4k

I did it for you with awk :

cat myfile.fastq | awk -f convert.awk 

// convert.awk 
function toascii(score)
{
    return sprintf("%c",score + 64)
}


(NR-1) % 4 == 0{
print $0
}

(NR-1) % 4 == 1{
print $0
}

(NR-1) % 4 == 2{
print "+"
}

(NR-1) % 4 == 3{

for (i=1; i <= NF ; i+=1)
    {
        printf(toascii($i))
    }
    printf("\n")
}
ADD COMMENT
1
Entering edit mode
6.4 years ago
liartom2 ▴ 10

for (i=1; i < NF ; i+=1)

i <= NF, my dude

ADD COMMENT
2
Entering edit mode

Hi liartom2 ,

This reply is better suited as a comment on sacha's answer. Could you make the appropriate change please? That would involve the following steps:

Thank you!

ADD REPLY

Login before adding your answer.

Traffic: 2148 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6