I was going to align a bunch of old fastq files with bwa and got no result. When I looked into the files, I saw that the base quality is reported as integers as opposed to ascii:
@1_21_9:1:2:1565:591
GTGTTGTTTAGAAGCTGAACTACCTTTTTCGCTGAG
+1_21_9:1:2:1565:591
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 31 5 40 40 1 40 15 40 40 40 40 40 4 2 40 40 15 1 39
@1_21_9:1:2:1307:745
GATCGGAAGAGCTCGTCTGCCGTCTTCTGCTTTGCT
+1_21_9:1:2:1307:745
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 4 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 -2 1 1 1
Has anyone ever seen this encoding before and knows a tool that can convert this into proper fastq?
Note that there are negative values as well. Could this be old Solexa quality scores?
That file does not meet fastq format definition. Where did you get this data BTW? Do you know what technology is it from?
I have seen GAIIx data that was in separate sequence and score (as integers) files. Maybe somebody just mashed them together without knowing that they need to be encoded...
That could be it. I don't have any hard proof from what technology this data is from though. Does it still make sense to try and convert the scores manually?
If you have a clue which encoding/phred scale is used you could convert it to a sane fastq, using some scripting. Alternatively you could just convert it to a fasta file and forget about the quality scores...