Hello,
I'd like to use a public dataset from SRA, this is one of the runs.
I'll put here some sample data, the first two reads in R1:
@ERR2204072.1 HWI-ST1450:172:C6H19ANXX:7:2315:16228:9537/1
ATTACCATCAGAATTGTACTGTTCTGTATCCCACCAGCAATGTCTAGGAATGCCTGTTTCTCCACAAAGTGTTTAC
+
%%$%%())))&)'))))))))))())()())))))))))()&&&)#)))))))))')))))))))))()&&%&)))
@ERR2204072.2 HWI-ST1450:172:C6H19ANXX:7:1104:8419:82653/1
GTTTAAACGAGATTGCCAGCACCGGGTATCATTCACCATTTTTCTTTTTGTTAACTTGCCGTCAGCCTTTTCTTTG
+
%%&&&))))))))))))))))())))))))))))))()))))))))&)&))%))))))))(%)))))))))))))!
A quick look would rule out phred64; but if those were actual phred33-encoded scores, then this would be a dismal sequence (which is what the graph at the SRA page is also displaying, and is also how FastQC inteprets it). But I can see from alignment data, that the sequences are actually very good. STAR aligns them well to the genome, with minimal mismatches.
So, could I affirm within reason that these fastq files are the likely result of a workflow that interpreted good phred64 scores from an original fastq file as dismal phred33 scores, and then mistakenly re-encoded them as such in phred33? (which would mean it'd be reasonable to act upon it and recalculate them?)
Or, is there any unusual phred encoding I'm not familiar with, that would explain these values?