Question

phred encoding issue in public dataset

2

Entering edit mode

2.1 years ago

bioruffo ▴ 40

Hello,
I'd like to use a public dataset from SRA, this is one of the runs.

I'll put here some sample data, the first two reads in R1:

@ERR2204072.1 HWI-ST1450:172:C6H19ANXX:7:2315:16228:9537/1
ATTACCATCAGAATTGTACTGTTCTGTATCCCACCAGCAATGTCTAGGAATGCCTGTTTCTCCACAAAGTGTTTAC
+
%%$%%())))&)'))))))))))())()())))))))))()&&&)#)))))))))')))))))))))()&&%&)))
@ERR2204072.2 HWI-ST1450:172:C6H19ANXX:7:1104:8419:82653/1
GTTTAAACGAGATTGCCAGCACCGGGTATCATTCACCATTTTTCTTTTTGTTAACTTGCCGTCAGCCTTTTCTTTG
+
%%&&&))))))))))))))))())))))))))))))()))))))))&)&))%))))))))(%)))))))))))))!

A quick look would rule out phred64; but if those were actual phred33-encoded scores, then this would be a dismal sequence (which is what the graph at the SRA page is also displaying, and is also how FastQC inteprets it). But I can see from alignment data, that the sequences are actually very good. STAR aligns them well to the genome, with minimal mismatches.

So, could I affirm within reason that these fastq files are the likely result of a workflow that interpreted good phred64 scores from an original fastq file as dismal phred33 scores, and then mistakenly re-encoded them as such in phred33? (which would mean it'd be reasonable to act upon it and recalculate them?)

Or, is there any unusual phred encoding I'm not familiar with, that would explain these values?

phred fastq • 872 views

ADD COMMENT • link updated 2.1 years ago by GenoMax 151k • written 2.1 years ago by bioruffo ▴ 40

score 1 · Answer 1 · 2023-04-20

It is possible to have perfectly good sequence (that aligns well) with crappy quality (which could be due to various reasons e.g. overloading). It is also possible that first few hundred reads from a flowcell to have bad quality.

BTW testformat.sh from BBMap suite reports the data to be in sanger format.