phred encoding issue in public dataset
1
2
Entering edit mode
19 months ago
bioruffo ▴ 40

Hello,
I'd like to use a public dataset from SRA, this is one of the runs.

I'll put here some sample data, the first two reads in R1:

@ERR2204072.1 HWI-ST1450:172:C6H19ANXX:7:2315:16228:9537/1
ATTACCATCAGAATTGTACTGTTCTGTATCCCACCAGCAATGTCTAGGAATGCCTGTTTCTCCACAAAGTGTTTAC
+
%%$%%())))&)'))))))))))())()())))))))))()&&&)#)))))))))')))))))))))()&&%&)))
@ERR2204072.2 HWI-ST1450:172:C6H19ANXX:7:1104:8419:82653/1
GTTTAAACGAGATTGCCAGCACCGGGTATCATTCACCATTTTTCTTTTTGTTAACTTGCCGTCAGCCTTTTCTTTG
+
%%&&&))))))))))))))))())))))))))))))()))))))))&)&))%))))))))(%)))))))))))))!

A quick look would rule out phred64; but if those were actual phred33-encoded scores, then this would be a dismal sequence (which is what the graph at the SRA page is also displaying, and is also how FastQC inteprets it). But I can see from alignment data, that the sequences are actually very good. STAR aligns them well to the genome, with minimal mismatches.

So, could I affirm within reason that these fastq files are the likely result of a workflow that interpreted good phred64 scores from an original fastq file as dismal phred33 scores, and then mistakenly re-encoded them as such in phred33? (which would mean it'd be reasonable to act upon it and recalculate them?)

Or, is there any unusual phred encoding I'm not familiar with, that would explain these values?

phred fastq • 642 views
ADD COMMENT
1
Entering edit mode
19 months ago
GenoMax 147k

It is possible to have perfectly good sequence (that aligns well) with crappy quality (which could be due to various reasons e.g. overloading). It is also possible that first few hundred reads from a flowcell to have bad quality.

BTW testformat.sh from BBMap suite reports the data to be in sanger format.

ADD COMMENT

Login before adding your answer.

Traffic: 2475 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6