Hi All,
Here is a fun riddle I have been working on for the last day or two. Can Q10 data be better that Q30 data? I have been trying to figure out what is the most fair way to filter out "low" quality data. We all agree that the Phred scale is helpful with this question with a Q10 representing a 1:10 error rate or a 90% correctness, with Q20 and Q30 being .99 and .999 correct.
So using the fastx toolkit to filter out lower quality reads you are allowed to tweak the Q score minimum and the the percent of each read having that minimum Q score. for example if I want a minimum of Q30 and each read has to have 100% of it at Q30 I know for a fact that my reads are 0.999% correct. Great! The highest level of quality I can have also leaves me with almost no data to work with (I'm working with Ion Torrent data my average now is around Q33, but earlier data was around Q28), so we make trade offs. How much error am I able to limit while still having "high" quality data? So I say that I can live with Q20 100% of the time and only have reads that are at least .99% correct.
However here is where it gets fun, so lets say I just want reads that have 80% Q30 this means I would have a 20% error rate across all the the bases. If this is how you would figure it out 1 - (0.8 * 0.999) where 0.8 is the 80% of the read and the 0.999 is the minimum quality. So that would imply that a 90% Q20 setting would have less error than 80% Q30; which is still small than 100% Q10 data.
So if this correct should all data just be filtered to be at least Q10 across 100% of all the reads and then move forward in the pipeline?
Most of your statements are incorrect. To highlight just a few:
No, the likelihood that any BASE is correct is 99.9% (or 0.999, but not 0.999%). The likelihood that the read is correct is 0.999^n, where n = number of bases in the read. For 100-bp reads, that likelihood is ~90%.
Only if Q=0 for the bases below Q30. And even that is incorrect, since a random (Q0) basecall will be right 25% of the time. You would need to know the Q value of the sub-Q30 bases to calculate the likely error rate.
Again, the value is unknowable without the sub-threshold Q values. And you're also ignoring the fact that the 90% Q20 is more likely to contain an error than the 80% Q30 (60% vs 8% for a 100-bp read).
Sounds like someone needs a refresher in probability and statistics...
Well, I cannot say you are wrong about the stats course since I have never taken one, however since you are clearly looking to take me to school. I have a few more questions.
I understand how you say the likelihood that the read is correct is 0.999^n, but that would imply the the longer the read the less likely it would be correct (i.e. 400bp read at Q30 ~ 67%) so that would mean that anything over 700bps would have less then a 50/50 chance of being correct. So then how can we trust single pass sanger for gap closure? Or how can PacBio have only a 20% error rate? Since each read length for them can be 10kbp the likelihood for that length of read being correct would be less then just randomly placing nucleotides in a line.
Also how do you figure out the expected error rate for the 90% Q20 and 80% Q30 to be 60% vs 8% per 100bp read?
The longer the read, the more likely it contains an error. All sequencing technologies address the problem of errors by redundancy. Sanger gap-filling relies on multiple overlapping clones/sequences. PacBio resequences the same molecules multiple times. Illumina takes advantage of its massive sequence depth to generate low-error consensus.
For 100-bp, 0.99^90 = 40% perfect sequence (or 60% with at least one error); 0.999^80 = 92% perfect.
Consider yourself schooled :-).
Some of the errors are non-random, and vary by technology. And the impact of even random errors depends upon your application, ranging from a negligible effect on differential gene expression to a huge headache for variant analysis in heterogeneous/cancer samples.
Sorry, I don't have any recommendations.