Question

Can Q10 be better than Q30

0

Entering edit mode

9.5 years ago

skbrimer ▴ 740

Hi All,

Here is a fun riddle I have been working on for the last day or two. Can Q10 data be better that Q30 data? I have been trying to figure out what is the most fair way to filter out "low" quality data. We all agree that the Phred scale is helpful with this question with a Q10 representing a 1:10 error rate or a 90% correctness, with Q20 and Q30 being .99 and .999 correct.

So using the fastx toolkit to filter out lower quality reads you are allowed to tweak the Q score minimum and the the percent of each read having that minimum Q score. for example if I want a minimum of Q30 and each read has to have 100% of it at Q30 I know for a fact that my reads are 0.999% correct. Great! The highest level of quality I can have also leaves me with almost no data to work with (I'm working with Ion Torrent data my average now is around Q33, but earlier data was around Q28), so we make trade offs. How much error am I able to limit while still having "high" quality data? So I say that I can live with Q20 100% of the time and only have reads that are at least .99% correct.

However here is where it gets fun, so lets say I just want reads that have 80% Q30 this means I would have a 20% error rate across all the the bases. If this is how you would figure it out 1 - (0.8 * 0.999) where 0.8 is the 80% of the read and the 0.999 is the minimum quality. So that would imply that a 90% Q20 setting would have less error than 80% Q30; which is still small than 100% Q10 data.

So if this correct should all data just be filtered to be at least Q10 across 100% of all the reads and then move forward in the pipeline?

sequencing Assembly software-error • 8.2k views

ADD COMMENT • link updated 2.8 years ago by Ram 45k • written 9.5 years ago by skbrimer ▴ 740

3

Entering edit mode

Most of your statements are incorrect. To highlight just a few:

for example if I want a minimum of Q30 and each read has to have 100% of it at Q30 I know for a fact that my reads are 0.999% correct.

No, the likelihood that any BASE is correct is 99.9% (or 0.999, but not 0.999%). The likelihood that the read is correct is 0.999^n, where n = number of bases in the read. For 100-bp reads, that likelihood is ~90%.

so lets say I just want reads that have 80% Q30 this means I would have a 20% error rate across all the the bases.

Only if Q=0 for the bases below Q30. And even that is incorrect, since a random (Q0) basecall will be right 25% of the time. You would need to know the Q value of the sub-Q30 bases to calculate the likely error rate.

So that would imply that a 90% Q20 setting would have less error than 80% Q30

Again, the value is unknowable without the sub-threshold Q values. And you're also ignoring the fact that the 90% Q20 is more likely to contain an error than the 80% Q30 (60% vs 8% for a 100-bp read).

Sounds like someone needs a refresher in probability and statistics...

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.5 years ago by harold.smith.tarheel ★ 5.0k

0

Entering edit mode

Well, I cannot say you are wrong about the stats course since I have never taken one, however since you are clearly looking to take me to school. I have a few more questions.

I understand how you say the likelihood that the read is correct is 0.999^n, but that would imply the the longer the read the less likely it would be correct (i.e. 400bp read at Q30 ~ 67%) so that would mean that anything over 700bps would have less then a 50/50 chance of being correct. So then how can we trust single pass sanger for gap closure? Or how can PacBio have only a 20% error rate? Since each read length for them can be 10kbp the likelihood for that length of read being correct would be less then just randomly placing nucleotides in a line.

Also how do you figure out the expected error rate for the 90% Q20 and 80% Q30 to be 60% vs 8% per 100bp read?

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.5 years ago by skbrimer ▴ 740

1

Entering edit mode

The longer the read, the more likely it contains an error. All sequencing technologies address the problem of errors by redundancy. Sanger gap-filling relies on multiple overlapping clones/sequences. PacBio resequences the same molecules multiple times. Illumina takes advantage of its massive sequence depth to generate low-error consensus.

For 100-bp, 0.99^90 = 40% perfect sequence (or 60% with at least one error); 0.999^80 = 92% perfect.

Consider yourself schooled :-).

ADD REPLY • link 9.5 years ago by harold.smith.tarheel ★ 5.0k

0

Entering edit mode

HaHa! Thanks Harold. I think I'm learning. So since the error is mostly random, machine idiosyncraties aside for this discussion, and sequence addresses this problem with depht of coverage at point is the Q score important? In theory if you had enough depht of coverage the error rate could be ignored right?

ADD REPLY • link 9.5 years ago by skbrimer ▴ 740

0

Entering edit mode

Some of the errors are non-random, and vary by technology. And the impact of even random errors depends upon your application, ranging from a negligible effect on differential gene expression to a huge headache for variant analysis in heterogeneous/cancer samples.

ADD REPLY • link 9.5 years ago by harold.smith.tarheel ★ 5.0k

0

Entering edit mode

Thanks Harold, this seems to be a recurring them of "it depends" which makes since but do you have a good text on this that you could direct me to? I'm trying to have confidence in my work and this is clearly a weak spot.

ADD REPLY • link 9.5 years ago by skbrimer ▴ 740

0

Entering edit mode

Sorry, I don't have any recommendations.

ADD REPLY • link 9.5 years ago by harold.smith.tarheel ★ 5.0k

Ram · Answer 1 · 2015-11-05

1

Entering edit mode

9.5 years ago

Antonio R. Franco ★ 5.2k

If you are using Illumina data, try using fastq_quality_trimmer and you will get the answer yourself

EDITED: I believe that this approach can also be used for Torrent data

ADD COMMENT • link updated 5.5 years ago by Ram 45k • written 9.5 years ago by Antonio R. Franco ★ 5.2k

0

Entering edit mode

I'm sorry I'm a little confused on why the trimmer option would answer my question. I will try it but If I want to trim my data from the 3' until I reach the threshold I set I do not see how this will improve the filtering. Do you trim and then filter the data? Also this would not help you if you have a over/undercall surrounded by confident calls. Am I incorrect?

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.5 years ago by skbrimer ▴ 740

1

Entering edit mode

You just compare the results with both methods and will see that with the trimmer program you will effectively filter your reads without losing so much data

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.5 years ago by Antonio R. Franco ★ 5.2k

Ram · Answer 2 · 2015-11-05

1

Entering edit mode

9.5 years ago

Brian Bushnell 20k

For quality-filtering, it's best to use the average expected error rate. For example, with the BBMap package:

bbduk.sh in=reads.fq out=filtered.fq maq=20

"maq" stands for min average quality. This will filter reads that have an an expected error rate greater than 1% overall. It is not calculated by averaging the quality values, which does not make sense since they are log-scaled, but rather by transforming them to probabilities and averaging those.

As Franco said, though, trimming is also useful for platforms in which the read quality deteriorates with increasing length.

But! Filtering or trimming to a very high level like Q20 is not a universally good idea. It's generally a bad idea, as it increases the bias in your data. Furthermore, Q-scores are not necessarily very accurate. What are you doing with the data, that you feel it needs aggressive quality-filtering?

ADD COMMENT • link updated 5.5 years ago by Ram 45k • written 9.5 years ago by Brian Bushnell 20k

0

Entering edit mode

I'm trying to figure out if I have a sub-population in my sample or not. I'm working with a ssRNA virus and have several snps that are linked in reads and in the same position I have reads that contain only one of the snps. I'm trying to figure out if I have more than one population. I thought if I could figure out the balance between the error rate and the amount of data I need to show it is or is not sub-population that maybe the viral popularisation reconstruction algorithms would do better. Basically reducing the noise.

ADD REPLY • link 9.5 years ago by skbrimer ▴ 740

1

Entering edit mode

Oh, virus data can be hard to work with; at least the ones I've seen never seem to match the reference very well. Yep, in that case quality filtering or trimming could be useful. But I don't generally recommend a threshold above Q15.

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.5 years ago by Brian Bushnell 20k

0

Entering edit mode

Thank you Brian, this has been a very helpful discussion.

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.5 years ago by skbrimer ▴ 740

0

Entering edit mode

Brian, do you have a good text on bioinformatics that you could direct me to?

ADD REPLY • link 9.5 years ago by skbrimer ▴ 740

0

Entering edit mode

Sorry, I've never read any bioinformatics books...

ADD REPLY • link 9.5 years ago by Brian Bushnell 20k

score 1 · Answer 3 · 2015-11-05

Quality trimming is generally not as straightforward as one would think. Some tasks (e.g. SNP calling) demand data of a higher quality than others (e.g. differential gene expression). And as you noted, there is a trade-off between stringent trimming and remaining data. As such, sometimes trimming will help - if you start from very deep sequencing, but may hurt a lot - if you have somewhat shallow sequencing from the beginning.

Regarding software, some do better than others, even under the "same" stringency settings, because there are a number of different trimming algorithms. I think (I may remember it wrongly) FastX toolkit is considered as performing poorly compared to other programs, and seqtk is considered particularly good.

There are some papers (e.g here and here) comparing trimming and downstream performance, reaching somewhat discordant conclusions