In a SAM file suppose I'm looking at a contiguous segment of bases on a certain read.
Each base has a QUAL score. What are the reasonable / standard mathematical strategies for combining the QUAL scores for these adjacent bases, to get a QUAL score for the segment?
cheers,
josh
I know it may not seem relevant, but what are you trying to do with the result? In other words, what is the scientific rationale for doing so?
It depends what are you trying to achieve. Refrase your question in probabilistic terms. Each qual score reflect a probability of error such that:
If you treat each base as independent, you can make calculations as to the probability that this segment of N bases contains 0 errors, 1, 2 ... N
OK I'm putting the MAPQs to one side for the time being and just focusing on the QUALs. The application is in the genomics of RNA viruses. In this area, even if the host organism was infected with a single virion for example, by the time the sample is taken the viral infection can easily have evolved into multiple quasi-species within that one infected host.
So, we're looking at the NGS data (e.g. SAM file) to scan for important variations within the sample, for example 28% of the reads come from a quasi-species which has evolved an important immuno-escaping mutation at a certain location. The variations we are scanning for take the form of short amino-acid motifs, sometimes just a single AA residue. The alignment proposed by the SAM file allows us to locate the reading frame for a given read, and thereby translate it (as I mentioned we're ignoring the MAPQs for now).
So I guess the probabilistic rephrasing would be "what is the probability that this read actually contains this variation".
Now I think about it, to do this to the highest level of correctness possible we'd also need to take into account translation code redundancy.
Im not sure you can. Ultimately, the read either maps correctly, or it doesn't. There is no "83.5% match" at the read level. The bases have quality scores as this is useful information for mapping SNPs, etc - but any statistic that averages these quality scores is going to unappreciated the stochastic nature of read mapping. I would rephrase the question in terms of your end goals :)
Good point, but it depends if OP wants to measure the probability of error to the template sequence or an aligned reference where you have to take into account mapping and normal divergence.
Yup yup - for sure. But say the first 5 bases all have terrible quality (perhaps because it really was just terrible quality, or perhaps because you are following the GATK pipeline and converting adapter sequences to poor quality bases), however the other 45 bases all map perfectly with high quality and the map itself is unique.
There would be no reason in such a situation to degrade this read to anything other than 100% mapped - because thats what it is :) OK, so you don't know what the first five bases are - but who cares? - the other 45 put it exactly at position X with 99.9999% likelyhood. It is, essentially, just as valid (and potentially more so) than a 50bp read with all high-quality sequencing, that maps to a region of low-complexity. This is because the question "How confident am I that my read has mapped correctly?" - is best answered using the DNA sequence itself, (and how often similar sequences appear in the genome), not base sequencing probabilities or any average thereof :)
In short, when people want an average QUAL, they really either want to plot the distribution of the MAPQs, or they want a FastQC quality-per-base-position plot. Perhaps in this case OP wants a distribution of QUALs but without respect to position. Either way, I can't understand the logic for grouping on reads. However, I'm frequently wrong so lets see what OP has planned :)
Illumina bins Q-scores for the raw sequence data to save on storage space (at least that is the justification given) for newer sequencers. You can find their Q-score binning scheme in this file: https://www.illumina.com/content/dam/illumina-marketing/documents/products/technotes/technote_understanding_quality_scores.pdf.
Note: Original post is referring to QUAL for bases (and for alignments) in the same question. Comment above is only for bases.