Question

FASTQ / SAM file processing: combining base QUAL scores

0

Entering edit mode

8.8 years ago

joshua.b.singer • 0

In a SAM file suppose I'm looking at a contiguous segment of bases on a certain read.

Each base has a QUAL score. What are the reasonable / standard mathematical strategies for combining the QUAL scores for these adjacent bases, to get a QUAL score for the segment?

cheers,

josh

Assembly next-gen • 2.8k views

ADD COMMENT • link updated 2.4 years ago by Ram 44k • written 8.8 years ago by joshua.b.singer • 0

0

Entering edit mode

I know it may not seem relevant, but what are you trying to do with the result? In other words, what is the scientific rationale for doing so?

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.8 years ago by Sean Davis 27k

0

Entering edit mode

It depends what are you trying to achieve. Refrase your question in probabilistic terms. Each qual score reflect a probability of error such that:

P[error] = 10^{-QUAL/10)

If you treat each base as independent, you can make calculations as to the probability that this segment of N bases contains 0 errors, 1, 2 ... N

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.8 years ago by Gabriel R. ★ 2.9k

0

Entering edit mode

OK I'm putting the MAPQs to one side for the time being and just focusing on the QUALs. The application is in the genomics of RNA viruses. In this area, even if the host organism was infected with a single virion for example, by the time the sample is taken the viral infection can easily have evolved into multiple quasi-species within that one infected host.

So, we're looking at the NGS data (e.g. SAM file) to scan for important variations within the sample, for example 28% of the reads come from a quasi-species which has evolved an important immuno-escaping mutation at a certain location. The variations we are scanning for take the form of short amino-acid motifs, sometimes just a single AA residue. The alignment proposed by the SAM file allows us to locate the reading frame for a given read, and thereby translate it (as I mentioned we're ignoring the MAPQs for now).

So I guess the probabilistic rephrasing would be "what is the probability that this read actually contains this variation".

Now I think about it, to do this to the highest level of correctness possible we'd also need to take into account translation code redundancy.

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.8 years ago by joshua.b.singer • 0

0

Entering edit mode

Im not sure you can. Ultimately, the read either maps correctly, or it doesn't. There is no "83.5% match" at the read level. The bases have quality scores as this is useful information for mapping SNPs, etc - but any statistic that averages these quality scores is going to unappreciated the stochastic nature of read mapping. I would rephrase the question in terms of your end goals :)

ADD REPLY • link 8.8 years ago by John 13k

0

Entering edit mode

Good point, but it depends if OP wants to measure the probability of error to the template sequence or an aligned reference where you have to take into account mapping and normal divergence.

ADD REPLY • link 8.8 years ago by Gabriel R. ★ 2.9k

0

Entering edit mode

Yup yup - for sure. But say the first 5 bases all have terrible quality (perhaps because it really was just terrible quality, or perhaps because you are following the GATK pipeline and converting adapter sequences to poor quality bases), however the other 45 bases all map perfectly with high quality and the map itself is unique.

There would be no reason in such a situation to degrade this read to anything other than 100% mapped - because thats what it is :) OK, so you don't know what the first five bases are - but who cares? - the other 45 put it exactly at position X with 99.9999% likelyhood. It is, essentially, just as valid (and potentially more so) than a 50bp read with all high-quality sequencing, that maps to a region of low-complexity. This is because the question "How confident am I that my read has mapped correctly?" - is best answered using the DNA sequence itself, (and how often similar sequences appear in the genome), not base sequencing probabilities or any average thereof :)

In short, when people want an average QUAL, they really either want to plot the distribution of the MAPQs, or they want a FastQC quality-per-base-position plot. Perhaps in this case OP wants a distribution of QUALs but without respect to position. Either way, I can't understand the logic for grouping on reads. However, I'm frequently wrong so lets see what OP has planned :)

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.8 years ago by John 13k

0

Entering edit mode

Illumina bins Q-scores for the raw sequence data to save on storage space (at least that is the justification given) for newer sequencers. You can find their Q-score binning scheme in this file: https://www.illumina.com/content/dam/illumina-marketing/documents/products/technotes/technote_understanding_quality_scores.pdf.

Note: Original post is referring to QUAL for bases (and for alignments) in the same question. Comment above is only for bases.

ADD REPLY • link 8.8 years ago by GenoMax 147k

Ram · Answer 1 · 2016-02-17

1

Entering edit mode

8.8 years ago

Sean Davis 27k

I'm going to venture an answer here to the question that I think you meant to ask but did not. Please correct my interpretation if I missed something.

http://www.ncbi.nlm.nih.gov/pubmed/25178459

http://www.ncbi.nlm.nih.gov/pubmed/26554718

ADD COMMENT • link updated 2.4 years ago by Ram 44k • written 8.8 years ago by Sean Davis 27k

0

Entering edit mode

Thanks, this is 100% relevant.

It seems the two key points are (a) minimum quality of a base within a codon can be used as a proxy for quality of the codon as a whole, when scanning for single AA low-frequency variants (b) for best results, the quality threshold needs to adapt to the overall quality context of the run.

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.8 years ago by joshua.b.singer • 0