Really confused (new to the sequence alignment field and from non biological background). I have some understanding of read qualities, however not sure of handling them effectively. For example, the reads with read qualities in compact form are shown below.
@r0
GAACGATACCCACCCAACTATCGCCATTCCAGCAT
+
EDCCCBAAAA@@@@?>===<;;9:99987776554
@r1
CCGAACTGGATGTCTCATGGGATAAAAATCATCCG
+
EDCCCBAAAA@@@@?>===<;;9:99987776554
@r2
TCAAAATTGTTATAGTATAACACTGTTGCTTTATG
+
EDCCCBAAAA@@@@?>===<;;9:99987776554
I also know that @
has the lowest value and ~
has the highest value. If the error probability of a base is e, the Phred quality Q is:
Q = -10 * log(e) / log(10)
and the Solexa quality sQ is:
sQ = -10 * log(e / (1 - e)) / log(10)
What I would like to know is, when we try to do the alignment of the read with the reference genome, is the total quality of all the bases considered or a single base is considered. For example the read
GAACGATACCCACCCAACTATCGCCATTCCAGCAT
has a quality
EDCCCBAAAA@@@@?>===<;;9:99987776554
Should I conside base by base quality (if suppose the quality of any of the base is less than 40, don't try to align the sequence with the reference genome) or the cumulative quality score of all the bases will taken and if the total score is less than some threshold, the read will not be considered for alignment.
Is it also sensible to report the read quality with valid alignments.?
Thanks, do you mean to say that the quality does not play any role if the read is matching exactly.
The answer to this is based entirely on what you're doing. In general, if you don't know the answer to this you probably shouldn't be writing a tool that might need to know it.