Hi, I recently used bowtie to map an individual back to the reference genome for that species. Unsatisfied with the results, I bumped up the parameters to improve the mapping percentage. However, upon filtering the new mapped files for a quality score of 20, I found out that the greater majority of the newly mapped reads were reported to have a quality score between 10 to 20. Looking up some of these mappings in IGV seemed to suggest that these maps were not really that bad, with little heterozygosity and decent depth.
Could it be possible that these be good maps, and worth keeping? Or is the quality filter of 20 really necessary to move ahead.
Thanks,
There are a couple of conflated things here. Are you talking about mapq (mapping quality) or the quality-scores of the original reads (per base)? You can filter on either of them. Bowtie does not actually produce mapq scores, though bowtie2 does. Did you perhaps actually use bowtie2?
If you filter on mapq, it highly depends on the aligner, genome, and quality of the data; there's no simple answer as to what is a good mapq. You'd probably have to calibrate it. Values under 5 are pretty bad, though; according to the official specification, 3 or less means the aligner believes there is at most a 50% chance of the alignment being correct, so those are probably not useful.
Hey, Sorry for not being very clear. I did mean mapping quality scores indeed. And yeah was using bowtie2. Do you have any further pointers as to how these calibrations are done, and what things people usually look out for? Couldn't find a lot of literature relating to it, and maybe I'm not looking at the right things.
Biofinysics - How does bowtie2 assign MAPQ scores?
How to calibrate depends on your experiment. But, for example... in the past I did calibration on human exome sequencing by looking at the true and false positive rates of called variations. These can be approximated via a couple of methods.
1) Assume that every variant listed in dbSNP (or a similar database) is correct, and every novel variant is wrong. Then optimize for maximal discovery rate of dbSNP variants and minimal discovery rate of non-dbSNP variants. You could optionally ignore extremely rare variants for this calculation. You might also ignore novel variants that show up even with the most conservative settings, on the assumption that those are real.
or
2) Use trios (parents + child) to calculate concordant and discordant variants. A concordant variant is one found in the child and in one of the parents, and a discordant variant is found only in the child. You can increase the power using ploidy information; a homozygous variant in a parent but absent in the child is also discordant, for example.