I am trying to get a set of heuristics to predict which exome sequencing variants (illumina + Agilent or TruSeq capture) would or would not validate with sanger sequencing. We have a problem with this in our lab. A lot of variants called by samtools + another tool don't validate with sanger and looking at the bam files with samtools tview the alingment and base qualities seem to be ok. These are mostly heterozygote calls and I noticed that in most cases the variant allele is 20-30 % of the reads with a minimum depth of 10 and a range of 30x coverage.
What should I look for when I am trying to decide which call is less reliable, essentially can I do a better job at assigning a snp score to variants by manually scrutinizing the bam file and if so what is the protocol to do that?
Sure, I was just thinking that 20-30% could come from amplification of early stage PCR errors if that is used.
Are the errors arising from alignment errors or from sequencing errors, or something else?
That's what I am trying to find out. I use the base and mapping quality coloring in samtools tview and they seem to be ok for the most part. I also looked if the variant was called by bases mostly at the end of the short reads and that was not the case either.
Do agilent and TruSeq involve amplification of your template material?
Yes exome capture and sequencing involves amplification. My question is not so much about the error profiles and how to get a better sequence but rather how to identify problematic snp calls from aligned bam files.
Do you limit the number of positions that a given read can map to on the reference?