Question

How To Predict Which Next Gen Variants Would Not Validate By Sanger

7

Entering edit mode

13.7 years ago

Biomed 5.0k

I am trying to get a set of heuristics to predict which exome sequencing variants (illumina + Agilent or TruSeq capture) would or would not validate with sanger sequencing. We have a problem with this in our lab. A lot of variants called by samtools + another tool don't validate with sanger and looking at the bam files with samtools tview the alingment and base qualities seem to be ok. These are mostly heterozygote calls and I noticed that in most cases the variant allele is 20-30 % of the reads with a minimum depth of 10 and a range of 30x coverage.

What should I look for when I am trying to decide which call is less reliable, essentially can I do a better job at assigning a snp score to variants by manually scrutinizing the bam file and if so what is the protocol to do that?

bam samtools exome • 4.9k views

ADD COMMENT • link updated 13.7 years ago by Bioinfosm ▴ 620 • written 13.7 years ago by Biomed 5.0k

1

Entering edit mode

Sure, I was just thinking that 20-30% could come from amplification of early stage PCR errors if that is used.

ADD REPLY • link 13.7 years ago by Casbon ★ 3.3k

0

Entering edit mode

Are the errors arising from alignment errors or from sequencing errors, or something else?

ADD REPLY • link 13.7 years ago by Casbon ★ 3.3k

0

Entering edit mode

That's what I am trying to find out. I use the base and mapping quality coloring in samtools tview and they seem to be ok for the most part. I also looked if the variant was called by bases mostly at the end of the short reads and that was not the case either.

ADD REPLY • link 13.7 years ago by Biomed 5.0k

0

Entering edit mode

Do agilent and TruSeq involve amplification of your template material?

ADD REPLY • link 13.7 years ago by Casbon ★ 3.3k

0

Entering edit mode

Yes exome capture and sequencing involves amplification. My question is not so much about the error profiles and how to get a better sequence but rather how to identify problematic snp calls from aligned bam files.

ADD REPLY • link 13.7 years ago by Biomed 5.0k

0

Entering edit mode

Do you limit the number of positions that a given read can map to on the reference?

ADD REPLY • link 13.7 years ago by Russh ★ 1.2k

score 1 · Answer 1 · 2011-07-27

There are a couple of things that I would consider. SNP Quality is an obvious one. A minimum cutoff I've seen used is 20. If you up that to at least 50 or greater you should have better validation. The second thing to look at is exactly what you said as far as heterozygosity. Low-percentage variant reads will probably not validate well. Give a try with [1] high SNP quality, [2] good read depth (greater or equal to 30 for substitutions) and [3] positions for which the heterozygosity is 50% variant or greater.

The PCR errors are an excellent possibility. Also make sure to mark the PCR / optical duplicates before calling variants. That won't fix early PCR errors, but it will help with early PCR bias.

Obviously to come up with a good heuristic you need to Sanger validate as many variants as possible in a range of SNP qualities, read depths and heterozygosity levels. Then let the data dictate empirical heuristic parameters.

score 1 · Answer 2 · 2011-08-09

Thats a critical question. Seems you are already looking at base quality and read-depth but that is not helping in sifting out the ones that would validate. I would recommend doing realignment-recalibration after first-pass alignment. That really cleans up the data.

Another thing along with base quality is the mapping quality. Something above MAPQ of 20 is good.

One could try running other variant callers like GATK/SNVMix and take an intersection of variant calls - those would have a higher chance of validating.

The trade-off is to go for lesser false-positives or lesser false-negatives.. depending on that, you can change the stringency.

If you have a sample with known list of variants (or some public data) there could be some analysis done to look at distribution of false calls and % of variant reads.. to see if at some % of variant reads, there is more noise!

Good luck, and do update us on your findings..

score 0 · Answer 3 · 2011-08-02

In my experience, the biggest source of miscalls was from PCR amplification of polymerase errors but that was for limiting amounts of start material in amplicon sequencing. My fix was to create A method for counting PCR template molecules with application to next-generation sequencing. Since you are using randomly fragmented material, you can remove PCR duplicates as suggested by DocRoberson.

I found misalignments were a source of spurious SNP calls. These were often around indels, and sometimes all aligners do stupid things. I found that these had a characteristic pattern of lots of low quality calls across multiple samples, and so it was possible to build a heuristic to exclude them.

If you cannot understand why they are being called wrongly on a case by case basis, I don't think anything automated will help. You could exclude het calls at 20-30% MAF, but presumably this would exclude valid calls as well.