Question

allelic unbalance for multiplex PCR based amplicon data

0

Entering edit mode

7.2 years ago

J.F.Jiang ▴ 930

Hi all,

We used multiplex PCR to enrich the target regions, and then get them sequenced on HiSeq platform.

For germline variants, it will be ideally that the ratio for ref allele against alt is around 0.5 for heterozygous variants.

However, in our data, we find that sometimes, this ratio is less than 0.1 according to GATK calling result.

I am wondering why this could be happen for germline variants?

And the most confusing thing is that we find calling results differently but slightly difference for this ratio, etc., 0.06 for homozygous but 0.07 for heterozygous.

It will be great if you can give me some suggestions.

allelic multiplex PCR amplicon • 1.5k views

ADD COMMENT • link updated 7.2 years ago by Kevin Blighe 88k • written 7.2 years ago by J.F.Jiang ▴ 930

0

Entering edit mode

Finding a solution to such a problem during genotype assignment and variant calling is not very easy as it depends on a variety of factors such as sequence and mapping errors that any variant calling software takes into consideration. Some of them can be taken care using methods mentioned by Kevin.. but further filtering can be also be done using genotype quality making it stricter(although haplotypecaller itself applies it by default). And most of these quality values are phred scaled likelihood values..so it's an estimation what the tool is making about the genotype again taking sequencing and mapping errors into consideration..and it's the best it can estimate based on the sequencing data..

You can also try gatk's genotype refinement tool to refine your assigned genotypes if you have a truth set for the kind of data you are exploring..

ADD REPLY • link 7.2 years ago by prasundutta87 ▴ 670

0

Entering edit mode

Thank,

Yes, I indeed find it is tough to find "truth" calling when the allelic unbalance came out.

GATK refinement workflow require a truth set, such as trio/pedegree data, as the prior knowledge to adjust the variant calling. However, our data is based on sporadic population, and the adjustment without any dataset makes even worse at sometimes.

We also applied 1KG dataset as the truth, and similar results were found.

So we believe refinement will not work fine if no trio/pedegree data are offerred.

ADD REPLY • link 7.2 years ago by J.F.Jiang ▴ 930

score 0 · Answer 1 · 2017-09-09

0

Entering edit mode

7.2 years ago

Kevin Blighe 88k

This is a frequent problem in NGS data analysis, i.e., expecting a germline heterozygous variant at ~50% frequency but observing it at <20% frequency.

Just to be sure:

remove PCR duplicates from your aligned BAM file using Picard (http://broadinstitute.github.io/picard/)
ensure that only reads with high mapping quality (MAPQ), e.g. 50, are retained using samtools view -bq 50 Input.bam > out.bam
when using the GATK, use HaplotypeCaller, not UnifiedGenotyper

If you still have problems after that, I would be somewhat surprised.

ADD COMMENT • link 7.2 years ago by Kevin Blighe 88k

0

Entering edit mode

Thanks Kevin,

Since it is PCR enriched amplicon data, duplicates can not be removed
I did not remove the low mapping quality reads with samtools, but did it when variants calling using GATK with -mmq 30 to get confident calling
HaplotypeCaller take much longer time against UnifiedGenotyper, and GGA mode is not recommended at HC based on GATK forum threads.

Junfeng

ADD REPLY • link 7.2 years ago by J.F.Jiang ▴ 930