Question

NGS/GATK pipeline returns too many variants

0

Entering edit mode

3.8 years ago

dodausp ▴ 190

Hi, everyone I have been running the GATK pipeline on a set of WES samples, but I am afraid it is returning way too many variants in the end. I am using the AmpliSeq panel from IonTorrent, and using the output bam file to run the analysis. The pipeline running goes as follows:

Running Mutect2

Mutect2 -R hg19.fasta -I sample1.bam -O sample1_unfiltered.vcf --f1r2-tar-gz sample1_f1r2.tar.gz    -pon Mutect2_hg19_exome.vcf  --germline-resource af-only-gnomad_IT.raw.sites.b37.vcf.gz -L AmpliSeqExome.bed

Here, I opted to use the -pon and --germline-resource options, which were retrieved from a PoN panel devised by Broad and gnomAD, respectively. The BED file is provided by the sequencer.

Learn orientation bias

Since I am using FFPE samples, this step seems neccessary.

gatk LearnReadOrientationModel -I sample1_f1r2.tar.gz -O sample1_ROM_model.tar.gz

Tabulating pileup metrics for contamination inference

Now, running GetPileupSummaries. Info here.

gatk GetPileupSummaries -I sample1.bam -V ExAC_hg19_IonTorrent_BiallelicOnly.r1.sites.vep.vcf.gz -L ExAC_hg19_IonTorrent_BiallelicOnly.r1.sites.vep.vcf.gz -O sample1_pileups.table

Estimating sample contamination

Here, the fraction of reads coming from cross-sample contamination is calculated, given the output from GetPileupSummaries. This output will be used on FilterMutectCalls later on.

gatk CalculateContamination -I sample1_pileups.table -tumor-segmentation sample1_segments.table -O sample1_calcuContamination.table

Filtering somatic SNV's and indels

gatk FilterMutectCalls -R hg19.fasta -V sample1_unfiltered.vcf --tumor-segmentation sample1_segments.table --contamination-table sample1_calcuContamination.table --ob-priors sample1_ROM_model.tar.gz -O sample1_Filtered.vcf

After that, from the resuting output sample1_Filtered.vcf, I use vcftools to retrieve only those PASS variants:

vcftools --vcf sample1_41_Filtered.vcf --out sample1_FiltPASS --remove-filtered-all --recode

The resulting vcf files shows 72,358 variants.
When trying using the recommended VQSR/VQSLOD functions for filtering SNP's and INDEL's, the result is even greater, retrieving 412,000 variants.
And of course, when running the Annotation tool, in this case the Funcotator is equally large.
.
The issue with that is mainly I am sure there might be something wrong with this devised pipeline (maybe I have added some redundant steps, or filtering for the wrong criteria). And the other problem is, even if using hard filtering of this sort of output is hard to know where to begin with.

If you have experienced something like that before, or see something awfully wrong on this pipeline, I will be very glad to have your feedback either way. (:

Thanks a lot!

NGS WES GATK Mutect • 1.4k views

ADD COMMENT • link updated 8 months ago by dhruti • 0 • written 3.8 years ago by dodausp ▴ 190

0

Entering edit mode

use snpEff for variant calling/annotation !

ADD REPLY • link 3.8 years ago by shubhamkumbhar420 ▴ 40

1

Entering edit mode

snpEff is not a variant caller, it "only" makes predictions on variants you give it. Please read about the tools you aim to suggest first before adding as answer.

ADD REPLY • link 3.8 years ago by ATpoint 85k

0

Entering edit mode

Then can we use bcftools?

ADD REPLY • link 3.8 years ago by shubhamkumbhar420 ▴ 40

0

Entering edit mode

@dodausp Did you ever find a reason for this? I am having the same problem

ADD REPLY • link 8 months ago by dhruti • 0