Hi, everyone I have been running the GATK pipeline on a set of WES samples, but I am afraid it is returning way too many variants in the end. I am using the AmpliSeq panel from IonTorrent, and using the output bam file to run the analysis. The pipeline running goes as follows:
Running Mutect2
Mutect2 -R hg19.fasta -I sample1.bam -O sample1_unfiltered.vcf --f1r2-tar-gz sample1_f1r2.tar.gz -pon Mutect2_hg19_exome.vcf --germline-resource af-only-gnomad_IT.raw.sites.b37.vcf.gz -L AmpliSeqExome.bed
Here, I opted to use the -pon
and --germline-resource
options, which were retrieved from a PoN panel devised by Broad and gnomAD, respectively. The BED file is provided by the sequencer.
Learn orientation bias
Since I am using FFPE samples, this step seems neccessary.
gatk LearnReadOrientationModel -I sample1_f1r2.tar.gz -O sample1_ROM_model.tar.gz
Tabulating pileup metrics for contamination inference
Now, running GetPileupSummaries
. Info here.
gatk GetPileupSummaries -I sample1.bam -V ExAC_hg19_IonTorrent_BiallelicOnly.r1.sites.vep.vcf.gz -L ExAC_hg19_IonTorrent_BiallelicOnly.r1.sites.vep.vcf.gz -O sample1_pileups.table
Estimating sample contamination
Here, the fraction of reads coming from cross-sample contamination is calculated, given the output from GetPileupSummaries
. This output will be used on FilterMutectCalls
later on.
gatk CalculateContamination -I sample1_pileups.table -tumor-segmentation sample1_segments.table -O sample1_calcuContamination.table
Filtering somatic SNV's and indels
gatk FilterMutectCalls -R hg19.fasta -V sample1_unfiltered.vcf --tumor-segmentation sample1_segments.table --contamination-table sample1_calcuContamination.table --ob-priors sample1_ROM_model.tar.gz -O sample1_Filtered.vcf
After that, from the resuting output sample1_Filtered.vcf, I use vcftools to retrieve only those PASS variants:
vcftools --vcf sample1_41_Filtered.vcf --out sample1_FiltPASS --remove-filtered-all --recode
The resulting vcf files shows 72,358 variants.
When trying using the recommended VQSR/VQSLOD
functions for filtering SNP's and INDEL's, the result is even greater, retrieving 412,000 variants.
And of course, when running the Annotation tool, in this case the Funcotator is equally large.
.
The issue with that is mainly I am sure there might be something wrong with this devised pipeline (maybe I have added some redundant steps, or filtering for the wrong criteria). And the other problem is, even if using hard filtering of this sort of output is hard to know where to begin with.
If you have experienced something like that before, or see something awfully wrong on this pipeline, I will be very glad to have your feedback either way. (:
Thanks a lot!
use snpEff for variant calling/annotation !
snpEff is not a variant caller, it "only" makes predictions on variants you give it. Please read about the tools you aim to suggest first before adding as answer.
Then can we use bcftools?
@dodausp Did you ever find a reason for this? I am having the same problem