For exome sequencing datasets you should always use an interval file. As explained on GATK website (https://software.broadinstitute.org/gatk/documentation/article?id=11009-
For exomes and similarly targeted data types, the interval list should
correspond to the capture targets used for the library prep, and is
typically provided by the prep kit manufacturer (with versions for
each ref genome build of course).
Thus you should at each step where an -L
option is available use the corresponding interval file. The exome kit vendor should provide you with the corresponding interval file. If not you can use gencode :
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_31/gencode.v31.annotation.gtf.gz
zcat gencode.v31.annotation.gtf.gz | awk '$3 == "exon" { print $1, $4, $5, $7, $18}' | tr ' ' '\t' | tr -d '";' >> interval_list.bed
For information if you use GATK4 best practices these are the tools where the interval file needs to be specified
BaseRecalibrator
ApplyBQSR
HaplotypeCaller
GenomicsDBimport
(as you have only 15 samples you can use directly GenotypeGVCFs)
GenotypeGVCFs
VariantRecalibrator
ApplyVQSR
Also as you have a limited number of samples you should add 1000G samples (from here : https://software.broadinstitute.org/gatk/documentation/article.php?id=1259)
In our testing we've
found that in order to achieve the best exome results one needs to use
an exome SNP and/or indel callset with at least 30 samples. For users
with experiments containing fewer exome samples there are several
options to explore:
Add additional samples for variant calling, either by sequencing
additional samples or using publicly available exome bams from the
1000 Genomes Project (this option is used by the Broad exome
production pipeline). Be aware that you cannot simply add VCFs from
the 1000 Genomes Project. You must either call variants from the
original BAMs jointly with your own samples, or (better) use the
reference model workflow to generate GVCFs from the original BAMs, and
perform joint genotyping on those GVCFs along with your own samples'
GVCFs with GenotypeGVCFs.
You can also try using the VQSR with the smaller variant callset, but
experiment with argument settings (try adding --maxGaussians 4 to your
command line, for example). You should only do this if you are working
with a non-model organism for which there are no available genomes or
exomes that you can use to supplement your own cohort.
Could you add GATK version you used. And also the previous steps you did. Did you used an interval file ? Tranche plot is not very good IMO.
I have used GATK 4.0.8. I run
HaplotypeCaller
,CombineGVCFs
, andGenotypeGVCFs
. I did not use interval file. So what should I do for this?