I am trying to generate PON for WES samples following GATK recommendations, they also have another explanation in this Mutect2 article but it's basically the same identical 3-steps procedure:
step 1. Run Mutect2 in tumor-only mode for each normal sample:
gatk Mutect2 -R reference.fasta -I normal1.bam -max-mnp-distance 0 -O normal1.vcf.gz
step 2. Create a GenomicsDB from the normal Mutect2 calls:
gatk GenomicsDBImport -R reference.fasta -L intervals.interval_list --genomicsdb-workspace-path pon_db -V normal1.vcf.gz -V normal2.vcf.gz -V normal3.vcf.gz -V ...
step 3. Combine the normal calls using CreateSomaticPanelOfNormals:
gatk CreateSomaticPanelOfNormals -R reference.fasta --germline-resource af-only-gnomad.vcf.gz -V gendb://pon_db -O pon.vcf.gz
I am using gatk 4.1.7
(latest at the moment) but the output I got from step 2 (GenomicsDBImport
) is a folder with some files in it, such as vcfheader.vcf
, vidmap.json
and what looks like a file for every chromosme with a $
and contig boundaries specified in the BED file (e.g. X$200786$155255277
).
If I try to pass this directory in the -V
option of CreateSomaticPanelOfNormals
(step 3 ) I got an error that the specified input is not a regular file, and GATK documentation confirms that -V
is supposed to be a VCF file.
Does anybody, that maybe has generate PONs before or worked with this, knows what is the exact file output from step 2 that I am supposed to pass in step 3 -V
?
Thank you very much in advance for any help!
Can you not use the PON files they make available here?
No my samples are hg19, moreover I’d like to be sure that the PON comes from samples samples with the same kit
I am absolutely new to bioinformatics and I'm seeking solution for whole-exome somatic variant calling...
This command specifically.....Is it necessary to provide interval or interval-list here since I'm looking at whole-exome??? I forgot to mention - I'm using gatk 4.2.0.0 version....
Yes, because GATK has no way to know which targets were captured in your exome assay. Every kit out there is slightly different and may be based on specific genome builds. Your kit manufacture should have this file already available.
Thanks a lot for such a prompt response. I don't think I have that list.. all I know is - its a whole exome - hg38 is what I have mapped it with.. I'm not sure if I can just provide a list of all the possible chromosomes - something like: chr1 chr2 chr3 chr4 . . . chrX chrY
If you targeted just exome then you can't provide entire genome. Can you check what kit was used for preparing your samples? If you don't have that information you should check with whoever prepared the samples to get that information.
If that does not work then Broad Institute makes a generic interval list available for GRCh38 here. You could use it with caveat that the list may not match your data 100%.
Thanks!... Sure will check that...