I am running a variant calling pipeline for cancer samples. It includes Mutect2.
Working on human, i started with the reference & dbsnp files contained in the GATK bundle for hg38 (ftp://ftp.broadinstitute.org/bundle/hg38). Picked the following files :
Homo_sapiens_assembly38.dict
Homo_sapiens_assembly38.fasta.fai
Homo_sapiens_assembly38.fasta.gz
dbsnp_146.hg38.vcf.gz
dbsnp_146.hg38.vcf.gz.tbi
With Mutect2, you can feed a DB of known somatic variants using "--cosmic". Given that i started the pipeline with hg38 reference file, i picked the grch38 cosmic file (https://cancer.sanger.ac.uk/cosmic/files?data=/files/grch38/cosmic/v79/VCF/CosmicCodingMuts.vcf.gz). From my understanding, hg38 <=> UCSC and GRCh38 <=> NCBI, but i thought it would be close/good enough.
Then, when i run Mutect2, i get the following error : "Input files cosmic and reference have incompatible contigs. Error details: The contig order in cosmic and reference is not the same"
I corrected chromosomes names (1->chr1, MT->chrM, etc...) in the CosmicCodingMuts.vcf file, then sorted it using Picard SortVcf. But i am still stuck with the same kind of error in Mutect2.
Question is : 1) How to modify the COSMIC.vcf to match hg38 reference ? 2) If 1) is not possible, where can retrieve compatible genome_ref + germline_snp + somatic_snp ?
If anyone wants to do this in R, it a lot easier:
Thanks for posting your troubleshooting / solution for this problem. I know I will be looking for this when we move to hg38.
Thank you so much for this information. I was having trouble locating the cosmic data. By the way, is there any reason for you to skip the non-coding variants? Cosmic should serve as the white list, and I believe the more confident variants we provide, the better MuTect2 can work. Please correct me if I am wrong.
Thanks a lot for posting. I have a question. Why did you sort your 'chr added' vcf which was already sorted vcf? I mean, is not original ccosmic vcf sorted?