COSMIC vcf file compatibility for Mutect2
2
4
Entering edit mode
7.8 years ago
erwan.scaon ▴ 950

I am running a variant calling pipeline for cancer samples. It includes Mutect2. Working on human, i started with the reference & dbsnp files contained in the GATK bundle for hg38 (ftp://ftp.broadinstitute.org/bundle/hg38). Picked the following files : Homo_sapiens_assembly38.dict
Homo_sapiens_assembly38.fasta.fai
Homo_sapiens_assembly38.fasta.gz dbsnp_146.hg38.vcf.gz dbsnp_146.hg38.vcf.gz.tbi

With Mutect2, you can feed a DB of known somatic variants using "--cosmic". Given that i started the pipeline with hg38 reference file, i picked the grch38 cosmic file (https://cancer.sanger.ac.uk/cosmic/files?data=/files/grch38/cosmic/v79/VCF/CosmicCodingMuts.vcf.gz). From my understanding, hg38 <=> UCSC and GRCh38 <=> NCBI, but i thought it would be close/good enough.

Then, when i run Mutect2, i get the following error : "Input files cosmic and reference have incompatible contigs. Error details: The contig order in cosmic and reference is not the same"

I corrected chromosomes names (1->chr1, MT->chrM, etc...) in the CosmicCodingMuts.vcf file, then sorted it using Picard SortVcf. But i am still stuck with the same kind of error in Mutect2.

Question is : 1) How to modify the COSMIC.vcf to match hg38 reference ? 2) If 1) is not possible, where can retrieve compatible genome_ref + germline_snp + somatic_snp ?

GATK Mutect2 COSMIC VCF • 11k views
ADD COMMENT
8
Entering edit mode
7.8 years ago
erwan.scaon ▴ 950

Ok, things are working for me, i did the following to make my "grch38" cosmic.vcf compatible with human genome reference hg38 (downloaded from GATK bundle) :

--- convert cosmic contigs names from "1" -> "chr1"

awk '{if($0 !~ /^#/) print "chr"$0; else print $0}' CosmicCodingMuts.vcf > CosmicCodingMuts_chr.vcf

--- change chrMT -> chrM

sed 's/chrMT/chrM/g' CosmicCodingMuts_chr.vcf > CosmicCodingMuts_chr_M.vcf

--- sort this modified cosmic.vcf (it will create and index too)

java -jar $PICARD SortVcf I=CosmicCodingMuts_chr_M.vcf O=CosmicCodingMuts_chr_M_sorted.vcf SEQUENCE_DICTIONARY=GATK_bundle_h38/Homo_sapiens_assembly38.dict

--- Weird step, need to remove the index created by Picard (http://gatkforums.broadinstitute.org/gatk/discussion/1328/errors-about-contigs-in-bam-or-vcf-files-not-being-properly-ordered-or-sorted)

rm CosmicCodingMuts_chr_M_sorted.vcf.idx

--- Then, when you launch Mutect2 with "--cosmic CosmicCodingMuts_chr_M_sorted.vcf", it will "regenerate" an .idx for this vcf file, and this .idx will be compatible with the reference genome

Hope this helps

ADD COMMENT
1
Entering edit mode

If anyone wants to do this in R, it a lot easier:

library(VariantAnnotation)

## import cosmic vcf file
cos.ens <- readVcf('./data/CosmicCodingMuts.vcf')
## check current chr names
seqinfo(cos.ens)

## set UCSC compatible chromosome names
seqlevelsStyle(cos.ens) <- 'UCSC'
## check new chr names
seqinfo(cos.ens)

## export new vcf
writeVcf(cos.mut.ens, 'data/cosmic_mut_ucsc.vcf')
ADD REPLY
0
Entering edit mode

Thanks for posting your troubleshooting / solution for this problem. I know I will be looking for this when we move to hg38.

ADD REPLY
0
Entering edit mode

Thank you so much for this information. I was having trouble locating the cosmic data. By the way, is there any reason for you to skip the non-coding variants? Cosmic should serve as the white list, and I believe the more confident variants we provide, the better MuTect2 can work. Please correct me if I am wrong.

ADD REPLY
0
Entering edit mode

Thanks a lot for posting. I have a question. Why did you sort your 'chr added' vcf which was already sorted vcf? I mean, is not original ccosmic vcf sorted?

ADD REPLY
5
Entering edit mode
7.6 years ago
erwan.scaon ▴ 950

Indeed, you should use both coding and non-coding variants.

In our MuTect2 project, we ended up using the hg19 COSMIC files for both coding and non-coding :

Retrive files

sftp "erwan.scaon@unilim.fr"@sftp-cancer.sanger.ac.uk;
get files/grch37/cosmic/v79/VCF/CosmicCodingMuts.vcf.gz
get files/grch37/cosmic/v79/VCF/CosmicNonCodingVariants.vcf.gz

Make them "hg19-compatible"

grep '^[^#]' CosmicNonCodingVariants.vcf >> CosmicCodingMuts.vcf;
mv CosmicCodingMuts.vcf cosmic_coding_and_noncoding.vcf;
awk '{if($0 !~ /^#/) print "chr"$0; else print $0}' cosmic_coding_and_noncoding.vcf > cosmic_coding_and_noncoding_chr.vcf;
sed 's/chrMT/chrM/g' cosmic_coding_and_noncoding_chr.vcf > cosmic_coding_and_noncoding_chr_M.vcf;
java -jar $PICARD SortVcf I=cosmic_coding_and_noncoding_chr_M.vcf O=cosmic_coding_and_noncoding_chr_M_sorted.vcf SEQUENCE_DICTIONARY=../GATK_bundle_hg38/Homo_sapiens_assembly38.dict;
rm cosmic_coding_and_noncoding_chr_M_sorted.vcf.idx;
ADD COMMENT

Login before adding your answer.

Traffic: 2047 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6