Question

Reference and dbSNP incompatibility issue (MuTect2)

0

Entering edit mode

8.8 years ago

umn_bist ▴ 390

When I try using MuTect2 (from GATK) I get this error

Is there a link to an (old) dbSNP that is compatible with UCSC's hg19 assembly?

EDIT: I cannot post the error message because Biostar is saying that it isn't in English... I used the dbSNP from NCBI ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/

00-All.vcf.gz

and I am using ucsc.hg19.fasta reference assembly

##### ERROR   dbsnp contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT, GL000207.1, GL000226.1, GL000229.1, GL000231.1, GL000210.1, GL000239.1, GL000235.1, GL000201.1, GL000247.1, GL000245.1, GL000197.1, GL000203.1, GL000246.1, GL000249.1, GL000196.1, GL000248.1, GL000244.1, GL000238.1, GL000202.1, GL000234.1, GL000232.1, GL000206.1, GL000240.1, GL000236.1, GL000241.1, GL000243.1, GL000242.1, GL000230.1, GL000237.1, GL000233.1, GL000204.1, GL000198.1, GL000208.1, GL000191.1, GL000227.1, GL000228.1, GL000214.1, GL000221.1, GL000209.1, GL000218.1, GL000220.1, GL000213.1, GL000211.1, GL000199.1, GL000217.1, GL000216.1, GL000215.1, GL000205.1, GL000219.1, GL000224.1, GL000223.1, GL000195.1, GL000212.1, GL000222.1, GL000200.1, GL000193.1, GL000194.1, GL000225.1, GL000192.1, NC_007605]
##### ERROR   reference contigs = [chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chr1_gl000191_random, chr1_gl000192_random, chr4_ctg9_hap1, chr4_gl000193_random, chr4_gl000194_random, chr6_apd_hap1, chr6_cox_hap2, chr6_dbb_hap3, chr6_mann_hap4, chr6_mcf_hap5, chr6_qbl_hap6, chr6_ssto_hap7, chr7_gl000195_random, chr8_gl000196_random, chr8_gl000197_random, chr9_gl000198_random, chr9_gl000199_random, chr9_gl000200_random, chr9_gl000201_random, chr11_gl000202_random, chr17_ctg5_hap1, chr17_gl000203_random, chr17_gl000204_random, chr17_gl000205_random, chr17_gl000206_random, chr18_gl000207_random, chr19_gl000208_random, chr19_gl000209_random, chr21_gl000210_random, chrUn_gl000211, chrUn_gl000212, chrUn_gl000213, chrUn_gl000214, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chrUn_gl000218, chrUn_gl000219, chrUn_gl000220, chrUn_gl000221, chrUn_gl000222, chrUn_gl000223, chrUn_gl000224, chrUn_gl000225, chrUn_gl000226, chrUn_gl000227, chrUn_gl000228, chrUn_gl000229, chrUn_gl000230, chrUn_gl000231, chrUn_gl000232, chrUn_gl000233, chrUn_gl000234, chrUn_gl000235, chrUn_gl000236, chrUn_gl000237, chrUn_gl000238, chrUn_gl000239, chrUn_gl000240, chrUn_gl000241, chrUn_gl000242, chrUn_gl000243, chrUn_gl000244, chrUn_gl000245, chrUn_gl000246, chrUn_gl000247, chrUn_gl000248, chrUn_gl000249]

ucsc.hg19.fa GATK RNA-Seq dbSNP Mutect2 • 4.2k views

ADD COMMENT • link updated 2.3 years ago by Ram 44k • written 8.8 years ago by umn_bist ▴ 390

1

Entering edit mode

Hi,

Just one addition to what Chris has already said. There is difference in the mito. sequence in the UCSC version as compared to the b37/ 1000G/ Ensembl ver. So if you stick to 1-22 & X and Y only then replacing/ prefixing 'chr' is Ok.

Else take care of the mito. data. And also the alternate/ unplaced contigs. Those are also different in the UCSC ver.

When I analyze WES data, since its (Agilent) not designed to capture mito. anyways, I just choose 1-22, X and Y. Then the data/ sequence of UCSC is interchangeable smoothly with b37/ 1000G

ADD REPLY • link 8.8 years ago by Amitm ★ 2.3k

Ram · Answer 1 · 2016-01-28

3

Entering edit mode

8.8 years ago

Chris Miller 22k

This is the same as your previous problems. "You'll either need to change the dbSNP file or change your data and reference fasta. The former is probably easier - you'll just need to add "chr" when appropriate, change "MT" to "chrM", and convert between the gl contig names

ADD COMMENT • link updated 4.9 years ago by Ram 44k • written 8.8 years ago by Chris Miller 22k

2

Entering edit mode

There is now a separate dbSNP download section with "corrected" contig names: ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/GATK/

ADD REPLY • link 8.4 years ago by igor 13k

0

Entering edit mode

This is pretty useful. THX

ADD REPLY • link 8.1 years ago by Mdeng ▴ 530

0

Entering edit mode

Thanks for your help, Chris. Yes, this has all been little validation errors due to the main issue of not having the original reference.

I did however get a hold of a working reference genome (ucsc.hg19), its corresponding dbSNP and COSMIC vcf but having gone through the formatting process (sorting, indexing, add read group) and finally getting a vcf file with no mutation detection, I think I will resort to the second best option. Do you have any recommendations other than Mutect2 if I am trying to resort to a single tool? FreeBayes/VarScan2/SomaticSniper? GATK has been a very difficult, time consuming (and eye-opening) experience thus far. Thanks again for your help.

EDIT: I find samtools mpileup function much more comfortable to use (but it seems that it is horrible for somatic variant calling).

ADD REPLY • link 8.8 years ago by umn_bist ▴ 390

score 1 · Answer 2 · 2016-01-28

1

Entering edit mode

8.8 years ago

Chris Miller 22k

If you're only going to run one variant caller, Mutect is probably the way to go

ADD COMMENT • link 8.8 years ago by Chris Miller 22k

0

Entering edit mode

Does this stand even if I have (impure) tumor samples with no matching normals? I read that MuTect2 is great for pure tumor samples because it picks up low VAF % but for impure ones, it can be too sensitive (high false positives). Does the fact that I have dbSNP and COSMIC vcf ensure that MuTect is good for my use case? Thank you for your help.

ADD REPLY • link 8.8 years ago by umn_bist ▴ 390

0

Entering edit mode

No variant caller that I've seen yet is great at low-VAF calling. Impure tumors are more difficult, because the signal is depressed and closer to the noise level from the error rate of the sequencer/prep. If you push too far down, you begin picking those up get a huge number of false positives. My preference is always for some sort of ensemble calling, followed by filtering, but if you're going to use one caller, I still think that Mutect is a reasonable way to go here.

ADD REPLY • link 8.8 years ago by Chris Miller 22k