Question

NCBI dbSNP compatibility w/ Ensembl whole genome

1

Entering edit mode

8.8 years ago

umn_bist ▴ 390

I downloaded dbSNP build 141 from NCBI and GRCh38.p5 from Gencode. I am using both for GATK BaseRecalibrator but I receive an error caused by 'chr' annotation.

Specifically the genome sequence has 'chr' and also unplaced contigs but the SNP vcf file does not. I am wondering if I can simply append 'chr' into the SNP file (assuming that unplaced contigs are included) or if there is a SNP file (that has indels included) for Ensembl genome (ideally for both GRCh38 and GRCh37).

EDIT: Upon further inspection, the SNP vcf file with 'papu' notation included has unplaced contigs, but this still does not include 'chr' notation. I also found that Ensembl has its own dbSNP (version 144) that corresponds to Ensembl 83 (GRCh38) but I do not see a download link. I also see that UCSC adopted the Gencode/Ensembl format but their SNP does not include ones for unplaced contigs. First, I am wondering if this matters for the purpose of running GATK, and, second, is it possible to merge common, clinically associated, and multimapped variants into 1 vcf? Is this advisable?

ERROR   /00-All.vcf contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT]

ERROR   reference contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM, GL000008.2, GL000009.2, GL000194.1, GL000195.1, GL000205.2, GL000208.1, GL000213.1, GL000214.1, GL000216.2, GL000218.1, GL000219.1, GL000220.1, GL000221.1, GL000224.1, GL000225.1, GL000226.1, KN538364.1, KQ031383.1, KN538369.1, JH159136.1, JH159137.1, KQ031387.1, KN538360.1, KN196484.1, KN196476.1, KN196479.1, KN196473.1, KN196487.1, KN196475.1, KQ090016.1, KN538361.1, KN196474.1, KQ090022.1, KN196478.1, KN196480.1, KQ090028.1, KN196483.1, KN196481.1, KN538363.1, KN538362.1, KQ031385.1, KQ031386.1, KQ031388.1, KN538365.1, KN538366.1, KN538367.1, KN538370.1, KN538373.1, KN538371.1, KQ031384.1, KN538372.1, KQ090021.1, KN196482.1, KQ458386.1, KN196472.1, GL383545.1, GL383546.1, KI270824.1, KI270825.1, KQ090020.1, GL383547.1, KN538368.1, KI270826.1, KI270827.1, KI270829.1, KI270830.1, KI270831.1, KI270832.1, KI270902.1, KI270903.1, KI270927.1, GL877875.1, GL383549.1, GL383550.2, KQ090023.1, GL877876.1, GL383552.1, KI270904.1, GL383553.2, KI270835.1, GL383551.1, KI270837.1, KI270833.1, KI270834.1, KI270836.1, KI270838.1, KI270839.1, KI270840.1, KI270841.1, KI270842.1, KI270843.1, KQ090024.1, KQ090025.1, KI270844.1, KI270845.1, KI270846.1, KI270847.1, KI270852.1, KI270848.1, GL383554.1, KI270906.1, GL383555.2, KI270851.1, KI270849.1, KI270905.1, KI270850.1, KQ031389.1, KI270853.1, GL383556.1, GL383557.1, KI270855.1, KQ031390.1, KI270856.1, KQ090027.1, KQ090026.1, KI270854.1, KI270909.1, GL383563.3, KI270861.1, GL383564.2, GL000258.2, KI270860.1, KI270907.1, KI270862.1, ... ...

(contracted to meet character limit)

GATK Ensembl NCBI • 3.8k views

ADD COMMENT • link updated 6.2 years ago by Ram 44k • written 8.8 years ago by umn_bist ▴ 390

0

Entering edit mode

When I printed the first and last 3000 lines of NCBI's 00-All_papu.vcf (which has SNP for unplaced contigs). the chromosome notation had NT_113889.1 and NW_009646209.1 respectively.

The genome reference that this SNP corresponds to (GRCh38) from Ensembl does not have any unplaced contig starting with NT or NW (they only start with GL, KN, KW, JH, KI). Does this require editing unplaced contig notations in my dbSNP file to match that of my genome reference?

ADD REPLY • link updated 6.2 years ago by Ram 44k • written 8.8 years ago by umn_bist ▴ 390

Ram · Accepted Answer · 2016-02-17

3

Entering edit mode

8.8 years ago

Pierre Lindenbaum 164k

I am wondering if I can simply append 'chr' into the SNP file

no chrM is the exception -> MT

sed -e '/^[^#]/s/^/chr/' -e 's/^chrMT/chrM/'

last time I looked at the NCBI vcf, there was no VCF sequence dictionnary (##contig lines). You could insert it with picard UpdateVcfSequenceDictionary

ADD COMMENT • link updated 6.2 years ago by Ram 44k • written 8.8 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Thank you for your reply. This is exactly what I needed. Would I use the GRCh38.dict file as my sequence dictionary to update the dbSNP vcf file? Thanks again.

ADD REPLY • link 8.8 years ago by umn_bist ▴ 390

1

Entering edit mode

yes that should work.

ADD REPLY • link 8.8 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

So I found that Ensembl has a publicly available file corresponding to GRCh38 release 83. If I'm looking at tumor samples, wouldn't I want both germline and somatic variations, and is it advisable to merge the two files? Is the somatic variation file equivalent to Sanger's COSMIC file?

ADD REPLY • link updated 6.2 years ago by Ram 44k • written 8.8 years ago by umn_bist ▴ 390