I downloaded dbSNP build 141 from NCBI and GRCh38.p5 from Gencode. I am using both for GATK BaseRecalibrator but I receive an error caused by 'chr' annotation.
Specifically the genome sequence has 'chr' and also unplaced contigs but the SNP vcf file does not. I am wondering if I can simply append 'chr' into the SNP file (assuming that unplaced contigs are included) or if there is a SNP file (that has indels included) for Ensembl genome (ideally for both GRCh38 and GRCh37).
EDIT: Upon further inspection, the SNP vcf file with 'papu' notation included has unplaced contigs, but this still does not include 'chr' notation. I also found that Ensembl has its own dbSNP (version 144) that corresponds to Ensembl 83 (GRCh38) but I do not see a download link. I also see that UCSC adopted the Gencode/Ensembl format but their SNP does not include ones for unplaced contigs. First, I am wondering if this matters for the purpose of running GATK, and, second, is it possible to merge common, clinically associated, and multimapped variants into 1 vcf? Is this advisable?
ERROR /00-All.vcf contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT]
ERROR reference contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM, GL000008.2, GL000009.2, GL000194.1, GL000195.1, GL000205.2, GL000208.1, GL000213.1, GL000214.1, GL000216.2, GL000218.1, GL000219.1, GL000220.1, GL000221.1, GL000224.1, GL000225.1, GL000226.1, KN538364.1, KQ031383.1, KN538369.1, JH159136.1, JH159137.1, KQ031387.1, KN538360.1, KN196484.1, KN196476.1, KN196479.1, KN196473.1, KN196487.1, KN196475.1, KQ090016.1, KN538361.1, KN196474.1, KQ090022.1, KN196478.1, KN196480.1, KQ090028.1, KN196483.1, KN196481.1, KN538363.1, KN538362.1, KQ031385.1, KQ031386.1, KQ031388.1, KN538365.1, KN538366.1, KN538367.1, KN538370.1, KN538373.1, KN538371.1, KQ031384.1, KN538372.1, KQ090021.1, KN196482.1, KQ458386.1, KN196472.1, GL383545.1, GL383546.1, KI270824.1, KI270825.1, KQ090020.1, GL383547.1, KN538368.1, KI270826.1, KI270827.1, KI270829.1, KI270830.1, KI270831.1, KI270832.1, KI270902.1, KI270903.1, KI270927.1, GL877875.1, GL383549.1, GL383550.2, KQ090023.1, GL877876.1, GL383552.1, KI270904.1, GL383553.2, KI270835.1, GL383551.1, KI270837.1, KI270833.1, KI270834.1, KI270836.1, KI270838.1, KI270839.1, KI270840.1, KI270841.1, KI270842.1, KI270843.1, KQ090024.1, KQ090025.1, KI270844.1, KI270845.1, KI270846.1, KI270847.1, KI270852.1, KI270848.1, GL383554.1, KI270906.1, GL383555.2, KI270851.1, KI270849.1, KI270905.1, KI270850.1, KQ031389.1, KI270853.1, GL383556.1, GL383557.1, KI270855.1, KQ031390.1, KI270856.1, KQ090027.1, KQ090026.1, KI270854.1, KI270909.1, GL383563.3, KI270861.1, GL383564.2, GL000258.2, KI270860.1, KI270907.1, KI270862.1, ... ...
(contracted to meet character limit)
When I printed the first and last 3000 lines of NCBI's
(which has SNP for unplaced contigs). the chromosome notation hadNT_113889.1
respectively.The genome reference that this SNP corresponds to (GRCh38) from Ensembl does not have any unplaced contig starting with NT or NW (they only start with GL, KN, KW, JH, KI). Does this require editing unplaced contig notations in my dbSNP file to match that of my genome reference?