Hello Guys,
I hope somebody can help me. I am following a pipeline to call snps from fastq files. I have successfully performed the alignment to hg38 and mks duplicates in my bam file with picard. However, now I am in the step in which I am using GATK to call variants. I know GATK requires a dbSNP file to use as a reference. I have downloaded the latest dbSNP release (dbSNP154 v2) from this website: https://ftp.ncbi.nih.gov/snp/latest_release/VCF/. The chromosomes were named differently in that latest version. So I looked at the assembly report, extracted the columns, and renamed the chromosomes using bcftools annotate --rename-chrs. I had to reorder the rows in this new file using bcf sort, because using tabix to index the file was giving me an error. However, after all this steps, when running GATK on my sample using this dbSNP154vs release I get the following error: htsjdk.samtools.SAMException: Sequence name '' doesn't match regex: '[0-9A-Za-z!#$%&+./:;?@^_|~-][0-9A-Za-z!#$%&+./:;=?@^_|~-]'
If I run GATK on an older version of the dbSNP(like 151), it works perfectly fine. Any ideas on how can I run GATK using the dbSNP154v2 for known sites. Thanks!
Alex
show us the chromosomes names in the VCF.
Hello, I replied below, thanks!
Hello, See below, the 1st column was replaced with the 2nd column in the vcf file. This is just part of the rows that were replaced. I just noticed that in the 2nd column there is a bunch of "na", and probably those "na" are messing up gatk. Is there a way to eliminate those "na" from the vcf file? Thanks!
Alex
NC_000001.11 chr1
NC_000002.12 chr2
NC_000003.12 chr3
NC_000004.12 chr4
NC_000005.10 chr5
NC_000006.12 chr6
NC_000007.14 chr7
NC_000008.11 chr8
NC_000009.12 chr9
NC_000010.11 chr10
NC_000011.10 chr11
NC_000012.12 chr12
NC_000013.11 chr13
NC_000014.9 chr14
NC_000015.10 chr15
NC_000016.10 chr16
NC_000017.11 chr17
NC_000018.10 chr18
NC_000019.10 chr19
NC_000020.11 chr20
NC_000021.9 chr21
NC_000022.11 chr22
NC_000023.11 chrX
NC_000024.10 chrY
NT_187361.1 chr1_KI270706v1_random
NT_187362.1 chr1_KI270707v1_random
NT_187363.1 chr1_KI270708v1_random
NT_187364.1 chr1_KI270709v1_random
NT_187365.1 chr1_KI270710v1_random
NT_187366.1 chr1_KI270711v1_random
NT_187367.1 chr1_KI270712v1_random
NT_187368.1 chr1_KI270713v1_random
NT_187369.1 chr1_KI270714v1_random
NT_187370.1 chr2_KI270715v1_random
NT_187371.1 chr2_KI270716v1_random
NT_167215.1 chr3_GL000221v1_random
NT_113793.3 chr4_GL000008v2_random
NT_113948.1 chr5_GL000208v1_random
NT_187372.1 chr9_KI270717v1_random
NT_187373.1 chr9_KI270718v1_random
NT_187374.1 chr9_KI270719v1_random
NT_187375.1 chr9_KI270720v1_random
NT_187376.1 chr11_KI270721v1_random
NT_113796.3 chr14_GL000009v2_random
NT_113888.1 chr14_GL000194v1_random
NT_167219.1 chr14_GL000225v1_random
NT_187377.1 chr14_KI270722v1_random
NT_187378.1 chr14_KI270723v1_random
NT_187379.1 chr14_KI270724v1_random
NT_187380.1 chr14_KI270725v1_random
NT_187381.1 chr14_KI270726v1_random
NT_187382.1 chr15_KI270727v1_random
NW_021159989.1 na
NW_015495300.1 chr4_KQ983257v1_fix
NW_021159990.1 na
NW_021159991.1 na
NW_021159992.1 na
NW_021159993.1 na
NW_021159994.1 na
NW_021159995.1 na
NT_187685.1 chr19_KI270931v1_alt
NT_187686.1 chr19_KI270932v1_alt
NT_187687.1 chr19_KI270933v1_alt
NT_113949.2 chr19_GL000209v2_alt
NC_012920.1 chrM
na chrUn_KI270752v1
I am wondering how to it is dbSNP153 or dbSNP154? https://ftp.ncbi.nih.gov/snp/latest_release/VCF/ Thanks.