Hello everyone! I've been trying to use GATK with updated version of the human genome as the GATK files are outdated by ten years.
I've downloaded NCBI reference GCF_000001405.40.fna, which is GRCh38.p14
For dbSNP version, I've downloaded GCF_000001405.40.gz , which is also GRCh38.p14
When extracting the contig names from my reference file, I found:
NC_000001.11 Homo sapiens chromosome 1, GRCh38.p14 Primary Assembly
0 252068378 NT_187361.1 Homo sapiens chromosome 1 unlocalized genomic scaffold, GRCh38.p14 Primary As etc...
Extracting the contig names:
reference contigs = [NC_000001.11, NT_187361.1, NT_187362.1, NT_187363.1, NT_187364.1, NT_187365.1, NT_187366.1, NT_187367.1, NT_187368.1, NT_187369.1, NC_000002.12, NT_187370.1, NT_187371.1, NC_000003.12, NT_167215.1, NC_000004.12, NT_113793.3...
For dbSNP file, I found:
features contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM, chr1_KI270706v1_random, chr1_KI270707v1_random...
Which causes a bunch of errors with GATK and other anotation tools.
I'm lost to which option would be the best: Converting all BAMs and reference file contig names or converting the dbSNP vcf contig names. I have no idea how to do any of them!
duplicate of dbSNP with RefSeq chromosome notation
see
bcftools annotate --rename-chrs
When I downloaded this file: https://ftp.ncbi.nlm.nih.gov/snp/latest_release/VCF/
I got the contigs present in the NCBI reference. Although this created another problem for me which is that the dbsnp RSIDs seem to not be mapped to the main chromosomes. For example NW_015148968.1 was coming up for rs28371738 instead of the contigs chr22/NC_0000022....