VCF: Replacing RefSeq ID to chr in #CHROM
2
5
Entering edit mode
5.1 years ago
Hadi M ▴ 70

Hi everyone,

I recently downloaded the latest dbSNP VCF and when opening the file, I noticed the #CHROM column is filled with RefSeq ID instead of chr1, chr2 or so on. Here is how the VCF looks like:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
NC_000001.11    10019   rs775809821     TA      T       .       .       RS=775809821;dbSNPBuildID=144;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=INDEL
NC_000001.11    10039   rs978760828     A       C       .       .       RS=978760828;dbSNPBuildID=150;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV
NC_000001.11    10043   rs1008829651    T       A       .       .       RS=1008829651;dbSNPBuildID=150;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV

I would like to convert the RefSeq ID to its corresponding chromosome number:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
chr1    10019   rs775809821     TA      T       .       .       RS=775809821;dbSNPBuildID=144;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=INDEL
chr1   10039   rs978760828     A       C       .       .       RS=978760828;dbSNPBuildID=150;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV
chr1    10043   rs1008829651    T       A       .       .       RS=1008829651;dbSNPBuildID=150;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV

Is a tool or script available that can convert the RefSeq ID? Thank you in advance.

VCF • 8.4k views
ADD COMMENT
14
Entering edit mode
4.8 years ago
rrbutleriii ▴ 260

Yeah, that was a real kick in the pants, as they didn't seem to post anything in the release notes, which still list them by chr1, chr2, etc.

Best. documentation. EVAR.

The easy solution is to have a renaming file for all the chromosomes of the format old_name new_name\n per bcftools and use the --rename-chrs option:

bcftools annotate --rename-chrs rename_file.txt --threads num_threads -o new.vcf.gz -Oz original.vcf.gz

Update: NCBI Assembly to the rescue. Look up your assembly (I did both GRCh37.p13 and GRCh38.p13) and click "Download the full sequence report". Or if you copy the links:

# Assembly reports for renaming chromosomes
report_dir='ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405'
wget -N "${report_dir}/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_assembly_report.txt"
wget -N "${report_dir}/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_assembly_report.txt"

# Grab the useful columns
for k in *assembly_report.txt
  do
    out=$(echo $k | sed 's/.txt/.chrnames/')
    grep -e '^[^#]' $k | awk '{ print $7, $1 }' > $out
done

# Annotate
bcftools annotate \
  --rename-chrs GCF_000001405.25_GRCh37.p13_assembly_report.chrnames \
  --threads 10 -Oz \
  -o GRCh37.dbSNP153.vcf.gz \
  GCF_000001405.25.gz
bcftools annotate \
  --rename-chrs GCF_000001405.39_GRCh38.p13_assembly_report.chrnames \
  --threads 10 -Oz \
  -o GRCh38.dbSNP153.vcf.gz \
  GCF_000001405.38.gz
ADD COMMENT
1
Entering edit mode

Huge thanks to you! This is so essential for anyone who wants to use dbSNP VCF, yet nothing was mentioned in the NCBI documentation!

ADD REPLY
0
Entering edit mode

Hi @rrbutleriii, How do you manage to use multiple threads (in bedtools) to process that single task?

ADD REPLY
0
Entering edit mode

Hi rrbutleriii, brilliant answer- 4.4 years later and it is still an issue (also the documentation is still unhelpful). Adding here an update for the most recent version, to get chromosome names as chrN rather than N:

# Assembly reports for renaming chromosomes
report_dir='ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405'
wget -N "${report_dir}/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_assembly_report.txt"

# Grab the useful columns
for k in *assembly_report.txt
  do
    out=$(echo $k | sed 's/.txt/.chrnames/')
    grep -e '^[^#]' $k | awk  -F $'\t' '{print $7, $10}' > $out
done

# Annotate
bcftools annotate \
  --rename-chrs GCF_000001405.40_GRCh38.p14_assembly_report.chrnames \
  --threads 10 -Oz \
  -o "GRCh38.dbSNP156.vcf.gz" \
  "GCF_000001405.40.gz"

# Create index
bcftools index \
  --tbi \
  --threads 10 \
  "GRCh38.dbSNP156.vcf.gz"
ADD REPLY
1
Entering edit mode
4.8 years ago
tothepoint ▴ 940

Look for the specific position of ref sequence here : Reference sequence with chromosome position

Change each reference position with chr1, chr2 and so on :

awk '{gsub(/NC_000001.11/, "chr1"); gsub(/NC_000002.12/, "chr2") gsub(/NC_000003.13/, "chr3") .......); print;}' your.vcf > your_final.vcf

Here you go.

ADD COMMENT

Login before adding your answer.

Traffic: 1688 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6