Question

How to get SNP identifiers from VCF file?

2

Entering edit mode

7.2 years ago

O.rka ▴ 740

I grabbed the first couple of columns from my whole genome VCF file. It looks like this. Are there any tools either web-based, Python, or R that I could use to get SNP identifiers (e.g. rs429358 or rs7412) for all of my SNPs that are in a particular database? I'm very new to working with VCF files and I want to figure out my blood-type. I'm very comfortable coding in both Python and R if there are any packages available for these languages. I would like to avoid, if possible, depositing my sequences in a 3rd party that would potentially use my information for their own gains but I am not opposed to having them as references in case I can't figure out any other options.

#CHROM  POS ID  REF ALT QUAL
chrM    64  .   C   T   3070.00
chrM    73  .   A   G   3070.00
chrM    146 .   T   C   3070.00
chrM    153 .   A   G   3070.00
chrM    263 .   A   G   3070.00
chrM    310 .   T   C   3070.00
chrM    513 .   GCA G   3070.00
chrM    663 .   A   G   3070.00
chrM    750 .   A   G   3070.00
chrM    1438    .   A   G   3070.00
chrM    1598    .   G   A   3070.00
chrM    1736    .   A   G   3070.00
chrM    1888    .   G   A   3070.00
chrM    2706    .   A   G   3070.00
chrM    3106    .   CN  C   3070.00
chrM    4248    .   T   C   3070.00
chrM    4769    .   A   G   3070.00
chrM    4824    .   A   G   3070.00
chrM    7028    .   C   T   3070.00
chrM    8027    .   G   A   3070.00
chrM    8794    .   C   T   3070.00
chrM    8860    .   A   G   3070.00
chrM    11719   .   G   A   3070.00

SNP vcf genome database clinical • 11k views

ADD COMMENT • link updated 7.2 years ago by Denise CS ★ 5.2k • written 7.2 years ago by O.rka ▴ 740

3

Entering edit mode

7.2 years ago

finswimmer 16k

Hello,

what you are looking for is called "Annotating". There a couple of programs that can use other vcf/bed,/tab-delimited files which contains information that you want to see in your vcf.

bcftools and snpEff/SnpSift are such programs .

So first download the dbnSnp database (hg19 / hg38)

Than run e.g. SnpSift:

java -jar SnpSift.jar annotate -id dbSnp.vcf your.vcf > annotated.vcf

fin swimmer

ADD COMMENT • link 7.2 years ago by finswimmer 16k

0

Entering edit mode

I'm trying it out with SnpSift right now. Do you have any examples of its use with bcftools?

ADD REPLY • link 7.2 years ago by O.rka ▴ 740

0

Entering edit mode

I ran it on the grid and got the following error:

qsub -cwd -V -N sift "SnpSift annotate -id 00-All.vcf genome.vcf > genome.dbSNP.hg38.annotated.vcf"
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:3284)
    at org.snpsift.annotate.VcfIndexDataChromo.grow(VcfIndexDataChromo.java:101)
    at org.snpsift.annotate.VcfIndexDataChromo.add(VcfIndexDataChromo.java:46)
    at org.snpsift.annotate.VcfIndex.add(VcfIndex.java:67)
    at org.snpsift.annotate.VcfIndex.loadIntervals(VcfIndex.java:245)
    at org.snpsift.annotate.VcfIndex.index(VcfIndex.java:183)
    at org.snpsift.annotate.DbVcfSorted.open(DbVcfSorted.java:55)
    at org.snpsift.annotate.AnnotateVcfDb.open(AnnotateVcfDb.java:395)
    at org.snpsift.SnpSiftCmdAnnotate.annotateInit(SnpSiftCmdAnnotate.java:190)
    at org.snpsift.SnpSiftCmdAnnotate.annotate(SnpSiftCmdAnnotate.java:70)
    at org.snpsift.SnpSiftCmdAnnotate.run(SnpSiftCmdAnnotate.java:410)
    at org.snpsift.SnpSiftCmdAnnotate.run(SnpSiftCmdAnnotate.java:397)
    at org.snpsift.SnpSift.run(SnpSift.java:588)
    at org.snpsift.SnpSift.main(SnpSift.java:76)

ADD REPLY • link 7.2 years ago by O.rka ▴ 740

0

Entering edit mode

It looks like you are running out of memory. Try to limit the memory java can use by setting the -Xmx paramater. E.g.

java -Xmx12G -jar SnpSift.jar annotate -id dbSnp.vcf your.vcf > annotated.vcf

if you want to allow java to take 12GB.

BTW: You should index the gzip'ed dbSnp file with tabix and use the .gz file in the command above. Or you can download the index file directly: hg19/hg38

fin swimmer

ADD REPLY • link 7.2 years ago by finswimmer 16k

1

Entering edit mode

7.2 years ago

Denise CS ★ 5.2k

Use the Ensembl Variant Effect Predictor. If your variant is known in the public database e.g. dbSNP you will get its rsID as VEP has an output option for co-located variants. VEP accepts VCF as an input file and you can run it through the Ensembl REST API.

ADD COMMENT • link 7.2 years ago by Denise CS ★ 5.2k

0

Entering edit mode

Do you know if this takes the truncated VCF that I've previewed above? or full VCFs only?

ADD REPLY • link 7.2 years ago by O.rka ▴ 740

0

Entering edit mode

It errored when I tried both the full and trimmed VCFs

ADD REPLY • link 7.2 years ago by O.rka ▴ 740

0

Entering edit mode

7.2 years ago

steve ★ 3.5k

I typically use ANNOVAR for annotating data like this, you can specify dbSNP output if you've installed the corresponding reference database

ADD COMMENT • link 7.2 years ago by steve ★ 3.5k

0

Entering edit mode

Do you have any example commands you would use?

ADD REPLY • link 7.2 years ago by O.rka ▴ 740

0

Entering edit mode

yes I recently made a script for this here. The installation commands are here

ADD REPLY • link 7.2 years ago by steve ★ 3.5k

score 5 · Accepted Answer · 2018-02-21

5

Entering edit mode

7.2 years ago

Alex Reynolds 36k

Make a BED file of SNPs for your genome of interest, e.g. hg38 and dbSNP v147:

$ LC_ALL=C
$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/snp147.txt.gz \
   | gunzip -c \
   | awk -v OFS="\t" '{ print $2,$3,($3+1),$5 }' \
   | sort-bed - \
   > hg38.snp147.bed

Convert your VCF file to BED and map the ID field from the SNP file:

$ bedmap --echo --echo-map-id --delim '\t' <(vcf2bed --sort-tmpdir=$PWD < variants.vcf) hg38.snp147.bed > answer.bed

The rs* ID for the variant will be in the last column of answer.bed.

ADD COMMENT • link 7.2 years ago by Alex Reynolds 36k

0

Entering edit mode

I first ran this: LC_ALL=C wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/snp147.txt.gz \ | gunzip -c \ | awk -v OFS="\t" '{ print $2,$3,($3+1),$5 }' \ | sort-bed - \

hg38.snp147.bed

and then ran my non-trimmed (all the columns) vcf file genome.vcf:

bedmap --echo --echo-map-id --delim '\t' <(vcf2bed --sort-tmpdir=./jespinoz/ < genome.vcf) hg38.snp147.bed > genome.vcf.hg38.snp147.bed

but none of the columns contain the rs* IDs

-bash-4.1$ grep "rs" genome.vcf.hg38.snp147.bed
-bash-4.1$

Do you know what could be happening?

ADD REPLY • link 7.2 years ago by O.rka ▴ 740

0

Entering edit mode

Can you show the top ten or twenty lines of your genome.vcf? I just want to verify that conversion will work. If your file isn't in VCF format, we could look at using other tools to convert it to BED so that you can map IDs to positions.

ADD REPLY • link 7.2 years ago by Alex Reynolds 36k

0

Entering edit mode

I ended up running vcf2bed and then: bedmap --echo --echo-map-id --delim '\t' hg38.snp147.bed genome.bed > genome.vcf.hg38.snp147.bed and it worked. Do you know of any way I can map the rs* IDs to their descriptive names?