Extract number of SNPs for any given gene
5
3
Entering edit mode
6.6 years ago
Matt En ▴ 30

What is the easiest way to count the number of SNPs in humans (and distinguish them by the standard NCBI filters of "missense mutations" / "nonsense mutations" / "synonymous mutations" / "intron variants")? I wish to get the data from NCBI's dbSNP database. I can see the numbers through the NCBI variation viewer. Tried doing it through NCBI Entrez Direct but haven't gotten it to work thus far. Any ideas?

SNP • 1.8k views
ADD COMMENT
1
Entering edit mode
6.6 years ago

Hello,

one way is to download the dbSnp vcf file an annotate it with snpEff (with -canon option). snpEff produces a summary file containing the infos you like.

As the dbSnp file is huge it might take some time to annotate ...

If you are just interested in a given gene, you can first build a subset of variants using tabix and annotate than.

fin swimmer

ADD COMMENT
1
Entering edit mode
6.6 years ago
GenoMax 147k

Try the following.

$ esearch -db snp -query "brca1" | esummary | xtract -pattern DocumentSummary -element SNP_ID -element FXN_CLASS -element VALIDATED 
1491587915      intron-variant  no-info
1491573730      intron-variant  no-info
1491558507      intron-variant  no-info
1491544899      frameshift-variant,nc-transcript-variant,reference,upstream-variant-2KB,utr-variant-5-prime     no-info
1491538523      intron-variant  no-info
1491356043      intron-variant  by-cluster
1491320693      intron-variant  no-info
1491293961      frameshift-variant,nc-transcript-variant,reference,upstream-variant-2KB,utr-variant-5-prime     no-info
1491277616      intron-variant  no-info

Truncated due to space constraint.

ADD COMMENT
1
Entering edit mode
6.6 years ago
GenoMax 147k

One more variation. If you only need numbers.

$ esearch -db snp -query "brca1" | esummary | xtract -pattern DocumentSummary -element SNP_ID -element FXN_CLASS -element VALIDATED | awk -F '\t' '{print $2}' | sort | uniq -c
      2 cds-indel,frameshift-variant,intron-variant,nc-transcript-variant,reference
     49 cds-indel,intron-variant,nc-transcript-variant
     22 cds-indel,nc-transcript-variant
      1 cds-indel,nc-transcript-variant,reference
      1 cds-indel,nc-transcript-variant,upstream-variant-2KB
      1 cds-indel,nc-transcript-variant,upstream-variant-2KB,utr-variant-5-prime
      1 cds-indel,nc-transcript-variant,utr-variant-3-prime
      1 cds-indel,splice-acceptor-variant
     93 downstream-variant-500B
    125 downstream-variant-500B,nc-transcript-variant,utr-variant-3-prime
      1 frameshift-variant,intron-variant,nc-transcript-variant
   1144 frameshift-variant,intron-variant,nc-transcript-variant,reference
     10 frameshift-variant,intron-variant,reference
      2 frameshift-variant,nc-transcript-variant
    459 frameshift-variant,nc-transcript-variant,reference
      1 frameshift-variant,nc-transcript-variant,reference,splice-donor-variant
     39 frameshift-variant,nc-transcript-variant,reference,upstream-variant-2KB,utr-variant-5-prime
     36 frameshift-variant,nc-transcript-variant,reference,utr-variant-3-prime
      2 frameshift-variant,nc-transcript-variant,reference,utr-variant-5-prime
  16573 intron-variant
     86 nc-transcript-variant,reference,stop-gained
      5 nc-transcript-variant,reference,stop-gained,synonymous-codon
      2 nc-transcript-variant,reference,stop-gained,upstream-variant-2KB,utr-variant-5-prime
      5 nc-transcript-variant,reference,stop-gained,utr-variant-3-prime

Truncated due to space constraint.

ADD COMMENT
1
Entering edit mode
6.6 years ago
JJ ▴ 710

Have a look at the VariantAnnotation package in R - it can do this.

ADD COMMENT
0
Entering edit mode
5.9 years ago
Andrew Su 4.9k

You can get this info using the myvariant.info API. For example, suppose you are interested in SYNE1 (ENSG00000131018), you can get all known variants associated with this gene using this API call:

http://myvariant.info/v1/query?q=cadd.gene.gene_id:ENSG00000131018

which currently says that there are 88731 such variants.

To count the number of missense you can execute this call and see that there are currently 56500:

http://myvariant.info/v1/query?q=cadd.gene.gene_id:ENSG00000131018%20AND%20cadd.consdetail:missense

similarly, there are 3893 "stop_gained" or nonsense variants:

http://myvariant.info/v1/query?q=cadd.gene.gene_id:ENSG00000131018%20AND%20cadd.consdetail:stop_gained

You can also consult this page http://docs.myvariant.info/en/latest/doc/packages.html to learn more about the myvariant libraries for python, R, and node.

ADD COMMENT

Login before adding your answer.

Traffic: 1948 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6