How To Find The List Of All Coding Snps / Exomic Variants In A Given Gene ?

Entering edit mode

14.2 years ago

Khader Shameer 18k

I have a list of genes (for example: SPTA1, THRB, PDGFRA, KIT, LRRC16A, SCGN). I am looking for a resource/way to find list of all coding SNPs (synonymous, non-synonymous, nonsense, missense, frameshift or any other relevant class of snps in coding region) in the exome region of these genes.

EDIT: I am not looking for a way to filter dbSNP/Ensembl variants in a gene using func/consequencetype. The idea is to get list SNPs which can have coding related func/consequencetype and then map the location to available exome data (CDS) to verify the location is in the exome region or not.

variant snp annotation • 8.3k views

ADD COMMENT • link updated 14.2 years ago by Ryan D ★ 3.4k • written 14.2 years ago by Khader Shameer 18k

Entering edit mode

The answers below give solutions using UCSC. However, UCSC uses dbSNP131 only, while the 1000g has put many more in dbSNP132. If it were me, I would try to use dbSNP132.

ADD REPLY • link 14.2 years ago by lh3 33k

Entering edit mode

Thanks a lot @lh3.

ADD REPLY • link 13.9 years ago by Khader Shameer 18k

Entering edit mode

14.2 years ago

Pierre Lindenbaum 166k

Try to use the UCSC mysql server:

select distinct
S.name,
S.chromStart,
S.chromEnd,
S.func,
X.geneSymbol
from 
snp130 as S,
knownGene as K,
kgXref as X

where

X.kgId=K.name and
K.chrom=S.chrom and
K.txStart<=S.chromStart and
S.chromEnd<=K.txEnd and
X.geneSymbol in ('SPTA1', 'THRB', 'PDGFRA', 'KIT', 'LRRC16A', 'SCGN') and
S.func in ('coding-synon','nonsense','missense','frameshift')

Result:

mysql  -h  genome-mysql.cse.ucsc.edu -A -u genome -D hg18 < query.sql

	name chromStart chromEnd func geneSymbol
	rs55755457 55256467 55256468 missense KIT
	rs72549300 55256489 55256490 coding-synon KIT
	rs72549301 55256513 55256514 coding-synon KIT
	rs56411694 55256618 55256619 coding-synon KIT
	rs35525445 55259236 55259236 frameshift KIT
	rs66470020 55264726 55264726 frameshift KIT
	rs66494177 55264744 55264744 frameshift KIT
	rs66538661 55268064 55268065 frameshift KIT
	rs72549293 55270349 55270350 coding-synon KIT
	rs73137716 55270350 55270351 missense KIT
	rs72549294 55270429 55270430 missense KIT
	rs55966164 55284564 55284565 coding-synon KIT
	rs66547161 55286846 55286846 frameshift KIT
	rs56225530 55286895 55286896 missense KIT
	rs67271072 55286924 55286924 frameshift KIT
	rs72550822 55288187 55288188 missense KIT
	rs55792975 55288193 55288194 missense KIT
	rs3822214 55288220 55288221 missense KIT
	rs55986963 55288237 55288238 coding-synon KIT
	rs28933968 55288339 55288340 missense KIT
	rs67104871 55288369 55288370 frameshift KIT
	rs67897412 55288440 55288441 frameshift KIT
	rs28933371 55288441 55288442 missense KIT
	rs72549292 55288764 55288765 coding-synon KIT
	rs35200131 55290338 55290339 missense KIT
	rs35826988 55290404 55290404 frameshift KIT
	rs56094246 55292252 55292253 missense KIT
	rs66819416 55292304 55292304 frameshift KIT
	rs55717477 55292839 55292840 coding-synon KIT
	rs55789615 55294024 55294025 coding-synon KIT
	rs28933969 55294076 55294077 missense KIT
	rs3733542 55297521 55297522 coding-synon KIT
	rs55817813 55297668 55297669 coding-synon KIT
	rs72549296 55299353 55299354 coding-synon KIT
	rs56288823 55299395 55299396 coding-synon KIT
	rs72549297 55299468 55299469 missense KIT
	rs913455 25528322 25528323 coding-synon LRRC16A
	rs41271807 25528328 25528329 coding-synon LRRC16A
	rs9358856 25534746 25534747 missense LRRC16A
	rs9467515 25543736 25543737 coding-synon LRRC16A
	rs73734005 25543783 25543784 missense LRRC16A
	rs41271815 25580715 25580716 coding-synon LRRC16A
	rs12207840 25623882 25623883 missense LRRC16A
	rs35230144 25623991 25623992 coding-synon LRRC16A
	rs35502441 25625618 25625619 frameshift LRRC16A
	rs7454756 25628491 25628492 missense LRRC16A
	rs7452135 25628492 25628493 coding-synon LRRC16A
	rs41271817 25637083 25637084 missense LRRC16A
	rs41271819 25646118 25646119 missense LRRC16A
	rs12527029 25646166 25646167 missense LRRC16A
	rs9885888 25708580 25708581 coding-synon LRRC16A
	rs9885914 25708750 25708751 missense LRRC16A
	rs34691813 25708805 25708806 coding-synon LRRC16A
	rs3747525 25727761 25727762 missense LRRC16A
	rs3747524 25727762 25727763 coding-synon LRRC16A
	rs35805947 54822169 54822170 coding-synon PDGFRA
	rs67744156 54822179 54822180 frameshift PDGFRA
	rs66500524 54822195 54822196 frameshift PDGFRA
	rs36035373 54822204 54822205 missense PDGFRA
	rs67750107 54822220 54822220 frameshift PDGFRA
	rs67545113 54824613 54824613 frameshift PDGFRA
	rs2229307 54824834 54824835 coding-synon PDGFRA
	rs67343603 54824843 54824844 frameshift PDGFRA
	rs55966236 54828253 54828254 coding-synon PDGFRA
	rs67558150 54828262 54828263 frameshift PDGFRA
	rs66482249 54828335 54828336 frameshift PDGFRA
	rs4358459 54828482 54828483 coding-synon PDGFRA
	rs61735622 54831556 54831557 missense PDGFRA
	rs41279521 54831561 54831562 missense PDGFRA
	rs55865821 54833356 54833357 missense PDGFRA
	rs61735621 54833359 54833360 missense PDGFRA
	rs67079911 54833361 54833362 frameshift PDGFRA
	rs56026726 54833399 54833400 coding-synon PDGFRA
	rs35597368 54834527 54834528 missense PDGFRA
	rs1873778 54835811 54835812 coding-synon PDGFRA
	rs55830582 54835841 54835842 coding-synon PDGFRA
	rs67896674 54838326 54838326 frameshift PDGFRA
	rs10028020 54838333 54838334 coding-synon PDGFRA
	rs66793306 54839355 54839355 frameshift PDGFRA
	rs61735626 54841312 54841313 missense PDGFRA
	rs34392012 54841372 54841373 missense PDGFRA
	rs61735625 54841385 54841386 missense PDGFRA
	rs2228230 54846796 54846797 coding-synon PDGFRA
	rs66606083 54848396 54848396 frameshift PDGFRA
	rs67676009 54849782 54849782 frameshift PDGFRA
	rs56384252 54849789 54849790 coding-synon PDGFRA
	rs56028238 54851368 54851369 missense PDGFRA
	rs55996208 54856081 54856082 coding-synon PDGFRA
	rs7685117 54856147 54856148 coding-synon PDGFRA
	rs34147078 54003577 54003578 frameshift PDGFRA
	rs11543418 54019681 54019681 frameshift PDGFRA
	rs15887 54020272 54020273 missense PDGFRA
	rs71797427 54021970 54022362 frameshift PDGFRA
	rs34296513 54022397 54022398 missense PDGFRA
	rs72223577 54056485 54062205 frameshift PDGFRA
	rs72336632 54056485 54061182 frameshift PDGFRA
	rs56148907 54056947 54056948 missense PDGFRA
	rs35082655 54059746 54059747 coding-synon PDGFRA
	rs28392904 54136598 54136599 missense PDGFRA
	rs61739992 54136937 54136938 missense PDGFRA
	rs4864801 54136938 54136939 coding-synon PDGFRA
	rs10002285 54137088 54137089 coding-synon PDGFRA
	rs34128887 54625172 54625173 coding-synon PDGFRA
	rs3194383 54661423 54661424 coding-synon PDGFRA
	rs13144341 54661586 54661587 missense PDGFRA
	rs1132998 54661675 54661676 coding-synon PDGFRA
	rs61737615 54662554 54662555 coding-synon PDGFRA
	rs34355491 25760660 25760661 coding-synon SCGN
	rs36013414 25773234 25773235 missense SCGN
	rs34286843 25777772 25777773 missense SCGN
	rs6942245 25799275 25799276 missense SCGN
	rs2251969 156849269 156849270 coding-synon SPTA1
	rs952094 156850714 156850715 missense SPTA1
	rs41273519 156855744 156855745 missense SPTA1
	rs34647363 156856629 156856629 frameshift SPTA1
	rs3753068 156859558 156859559 coding-synon SPTA1
	rs3737515 156864130 156864131 missense SPTA1
	rs16830483 156871014 156871015 missense SPTA1
	rs61140046 156872370 156872371 coding-synon SPTA1
	rs3738791 156873072 156873073 coding-synon SPTA1
	rs857725 156874558 156874559 missense SPTA1
	rs863931 156878859 156878860 missense SPTA1
	rs41273523 156879342 156879343 missense SPTA1
	rs35237700 156879355 156879356 missense SPTA1
	rs34973695 156879379 156879380 missense SPTA1
	rs34214405 156881915 156881916 missense SPTA1
	rs34773716 156884943 156884944 coding-synon SPTA1
	rs2482965 156886351 156886352 missense SPTA1
	rs12754252 156887857 156887858 missense SPTA1
	rs35733059 156887859 156887860 missense SPTA1
	rs35078963 156889704 156889705 coding-synon SPTA1
	rs34886778 156889767 156889768 coding-synon SPTA1
	rs35948326 156891151 156891152 missense SPTA1
	rs857691 156893001 156893002 coding-synon SPTA1
	rs34706737 156893005 156893006 missense SPTA1
	rs36109350 156893063 156893064 missense SPTA1
	rs16840450 156893938 156893939 coding-synon SPTA1
	rs36057043 156894023 156894024 missense SPTA1
	rs35121052 156897729 156897730 missense SPTA1
	rs34577746 156897767 156897768 missense SPTA1
	rs2518493 156897794 156897795 coding-synon SPTA1
	rs35856400 156897815 156897816 coding-synon SPTA1
	rs7547313 156899154 156899155 missense SPTA1
	rs35318623 156899161 156899162 coding-synon SPTA1
	rs7418956 156899206 156899207 missense SPTA1
	rs2022057 156899226 156899227 coding-synon SPTA1
	rs11265047 156899283 156899284 missense SPTA1
	rs12090314 156902847 156902848 missense SPTA1
	rs34211240 156905923 156905924 missense SPTA1
	rs73020251 156905974 156905975 coding-synon SPTA1
	rs35932551 156906158 156906159 coding-synon SPTA1
	rs41273525 156907783 156907784 coding-synon SPTA1
	rs34446973 156907822 156907823 coding-synon SPTA1
	rs34133563 156912588 156912589 missense SPTA1
	rs325996 156914118 156914119 coding-synon SPTA1
	rs703121 156914145 156914146 coding-synon SPTA1
	rs12083637 156914193 156914194 coding-synon SPTA1
	rs36058424 156914828 156914829 coding-synon SPTA1
	rs16840544 156918017 156918018 missense SPTA1
	rs73020287 156918030 156918031 coding-synon SPTA1
	rs3737521 156919848 156919849 missense SPTA1
	rs41273531 156921579 156921580 missense SPTA1
	rs435080 156921659 156921660 coding-synon SPTA1
	rs28934004 156921702 156921703 missense SPTA1
	rs28934005 156921703 156921704 missense SPTA1
	rs41273533 156921752 156921753 missense SPTA1
	rs28933408 24139407 24139408 missense THRB
	rs13081063 24139513 24139514 coding-synon THRB
	rs28934868 24139616 24139617 missense THRB
	rs28999971 24144097 24144098 missense THRB
	rs28999970 24144103 24144104 missense THRB
	rs1054624 24144127 24144128 missense THRB
	rs28999969 24144143 24144144 missense THRB
	rs3752874 24159998 24159999 coding-synon THRB
	rs9865746 24160086 24160087 missense THRB
	rs61756233 24206597 24206598 missense THRB

view raw biostars-5581.txt hosted with ❤ by GitHub

ADD COMMENT • link updated 5.6 years ago by Ram 45k • written 14.2 years ago by Pierre Lindenbaum 166k

Entering edit mode

AFAIK, there is no table only storing the exons data. You'll have to extract them from knownGene.exonStarts and knownGene.exonEnds

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 14.2 years ago by Pierre Lindenbaum 166k

Entering edit mode

Thanks Pierre, I tried something like this by reverse engineering your earlier answers. My concern is that how can I verify these are actually coding regions. Do you know about any table that store information about coding (for example NCBI CDS)/exomic regions of human genome to map this positions ?

ADD REPLY • link 14.2 years ago by Khader Shameer 18k

Entering edit mode

Thanks Pierre, I will try that option.

ADD REPLY • link 14.2 years ago by Khader Shameer 18k

Entering edit mode

-1. This query is very inefficient. If everyone queries the database like this, it will be a disaster.

ADD REPLY • link 14.2 years ago by lh3 33k

Entering edit mode

-1. This query is very inefficient. We should stop populating wrong queries to benefit other UCSC MySQL users.

ADD REPLY • link 14.2 years ago by lh3 33k

Entering edit mode

-1. Another inefficient UCSC query. Note that Start and End are not indexed. I do not know how UCSC performs such query, but I would use several SQLs instead of using table joining.

ADD REPLY • link 14.2 years ago by lh3 33k

Entering edit mode

14.2 years ago

William Spooner ▴ 310

You can do this using Ensembl BioMart (using the Ensembl API is another option), http://www.ensembl.org/biomart/martview here is the XML representation;

<Dataset name = "hsapiens_snp" interface = "default" >
    <Filter name = "consequence_type" value = "3PRIME_UTR,3PRIME_UTR&amp;NMD_TRANSCRIPT,5PRIME_UTR,5PRIME_UTR&amp;NMD_TRANSCRIPT,COMPLEX_INDEL,COMPLEX_INDEL&amp;NMD_TRANSCRIPT,COMPLEX_INDEL&amp;SPLICE_SITE,ESSENTIAL_SPLICE_SITE&amp;INTRONIC,ESSENTIAL_SPLICE_SITE&amp;INTRONIC&amp;NMD_TRANSCRIPT,FRAMESHIFT_CODING,FRAMESHIFT_CODING&amp;NMD_TRANSCRIPT,FRAMESHIFT_CODING&amp;SPLICE_SITE,FRAMESHIFT_CODING&amp;SPLICE_SITE&amp;NMD_TRANSCRIPT,HGMD_MUTATION,INTRONIC&amp;NMD_TRANSCRIPT,NON_SYNONYMOUS_CODING,NON_SYNONYMOUS_CODING&amp;NMD_TRANSCRIPT,NON_SYNONYMOUS_CODING&amp;SPLICE_SITE,NON_SYNONYMOUS_CODING&amp;SPLICE_SITE&amp;NMD_TRANSCRIPT,PARTIAL_CODON,SPLICE_SITE&amp;3PRIME_UTR,SPLICE_SITE&amp;3PRIME_UTR&amp;NMD_TRANSCRIPT,SPLICE_SITE&amp;5PRIME_UTR,SPLICE_SITE&amp;5PRIME_UTR&amp;NMD_TRANSCRIPT,SPLICE_SITE&amp;INTRONIC,SPLICE_SITE&amp;INTRONIC&amp;NMD_TRANSCRIPT,SPLICE_SITE&amp;SYNONYMOUS_CODING,SPLICE_SITE&amp;SYNONYMOUS_CODING&amp;NMD_TRANSCRIPT,STOP_GAINED,STOP_GAINED&amp;FRAMESHIFT_CODING,STOP_GAINED&amp;FRAMESHIFT_CODING&amp;NMD_TRANSCRIPT,STOP_GAINED&amp;NMD_TRANSCRIPT,STOP_GAINED&amp;SPLICE_SITE,STOP_GAINED&amp;SPLICE_SITE&amp;NMD_TRANSCRIPT,STOP_LOST,STOP_LOST&amp;NMD_TRANSCRIPT,STOP_LOST&amp;SPLICE_SITE,STOP_LOST&amp;SPLICE_SITE&amp;NMD_TRANSCRIPT,SYNONYMOUS_CODING,SYNONYMOUS_CODING&amp;NMD_TRANSCRIPT,WITHIN_MATURE_miRNA,WITHIN_NON_CODING_GENE"/>
    <Attribute name = "refsnp_id" />
    <Attribute name = "chr_name" />
    <Attribute name = "chrom_start" />
    <Attribute name = "consequence_type_tv" />
    <Attribute name = "ensembl_transcript_stable_id" />
</Dataset>

<Dataset name = "hsapiens_gene_ensembl" interface = "default" >
    <Filter name = "hgnc_symbol" value = "SPTA1,THRB,PDGFRA,KIT,LRRC16A,SCGN"/>
    <Attribute name = "hgnc_symbol" />
</Dataset>

ADD COMMENT • link updated 5.6 years ago by Ram 45k • written 14.2 years ago by William Spooner ▴ 310

Entering edit mode

you may then want to build a query to retrieve chromosome positions of all exons on your genes, and then query for SNPs on that regions. modifying the code from above to do so should not be complicated.

ADD REPLY • link 14.2 years ago by Jorge Amigo 14k

Entering edit mode

Thanks William. I am not looking for a way to filter SNPs based on consequence type. If we filter based on pre-defined consequence type, there are chances that some of these SNPs may fall in regulatory or intronic region. I am trying to find a way to filter SNPs then check whether they are in exome region or not. Sorry that it was not clear in my question.

ADD REPLY • link 14.2 years ago by Khader Shameer 18k

Entering edit mode

Yes Jorge, but my gut feeling was that the coding snp must have already in another table/db. And it is available as a separate table in snp131CodingDbSnp. Please see Ryan's answer.

ADD REPLY • link 14.2 years ago by Khader Shameer 18k

Entering edit mode

14.2 years ago

Ryan D ★ 3.4k

I think Pierre is on the right track above. I would do it his way, but just pull from the table: snp131CodingDbSnp.

Database: hg19    Primary Table: snp131CodingDbSnp    Row Count: 443,544
Format description: Annotations of the effects of SNPs on translated protein sequence.

Give it a try. And thanks again, Pierre, for another great answer.

Explicitly:

mysql -h genome-mysql.cse.ucsc.edu -A -u genome -D hg19 > ~/scripts/query2.sql

query2.sql below

select
distinct
S.name,
S.chromStart,
S.chromEnd,
S.funcCodes,
X.geneSymbol
from
snp131CodingDbSnp as S,
knownGene as K,
kgXref as X

where

X.kgId=K.name and
K.chrom=S.chrom and
K.txStart<=S.chromStart and
S.chromEnd<=K.txEnd and
X.geneSymbol in ('SPTA1', 'THRB', 'PDGFRA', 'KIT', 'LRRC16A', 'SCGN')`

ADD COMMENT • link updated 5.6 years ago by Ram 45k • written 14.2 years ago by Ryan D ★ 3.4k

Entering edit mode

Incidentally, this table gives about the same results as Pierre's method. That concordance is probably a good thing.

ADD REPLY • link 14.2 years ago by Ryan D ★ 3.4k

Entering edit mode

Thanks a lot Ryan. snp131CodingDbSnp is the one I was looking for.

ADD REPLY • link 14.2 years ago by Khader Shameer 18k