Entering edit mode
7.9 years ago
mms140130
▴
60
Hi,
my Question is how to find the genes related to given SNPs the part of Snps data I have as follows:
> rs987435 C G 1 1 1 0 2
> rs345783 C G 0 0 1 0 0
> rs955894 G T 1 1 2 2 1
> rs6088791 A G 1 2 0 0 1
> rs11180435 C T 1 0 1 1 1
> rs17571465 A T 1 2 2 2 2
> rs17011450 C T 2 2 2 2 2
> rs6919430 A C 2 1 2 2 2
> rs2342723 C T 0 2 0 0 0
> rs11992567 C T 2 2 2 2 2
While the data for the gene annotation was downloaded from UCSC website
> DDX11L1 NR_046018 chr1 + 11873 14409 14409 14409
> WASH7P NR_024540 chr1 - 14361 29370 29370 29370
> LINC01204 NR_104644 chrX + 45364632 45386484 45386484 45386484
> LOC392232 NR_033867 chr8 - 73114986 73163869 73163869 73163869
> FBXL22 NM_203373 chr15 + 63889551 63894620 63889591 63893885
> LOC729737 NR_039983 chr1 - 134772 140566 140566 140566
> LOC100132287 NR_028322 chr1 + 323891 328581 328581 328581
> LOC100132062 NR_028325 chr1 + 323891 328581 328581 328581
> >
> > where the columns names are as follows table refFlat "A gene
> > prediction with additional geneName field."
> > (
> > string geneName; "Name of gene as it appears in Genome Browser."
> > string name; "Name of gene"
> > string chrom; "Chromosome name"
> > char[1] strand; "+ or - for strand"
> > uint txStart; "Transcription start position"
> > uint txEnd; "Transcription end position"
> > uint cdsStart; "Coding region start"
> > uint cdsEnd; "Coding region end"
> > uint exonCount; "Number of exons"
> > uint[exonCount] exonStarts; "Exon start positions"
> > uint[exonCount] exonEnds; "Exon end positions"
> > )
how can I use this data to find the gene related to which SNPs
Thank you
I don't understand what do you mean
If you want to lookup the annotation information for a given dbSNP id, then you need to interface with the dbSNP database directly and do a lookup. Or you can use a variety of web-interface tools available for this purpose, e.g., SNP-nexus takes dbSNP ids as input.
If you need to lookup annotation information based on genomic location, for example like the data you'll see in a standard vcf file, then you can use any of the tools above which work directly with genomic coordinate variant calls. alternatively, you can download the annotations (gtf, gff) corresponding to your sequence assembly then recursively examine your data with the annotation features.
Not exactly what OP was asking for (annotation using Python), but still the best/right answer for his issue :-)
HOWTO: Annotations using assembly :).
But honestly, you'd be surprised how much I see bioinformaticians using python and all they're are doing is making a bunch of system calls. It's both hilarious, and disturbing. I get why someone would want to do this, as I occasionally do it in Rscripts, but still.
The thing is my data is really large almost 9 million snps that is why I asked about python ... I'm new to python if any can help with that I do appreciate that.. I have a file with rs for the snps
That's quite a lot indeed. Then I think the easiest path is to:
grep -f
)thank you, can you explain how to start SNPEff Actually I was searching the whole day . I got tired so would you please give me the steps Appreciate your help
I think the manual is very clear, what did you try and what doesn't work? Try to be specific when asking questions.
I tried SCAN http://www.scandb.org/newinterface/index_v1.html But my data is really large that is why it doesn't work .. same for snp nexus didn't work .
I listed the steps here: C: Finding the Genes corresponding to given SNPs using python
What's wrong? Where are you stuck?
I used the following:
curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/snp147.txt.gz" | gunzip > datasnp
Can you pleas tell me where I can find the dbSNP databank in VCF format, or I can use the txt format as i mentioned above
now I have the dbSnp data base in txt format and I also have my dbSNp in a txt file called pre_snpinfo_tumor ( 4th column).
what is the code to filer the snp database ?
You would need a file containing just the dbSNP identifiers, you can isolate this one using awk or cut. This file would be
DesiredIdentifiers.txt
. Then you can use a simple grep, such asI'm really confused could you please tell me where to find the VCF file for the snp database as what I have downloaded from the UCSC website doesn't contain the gene names