Extract Flanking Region Of A Snp
2
0
Entering edit mode
13.3 years ago
Patrick ▴ 40

Hello,

I have a file containing a lot of SNP in a tabular format with the chromosome and the position on the chromosome and the allele change

here is an exemple

chr1  2898298 rsxxxxxx A T

The file contain thousands of data like this

I want to extract flanking region of each snp and export both sequences in fasta format for each SNP with both allele, for my example one with A and one sequence with T flanked by 25 bp from each side

I want to do it with UCSC not Ensembl, I browsed biostar and I found how to connect to UCSC via MySQL. Can someone help me to design the query using mysql, I will wrap it in a perl script and try to export both sequences

Regards

Patrick

snp sequence ucsc • 6.2k views
ADD COMMENT
5
Entering edit mode

what is wrong with these solutions? Get Flanking Sequence Given A List Of Positions

ADD REPLY
0
Entering edit mode

I'd download the latest human genome release and use bio::db::sam modules' access reference functions to grab the flanking regions. It's very very fast.

ADD REPLY
3
Entering edit mode
13.3 years ago
lh3 33k

If you do not mind downloading the human reference genome (~900MB), you can:

wget --no-check-certificate https://raw.github.com/lh3/misc/master/seq/seqtk/seqtk.c
wget --no-check-certificate https://raw.github.com/lh3/misc/master/seq/seqtk/khash.h
wget --no-check-certificate https://raw.github.com/lh3/misc/master/seq/seqtk/kseq.h
wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
gcc -g -Wall -O2 seqtk.c -lz -o seqtk
sed s,^chr,, your.list | awk -v OFS="\t" '{print $1,$2-26,$2+25}' > snps.bed
./seqtk subseq -t hs37d5.fa.gz snps.bed > snps.seq
paste your.list snps.seq > snps.final.txt

As to UCSC, I do know where the reference sequence is kept in MySQL, but for such a task, it is frequently simpler to do locally. The human reference genome is pretty small in comparison to other data sets. BTW, if you want to get the a sequence in a couple of regions, a more convenient way is:

samtools faidx ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz 1:1,000,000-1,010,000 2:2,000,000-2,010,000

But this method becomes inefficient when you want to get thousands of regions.

ADD COMMENT
2
Entering edit mode
13.3 years ago

for this purpose I used to suggest to my lab colleages this script I found years ago, which queries EMBL and GenBank, from the very interesting bioperl repository. but as Jeremy has pointed out, I see great solutions on this post which I don't see why they wouldn't suit your needs.

ADD COMMENT

Login before adding your answer.

Traffic: 1776 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6