I have a list of Refseqs Ids and I want to get the introns position, relative to the protein sequence. Does any one have a python script to grab the introns from the genomic reference of a refseq gene, and get their position in the protein?
I have a list of Refseqs Ids and I want to get the introns position, relative to the protein sequence. Does any one have a python script to grab the introns from the genomic reference of a refseq gene, and get their position in the protein?
The UCSC has already computed this table: see refGene.txt.gz, refGene.sql, here.
The table contains the postion of the exons separated by a comma, you then "just have to" reconstruct the sequence of protein from the reference sequences (here)
curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz" | gunzip -c | head
971 NR_024227 chr19 - 50595745 50595866 50595866 50595866 1 50595745, 50595866, 0 SNAR-A6 unk unk -1,
971 NR_024227 chr19 - 50601082 50601203 50601203 50601203 1 50601082, 50601203, 0 SNAR-A6 unk unk -1,
629 NM_001014809 chr4 - 5822491 5894785 5823486 5894696 14 5822491,5827220,5830215,5837641,5838491,5841248,5843034,5844819,5851118,5853134,5857869,5862752,5868394,5894315, 5823578,5827386,5830395,5837812,5838633,5841405,5843155,5844888,5851199,5853196,5858034,5862937,5868483,5894785, 0 CRMP1 cmpl cmpl 1,0,0,0,2,1,0,0,0,1,1,2,0,0,
808 NM_001029883 chr2 - 29284557 29297127 29287734 29297127 2 29284557,29293459, 29287933,29297127, 0 C2orf71cmpl cmpl 2,0,
705 NM_024329 chr1 + 15736390 15756839 15736467 15755220 4 15736390,15752366,15753645,15755088, 15736775,15752514,15753780,15756839, 0 EFHD2 cmpl cmpl 0,2,0,0,
768 NM_024328 chr14 + 24025197 24028786 24025966 24028049 2 24025197,24027903, 24026513,24028786, 0 THTPA cmpl cmpl 0,1,
1379 NM_024326 chr10 + 104179570 104182893 104180886 104182750 4 104179570,104181110,104181543,104182560, 104180939,104181264,104182049,104182893, 0 FBXL15 cmpl cmpl 0,2,0,2,
826 NM_138275 chr6 + 31691160 31692850 31691160 31692850 4 31691160,31691415,31692541,31692746, 31691221,31691763,31692621,31692850, 0 C6orf25 cmpl incmpl 0,1,1,0,
609 NM_138275 chr6_cox_hap2 + 3200777 3202467 3200777 3202467 4 3200777,3201032,3202158,3202363, 3200838,3201380,3202238,3202467, 0 C6orf25 cmpl incmpl 0,1,1,0,
607 NM_138275 chr6_dbb_hap3 + 2976730 2978420 2976730 2978420 4 2976730,2976985,2978111,2978316, 2976791,2977333,2978191,2978420, 0 C6orf25 cmpl incmpl 0,1,1,0,
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
I might have a solution for this. Can you provide some of your RefSeq IDs to test it on?