I'm trying to retrieve the 3'UTR sequence of a given RefSeq id from the UCSC genome browser.
From its main web interface, this task can be easily accomplished selecting the "refGene" table and "RefSeq" track. However I'm trying to automate the retrieval process, so that the final user will not interact with the UCSC's web interface directly, but rather launch a script to obtain the corresponding 3'UTR sequence of some provided RefSeq identifier.
The UCSC genome browser web interface is therefore off the table.
Looking around I found the possibility to interact with the UCSC via MySQL queries, but since I'm only interested in the 3'UTR sequence, and since I cannot retrieve a genomic sequence of DNA only with a SQL query, I'm relying on the DAS server, as pointed out in this answer.
I therefore obtained an example 3'UTR sequence from the UCSC web interface, and tested if I could retrieve the same sequence through the DAS server:
when retrieved from the UCSC web interface, the RefSeq id "NM_005504" is associated with a 7985 nt long 3'UTR FASTA sequence carrying the header
>hg19_refGene_NM_005504 range=chr12:24962958-24970941 5'pad=0 3'pad=0 strand=- repeatMasking=none
the range "chr12:24962958-24970941" from the previous header can be retrieved from UCSC's hg19 table through the MySQL query
select g.chrom, g.txStart, g.cdsStart from refGene g, knownToRefSeq r where g.name = 'NM_005504' AND r.value = g.name;
I now reuse this information to query the DAS server, and test if I can retrieve the same 3'UTR sequence:
http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=chr12:24962957,24970941
and I obtain a 7985 nt long sequence.
The two retrieval methods return indeed a sequence of the same length, but the two do not coincide.
What am I missing?
You should probably consider the fact that DAS server is using +1 index for the first base, as mentioned in the link you posted.
Thanks. You're right, I forgot to mention I tried to shift the 7985 nt long window, but the issue remains: considering the DAS server indexing from 1 instead of 0, I modified the range that I retrieved through the MySQL query from 24962958-24970941 to 24962959-24970942, but the sequences still do not coincide
If I search hg19 build with
NM_005104
at UCSC I am getting a completely different region onchr 6
. But if I useNM_005504
then that refers tochr12
. Looks like there are two separate accession #.Yes, sorry, it was a typo in my explanation: the accession number whose sequence I was trying to retrieve was the NM_005504, not NM_005104. I edited my post to remove the typo. Thank you!