retrieving the 3' UTR sequence from the UCSC genome browser via DAS server
1
0
Entering edit mode
7.3 years ago
bagnacan • 0

I'm trying to retrieve the 3'UTR sequence of a given RefSeq id from the UCSC genome browser.
From its main web interface, this task can be easily accomplished selecting the "refGene" table and "RefSeq" track. However I'm trying to automate the retrieval process, so that the final user will not interact with the UCSC's web interface directly, but rather launch a script to obtain the corresponding 3'UTR sequence of some provided RefSeq identifier.
The UCSC genome browser web interface is therefore off the table.

Looking around I found the possibility to interact with the UCSC via MySQL queries, but since I'm only interested in the 3'UTR sequence, and since I cannot retrieve a genomic sequence of DNA only with a SQL query, I'm relying on the DAS server, as pointed out in this answer.

I therefore obtained an example 3'UTR sequence from the UCSC web interface, and tested if I could retrieve the same sequence through the DAS server:

  • when retrieved from the UCSC web interface, the RefSeq id "NM_005504" is associated with a 7985 nt long 3'UTR FASTA sequence carrying the header

    >hg19_refGene_NM_005504 range=chr12:24962958-24970941 5'pad=0 3'pad=0 strand=- repeatMasking=none

  • the range "chr12:24962958-24970941" from the previous header can be retrieved from UCSC's hg19 table through the MySQL query

    select g.chrom, g.txStart, g.cdsStart from refGene g, knownToRefSeq r where g.name = 'NM_005504' AND r.value = g.name;

  • I now reuse this information to query the DAS server, and test if I can retrieve the same 3'UTR sequence:

    http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=chr12:24962957,24970941

    and I obtain a 7985 nt long sequence.

The two retrieval methods return indeed a sequence of the same length, but the two do not coincide.
What am I missing?

ucsc mySQL • 4.3k views
ADD COMMENT
0
Entering edit mode

You should probably consider the fact that DAS server is using +1 index for the first base, as mentioned in the link you posted.

ADD REPLY
0
Entering edit mode

Thanks. You're right, I forgot to mention I tried to shift the 7985 nt long window, but the issue remains: considering the DAS server indexing from 1 instead of 0, I modified the range that I retrieved through the MySQL query from 24962958-24970941 to 24962959-24970942, but the sequences still do not coincide

ADD REPLY
0
Entering edit mode

when retrieved from the UCSC web interface, the RefSeq id "NM_005104" is associated with a 7985 nt long 3'UTR FASTA sequence carrying the header

hg19_refGene_NM_005504 range=chr12:24962958-24970941 5'pad=0 3'pad=0 strand=- repeatMasking=none

If I search hg19 build with NM_005104 at UCSC I am getting a completely different region on chr 6. But if I use NM_005504 then that refers to chr12. Looks like there are two separate accession #.

ADD REPLY
0
Entering edit mode

Yes, sorry, it was a typo in my explanation: the accession number whose sequence I was trying to retrieve was the NM_005504, not NM_005104. I edited my post to remove the typo. Thank you!

ADD REPLY
1
Entering edit mode
7.3 years ago
genecats.ucsc ▴ 580

The issue here is that the transcript you are interested in lies on the negative strand, and so the Table Browser returns a reverse complemented sequence compared to what you are seeing from DAS, which is a simple genomic query. If you reverse complement one of the sequences you will find that they are the same.

Please note that you can use the utility faRc from our directory of utilities to reverse complement the sequence: http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/

ChrisL from the UCSC Genome Browser

ADD COMMENT
0
Entering edit mode

Thank you, this solved the issue

ADD REPLY

Login before adding your answer.

Traffic: 1800 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6