Getting Sequence Information From Ucsc Genome Browser
2
1
Entering edit mode
11.5 years ago
anuragm ▴ 130

I had downloaded the Phast cons score for Xenopus tropicalis alignment with other vertebrate species to look for conserved regions. So, now I have the positions that I am interested in. How do I get the exact nucleotides corresponding to these positions now ?

genome-browser nucleotide • 3.7k views
ADD COMMENT
3
Entering edit mode
11.5 years ago

A DAS query can help with automation.

For example, to write the human (hg19) sequence for a region on chromosome chrX at positions 1000000-1000010 to a file called foo.xml:

$ wget -O - http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=chrX:1000000,1000010 > foo.xml

The XML looks like this:


http://www.biodas.org/dtd/dasdna.dtd">
<DASDNA>
<SEQUENCE id="chrX" start="1000000" stop="1000010" version="1.00">
<DNA length="11">
gaaacagctac
</DNA>
</SEQUENCE>
</DASDNA>

You can parse this on the command line, using an XSLT stylesheet and xsltproc.

First, create the stylesheet that retrieves the value of data in the sequence path; for example, foo.xsl:


<xsl:stylesheet xmlns:xsl="&lt;a href=" <a="" href="http://www.w3.org/1999/XSL/Transform" rel="nofollow">http://www.w3.org/1999/XSL/Transform" "="" rel="nofollow">http://www.w3.org/1999/XSL/Transform' version='1.0'>
  <xsl:output method="text" encoding="UTF-8"/>
  <xsl:template match="/">
    <xsl:value-of select="DASDNA/SEQUENCE/DNA"/>
  </xsl:template>
</xsl:stylesheet>

Then run the foo.xml result against this stylesheet:

$ xsltproc foo.xsl foo.xml | awk '($0 ~ /^[acgtnACGTN]/)'
gaaacagctac

You can glue some of this into a pipeline or shell script:

#!/bin/bash -efx

DASURL="http://genome.ucsc.edu/cgi-bin/das"
BUILD="hg19"
CHR="chrX"
START="1000000"
STOP="1000010"

wget -O - ${DASURL}/${BUILD}/dna?segment=${CHR}:${START},${STOP} \
    | xsltproc foo.xsl - \
    | awk '($0 ~ /^[acgtnACGTN]/)' \
    > foo.txt
ADD COMMENT
1
Entering edit mode
11.5 years ago

The high-level view may be enough to get you going.

  1. Convert your regions of interest into BED format
  2. Upload your BED file to the UCSC genome browser as a custom track
  3. Use the UCSC Table Browser, choose your custom track as the track of interest, then choose output "sequence"
ADD COMMENT

Login before adding your answer.

Traffic: 2960 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6