Question

Getting Sequence Information From Ucsc Genome Browser

1

Entering edit mode

12.2 years ago

anuragm ▴ 130

I had downloaded the Phast cons score for Xenopus tropicalis alignment with other vertebrate species to look for conserved regions. So, now I have the positions that I am interested in. How do I get the exact nucleotides corresponding to these positions now ?

genome-browser nucleotide • 3.9k views

ADD COMMENT • link updated 12.2 years ago by Alex Reynolds 36k • written 12.2 years ago by anuragm ▴ 130

score 3 · Answer 1 · 2013-05-19

A DAS query can help with automation.

For example, to write the human (hg19) sequence for a region on chromosome chrX at positions 1000000-1000010 to a file called foo.xml:

$ wget -O - http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=chrX:1000000,1000010 > foo.xml

The XML looks like this:


http://www.biodas.org/dtd/dasdna.dtd">
<DASDNA>
<SEQUENCE id="chrX" start="1000000" stop="1000010" version="1.00">
<DNA length="11">
gaaacagctac
</DNA>
</SEQUENCE>
</DASDNA>

You can parse this on the command line, using an XSLT stylesheet and xsltproc.

First, create the stylesheet that retrieves the value of data in the sequence path; for example, foo.xsl:


<xsl:stylesheet xmlns:xsl="&lt;a href=" <a="" href="http://www.w3.org/1999/XSL/Transform" rel="nofollow">http://www.w3.org/1999/XSL/Transform" "="" rel="nofollow">http://www.w3.org/1999/XSL/Transform' version='1.0'>
  <xsl:output method="text" encoding="UTF-8"/>
  <xsl:template match="/">
    <xsl:value-of select="DASDNA/SEQUENCE/DNA"/>
  </xsl:template>
</xsl:stylesheet>

Then run the foo.xml result against this stylesheet:

$ xsltproc foo.xsl foo.xml | awk '($0 ~ /^[acgtnACGTN]/)'
gaaacagctac

You can glue some of this into a pipeline or shell script:

#!/bin/bash -efx

DASURL="http://genome.ucsc.edu/cgi-bin/das"
BUILD="hg19"
CHR="chrX"
START="1000000"
STOP="1000010"

wget -O - ${DASURL}/${BUILD}/dna?segment=${CHR}:${START},${STOP} \
    | xsltproc foo.xsl - \
    | awk '($0 ~ /^[acgtnACGTN]/)' \
    > foo.txt

score 1 · Answer 2 · 2013-05-19

1

Entering edit mode

12.2 years ago

Sean Davis 27k

The high-level view may be enough to get you going.

Convert your regions of interest into BED format
Upload your BED file to the UCSC genome browser as a custom track
Use the UCSC Table Browser, choose your custom track as the track of interest, then choose output "sequence"

ADD COMMENT • link 12.2 years ago by Sean Davis 27k