A DAS query can help with automation.
For example, to write the human (hg19) sequence for a region on chromosome chrX
at positions 1000000-1000010 to a file called foo.xml
:
$ wget -O - http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=chrX:1000000,1000010 > foo.xml
The XML looks like this:
http://www.biodas.org/dtd/dasdna.dtd">
<DASDNA>
<SEQUENCE id="chrX" start="1000000" stop="1000010" version="1.00">
<DNA length="11">
gaaacagctac
</DNA>
</SEQUENCE>
</DASDNA>
You can parse this on the command line, using an XSLT stylesheet and xsltproc
.
First, create the stylesheet that retrieves the value of data in the sequence path; for example, foo.xsl
:
<xsl:stylesheet xmlns:xsl="<a href=" <a="" href="http://www.w3.org/1999/XSL/Transform" rel="nofollow">http://www.w3.org/1999/XSL/Transform" "="" rel="nofollow">http://www.w3.org/1999/XSL/Transform' version='1.0'>
<xsl:output method="text" encoding="UTF-8"/>
<xsl:template match="/">
<xsl:value-of select="DASDNA/SEQUENCE/DNA"/>
</xsl:template>
</xsl:stylesheet>
Then run the foo.xml
result against this stylesheet:
$ xsltproc foo.xsl foo.xml | awk '($0 ~ /^[acgtnACGTN]/)'
gaaacagctac
You can glue some of this into a pipeline or shell script:
#!/bin/bash -efx
DASURL="http://genome.ucsc.edu/cgi-bin/das"
BUILD="hg19"
CHR="chrX"
START="1000000"
STOP="1000010"
wget -O - ${DASURL}/${BUILD}/dna?segment=${CHR}:${START},${STOP} \
| xsltproc foo.xsl - \
| awk '($0 ~ /^[acgtnACGTN]/)' \
> foo.txt