Perl To Retrieve Sequences From Ucsc Genome Browser
5
7
Entering edit mode
13.7 years ago

Dear all, I am going to get DNA sequence by its given chromosome position from the website of UCSC (link text), i.e, when I click the button 'get DNA', there would be a new html page produced, in which is my target DNA sequence; actually, for only one DNA sequence query, I can simply copy the result sequence, but I want to use perl to automate the procedure for many sequences (not more than 20). Firstly, I tried to mine the html codes and use perl HTML PARSER or LWP::useragent to accomplish the job, but I am not good at this, finally, I have to ask for your guys' help. Would you please show me how to do that? Thanks very much!

I finally get the solution for this problem with perl, and the following is my perl code to use DAS of UCSC to fetch DNA sequence by its given chromosome position:

#!/usr/bin/perl

use LWP::Simple;

  #Use DAS of UCSC to fetch specific sequence by its given chromosome position

  $URL_gene ="http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=chr2:100000,200000";

  $genefile = get($URL_gene);

  @DNA=grep {
             /^[acgt]*$/i;
                    } split("\n",$genefile);

  print @DNA,"\n";
perl sequence retrieval ucsc • 13k views
ADD COMMENT
0
Entering edit mode

Tip: indent your code with 4 spaces to format it properly.

ADD REPLY
0
Entering edit mode

yould should not (never) parse a XML document like this.

ADD REPLY
0
Entering edit mode

Exactly, you got the query right, now you have to apply the same care for the response handling. btw. check your code for a short sequence of length 10, or sequence starting with a whitespace or containing N it breaks. Instead of relying on random whitespace, use a perl XM parser module e.g. a subclass of XML::Parser.

ADD REPLY
0
Entering edit mode

Yeah, you are right in that case. I will revise it according to all your guys's suggestions. Thanks very much for all!

ADD REPLY
0
Entering edit mode

Can biomart do this job? I tried but found biomart only provides human gene sets. IF biomart has human whole genome, this work can be done using biomaRt package in R.

ADD REPLY
12
Entering edit mode
13.7 years ago
Michael 55k

The simple answer is: don't! Parsing HTML pages to get the result is highly error-prone tedious, and easy to break. It is considered a last resort only in cases where there is no other way to access the content of the resource in a programmatic way.

For UCSC programmatic interfaces would either be using DAS: http://genome.ucsc.edu/FAQ/FAQdownloads.html#download23 or their public mysql server: http://genome.ucsc.edu/FAQ/FAQdownloads#download29


Edit: Added a code example for parsing the XML using XML::XPath with the DAS link you made. This will fetch the sequence more reliably:

#!/usr/bin/env perl

use strict;
use warnings;
use LWP::Simple;
use XML::XPath;
use XML::XPath::XMLParser;

#Use DAS of UCSC to fetch specific sequence by its given chromosome position

my $URL_gene ="http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=chr2:100000,100100";

my $xml = get($URL_gene);

my $xp = XML::XPath->new(xml=>$xml);

my $nodeset = $xp->find('/DASDNA/SEQUENCE/DNA/text()'); # find all sequences
# there should be only one node, anyway:    
foreach my $node ($nodeset->get_nodelist) {
   my $seq = $node->getValue;
   $seq =~ s/\s//g; # remove white spaces
   print $seq, "\n";
}
ADD COMMENT
0
Entering edit mode

Thanks, actually, my first thought was the same as you suggested, but I want to make my perl codes simple and integrate it into another part of code without too many software involved in. Anyhow, I will think my plan again.

ADD REPLY
12
Entering edit mode
13.7 years ago

I think you took the wrong turn at some point. If what you need to do is to extract DNA sequences from the human genome based on coordinates, querying the UCSC genome browser and scraping the resulting web pages is not the way to go. You would be much better off downloading their latest assembly of the complete genome, and extract the wanted sequences from that.

ADD COMMENT
0
Entering edit mode

You are right, and if I download the whole genome, which would be a good start of long time mantaining of programming codes.Thanks!

ADD REPLY
0
Entering edit mode

Here is the answer from GenomeBrowser FAQ: http://genome.ucsc.edu/FAQ/FAQdownloads.html#download32

ADD REPLY
6
Entering edit mode
13.7 years ago
Ian 6.1k

I was given a Perl script by the UCSC helpdesk that i adapted to run on multiple coordinates.

Requires CPAN Bio::Das, which I got via perl -MCPAN -e shell'

The input is a file with one or more lines of: chr start end label (label is anything you like)

use strict;
use Bio::Das;

#### USAGE
unless(@ARGV == 2) {
   die("USAGE: $0 | Input BED file | Output FASTA file\n\n");
}

#### Access files
open(INPUT, "<$ARGV[0]") or die("Could not open input!\n");
open(OUTPUT, ">$ARGV[1]") or die("Could not open output!\n");

#### DAS db
my $das = Bio::Das->new(-source => 'http://genome.cse.ucsc.edu/cgi-bin/das/', -dsn=>'hg19');

#### Line counter
my $count = 0;

#### Work thru each line of BED file
while(defined(my $line = <INPUT>)) {
   #### Make sure line begins with chr
   unless($line =~ /^chr/) { next }

   #### Count lines
   $count++;
   print "#$count $line\n";

   #### Split line
   my ($chr, $start, $end, $label) = '';
   my @line_bits = ();

   @line_bits = split(/\t/, $line);

   $chr = $line_bits[0];
   $start = $line_bits[1];
   $end = $line_bits[2];
   $label = $line_bits[3];
   chomp($label);

   #### Define segment of genome
   my $segment = $das->segment("$chr\:$start\,$end");

   #### Get DNA
   my $dna = $segment->dna;

   #### Print sequence to output
   print OUTPUT "\>$chr:$start-$end\_$label\n$dna\n";
}

#### Close files
close(INPUT);
close(OUTPUT);

exit;
ADD COMMENT
4
Entering edit mode
13.7 years ago
Neilfws 49k

As the other answers said, it's usually not good to do things this way and there are often alternatives (downloads, APIs). However, if you must...

...I'd suggest looking at WWW Mechanize. It is a library which automates interaction with websites and can parse whatever is returned.

There is some documentation at the website (cookbook; examples) and because it is used widely, many tutorials such as this one. Just Google search "perl mechanize tutorial".

There are also Mechanize libraries for other languages. I find the Ruby version very intuitive, concise and easy to use.

ADD COMMENT
1
Entering edit mode

Yes, there is certainly no need to scrape a browser in order to retrieve sequences.

ADD REPLY
0
Entering edit mode

If one wanted to actually scrape a web site, I would agree with you. But I think scraping the UCSC genome browser is the entirely wrong angle of attack in this case ;-)

ADD REPLY

Login before adding your answer.

Traffic: 2472 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6