Question

Perl To Retrieve Sequences From Ucsc Genome Browser

7

Entering edit mode

14.4 years ago

Cheng Zhongshan ▴ 400

Dear all, I am going to get DNA sequence by its given chromosome position from the website of UCSC (link text), i.e, when I click the button 'get DNA', there would be a new html page produced, in which is my target DNA sequence; actually, for only one DNA sequence query, I can simply copy the result sequence, but I want to use perl to automate the procedure for many sequences (not more than 20). Firstly, I tried to mine the html codes and use perl HTML PARSER or LWP::useragent to accomplish the job, but I am not good at this, finally, I have to ask for your guys' help. Would you please show me how to do that? Thanks very much!

I finally get the solution for this problem with perl, and the following is my perl code to use DAS of UCSC to fetch DNA sequence by its given chromosome position:

#!/usr/bin/perl

use LWP::Simple;

  #Use DAS of UCSC to fetch specific sequence by its given chromosome position

  $URL_gene ="http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=chr2:100000,200000";

  $genefile = get($URL_gene);

  @DNA=grep {
             /^[acgt]*$/i;
                    } split("\n",$genefile);

  print @DNA,"\n";

perl sequence retrieval ucsc • 14k views

ADD COMMENT • link updated 10.9 years ago by Biostar 20 • written 14.4 years ago by Cheng Zhongshan ▴ 400

0

Entering edit mode

Tip: indent your code with 4 spaces to format it properly.

ADD REPLY • link 14.4 years ago by Neilfws 49k

0

Entering edit mode

yould should not (never) parse a XML document like this.

ADD REPLY • link 14.4 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Exactly, you got the query right, now you have to apply the same care for the response handling. btw. check your code for a short sequence of length 10, or sequence starting with a whitespace or containing N it breaks. Instead of relying on random whitespace, use a perl XM parser module e.g. a subclass of XML::Parser.

ADD REPLY • link 14.4 years ago by Michael 56k

0

Entering edit mode

Yeah, you are right in that case. I will revise it according to all your guys's suggestions. Thanks very much for all!

ADD REPLY • link 14.4 years ago by Cheng Zhongshan ▴ 400

0

Entering edit mode

Can biomart do this job? I tried but found biomart only provides human gene sets. IF biomart has human whole genome, this work can be done using biomaRt package in R.

ADD REPLY • link 14.3 years ago by Dejian ★ 1.3k

Ram · Answer 1 · 2011-03-06

The simple answer is: don't! Parsing HTML pages to get the result is highly error-prone tedious, and easy to break. It is considered a last resort only in cases where there is no other way to access the content of the resource in a programmatic way.

For UCSC programmatic interfaces would either be using DAS: http://genome.ucsc.edu/FAQ/FAQdownloads.html#download23 or their public mysql server: http://genome.ucsc.edu/FAQ/FAQdownloads#download29

Edit: Added a code example for parsing the XML using XML::XPath with the DAS link you made. This will fetch the sequence more reliably:

#!/usr/bin/env perl

use strict;
use warnings;
use LWP::Simple;
use XML::XPath;
use XML::XPath::XMLParser;

#Use DAS of UCSC to fetch specific sequence by its given chromosome position

my $URL_gene ="http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=chr2:100000,100100";

my $xml = get($URL_gene);

my $xp = XML::XPath->new(xml=>$xml);

my $nodeset = $xp->find('/DASDNA/SEQUENCE/DNA/text()'); # find all sequences
# there should be only one node, anyway:    
foreach my $node ($nodeset->get_nodelist) {
   my $seq = $node->getValue;
   $seq =~ s/\s//g; # remove white spaces
   print $seq, "\n";
}

score 12 · Answer 2 · 2011-03-06

12

Entering edit mode

14.4 years ago

Lars Juhl Jensen 11k

I think you took the wrong turn at some point. If what you need to do is to extract DNA sequences from the human genome based on coordinates, querying the UCSC genome browser and scraping the resulting web pages is not the way to go. You would be much better off downloading their latest assembly of the complete genome, and extract the wanted sequences from that.

ADD COMMENT • link 14.4 years ago by Lars Juhl Jensen 11k

0

Entering edit mode

You are right, and if I download the whole genome, which would be a good start of long time mantaining of programming codes.Thanks!

ADD REPLY • link 14.4 years ago by Cheng Zhongshan ▴ 400

0

Entering edit mode

Here is the answer from GenomeBrowser FAQ: http://genome.ucsc.edu/FAQ/FAQdownloads.html#download32

ADD REPLY • link 11.9 years ago by ivan.antonov ▴ 100

Ram · Answer 3 · 2011-03-09

I was given a Perl script by the UCSC helpdesk that i adapted to run on multiple coordinates.

Requires CPAN Bio::Das, which I got via perl -MCPAN -e shell'

The input is a file with one or more lines of: chr start end label (label is anything you like)

use strict;
use Bio::Das;

#### USAGE
unless(@ARGV == 2) {
   die("USAGE: $0 | Input BED file | Output FASTA file\n\n");
}

#### Access files
open(INPUT, "<$ARGV[0]") or die("Could not open input!\n");
open(OUTPUT, ">$ARGV[1]") or die("Could not open output!\n");

#### DAS db
my $das = Bio::Das->new(-source => 'http://genome.cse.ucsc.edu/cgi-bin/das/', -dsn=>'hg19');

#### Line counter
my $count = 0;

#### Work thru each line of BED file
while(defined(my $line = <INPUT>)) {
   #### Make sure line begins with chr
   unless($line =~ /^chr/) { next }

   #### Count lines
   $count++;
   print "#$count $line\n";

   #### Split line
   my ($chr, $start, $end, $label) = '';
   my @line_bits = ();

   @line_bits = split(/\t/, $line);

   $chr = $line_bits[0];
   $start = $line_bits[1];
   $end = $line_bits[2];
   $label = $line_bits[3];
   chomp($label);

   #### Define segment of genome
   my $segment = $das->segment("$chr\:$start\,$end");

   #### Get DNA
   my $dna = $segment->dna;

   #### Print sequence to output
   print OUTPUT "\>$chr:$start-$end\_$label\n$dna\n";
}

#### Close files
close(INPUT);
close(OUTPUT);

exit;

score 4 · Answer 4 · 2011-03-06

4

Entering edit mode

14.4 years ago

Neilfws 49k

As the other answers said, it's usually not good to do things this way and there are often alternatives (downloads, APIs). However, if you must...

...I'd suggest looking at WWW Mechanize. It is a library which automates interaction with websites and can parse whatever is returned.

There is some documentation at the website (cookbook; examples) and because it is used widely, many tutorials such as this one. Just Google search "perl mechanize tutorial".

There are also Mechanize libraries for other languages. I find the Ruby version very intuitive, concise and easy to use.

ADD COMMENT • link 14.4 years ago by Neilfws 49k

1

Entering edit mode

Yes, there is certainly no need to scrape a browser in order to retrieve sequences.

ADD REPLY • link 14.4 years ago by Neilfws 49k

0

Entering edit mode

If one wanted to actually scrape a web site, I would agree with you. But I think scraping the UCSC genome browser is the entirely wrong angle of attack in this case ;-)

ADD REPLY • link 14.4 years ago by Lars Juhl Jensen 11k