Dear all, I am going to get DNA sequence by its given chromosome position from the website of UCSC (link text), i.e, when I click the button 'get DNA', there would be a new html page produced, in which is my target DNA sequence; actually, for only one DNA sequence query, I can simply copy the result sequence, but I want to use perl to automate the procedure for many sequences (not more than 20). Firstly, I tried to mine the html codes and use perl HTML PARSER or LWP::useragent to accomplish the job, but I am not good at this, finally, I have to ask for your guys' help. Would you please show me how to do that? Thanks very much!
I finally get the solution for this problem with perl, and the following is my perl code to use DAS of UCSC to fetch DNA sequence by its given chromosome position:
#!/usr/bin/perl
use LWP::Simple;
#Use DAS of UCSC to fetch specific sequence by its given chromosome position
$URL_gene ="http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=chr2:100000,200000";
$genefile = get($URL_gene);
@DNA=grep {
/^[acgt]*$/i;
} split("\n",$genefile);
print @DNA,"\n";
Exactly, you got the query right, now you have to apply the same care for the response handling. btw. check your code for a short sequence of length 10, or sequence starting with a whitespace or containing N it breaks. Instead of relying on random whitespace, use a perl XM parser module e.g. a subclass of XML::Parser.
Can biomart do this job? I tried but found biomart only provides human gene sets. IF biomart has human whole genome, this work can be done using biomaRt package in R.
The simple answer is: don't! Parsing HTML pages to get the result is highly error-prone tedious, and easy to break. It is considered a last resort only in cases where there is no other way to access the content of the resource in a programmatic way.
Edit: Added a code example for parsing the XML using XML::XPath with the DAS link you made. This will fetch the sequence more reliably:
#!/usr/bin/env perl
use strict;
use warnings;
use LWP::Simple;
use XML::XPath;
use XML::XPath::XMLParser;
#Use DAS of UCSC to fetch specific sequence by its given chromosome position
my $URL_gene ="http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=chr2:100000,100100";
my $xml = get($URL_gene);
my $xp = XML::XPath->new(xml=>$xml);
my $nodeset = $xp->find('/DASDNA/SEQUENCE/DNA/text()'); # find all sequences
# there should be only one node, anyway:
foreach my $node ($nodeset->get_nodelist) {
my $seq = $node->getValue;
$seq =~ s/\s//g; # remove white spaces
print $seq, "\n";
}
Thanks, actually, my first thought was the same as you suggested, but I want to make my perl codes simple and integrate it into another part of code without too many software involved in. Anyhow, I will think my plan again.
I think you took the wrong turn at some point. If what you need to do is to extract DNA sequences from the human genome based on coordinates, querying the UCSC genome browser and scraping the resulting web pages is not the way to go. You would be much better off downloading their latest assembly of the complete genome, and extract the wanted sequences from that.
I was given a Perl script by the UCSC helpdesk that i adapted to run on multiple coordinates.
Requires CPAN Bio::Das, which I got via perl -MCPAN -e shell'
The input is a file with one or more lines of: chr start end label (label is anything you like)
use strict;
use Bio::Das;
#### USAGE
unless(@ARGV == 2) {
die("USAGE: $0 | Input BED file | Output FASTA file\n\n");
}
#### Access files
open(INPUT, "<$ARGV[0]") or die("Could not open input!\n");
open(OUTPUT, ">$ARGV[1]") or die("Could not open output!\n");
#### DAS db
my $das = Bio::Das->new(-source => 'http://genome.cse.ucsc.edu/cgi-bin/das/', -dsn=>'hg19');
#### Line counter
my $count = 0;
#### Work thru each line of BED file
while(defined(my $line = <INPUT>)) {
#### Make sure line begins with chr
unless($line =~ /^chr/) { next }
#### Count lines
$count++;
print "#$count $line\n";
#### Split line
my ($chr, $start, $end, $label) = '';
my @line_bits = ();
@line_bits = split(/\t/, $line);
$chr = $line_bits[0];
$start = $line_bits[1];
$end = $line_bits[2];
$label = $line_bits[3];
chomp($label);
#### Define segment of genome
my $segment = $das->segment("$chr\:$start\,$end");
#### Get DNA
my $dna = $segment->dna;
#### Print sequence to output
print OUTPUT "\>$chr:$start-$end\_$label\n$dna\n";
}
#### Close files
close(INPUT);
close(OUTPUT);
exit;
ADD COMMENT
• link
updated 5.3 years ago by
Ram
44k
•
written 13.8 years ago by
Ian
6.1k
As the other answers said, it's usually not good to do things this way and there are often alternatives (downloads, APIs). However, if you must...
...I'd suggest looking at WWW Mechanize. It is a library which automates interaction with websites and can parse whatever is returned.
There is some documentation at the website (cookbook; examples) and because it is used widely, many tutorials such as this one. Just Google search "perl mechanize tutorial".
There are also Mechanize libraries for other languages. I find the Ruby version very intuitive, concise and easy to use.
If one wanted to actually scrape a web site, I would agree with you. But I think scraping the UCSC genome browser is the entirely wrong angle of attack in this case ;-)
Tip: indent your code with 4 spaces to format it properly.
yould should not (never) parse a XML document like this.
Exactly, you got the query right, now you have to apply the same care for the response handling. btw. check your code for a short sequence of length 10, or sequence starting with a whitespace or containing N it breaks. Instead of relying on random whitespace, use a perl XM parser module e.g. a subclass of XML::Parser.
Yeah, you are right in that case. I will revise it according to all your guys's suggestions. Thanks very much for all!
Can biomart do this job? I tried but found biomart only provides human gene sets. IF biomart has human whole genome, this work can be done using biomaRt package in R.