Here is a somewhat low-tech and convoluted way of doing what you want using Phastcons scores, as i have no idea what the paper you mentioned did.
1) Get PhastCons scores from UCSC:
There are various phastCons score files, depending on the genomes in the pileup. E.g.
http://hgdownload.cse.ucsc.edu/goldenPath/mm9/phastCons30way/
You can get them by ftp:
ftp hgdownload.cse.ucsc.edu
user name: anonymous
password: <your email address>
cd goldenPath/mm9/phastCons30way/vertebrate
mget -a
2) Convert wig to bigWig:
Each chromosome file of scores (wig files) can be converted into bigWig files. 64-bit binaries come from:
http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/
The fetchChromSizes
script is required to get the chromosome length for the 'wigToBigWig' script:
fetchChromSizes mm9 > mm9.chrom.sizes
Run wigToBigWig on each chromosome file, e.g:
wigToBigWig -clip chr1.data mm9.chrom.sizes chr1.bw
3) Calculate mean phastCons score for the coordinates of interest:
Another script called bigWigSummary
can be obtained from UCSC, using the link above. It takes in a set of genome coordinate and will spit out the max or mean of values contained in the bigWig file, in our case phastCons scores.
Below is a Perl script i wrote (below) that retrieves max and mean scores. Paths to files and scripts will need editing.
This is at least a backup if you don't get anything else!
use strict;
if(@ARGV!=2) {
die ("USAGE: $0 | Regions BED file | Output\n");
}
open(BED, "<$ARGV[0]") or die("Could not open input BED file!\n");
open(OUTPUT, ">$ARGV[1]") or die("Could not open output file!\n");
while(defined(my $line = <BED>)) {
chomp($line);
my ($chr, $start, $end) = '';
unless($line =~ /^chr/) { next }
chomp($line);
my @line_bits = split(/\t/, $line);
$chr = $line_bits[0];
$start = $line_bits[1];
$end = $line_bits[2];
system("/home/idonaldson/bin/UCSC/bigWigSummary -type=max /home/idonaldson/genomes/mm9_vertebrate_phastCons/bw/$chr.bw $chr $start $end 1 > $$.max_tmp 2>&1");
system("/home/idonaldson/bin/UCSC/bigWigSummary -type=mean /home/idonaldson/genomes/mm9_vertebrate_phastCons/bw/$chr.bw $chr $start $end 1 > $$.mean_tmp 2>&1");
open(MAX_TEMP, "<$$.max_tmp") or die("Cannot open max temp file!\n");
open(MEAN_TEMP, "<$$.mean_tmp") or die("Cannot open mean temp file!\n");
my $max_temp_output = '';
while(<MAX_TEMP>) {
$max_temp_output .= $_;
chomp($max_temp_output);
}
close(MAX_TEMP);
if($max_temp_output =~ /^no/) {
$max_temp_output = '';
}
my $mean_temp_output = '';
while(<MEAN_TEMP>) {
$mean_temp_output .= $_;
chomp($mean_temp_output);
}
close(MEAN_TEMP);
if($mean_temp_output =~ /^no/) {
$mean_temp_output = '';
}
print OUTPUT "$line_bits[0]\t$line_bits[1]\t$line_bits[2]\t$max_temp_output\t$mean_temp_output\n";
}
system("rm $$.max_tmp");
system("rm $$.mean_tmp");
close(BED);
close(OUTPUT);
exit;
I think this code is very useful. Thanks.
great effort lan ! i used it and found interesting. but do you think mean and max phast score would work if we need to find conservastion of bit longer regions , say 500-3000 bps? chr7:21114483-21117423 (hg19) has phastcons max =1 and mean= 0.437672 when i check it in UCSC browser, its conserved down to fish ! where as chr7:20838708-20841649 (hg19) max=1 and mean =0.459534 while on browser its conserved to mammals only! why is this contradiction here. may be there is some misunderstanding in my concept about phastcons score. can you please clarify me ? i actually want to find conservation depth of many such regions, and based on 46 vertebrate multiple alignments phastcons score , i am trying to identify distant most specie to which an element/region is conserved.