Question

Measuring Conservation For Short Motif

2

Entering edit mode

13.2 years ago

Stephen 2.8k

How can I measure sequence conservation among different species for a very short (5-8 nt) motif in introns?

I know I can visualize conservation in UCSC, but I want something more quantitative. I'm looking at RNA-binding protein recognition motifs that occur in introns flanking the alternative exon. I want to show that this motif is more conserved than a "normal" intron. This paper did it, but didn't really tell how:

We analyzed sequence conservation of pentamers in the mouse intronic regions to identify potential splicing regulatory elements. The mouse introns were aligned to 7 other mammalian genomes that have at least 5x coverage in the UCSC 28-way multigenome alignment. For each pentamer in each region [upstream and downstream of exon], a conservation rate (CR) was calculated as the fraction of aligned and conserved occurrences among total occurrences. The significance of CR of each pentamer is evaluated by comparing with 10 other pentamers with similar expected CR calculated using the first-order Markov model. This procedure essentially controls for possible sequence bias in the dataset. A P value was calculated by using the binomial distribution.

Thanks in advance!

conservation transcription • 7.2k views

ADD COMMENT • link updated 11.5 years ago by jasper1918 ▴ 10 • written 13.2 years ago by Stephen 2.8k

score 12 · Answer 1 · 2012-01-27

Here is a somewhat low-tech and convoluted way of doing what you want using Phastcons scores, as i have no idea what the paper you mentioned did.

1) Get PhastCons scores from UCSC:

There are various phastCons score files, depending on the genomes in the pileup. E.g.

http://hgdownload.cse.ucsc.edu/goldenPath/mm9/phastCons30way/

You can get them by ftp:

ftp hgdownload.cse.ucsc.edu 
user name: anonymous
password: <your email address>
cd goldenPath/mm9/phastCons30way/vertebrate
mget -a

2) Convert wig to bigWig:

Each chromosome file of scores (wig files) can be converted into bigWig files. 64-bit binaries come from:

http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/

The fetchChromSizes script is required to get the chromosome length for the 'wigToBigWig' script:

fetchChromSizes mm9 > mm9.chrom.sizes

Run wigToBigWig on each chromosome file, e.g:

wigToBigWig -clip chr1.data mm9.chrom.sizes chr1.bw

3) Calculate mean phastCons score for the coordinates of interest:

Another script called bigWigSummary can be obtained from UCSC, using the link above. It takes in a set of genome coordinate and will spit out the max or mean of values contained in the bigWig file, in our case phastCons scores.

Below is a Perl script i wrote (below) that retrieves max and mean scores. Paths to files and scripts will need editing.

This is at least a backup if you don't get anything else!

#!/usr/bin/perl -w

use strict;

#####
# Get highest and mean phastCons score for each binding region in BED file
# 10 Sept 2010.  Ian Donaldson.
# 20 Oct 2010 modified to read BED format
#####

# Command line usage

if(@ARGV!=2) {
   die ("USAGE: $0 | Regions BED file | Output\n");
} 

open(BED, "<$ARGV[0]") or die("Could not open input BED file!\n"); 
open(OUTPUT, ">$ARGV[1]") or die("Could not open output file!\n");

# Work thru each line of BED file
while(defined(my $line = <BED>)) {
   chomp($line);

   my ($chr, $start, $end) = '';

   # Skip line if line does not start with 'chr'
   unless($line =~ /^chr/) { next }

   # Remove EOL from each line
   chomp($line);

   # Split each line by TAB
   my @line_bits = split(/\t/, $line);

   $chr = $line_bits[0];
   $start = $line_bits[1];
   $end = $line_bits[2];

   # Run bigWigSummary 
   system("/home/idonaldson/bin/UCSC/bigWigSummary -type=max /home/idonaldson/genomes/mm9_vertebrate_phastCons/bw/$chr.bw $chr $start $end 1 > $$.max_tmp 2>&1");
   system("/home/idonaldson/bin/UCSC/bigWigSummary -type=mean /home/idonaldson/genomes/mm9_vertebrate_phastCons/bw/$chr.bw $chr $start $end 1 > $$.mean_tmp 2>&1");

   # open temp file
   open(MAX_TEMP, "<$$.max_tmp") or die("Cannot open max temp file!\n");
   open(MEAN_TEMP, "<$$.mean_tmp") or die("Cannot open mean temp file!\n");


   # MAX phastCons
   my $max_temp_output = '';

   while(<MAX_TEMP>) {
      $max_temp_output .= $_;
      chomp($max_temp_output);
   }

   # close max temp file
   close(MAX_TEMP);

   # remove blank lines s/^$//g
   if($max_temp_output =~ /^no/) { 
      $max_temp_output = '';
   }


   # MEAN phastCons
   my $mean_temp_output = '';

   while(<MEAN_TEMP>) {
      $mean_temp_output .= $_;
      chomp($mean_temp_output);
   }

   # close max temp file
   close(MEAN_TEMP);

   # remove blank lines s/^$//g
   if($mean_temp_output =~ /^no/) { 
      $mean_temp_output = '';
   }


   # save temp file variable to $ARGV[1]
   print OUTPUT "$line_bits[0]\t$line_bits[1]\t$line_bits[2]\t$max_temp_output\t$mean_temp_output\n";
}

# remove temporary file to final output
system("rm $$.max_tmp");
system("rm $$.mean_tmp");

# close files
close(BED);
close(OUTPUT);

exit;