First, you could obtain PhastCons or phyloP conservation data via UCSC.
I'll show how you can do this with human phyloP data, which extrapolates to other per-base signal types. The general process outlined below will work with most any signal type, where signal falls directly on the base. In the case of signal that falls between bases (such as DNaseI or "cut-count" data) there are per-strand corrections required; these issues should not arise in conservation data scoring.
Intermediate data files are in UCSC BED format: I therefore show how to use high-performance BEDOPS tools to manipulate and map BED files.
I'm also going to use Perl instead of Python to process standard input and output, and R to do calculations. For one, Perl is a more natural choice for I/O-heavy work; Python is just slow. Second, R can do vectorized calculations that make it faster to process large matrices, without requiring installing SciPy or other heavy and fragile dependencies.
Note that PhastCons data can contain gaps, but you can't assume that these gaps are areas of zero or "neutral" conservation. Rather, they are regions simply where no score could be generated. This is, I think, an important distinction in that I do not think it is possible to interpolate gap scores between points where there are data. From prior experience, therefore, I would suggest either using per-base phyloP data or throwing away PhastCons-scored vectors, where such gaps are found.
In any case, as a for instance, here are phyloP 46-way vertebrate conservation data in WIG format: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phyloP46way/vertebrate/
Wget is a common tool for automating the acquisition of files off the web in bulk:
wget -r -l1 -nd -nc -A.wigFix.gz http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phyloP46way/vertebrate/
Once you have the WIG files, you can convert them into a unioned BED file via wig2bed:
for fn in `ls *.wigFix.gz`; do gunzip -c ${fn} | wig2bed - > ${fn}.bed; done
bedops --everything chr*.bed > vertebrate.phyloP46.bed
Once you have all the per-chromosome signal into one sorted BED file called vertebrate.phyloP46.bed
, you can use BEDOPS bedmap to map conservation score signal to gene annotations.
You'll need gene annotations as well as the conservation signal, though. So you might get gene annotations and convert them to BED via gff2bed:
wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_21/gencode.v21.annotation.gff3.gz \
| gunzip --stdout - \
| gff2bed - \
| grep -F "gene" - \
| cut -f1-6 - \
> genes.bed
Wherever you get your genes from, you should now have genes and a signal file. Both files should be sorted BED files, ordered correctly per sort-bed ordering.
Split the all-genes BED file into per-gene BED files, using cut
, sort
and uniq
to acquire a listing of the names of each gene:
mkdir perGene
for name in `cut -f4 genes.bed | sort | uniq`; do grep -F ${name} genes.bed > perGene/${name}.bed; done
For each gene instance, you need to split each instance apart by individual bases:
#!/usr/bin/env perl
use strict;
use warnings;
my $cnt = 0;
while (<STDIN>) {
chomp;
my ($chr, $start, $stop, $id, $score, $strand) = split("\t", $_);
for (my $meltIdx = $start; $meltIdx < $stop; $meltIdx += 1) {
print STDOUT join("\t", ($chr, $meltIdx, $meltIdx + 1, $id."!$cnt", $score, $strand))."\n";
}
$cnt++;
}
This splits each per-gene BED element into single bases, where each instance of a gene has a unique ID key:
for fn in `ls perGene/*.bed`; do melt.pl < ${fn} > ${fn}.melted.bed; done
The unique ID key is critical, because it is possible that some genes may partially overlap (palindromes, particularly) and we need to distinguish and score two overlapping elements separately.
Once you have per-base-split genes, you can score them quickly via bedmap --faster:
for fn in `ls perGene/*.melted.bed`; do bedmap --faster --echo --indicator --echo-map-score --delim '\t' ${fn} vertebrate.phyloP46.bed > ${fn}.vertebrate.phyloP46.bed; done
The result is a BED file for each gene (with the suffix .vertebrate.phyloP46.bed
) that contains the phyloP score at each position of each "melted" gene instance, where a score exists. In case of no score, there is a blank field; we account for this later by assigning a score of zero, which should be reasonable for phyloP data.
We now "unmelt" or "freeze" the scores back into a BED7+ file containing a vector of scores for the whole gene element:
#!/usr/bin/perl
use strict;
use warnings;
my $signalRef;
my $keyRef;
while (<STDIN>) {
chomp;
my ($chr, $start, $stop, $id, $score, $strand, $indicator, $signal) = split("\t", $_);
if (! defined $signalRef->{$id}) {
push (@{$keyRef}, $id);
my @id_components = split("!", $id);
my $fixed_id = $id_components[0];
$signalRef->{$id}->{chr} = $chr;
$signalRef->{$id}->{start} = $start;
$signalRef->{$id}->{stop} = $stop;
$signalRef->{$id}->{id} = $fixed_id;
$signalRef->{$id}->{score} = $score;
$signalRef->{$id}->{strand} = $strand;
$signalRef->{$id}->{signals} = ();
}
if ($indicator == 0) {
push (@{$signalRef->{$id}->{signals}}, 0);
}
else {
push (@{$signalRef->{$id}->{signals}}, $signal);
}
}
foreach my $key (@{$keyRef}) {
my $chr = $signalRef->{$key}->{chr};
my $start = $signalRef->{$key}->{start};
my $stop = $signalRef->{$key}->{stop};
my $id = $signalRef->{$key}->{id};
my $score = $signalRef->{$key}->{score};
my $strand = $signalRef->{$key}->{strand};
my @signals = @{$signalRef->{$key}->{signals}};
print STDOUT join("\t", ($chr, $start, $stop, $id, $score, $strand))."\t".join("\t", @signals)."\n";
}
To freeze all the melted files:
for fn in `ls perGene/*.melted.bed`; do freeze.pl < ${fn} > ${fn}.frozen.bed; done
Once turned into "frozen" BED files, you can turn them into more convenient "matrix" text files:
#!/usr/bin/perl
use strict;
use warnings;
# Input:
# chrX 155109956 155111957 ENSG00000124333.10 1 + 0 0 0 0 ...
# Output:
# window 0 1 2 3 4 ...
# chrX:155109956:155111957:ENSG00000124333.10:1:+ 0 0 0 0 ...
my $idx = 0;
while (<STDIN>) {
chomp;
my @elems = split("\t", $_);
my ($chr, $start, $stop, $id, $score, $strand) = @elems[0..5];
my @mtx_scores = @elems[6..(scalar @elems - 1)];
if ($idx == 0) {
my $len = scalar @mtx_scores;
print STDOUT "window\t".join("\t", 0..($len-1))."\n";
}
my $key = join(":", @elems[0..5]);
if ($strand eq "+")
{
print STDOUT "$key\t".join("\t", @mtx_scores)."\n";
}
elsif ($strand eq "-")
{
print STDOUT "$key\t".join("\t", reverse @mtx_scores)."\n";
}
$idx++;
}
Reversal of score vectors for reverse-stranded elements helps align the positions of conservation scores, so that calculating columnar means is done on aligned bases.
If you want to see the effect, you can split rows by strand and process them separately. This should demonstrate more clearly why reverse
is used.
A "frozen" per-gene file can be turned into a per-gene matrix ("mtx") easily:
for fn in `ls perGene/*.frozen.bed`; do mtx.pl < ${fn} > ${fn}.mtx; done
Note that the matrix result is no longer a BED-formatted file, but that doesn't matter: The key in the matrix file's first column can be easily reprocessed downstream, if a BED file is once again needed. Mainly, a matrix file is simply easier to work with in R.
A matrix file is easily imported into R and rendered into a vector of per-base averages:
gene_df <- read.table("gene_xyz.mtx", sep="\t", stringsAsFactors=T, skip=1, row.names=1)
gene_df_means <- as.vector(colMeans(gene_df))
print(gene_df_means)
Replace gene_xyz.mtx
with the name of your matrix file. You can do this programmatically with Rscript or similar. If you need to write out per-base averages from within R:
sink("gene_xyz.averages.txt")
cat(paste(gene_df_means, "\n", sep=""))
sink()
Modify filenames, as needed.