Not with R, but you could use BEDOPS tools:
Convert the per-chromosome phyloP WIG files on UCSC's servers to
sorted BED using wig2bed
.
Prepare your genomic regions into a sorted BED file containing
per-base elements.
Run bedmap
on the phyloP data and your regions to map per-base
phyloP scores to your per-base regions.
Retrieving phyloP data and converting it to sorted BED
We can make per-chromosome phyloP BED files and ultimately use these with bedmap or other BEDOPS tools. As an example, we can use the vertebrate hg19 phyloP data located at http://hgdownload-test.cse.ucsc.edu/goldenPath/hg19/phyloP46way/vertebrate/
$ rsync -avz --progress rsync://hgdownload.cse.ucsc.edu/goldenPath/hg19/phyloP46way/vertebrate .
...
$ for fn in `ls vertebrate/*.gz`; \
do gunzip -c $fn \
| wig2bed - \
> ${fn%%.*}.bed; \
done
Preparing your regions-of-interest
Let's assume your regions-of-interest are already in a four-column (or more) BED file called roi.contiguous.unsorted.bed
. To make sure it works with BEDOPS correctly, we sort it and create a file called roi.contiguous.bed
:
$ sort-bed roi.contiguous.unsorted.bed > roi.contiguous.bed
Next, we want to split the ranges into per-base elements. We assume there is an ID field in the fourth column. We also assume that the ID for a given region-of-interest is uniquely named. When splitting our regions, we can use this ID as a unique prefix, in order to specifically and uniquely identify with which range the per-base element associates:
$ awk ' \
{ \
regionChromosome = $1; \
regionStart = $2; \
regionStop = $3; \
regionID = $4; \
baseIdx = 0; \
for (baseStart = regionStart; baseStart < regionStop; baseStart++) { \
baseStop = baseStart + 1; \
print regionChromosome"\t"baseStart"\t"baseStop"\t"regionID"-"baseIdx; \
baseIdx++; \
} \
}' roi.contiguous.bed > roi.perBase.bed
Run head
or less
on the roi.perBase.bed
file so that you see what this script does. When we map phyloP scores, we will be able to map them base-by-base.
Note: If your BED file does not have an ID field, or region IDs are not unique, it is trivial to use awk
to add or replace the ID field with a unique, per-row identifier.
Performing bedmap
operations
Finally, we're ready to do the map operation and see which elements associate:
$ for chrLabel in `seq 1 22` X Y; \
do \
mapFn=/path/to/phyloP/chr${chrLabel}.bed; \
bedmap --chrom chr${chrLabel} --echo --echo-map-score roi.perBase.bed $mapFn \
> perBase.chr${chrLabel}.answer.bed; \
done
The file perBase.chr${chrLabel}.answer.bed
(where chr${chrLabel}
is replaced with the chromosome name from the per-chromosome phyloP files) contains per-base-split regions-of-interest along with the phyloP scores that associate with each base.
To explain the bedmap
statement, the --echo
operator prints out the region-of-interest (a per-base element from your original regions-of-interest) and --echo-map-score
prints out the phyloP score that associates with that base, for each chromosome.
In the case where no phyloP score maps to a base, a NAN
is printed.
You may or may not want to use sed
or other tools to replace NAN
with a zero, particularly if the presence of a zero is a biologically meaningful score, while the absence of a score is also biologically meaningful, but in a different way.
Some more work to do
A couple more things you could do that might make this easier, if you have to automate this process:
Concatenate the per-chromosome phyloP BED files into one master BED
file.
Merge the split elements in the perBase.chr${chrLabel}.answer.bed
result back into contiguous ranges, using the ID prefix to group
them. In a column adjacent to the newly re-merged contiguous range,
print all the scores into a row vector that is delimited with some
useful character.
Doing step 1 would remove the need to use a for
loop on the bedmap
statement. You then just have one bedmap
operation against the master phyloP file, instead of looping through each per-chromosome phyloP file.
Step 2 would let you more easily process contiguous regions. For example, you might be working with a window around a set of transcription start sites and you want to know what the mean conservation is across each window.
Hope this gets you started!
Thanks for sharing details and also for phastcons in a separate post. One correction: I think there should be
--chrom $chr
option inbedmap --echo --echo-map-score roi.perBase.bed $fn...
else it is echoing all chromosomes from roi.perBase.bed. That is strange given$fn
only has one chromosome and mapping seems to be based on start-stop columns and ignoring chromosome column.You're right that
--chrom $chr
is needed. The fileroi.perBase.bed
is the "reference file" and mapping will be done against that. So even if no overlaps are found in the map file$fn
(or nowchr${chrLabel}
, to track chromosome names), elements will be reported from all chromosomes of the reference file (unless--skip-unmapped
is used, but that is not useful here). Good catch.Maybe you can give some suggestions on a good strategy of assigning the unique per row identifier. Not sure what is most appropriate to this method just given a simple bed file. My main interest is to get the scores from the merged elements which would represent a single score per interval within the bed. My bed file looks as below. Thanks!
In Python:
Then:
The file
out.bed
will be sorted and contain per-base positions within each element.The ID field is effectively unique by using the form
A-B
, whereA
is the element or row index, andB
is the position or base index within the element. I use zero-padding to make the result easier to troubleshoot/read.If all the elements in
in.bed
are of the same length, this makes mapping and later aggregation very fast, as you can split the ID field into element index and base position index. The base index can directly reference the contents within an array at a specific position.This just may be my lack on Python understanding but in the initial python script where is the bed file input, that file requiring the unique identifiers? Thank you!
Maybe I misunderstand. I'm assuming you have a BED file that you want to map and aggregate on a per-base level with phyloP conservation scores (or whatever per-base signal, but phyloP for the purposes of this question). This BED file is made up of regions of interest, like TFBS or promoters, etc. What are you trying to do?
That is exactly right, at the end Id like to have a conservation score based on phyloP for each interval in my bed file to sort for downstream analysis. I was just asking about the python script you listed and how I can input this initial bed file into it so as to add unique identifiers to it before proceeding with your initial example in this post. Thanks!
If you need to strip the header, you could do:
Then you'd use
out.bed
for scoring.Thats perfect thank you!
Quick question, tried doing mapping as you suggested with the following code:
And got this error, not sure why its adding answer.bed to the phyloP files in this instance? Thanks!
Can you print a snippet of the results from this?
I suspect you may need to use the second variable (the "basename") in some form, instead of the raw
$fn
. I'd troubleshoot there.ok its really strange but I redid the import of mm10 phyloP as follows:
This gave a directory
mm10.60way.phyloP60way
which contained all of these the necessary .gz files, following this I ran:Which I thought would give me individual .bed files for use with bedmap however it concatenated all of them it seems into one file named mm10.bed where the only chromosome listed is chrY for some odd reason, can you see why this would be? I guess I can wig2bed all of them individually but its just odd.
Here's what I see:
I think you may want a slight tweak:
In other words, use
${fn##*/}
and not${fn%%.*}
. Maybe prefix${fn##*/}
with another directory.Bash variables like these are a huge pain in the ass. So I'd suggest getting into the habit of using
echo
to debug and sanity-check what your source and destination paths look like, before running thefor
loop with real data.This worked perfectly, thanks for tip in terms of checking the bash variables and adjusting them. I sometimes neglect to think of these path issues. In terms of concatenating the per-chromosome phyloP BED files would you suggest just using bedtools merge or something different? Also I was hoping you could give a suggestion on a good way to merge the
perBase.$fn.answer.bed
back into continuous ranges, since at the end I would like just a regular interval bed file where every interval has some conservation score that I can thereafter sort by. Thanks again!Let's say you have a per-base BED file that looks like this, where score or signal data are in the fifth column:
You can get contiguous ranges from this via:
You could then use
bedmap
to get the mean score or signal over the merged range:The
bedops
tool does set operations. Thebedmap
tool does map operations. Here, we map signal to genomic intervals, and we do some specified operation on that signal.If you want something other than a mean, there are other signal operations: https://bedops.readthedocs.io/en/latest/content/reference/statistics/bedmap.html#score-operations
You could get the median conservation score over the merged range, for instance.
You can also specify multiple signal operations:
This gives you the mean and standard deviation of signal over merged regions.
Using the
--delim '\t'
operator puts these results into separate tab-delimited columns, for example, to make it easy to pipe tosort
to get a sorted list:Note that I change the result's extension from
bed
totxt
, to serve as a reminder that the output is no longersort-bed
-sorted, and so I would need to do sorting before use with set operation tools.That works great thanks!
Still having a little trouble establishing the right variable for mapping my data with unique IDs to the phyloP per chromosome files, my data looks as follows:
I run the following commands, with similarly changed variables before using bedmap:
I get this output which appears to be missing the scores entirely, not sure why the
--echo-map-score
flag does not add scores and only the pipe:Any idea as to why? Thanks again!
It is likely because those positions have no scores associated with them.
If you do something like this:
and if you get no answer back, that means there is no element in
perBase.chr1.answer.bed
that has a score associated with it.If it helps with processing results, you can add the
--skip-unmapped
operator to thebedmap
command and it will leave out elements in the output where there are no mappings between reference and map elements, i.e.:So two questions, for phyloP to have a score at a particular base on lets say chromosome 1, what criteria needs to be met in terms of level of conservation? Looking this up as well, just surprised that so many of these have nothing associated though when I just scan these in UCSC for many I see at least some conservation - though maybe thats just a bias on my end.
Second question, if I invoke the
--skip-unmapped
flag during mapping wouldn't this preclude me from thereafter collapsing into contiguous intervals?Thanks for your help!
I'd take a very careful look at the distribution of your observed phyloP scores before imputing zeros in gaps. If there's a gap where conservation could not be determined over the species alignment, then putting in a zero could say this position is neutral when it simply isn't known. That could be problematic.
That caveat given, here is an ugly way you could use
bedmap
to deal with that case:If you're doing column aggregation for elements of equal length, this should give you a score of some kind at every per-base position over your set of elements.
An unreleased version of
bedmap
includes the option to report a zero score where there are unmapped reference elements, but it may be a little longer before the next version of BEDOPS comes out.If you are more comfortable leaving out zeros, you might simply aggregate over the non-NaNs. Say you have two vectors
(2, 10, NAN)
and(3, 2, 1)
. You would take the average of the first two elements, and leave the third element as it is. Not great, but if you have a bunch of such vectors, you could think about calculating a per-base or per-position standard deviation or variance specific to then
at that position, and return this along with the mean or other per-base statistic.In the case of missing data, you might impute a value of zero for neutral conservation or selection. But this really depends on the data, and I don't remember exactly how this works for phyloP. To do imputation you'd need an extra step. I'm on my phone so I'd have to wait until I'm in front of a proper keyboard to describe the specifics.