Question

How to get conservation information for bed file using 100wayPhastCons

1

Entering edit mode

7.7 years ago

chrys ▴ 80

Hi,

I have several bed files for which I would like to get conservation information. I have downloaded the 100wayPhastCons.bw file from UCSC and my plan is to now get the conservation scores for my bed intervals from this file.

Is there a way to achieve this ?

Thanks,

genome bed conservation sequence • 3.3k views

ADD COMMENT • link updated 7.7 years ago by Alex Reynolds 36k • written 7.7 years ago by chrys ▴ 80

1

Entering edit mode

Not sure if one of these help (you may have to modify solutions to fit your needs): Working With Phastcons Files
Obtaining phastCons conservation score for every gene in the Human genome
Batch conservation analysis

ADD REPLY • link 7.7 years ago by GenoMax 147k

0

Entering edit mode

Thanks, those are the ones I am going by right now. I was just hoping that somebody knew a tutorial or a site where a pipeline is described to get this kind of conservation information.

Quick Edit: I also was hoping to get conservation score per base. Most ways described are only returning a summary.

ADD REPLY • link 7.7 years ago by chrys ▴ 80

2

Entering edit mode

7.7 years ago

genecats.ucsc ▴ 580

Hi chrys,

You will probably be interested in the following page, which details how to extract wiggle data: http://genomewiki.ucsc.edu/index.php/Using_hgWiggle_without_a_database

If you have any follow-up questions, it would be helpful if you could post them to our Google Groups forum: https://groups.google.com/a/soe.ucsc.edu/forum/#!forum/genome, that way our whole team can see the question and help with an answer.

Thanks,

ChrisL from the UCSC Genome Browser

ADD COMMENT • link 7.7 years ago by genecats.ucsc ▴ 580

score 4 · Accepted Answer · 2017-03-13

Here's a pipeline you can use:

Convert the conservation bigWig (.bw) to Wiggle (.wig) with bigWigToWig, using /dev/fd/1 as a filename for standard output. This should allow you to pipe the result to BEDOPS wig2bed and get a sorted five-column BED file:
```
$ bigWigToWig conservation.bw /dev/fd/1 | wig2bed - > conservation.bed5
```
Split your intervals up into per-base elements with bedops --chop 1.
```
$ bedops --chop 1 intervals.bed > perbase.bed3
```
Run bedmap to map your per-base elements to the BED-formatted conservation signal. Normally, unmapped elements report no value. You can use the following command to replace unmapped bases with a score of 0:
```
$ bedmap --count --echo --echo-map-score --delim '\t' perbase.bed3 conservation.bed5 | awk '{ if ($1=="0") { print $0"0.0"; } else { print $0; } }' | cut -f2- > answer.bed
```
This will only make sense if a missing conservation score should be zero. You could alternatively replace 0 with NaN or a similar IEEE-ready placeholder for a non-number.

Finally, if you want to put everything into a one-liner, use file substitutions:

$ bedmap --count --echo --echo-map-score --delim '\t' <(bedops --chop 1 intervals.bed) <(bigWigToWig conservation.bw /dev/fd/1 | wig2bed - ) | awk '{ if ($1=="0") { print $0"0.0"; } else { print $0; } }' | cut -f2- > answer.bed

If you're going to do repeated queries with different sets of intervals, it probably makes sense to do the bigWig-to-BED5 conversion step only once, and drop that in as conservation.bed5 where you need it in mapping.