I have a bed file and wanted to calculated average PhastCon score over the regions, is there a tool which could do so? I tried using cmotifs but it submits jobs and at times does not produce timely results.
I have a bed file and wanted to calculated average PhastCon score over the regions, is there a tool which could do so? I tried using cmotifs but it submits jobs and at times does not produce timely results.
You could use bedmap --mean here, to map PhastCon signal onto your sorted regions.
First, grab PhastCon data from UCSC and convert WIG to a compressed form of BED called Starch, using a tool called wig2starch:
$ rootPath="http://hgdownload.cse.ucsc.edu/goldenpath/hg19/phastCons100way/hg19.100way.phastCons"
$ for i in `seq 1 22` X Y; \
do \
echo "getting PhastCon data for chromosome chr${i}..."; \
wigFn="chr${i}.phastCons100way.wigFix"; \
url="${rootPath}/${wigFn}.gz"; \
wget -qO- ${url} | gunzip -c - | wig2starch - > ${wigFn}.starch; \
done
We could have used wig2bed
instead of wig2starch
, but note that the final set of PhastCons files could take between 10-20 GB compressed, and uncompressed BED files could require about 10x more disk space. It takes a little longer to make Starch files, but it will take less disk space. I'd recommend some kind of compression, if you plan to do repeated map calculations.
Second, map PhastCon signal files to your sorted regions file by chromosome:
$ for i in `seq 1 22` X Y; \
do \
echo "mapping signal for chromosome chr${i}..."; \
phastConFn="chr${i}.phastCons100way.wigFix.starch"; \
bedmap --echo --mean --chrom chr${i} regions.bed ${phastConFn} > regions_with_avg_phastcons.${i}.bed; \
done
Third (optionally), union the per-chromosome results to a single file, using bedops --everything:
$ bedops --everything regions_with_avg_phastcons.*.bed > regions_with_avg_phastcons.bed
Doing the bedmap step with --chrom
allows parallelization with grid job schedulers or GNU Parallel, splitting the tasks by chromosome. This could reduce calculation time down to that required for the largest chromosome of regions.
One thing I would note is that the average value calculated is over scores for mapped elements. If you have gaps of scores along your mapped region, those gaps would not contribute to the calculation of the mean signal.
I do not think you can interpolate PhastCon conservation signal between a gap, but it has been a while since I have worked with this particular dataset - its authors could probably confirm or negate this.
It might generally be useful to use bedmap with the --skip-unmapped option to filter out results for unmapped regions.
If the files is small and this is a one time thing, you can extract it from ucsc table browser easily.
Go to ucsc, upload the track, go to table browser, specify the phastcons track you want, click on intersection and select your file, there it is.
If you will be doing this frequently, I would suggest downloading the wig file and writing a script for this.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
@AlexI
I am getting the following error:
For all the chromosomes.
Errors of the type "stat() failed" usually mean that the file does not exist or is otherwise not accessible. I'd double-check your input files are available and accessible.
Hi Alex the file is present there and I also checked the input format.
I'm not really sure: there's not much I can troubleshoot as described. Double-check to be sure that the specific inputs exist and are accessible, when the loop fails (chr18?). Make sure you have sufficient disk space and file permissions. Perhaps break down the loop to individual chromosomes, until you find the problem file related to chr18.
Hi Alex to troubleshoot it I just took a single chromosome file, and executed the following command:
May use bedmap --help for more help.
And so to see if bedmap is working at all on the file I did :
Let me try some other combinations
Thanks, a very useful code though.
Is
regions.bed
sorted per sort-bed? I can't immediately think of any other issues, but that's not to say this isn't a bug. If it helps, if you can postregions.bed
somewhere I can download it (Dropbox, Google Drive, etc.), I can offer to try to replicate the error on my end, as the Phastcon data are available from UCSC.Sure that would be great, where to can I send a link to download the file?
See my profile page for email address.
Or email: reynolda - at - uw.edu
I did some spot checks against the regions you sent by email, and I could not reproduce your issue.
For example:
I suggest again double-checking your files, permissions, and so forth. From the error message you reported earlier, it looks like something is wrong specifically at chromosome
chr18
, so I suggest that you maybe start your investigations there. I'm sorry you're having problems, and I hope you find a solution!Thanks for you help Alex.