It is a bit outdated; however, you could use the following BEDOPS usage script as a starting point. That example uses awk
to make windows, but that isn't necessary any longer.
Some recent chop and stagger features in the bedops
tool make it easy to generate windows for signal mapping.
You could:
Start with hg19
chromosome extents:
We'll call this file hg19.bed
. It is sorted per sort-bed
and is ready to use.
Chop these extents by 20 bases, and stagger windows by 1 base (or whatever parameter you want to use to slide the window):
$ bedops --chop 20 --stagger 1 -x hg19.bed > hg19.20nt_bin_1nt_slide.bed
Consider adding the -x
(exclude) option, which leaves off any trailing windows less than 20 bases in length at the ends of each chromosome.
Feed the sliding bins or windows to bedmap
, to measure (methylation) signal over each window.
You can pick any of the number of statistical operations available in bedmap
. For example, you could use --mean
to get the mean methylation signal over the sliding window, or --sum
to get the density of methylation signal over the window, etc. Run bedmap --help
or read the documentation for more information about available score/signal operations.
As an example, the following gives the density of methylation signal over each sliding window:
$ bedmap --echo --sum --delim '\t' hg19.20nt_bin_1nt_slide.bed methylation.bed > answer.bed
The file methylation.bed
should be at least a five-column sorted BED file, with the score or signal in the fifth column.
The file answer.bed
will contain each sliding window and a final column containing the sum of signal values of any methylation regions that overlap the sliding window. You could read this file into any number of tools (R, UCSC Genome Browser, etc.) to plot smoothed signal.
Because 20nt is a very fine window size, this will generate very large intermediate and final files.
A couple suggestions for dealing with these issues are to merge steps in a UNIX pipeline and to output to a compressed form of BED called Starch.
The following suggestion uses a pipe character |
to pass the sliding windows to the bedmap
mapping step. The use of the hyphen -
is a placeholder for standard output from the previous bedops
process:
$ bedops --chop 20 --stagger 1 -x hg19.bed | bedmap --echo --sum --delim '\t' - methylation.bed > answer.bed
This approach does the same thing as steps 1 and 2, and runs faster than running steps 1 and 2 separately. It uses much less disk space, because it does not create the intermediate regular file hg19.20nt_bin_1nt_slide.bed
— a file that you probably don't really need to keep around, anyway, so why waste time making it?
A second suggestion is to compress the output with a tool in BEDOPS called starch
. This application removes redundancy in the coordinates and applies a second compression step. BED files get much smaller:
$ bedops --chop 20 --stagger 1 -x hg19.bed | bedmap --echo --sum --delim '\t' - methylation.bed | starch - > answer.starch
You can use unstarch answer.starch
to extract data. You can specify a chromosome of interest, as well: unstarch chrY answer.starch
.
The unstarch
tool passes uncompressed BED data to standard output, so you can pipe out to downstream processes. Also, other BEDOPS tools natively work with Starch archives as if they were BED files, so you can do set and statistical operations with them directly, without piping:
$ bedmap --echo --echo-map-id-uniq foo.starch bar.bed > baz.bed
Making a compressed archive will add to the overall processing time but reduce disk usage even further. You might start with the initial mapping to an uncompressed BED file, and then explore Starch when disk usage becomes a concern.
Other options include limiting your signal operations to regions of interest. You might leave out RepeatMasked regions from windows, for instance. You could use bedops --difference
to put "holes" in the hg19.bed
extents file, and then do your windowing and mapping from that product. Your calculations will go faster because you don't waste time with regions that may not be biologically interesting or relevant to your experiment.
Hope this helps!
Perhaps I am mistaken here, or it only applies to signal processing, but for me digitization is taking an analog signal (which has variable intensity) and encoding it as a digital signal, where the intensity is always either 0 or 1 - min or max.
In signal processing, even if your frequency is not kept constant you can still have a digital signal. Bringing that back to bioinformatics, that would imply it doesn't matter if you used fixed-step binning, variable-step binning, or no binning at all - digitization would only refer to the signal intensity. Really, binarization is probably a better word, since digitization implies encoding (which could be lossless) which is obviously not what you want.
It's a good question Ace, but I think we need to know a little more about what your wider goal is before suggesting anything that may do the opposite of what you really want. Is it Chromatin State related by any chance?
Hi,
it is true, that the terms are actually misleading.
What i meant is, that i need to convert the existing 1 bp resolution methylation data that i have downloaded into 20 bp resolution data. The downloaded methylation file (1 bp resolution) stores for each cytosine the methylation value between 0.0 and 1.0 (if it is metgylated at all). I wanted to use BEDOPS to calculate the binned methylation values throughout the genome, but i still can't figure out how, so i hope someone can help me
Ah ok, no worries - so you just want to bin the data using a 20bp window. The only question left really, is how do you want your methylation values to be averaged? Mean methylation? Max methylation? Average methylation, only looking at bases that could actually be methylated? Average over all Cs not just CpGs, etc.
Fortunately 20bp is still pretty tiny, but if you went up to, say, 200bp, and you have just 1 CpG that's methylated at 100%, the other 199bp get included in that and set to 100% methylated. Alternatively, you average over all 200bp and now it doesn't matter what that 1 CpG was, it could never result in a methylation level above 0.5%.
I don't know what the correct/standard way to bin methylation data is, but im sure someone more knowledgeable will help soon :) All I know is that it's kind of problematic since "no data" is not the same as "no methylation", and CpG sites are very non-randomly distributed.
Im not too sure about that myself to be honest. Some sources say that it should be the mean of all values within a bin and some say, that it should be the number of methylated CpG divided by the number of all CpG within a bin. Apparently the "one and only correct way" does not exist.