Question

Calculate binned methylation data throughout hg19 with BEDOPS

0

Entering edit mode

8.8 years ago

thefirstrealace ▴ 30

Hello,

I need to digitize DNA methylation data througout the whole hg19. To be more precise, I need to generate smoothed density signal by binning the genome into 20 bp intervals and then run a sliding window over the whole genome. I already downloaded the methylation data as a BED file and I have the hg19 as a fasta file. I heard that this task could be performed by BEDOPS, but I am completely unfamiliar with that tool. Can somebody please show me script/workflow to solve this task?

Best regards

bedops hg19 methylation-data • 2.9k views

ADD COMMENT • link updated 2.3 years ago by Ram 44k • written 8.8 years ago by thefirstrealace ▴ 30

0

Entering edit mode

Perhaps I am mistaken here, or it only applies to signal processing, but for me digitization is taking an analog signal (which has variable intensity) and encoding it as a digital signal, where the intensity is always either 0 or 1 - min or max.

In signal processing, even if your frequency is not kept constant you can still have a digital signal. Bringing that back to bioinformatics, that would imply it doesn't matter if you used fixed-step binning, variable-step binning, or no binning at all - digitization would only refer to the signal intensity. Really, binarization is probably a better word, since digitization implies encoding (which could be lossless) which is obviously not what you want.

It's a good question Ace, but I think we need to know a little more about what your wider goal is before suggesting anything that may do the opposite of what you really want. Is it Chromatin State related by any chance?

ADD REPLY • link 8.7 years ago by John 13k

1

Entering edit mode

Hi,

it is true, that the terms are actually misleading.

What i meant is, that i need to convert the existing 1 bp resolution methylation data that i have downloaded into 20 bp resolution data. The downloaded methylation file (1 bp resolution) stores for each cytosine the methylation value between 0.0 and 1.0 (if it is metgylated at all). I wanted to use BEDOPS to calculate the binned methylation values throughout the genome, but i still can't figure out how, so i hope someone can help me

ADD REPLY • link 8.7 years ago by thefirstrealace ▴ 30

2

Entering edit mode

Ah ok, no worries - so you just want to bin the data using a 20bp window. The only question left really, is how do you want your methylation values to be averaged? Mean methylation? Max methylation? Average methylation, only looking at bases that could actually be methylated? Average over all Cs not just CpGs, etc.

Fortunately 20bp is still pretty tiny, but if you went up to, say, 200bp, and you have just 1 CpG that's methylated at 100%, the other 199bp get included in that and set to 100% methylated. Alternatively, you average over all 200bp and now it doesn't matter what that 1 CpG was, it could never result in a methylation level above 0.5%.

I don't know what the correct/standard way to bin methylation data is, but im sure someone more knowledgeable will help soon :) All I know is that it's kind of problematic since "no data" is not the same as "no methylation", and CpG sites are very non-randomly distributed.

ADD REPLY • link 8.7 years ago by John 13k

0

Entering edit mode

Im not too sure about that myself to be honest. Some sources say that it should be the mean of all values within a bin and some say, that it should be the number of methylated CpG divided by the number of all CpG within a bin. Apparently the "one and only correct way" does not exist.

ADD REPLY • link 8.7 years ago by thefirstrealace ▴ 30

2

Entering edit mode

8.6 years ago

Shicheng Guo ★ 9.5k

Here, I share you a more easy and quick way to do it.

1, First download my own modified PileOMeth to your Linux PC (Revised from Devon Ryan's original PileOMeth)

https://www.dropbox.com/s/m05dwutsh2g01g9/PileOMeth?dl=0

2, Run the following script to get the average methylation level within the genomic interval (-r)

PileOMeth extract -r chr1:7205-7516 --minDepth 1 hg19.fa input.bam -o output

Note1: hg19.fa is human genome. input.bam is the bam file you need to provide.

Note2: You can use all other options of PileOMeth, but please put -r as the first option and then followed by other options.

Note3: methylaiton level of all the CpG will be ouput to -o, while, average methylation level will be printed to terminal.

All of us should thank to @Devon Ryan

Hope this helps!

ADD COMMENT • link 8.6 years ago by Shicheng Guo ★ 9.5k

0

Entering edit mode

Hi,

thank you very much for your reply. However, my methylation file is in bed format. Is it possible to use my methylation.bed as input parameter?

best regards

ADD REPLY • link 8.4 years ago by thefirstrealace ▴ 30

score 5 · Accepted Answer · 2016-03-30

It is a bit outdated; however, you could use the following BEDOPS usage script as a starting point. That example uses awk to make windows, but that isn't necessary any longer.

Some recent chop and stagger features in the bedops tool make it easy to generate windows for signal mapping.

You could:

Start with hg19 chromosome extents:

We'll call this file hg19.bed. It is sorted per sort-bed and is ready to use.
Chop these extents by 20 bases, and stagger windows by 1 base (or whatever parameter you want to use to slide the window):
```
$ bedops --chop 20 --stagger 1 -x hg19.bed > hg19.20nt_bin_1nt_slide.bed
```
Consider adding the -x (exclude) option, which leaves off any trailing windows less than 20 bases in length at the ends of each chromosome.
Feed the sliding bins or windows to bedmap, to measure (methylation) signal over each window.

You can pick any of the number of statistical operations available in bedmap. For example, you could use --mean to get the mean methylation signal over the sliding window, or --sum to get the density of methylation signal over the window, etc. Run bedmap --help or read the documentation for more information about available score/signal operations.

As an example, the following gives the density of methylation signal over each sliding window:
```
$ bedmap --echo --sum --delim '\t' hg19.20nt_bin_1nt_slide.bed methylation.bed > answer.bed
```
The file methylation.bed should be at least a five-column sorted BED file, with the score or signal in the fifth column.

The file answer.bed will contain each sliding window and a final column containing the sum of signal values of any methylation regions that overlap the sliding window. You could read this file into any number of tools (R, UCSC Genome Browser, etc.) to plot smoothed signal.

Because 20nt is a very fine window size, this will generate very large intermediate and final files.

A couple suggestions for dealing with these issues are to merge steps in a UNIX pipeline and to output to a compressed form of BED called Starch.

The following suggestion uses a pipe character | to pass the sliding windows to the bedmap mapping step. The use of the hyphen - is a placeholder for standard output from the previous bedops process:

$ bedops --chop 20 --stagger 1 -x hg19.bed | bedmap --echo --sum --delim '\t' - methylation.bed > answer.bed

This approach does the same thing as steps 1 and 2, and runs faster than running steps 1 and 2 separately. It uses much less disk space, because it does not create the intermediate regular file hg19.20nt_bin_1nt_slide.bed — a file that you probably don't really need to keep around, anyway, so why waste time making it?

A second suggestion is to compress the output with a tool in BEDOPS called starch. This application removes redundancy in the coordinates and applies a second compression step. BED files get much smaller:

$ bedops --chop 20 --stagger 1 -x hg19.bed | bedmap --echo --sum --delim '\t' - methylation.bed | starch - > answer.starch

You can use unstarch answer.starch to extract data. You can specify a chromosome of interest, as well: unstarch chrY answer.starch.

The unstarch tool passes uncompressed BED data to standard output, so you can pipe out to downstream processes. Also, other BEDOPS tools natively work with Starch archives as if they were BED files, so you can do set and statistical operations with them directly, without piping:

$ bedmap --echo --echo-map-id-uniq foo.starch bar.bed > baz.bed

Making a compressed archive will add to the overall processing time but reduce disk usage even further. You might start with the initial mapping to an uncompressed BED file, and then explore Starch when disk usage becomes a concern.

Other options include limiting your signal operations to regions of interest. You might leave out RepeatMasked regions from windows, for instance. You could use bedops --difference to put "holes" in the hg19.bed extents file, and then do your windowing and mapping from that product. Your calculations will go faster because you don't waste time with regions that may not be biologically interesting or relevant to your experiment.

Hope this helps!