Question

False output when intersecting methylation data

0

Entering edit mode

7.9 years ago

theodore ▴ 110

Dear all, I came across the following situation: I have three WGBS methylation data from the epigenomics roadmap:

E016_WGBS_FractionalMethylation_allCHROM.bedgraph E013_WGBS_FractionalMethylation_allCHROM.bedgraph E004_WGBS_FractionalMethylation_allCHROM.bedgraph

The values of methylation states on specific sites are:

grep "chrY 5926917" E016_WGBS_FractionalMethylation_allCHROM.bedgraph
    chrY    59269171    59269175    0.98 grep "chrY 5926917"
E004_WGBS_FractionalMethylation_allCHROM.bedgraph
    chrY    59269171    59269173    0.93 chrY   59269173    59269175    1 grep "chrY
5926917"  E013_WGBS_FractionalMethylation_allCHROM.bedgraph
    chrY    59269171    59269173    0.91 chrY   59269173    59269175    0.9

Notice that the site E016 state has a uniformed methylation state so the nucleotides at chrY 59269171 59269173 and chrY 59269173 59269175 are merged into one site which is not true for the rest of the sites. So if I perform multiIntersect:

grep 5926917 E016_E004_E013_WGBS_FractionalMethylation_allCHROM2_multiIntersect.bedgraph
    chrY    59269171    59269175    3   1,2,3   1   1   1
    chrY    59269175    59269209    0   none    0   0   0

All sites are merged into one If I do subsequent intersects

grep 59269173  E016_E004_E013_WGBS_FractionalMethylation_allCHROM2.bedgraph
    chrY    59269171    59269173    0.98    chrY    59269171    59269175    0.93    chrY    59269171    59269173    0.91    2   2

chrY 59269171 59269173 0.98 chrY 59269171 59269175 1 chrY 59269173 59269175 0.9 2 2 chrY 59269173 59269175 0.98 chrY 59269171 59269175 0.93 chrY 59269171 59269173 0.91 2 2 chrY 59269173 59269175 0.98 chrY 59269171 59269175 1 chrY 59269173 59269175 0.9 2 2

The multiintersect does not include the appropriate methylation states (not its function) The subsequent IntersectBed keep the full file but somehow assign the values wrongly, what I would have expect would be:

chrY    59269171    59269173    0.98    chrY    59269171    59269175    0.93    chrY    59269171    59269173    0.91    2   2
chrY    59269173    59269175    0.98    chrY    59269171    59269175    1   chrY    59269173    59269175    0.9 2   2

Is there a way to get the correct intersect output? Thank you.

intersect multiIntersect WGBS • 3.1k views

ADD COMMENT • link updated 7.9 years ago by Alex Reynolds 36k • written 7.9 years ago by theodore ▴ 110

score 1 · Answer 1 · 2017-09-25

1

Entering edit mode

7.9 years ago

Alex Reynolds 36k

There is a patch in BEDOPS 2.4.29 that fixes this issue, so that we get items in correct order.

Given three BED5 files, A.bed, B.bed, and C.bed (your epigenome samples, say, preformatted with awk with unique, lexicographically-ordered IDs in the ID field column):

$ more A.bed
chr1    10468   10470   setA    0.97
chr1    10470   10472   setA    1
chr1    10483   10485   setA    0.95

$ more B.bed
chr1    10468   10470   setB    0.79
chr1    10470   10472   setB    0.93
chr1    10483   10485   setB    0.88

$ more C.bed
chr1    10468   10470   setC    0.91
chr1    10470   10472   setC    0.97
chr1    10483   10485   setC    0.87

We can make a partition and a union of these three sets:

$ bedops --partition A.bed B.bed C.bed > partition.bed
$ bedops --everything A.bed B.bed C.bed > union.bed

We can then map the scores from union.bed to the elements in partition.bed:

$ bedmap --echo --echo-map-score --delim '\t' --multidelim '\t' --prec 2 partition.bed union.bed
chr1    10468   10470   0.97    0.79    0.91
chr1    10470   10472   1.00    0.93    0.97    
chr1    10483   10485   0.95    0.88    0.87

However, I don't know when we will put out this revision; it may be a week or two, perhaps longer, as there some other issues I need to resolve with this release that will take time to fix.

You can, if you like, check out the current development version of BEDOPS via:

$ git clone https://github.com/bedops/bedops.git
$ cd bedops
$ git checkout v2p4p29
$ make
$ make install

Then:

$ ./bin/bedops --partition ...
$ ./bin/bedmap --echo --echo-map-score ...

etc.

I'm sorry I don't have a production version ready for you to use, but if this is critical work for you, then the above should hopefully get you going.

ADD COMMENT • link 7.9 years ago by Alex Reynolds 36k

0

Entering edit mode

I am currently installing the git version. Will let you know if everything works as promised ;) Thank you for your time.

ADD REPLY • link 7.9 years ago by theodore ▴ 110

0

Entering edit mode

Just a quick note that you have bedgraph files, and you'd need to convert them to BED files with a unique ID that puts the files in desired order (setA == E016, setB == E004, etc.).

ADD REPLY • link 7.9 years ago by Alex Reynolds 36k

0

Entering edit mode

Hey Alex, Now I found that I get some small problems like: The script does not calculate for empty genomic locations as 0 or NA data and when it merges the fifht column it swifts the values so that they will fill the space.

head -n 20 /media/trotos/4TB/hESC/Encode_chip_files/methylation/whatTOdo
chr1    10468   10470   0.97    0.79    0.91
chr1    10470   10472   1.00    0.93    0.97
chr1    10483   10485   0.95    0.88    0.87
chr1    10488   10490   0.95    0.91    0.94
chr1    10492   10494   0.94    0.91    0.97
chr1    10496   10498   0.91    0.90    0.96
chr1    10524   10526   1.00    0.95    0.99
chr1    10541   10543   0.92    0.98    0.95
chr1    10562   10564   0.96    0.96    0.93
chr1    10570   10572   0.98    0.94    0.91
chr1    10576   10578   0.90    0.91    0.89
chr1    10578   10580   0.92    1.00    0.85
chr1    10588   10590   0.85    0.90    0.88
chr1    10608   10610   0.77    1.00    0.76
chr1    10616   10618   0.67    0.70
chr1    10619   10621   0.55    0.50
chr1    10630   10632   0.25    0.75
chr1    10632   10634   0.10    0.09
chr1    10635   10637   0.38    0.21
chr1    10637   10639   0.21    0.43

And to make things worse:

grep "chr1   2956412" /media/trotos/4TB/hESC/Encode_chip_files/methylation/final
chr1    29564121    29564123    0.00    0.00    0.02
chr1    29564124    29564126    0.00    0.00
chr1    29564127    29564129    0.00    0.00

And the original files

grep "chr1   2956412" /media/trotos/4TB/hESC/Encode_chip_files/methylation/E016_WGBS_FractionalMethylation_allCHROM.bedgraph
chr1    29564121    29564123    0
chr1    29564124    29564126    0
chr1    29564127    29564129    0

grep "chr1   2956412" /media/trotos/4TB/hESC/Encode_chip_files/methylation/E004_WGBS_FractionalMethylation_allCHROM.bedgraph

grep "chr1   2956412" /media/trotos/4TB/hESC/Encode_chip_files/methylation/E013_WGBS_FractionalMethylation_allCHROM.bedgraph
    chr1    29564121    29564123    0.02
    chr1    29564124    29564126    0
    chr1    29564127    29564129    0

ADD REPLY • link 7.9 years ago by theodore ▴ 110

0

Entering edit mode

How can I compensate for the empty areas?

ADD REPLY • link 7.9 years ago by theodore ▴ 110

0

Entering edit mode

Please see answer below, for how to fill in gaps.

ADD REPLY • link 7.9 years ago by Alex Reynolds 36k

score 1 · Answer 2 · 2017-09-25

If you have "holes" in your sets, you need to patch them.

Say we have modified our initial three sets:

$ more A.bed
chr1    10468   10470   setA    0.97
chr1    10470   10472   setA    1
chr1    10483   10485   setA    0.95

$ more B.bed
chr1    10468   10470   setB    0.79
chr1    10483   10485   setB    0.88

$ more C.bed
chr1    10468   10470   setC    0.91
chr1    10470   10472   setC    0.97
chr1    10483   10485   setC    0.87

Note here that B.bed is missing one of the three elements.

As you note, if we bedmap the partition against the union, we'll get gaps.

Make the partition set as before:

$ bedops --partition A.bed B.bed C.bed > partition.bed

For each input set, bedmap that set against the partition, using the --indicator function to mark where there are and are not overlaps. Use this value with awk to post-process output:

$ bedmap --indicator --echo --echo-map --delim '\t' partition.bed A.bed | awk -v label='setA' -v OFS='\t' '{ if ($1 == 1) { print $5, $6, $7, $8, $9 } else { print $2, $3, $4, label, "NAN" } }' > A.patched.bed

$ bedmap --indicator --echo --echo-map --delim '\t' partition.bed B.bed | awk -v label='setB' -v OFS='\t' '{ if ($1 == 1) { print $5, $6, $7, $8, $9 } else { print $2, $3, $4, label, "NAN" } }' > B.patched.bed

$ bedmap --indicator --echo --echo-map --delim '\t' partition.bed C.bed | awk -v label='setC' -v OFS='\t' '{ if ($1 == 1) { print $5, $6, $7, $8, $9 } else { print $2, $3, $4, label, "NAN" } }' > C.patched.bed

(This is easy to script for N arbitrary sets. I'm writing everything out here explicitly so you see how this works.)

You could use 0.0 instead of NAN, depending on your experiment and what these values mean in that context. The NAN value is a reserved IEEE 754 floating point value, so it can be used here when it is meaningful to use that value as a marker.

Incidentally, this is why bedmap returns nothing when there are no overlaps, as 0.0 and NAN are values that have very specific and different meanings, and it is not really possible for bedmap to guess correctly which value is appropriate to use when the mapping result is a proper empty set.

In any case, here is what our patched sets look like:

$ more A.patched.bed
chr1    10468   10470   setA    0.97
chr1    10470   10472   setA    1
chr1    10483   10485   setA    0.95

$ more B.patched.bed
chr1    10468   10470   setB    0.79
chr1    10470   10472   setB    NAN
chr1    10483   10485   setB    0.88

$ more C.patched.bed
chr1    10468   10470   setC    0.91
chr1    10470   10472   setC    0.97
chr1    10483   10485   setC    0.87

From here out, it's the same as before. Take the union of the patched sets:

$ bedops --everything *.patched.bed > union.bed

Then map as before:

$ bedmap --echo --echo-map-score --delim '\t' --multidelim '\t' --prec 2 partition.bed union.bed
chr1    10468   10470   0.97    0.79    0.91
chr1    10470   10472   1.00    nan     0.97
chr1    10483   10485   0.95    0.88    0.87

(Note: This requires bedmap v2.4.29 or greater.)

score 0 · Answer 3 · 2017-09-23

0

Entering edit mode

7.9 years ago

Alex Reynolds 36k

You could use BEDOPS bedops --partition to partition the genomic space and bedmap to map the unioned set against the partitioned space.

Abstractly:

$ bedops --partition A.bed B.bed C.bed > partitionedABC.bed
$ bedops --everything A.bed B.bed C.bed > unionABC.bed
$ bedmap --echo --echo-map --delim '\t' partitionedABC.bed unionABC.bed > answer.bed

You could use awk to re-columnize answer.bed to whatever special format you're aiming at.

Or, as a one-liner, without intermediate files to slow things down:

$ bedmap --echo --echo-map --delim '\t' <(bedops --partition A.bed B.bed C.bed) <(bedops --everything A.bed B.bed C.bed) > answer.bed

ADD COMMENT • link 7.9 years ago by Alex Reynolds 36k

0

Entering edit mode

Thank you Alex for the answer. Since I am not familiar with the bedops suit could you please elaborate on the need for performing 3 steps and latter an additional awk? I checked briefly the functions and it is not intuitive hat this would be the best way to get the results.

ADD REPLY • link 7.9 years ago by theodore ▴ 110

0

Entering edit mode

Your question seems to suggest you want to partition the input space and then map all your elements to the partitioned space. It might help to read the documentation for bedops and particularly the --partition operator. Then read the portion of the bedmap documentation that relates to --echo-map-* operators.

ADD REPLY • link 7.9 years ago by Alex Reynolds 36k

0

Entering edit mode

Ok, I followed your advice and although the binning is correct I found that the data is somehow scrambled. so, the first lines of the original files are: (NOTE THAT I FOLLOW THE SAME PATTERN FOR THE FILE'S ORDER E016-E004-E013)

trotos@StupidMint:~$ head /media/trotos/4TB/hESC/Encode_chip_files/methylation/E016_WGBS_FractionalMethylation_allCHROM.bedgraph
chr1    10468   10470   0.97
chr1    10470   10472   1
chr1    10483   10485   0.95
chr1    10488   10490   0.95
chr1    10492   10494   0.94
chr1    10496   10498   0.91
chr1    10524   10526   1
chr1    10541   10543   0.92
chr1    10562   10564   0.96
chr1    10570   10572   0.98
---  19:13:12
trotos@StupidMint:~$ head /media/trotos/4TB/hESC/Encode_chip_files/methylation/E004_WGBS_FractionalMethylation_allCHROM.bedgraph
chr1    10468   10470   0.79
chr1    10470   10472   0.93
chr1    10483   10485   0.88
chr1    10488   10490   0.91
chr1    10492   10494   0.91
chr1    10496   10498   0.9
chr1    10524   10526   0.95
chr1    10541   10543   0.98
chr1    10562   10564   0.96
chr1    10570   10572   0.94
---  19:13:26
trotos@StupidMint:~$ head /media/trotos/4TB/hESC/Encode_chip_files/methylation/E013_WGBS_FractionalMethylation_allCHROM.bedgraph
chr1    10468   10470   0.91
chr1    10470   10472   0.97
chr1    10483   10485   0.87
chr1    10488   10490   0.94
chr1    10492   10494   0.97
chr1    10496   10498   0.96
chr1    10524   10526   0.99
chr1    10541   10543   0.95
chr1    10562   10564   0.93
chr1    10570   10572   0.91

and after your pipeline:

./bedops --partition E016_WGBS_FractionalMethylation_allCHROM.bedgraph E004_WGBS_FractionalMethylation_allCHROM.bedgraph E013_WGBS_FractionalMethylation_allCHROM.bedgraph > /partitionedBEDOPS
./bedops --everything E016_WGBS_FractionalMethylation_allCHROM.bedgraph E004_WGBS_FractionalMethylation_allCHROM.bedgraph E013_WGBS_FractionalMethylation_allCHROM.bedgraph > everything
./bedmap --echo --echo-map --delim '\t' partitionedBEDOPS everything > whatTOdo

with some sed-awk to clear the output:

sed 's/\;/\t/g' /media/trotos/4TB/hESC/Encode_chip_files/methylation/whatTOdo | awk '{print$1"\t"$2"\t"$3"\t"$7"\t"$11"\t"$15}' | head
chr1    10468   10470   0.79    0.91    0.97
chr1    10470   10472   1   0.93    0.97
chr1    10483   10485   0.87    0.88    0.95
chr1    10488   10490   0.95    0.91    0.94
chr1    10492   10494   0.91    0.94    0.97
chr1    10496   10498   0.96    0.9 0.91
chr1    10524   10526   0.95    0.99    1
chr1    10541   10543   0.98    0.92    0.95
chr1    10562   10564   0.93    0.96    0.96
chr1    10570   10572   0.98    0.91    0.94

As you can see the values are completely messed up. The first line should have been: chr1 10468 10470 0.97 0.79 0.91 the second line does not even follow the order of the first line.

ADD REPLY • link 7.9 years ago by theodore ▴ 110

0

Entering edit mode

I think there's a way to do this that is not pretty, but prepends the ID field in a way that provides results to resort lexicographically, so that map results can be printed in a consistent order. I'll check to see if there is a smarter way to do this and, either way, I'll probably follow up with an answer tomorrow or Tuesday.

ADD REPLY • link 7.9 years ago by Alex Reynolds 36k

score 0 · Answer 4 · 2017-09-24

0

Entering edit mode

7.9 years ago

theodore ▴ 110

I will wait patiently. Thank you for your efforts.

ADD COMMENT • link 7.9 years ago by theodore ▴ 110