Question

Merge several bigWig files (with different intervals) into single GRanges object

2

Entering edit mode

4.4 years ago

Roman Hillje ▴ 100

Hi everybody,

I've been banging my head against the wall for a few hours and finally decided to ask for some help here. What I need to do is relatively straight-forward:

I have a number of bigWig files, which were generated by deepTools and belong to two experiments groups, that I would like to plot using Gviz (Bioconductor).

In order to properly use the grouping feature of Gviz, it seems that I need a single GRanges object for each group with a column for each replicate. Since the intervals in the different samples/replicates are different, they can't simply be merged, or at least I haven't figured out how to do it.

I tried to make a very simple example with two GRanges objects:

df_1 <- data.frame(
  chr = 'chr1',
  start = c(1,200),
  end = c(100,300),
  strand = '*',
  score = c(10,5)
)

gr_1 <- makeGRangesFromDataFrame(df_1, keep.extra.columns = TRUE)
# GRanges object with 2 ranges and 1 metadata column:
#      seqnames    ranges strand |     score
#         <Rle> <IRanges>  <Rle> | <numeric>
#  [1]     chr1     1-100      * |        10
#  [2]     chr1   200-300      * |         5
#  -------
#  seqinfo: 1 sequence from an unspecified genome; no seqlengths

df_2 <- data.frame(
  chr = 'chr1',
  start = c(50,150),
  end = c(150,250),
  strand = '*',
  score = c(2,4)
)

gr_2 <- makeGRangesFromDataFrame(df_2, keep.extra.columns = TRUE)
# GRanges object with 2 ranges and 1 metadata column:
#       seqnames    ranges strand |     score
#          <Rle> <IRanges>  <Rle> | <numeric>
#   [1]     chr1    50-150      * |         2
#   [2]     chr1   150-250      * |         4
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths

The output I would like to have is this:

chr    start end score_1 score_2
chr1       1  50      10       0
chr1      50 100      10       2
chr1     100 150       0       2
chr1     150 200       0       4
chr1     200 250       5       4
chr1     250 300       5       0

In words, it is the minimum number of rows that still correctly reflects the "score" for both samples.

Is there a way to do this?

I would appreciate any kind of help or suggestions.

Best, Roman

ChIP-Seq R Gviz GenomicFeatures rtracklayer • 2.5k views

ADD COMMENT • link updated 3.9 years ago by Biostar 20 • written 4.4 years ago by Roman Hillje ▴ 100

score 1 · Answer 1 · 2020-07-09

1

Entering edit mode

4.4 years ago

Alex Reynolds 36k

You could do something efficiently via bedmap on the command-line:

$ bedmap --echo --echo-map-score --delim '\t' <(bedops --partition A.bed B.bed) <(bedops --everything A.bed B.bed) > answer.bed

Here, the reference set (from which intervals are derived) is the disjoint partition of sets A and B. The map set (from which scores are derived) is the multiset union of sets A and B.

To get sorted BED files (to make sets A, B, etc.), you could convert bigWig to Wiggle, and then convert Wiggle to BED:

$ bigWigToWig in.bw out.wig
$ wig2bed < out.wig > out.bed

ADD COMMENT • link 4.4 years ago by Alex Reynolds 36k

0

Entering edit mode

Thanks for the suggestion, I'll look into that in a bit more detail tomorrow. Perhaps I should mention, however, that I'll be working on at least 16 profiles (in groups of 4), and ideally 108 profiles in total. It's no problem at all to plot them as separate tracks (actually I'm impressed by how fast it's done), but I can't figure out how to group them together. I'll should probably also reach out to the Gviz developer(s) directly to see if they have any suggestion.

ADD REPLY • link 4.4 years ago by Roman Hillje ▴ 100

score 1 · Answer 2 · 2020-07-09

You can make a Granges object with the tile width that you want, then by using GenomicRanges::findOverlaps() find the score value for each one.

First you need to define your sequence info for genomic ranges object:

seqi = Seqinfo("chr1",1000)
# Seqinfo object with 1 sequence from an unspecified genome:
#   seqnames seqlengths isCircular genome
#   chr1           1000         NA   <NA>

More practically when using genome data, you can get your sequence info this way:

## for hg19
seqi_19 = SeqinfoForUCSCGenome("hg19")
## for hg38
seqi_38 = SeqinfoForUCSCGenome("hg38")
# Seqinfo object with 595 sequences from hg38 genome:
#   seqnames         seqlengths isCircular genome
#   chr1              248956422       <NA>   hg38
#   chr2              242193529       <NA>   hg38
#   chr3              198295559       <NA>   hg38
#   chr4              190214555       <NA>   hg38
#   chr5              181538259       <NA>   hg38
#   ...                     ...        ...    ...
#   chrUn_KI270753v1      62944       <NA>   hg38
#   chrUn_KI270754v1      40191       <NA>   hg38
#   chrUn_KI270755v1      36723       <NA>   hg38
#   chrUn_KI270756v1      79590       <NA>   hg38
#   chrUn_KI270757v1      71251       <NA>   hg38

After that, you just need to make a new Granges object:

gr_comb = tileGenome(seqi,tilewidth = 50000,cut.last.tile.in.chrom = TRUE)
# GRanges object with 20 ranges and 0 metadata columns:
#        seqnames    ranges strand
#           <Rle> <IRanges>  <Rle>
#    [1]     chr1      1-50      *
#    [2]     chr1    51-100      *
#    [3]     chr1   101-150      *
#    [4]     chr1   151-200      *
#    [5]     chr1   201-250      *
#    ...      ...       ...    ...
#   [16]     chr1   751-800      *
#   [17]     chr1   801-850      *
#   [18]     chr1   851-900      *
#   [19]     chr1   901-950      *
#   [20]     chr1  951-1000      *
#   -------
#   seqinfo: 1 sequence from an unspecified genome

And then just find the overlap (Which tells you which regions from first Granges overlap with which regions from the second Granges):

overlap.1 = findOverlaps(gr_comb,gr_1)
# Hits object with 5 hits and 0 metadata columns:
#       queryHits subjectHits
#       <integer>   <integer>
#   [1]         1           1
#   [2]         2           1
#   [3]         4           2
#   [4]         5           2
#   [5]         6           2
#   -------
#   queryLength: 20 / subjectLength: 2
overlap.2 = findOverlaps(gr_comb,gr_2)
# Hits object with 6 hits and 0 metadata columns:
#       queryHits subjectHits
#       <integer>   <integer>
#   [1]         1           1
#   [2]         2           1
#   [3]         3           1
#   [4]         3           2
#   [5]         4           2
#   [6]         5           2
#   -------
#   queryLength: 20 / subjectLength: 2

Finally, you just need to fill out the scores in the new Granges object:

gr_comb$score_1 = NA_real_ ## or you can use 0 
gr_comb$score_2 = NA_real_ ## or you can use 0 
gr_comb[queryHits(overlap.1)]$score_1 = gr_1[subjectHits(overlap.1)]$score
gr_comb[queryHits(overlap.2)]$score_2 = gr_2[subjectHits(overlap.2)]$score
gr_comb
# GRanges object with 20 ranges and 2 metadata columns:
#        seqnames    ranges strand |   score_1   score_2
#           <Rle> <IRanges>  <Rle> | <numeric> <numeric>
#    [1]     chr1      1-50      * |        10         2
#    [2]     chr1    51-100      * |        10         2
#    [3]     chr1   101-150      * |      <NA>         4
#    [4]     chr1   151-200      * |         5         4
#    [5]     chr1   201-250      * |         5         4
#    ...      ...       ...    ... .       ...       ...
#   [16]     chr1   751-800      * |      <NA>      <NA>
#   [17]     chr1   801-850      * |      <NA>      <NA>
#   [18]     chr1   851-900      * |      <NA>      <NA>
#   [19]     chr1   901-950      * |      <NA>      <NA>
#   [20]     chr1  951-1000      * |      <NA>      <NA>
#   -------
#   seqinfo: 1 sequence from an unspecified genome

NOTE: In your example for the desired results, you have defined two values for regions like chr1:50, chr1:100, ... This method gives you results for every 50bp, which I guess is what you meant.