Question

How to merge (or perform left joins) on GRange objects?

0

Entering edit mode

8.2 years ago

kyle.s.shank • 0

Hello all,

I'm having a bit of trouble with bioconductor. I have several different GRange objects (peak files from MACS, specifically) that I'd like to merge by performing a left join. Examples files look like this:

    > head(c8.grange)
  GRanges object with 6 ranges and 0 metadata columns:
                                 seqnames               ranges strand
                                    <Rle>            <IRanges>  <Rle>
  Grouped_C8_NoExt_WithD_cond1_1        1 [ 2215700,  2216098]      *
  Grouped_C8_NoExt_WithD_cond1_2        1 [ 2309602,  2309838]      *
  Grouped_C8_NoExt_WithD_cond1_3        1 [ 9958346,  9958709]      *
  Grouped_C8_NoExt_WithD_cond1_4        1 [18401454, 18401797]      *
  Grouped_C8_NoExt_WithD_cond1_5        1 [22603688, 22603920]      *
  Grouped_C8_NoExt_WithD_cond1_6        1 [23866394, 23866612]      *
  -------
  seqinfo: 29 sequences from an unspecified genome; no seqlengths

> head(v0.grange)
GRanges object with 6 ranges and 2 metadata columns:
                        seqnames         ranges strand |     score      fold
                           <Rle>      <IRanges>  <Rle> | <integer> <numeric>
  V0_Peaks_NoExt_peak_1        1 [11160, 11408]      * |        91   4.31469
  V0_Peaks_NoExt_peak_2        1 [13468, 13668]      * |        64   3.71956
  V0_Peaks_NoExt_peak_3        1 [16855, 17058]      * |        42   3.12443
  V0_Peaks_NoExt_peak_4        1 [27463, 27918]      * |       983  16.81243
  V0_Peaks_NoExt_peak_5        1 [28820, 29068]      * |        58   3.57078
  V0_Peaks_NoExt_peak_6        1 [36493, 37032]      * |       546  11.45626
  -------
  seqinfo: 457 sequences from an unspecified genome; no seqlengths

I'd like to:

Find where the peaks overlap, and subsequently:
Merge based on that overlap.

So far:

> c8.v0<-findOverlapsOfPeaks(c8.grange,v0.grange,minoverlap = 200)
> overlaps<-c8.v0$peaklist[["c8.grange///v0.grange"]]
> overlaps
GRanges object with 475 ranges and 1 metadata column:
          seqnames               ranges strand |                                                                        peakNames
             <Rle>            <IRanges>  <Rle> |                                                                  <CharacterList>
    [1]          1 [ 2215669,  2216098]      * |      v0.grange__V0_Peaks_NoExt_peak_98,c8.grange__Grouped_C8_NoExt_WithD_cond1_1
    [2]          1 [18401238, 18402379]      * |    v0.grange__V0_Peaks_NoExt_peak_698a,c8.grange__Grouped_C8_NoExt_WithD_cond1_4
    [3]          1 [22603511, 22603993]      * |     v0.grange__V0_Peaks_NoExt_peak_784,c8.grange__Grouped_C8_NoExt_WithD_cond1_5
    [4]          1 [23866184, 23866789]      * |     v0.grange__V0_Peaks_NoExt_peak_816,c8.grange__Grouped_C8_NoExt_WithD_cond1_6
    [5]          1 [25970340, 25971358]      * |    v0.grange__V0_Peaks_NoExt_peak_887a,c8.grange__Grouped_C8_NoExt_WithD_cond1_7
    ...        ...                  ...    ... .                                                                              ...
  [471] KN150112.1       [ 1834,  2755]      * | v0.grange__V0_Peaks_NoExt_peak_42395,c8.grange__Grouped_C8_NoExt_WithD_cond1_607
  [472] KN150127.1       [26653, 27288]      * | v0.grange__V0_Peaks_NoExt_peak_42417,c8.grange__Grouped_C8_NoExt_WithD_cond1_608
  [473] KN150342.1       [24017, 24364]      * | c8.grange__Grouped_C8_NoExt_WithD_cond1_609,v0.grange__V0_Peaks_NoExt_peak_42582
  [474] KN150342.1       [24739, 25115]      * | c8.grange__Grouped_C8_NoExt_WithD_cond1_610,v0.grange__V0_Peaks_NoExt_peak_42583
  [475] KN150342.1       [25567, 25951]      * | c8.grange__Grouped_C8_NoExt_WithD_cond1_611,v0.grange__V0_Peaks_NoExt_peak_42584
  -------
  seqinfo: 457 sequences from an unspecified genome; no seqlengths

What I'm attempting to do is join several different files (v0,v1,v2,v3, etc.) to this singular base file (c8) to have a way to compare the peak scores for overlapping peaks from different samples. Outside of cutting up the peakNames string to try and do a merge that way, does anyone know of a base function within ChIPpeakAnno that can do this for me? Thanks!

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.6 (El Capitan)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
 [1] grid      stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] dplyr_0.5.0          purrr_0.2.2          readr_1.0.0          tidyr_0.6.0          tibble_1.2           ggplot2_2.1.0        tidyverse_1.0.0      ChIPpeakAnno_3.6.5  
 [9] VennDiagram_1.6.17   futile.logger_1.4.3  GenomicRanges_1.24.3 GenomeInfoDb_1.8.7   Biostrings_2.40.2    XVector_0.12.1       IRanges_2.6.1        S4Vectors_0.10.3    
[17] BiocGenerics_0.18.0  BiocInstaller_1.22.3 gplots_3.0.1         edgeR_3.14.0         limma_3.28.21       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.7                   lattice_0.20-34               GO.db_3.3.0                   Rsamtools_1.24.0              gtools_3.5.0                  assertthat_0.1               
 [7] digest_0.6.10                 mime_0.5                      R6_2.1.3                      plyr_1.8.4                    futile.options_1.0.0          RSQLite_1.0.0                
[13] httr_1.2.1                    zlibbioc_1.18.0               GenomicFeatures_1.24.5        lazyeval_0.2.0                gdata_2.17.0                  Matrix_1.2-7.1               
[19] splines_3.3.1                 BiocParallel_1.6.6            AnnotationHub_2.4.2           RCurl_1.95-4.8                biomaRt_2.28.0                munsell_0.4.3                
[25] shiny_0.14                    httpuv_1.3.3                  rtracklayer_1.32.2            multtest_2.28.0               htmltools_0.3.5               SummarizedExperiment_1.2.3   
[31] interactiveDisplayBase_1.10.3 idr_1.2                       matrixStats_0.50.2            XML_3.98-1.4                  GenomicAlignments_1.8.4       MASS_7.3-45                  
[37] bitops_1.0-6                  RBGL_1.48.1                   xtable_1.8-2                  gtable_0.2.0                  DBI_0.5-1                     magrittr_1.5                 
[43] scales_0.4.0                  graph_1.50.0                  KernSmooth_2.23-15            seqinr_3.3-1                  lambda.r_1.1.9                ensembldb_1.4.7              
[49] tools_3.3.1                   ade4_1.7-4                    BSgenome_1.40.1               Biobase_2.32.0                survival_2.39-5               AnnotationDbi_1.34.4         
[55] colorspace_1.2-6              regioneR_1.4.2                caTools_1.17.1                memoise_1.0.0

Bioconductor • 9.2k views

ADD COMMENT • link updated 8.2 years ago by benformatics 4.0k • written 8.2 years ago by kyle.s.shank • 0

1

Entering edit mode

You know how to find overlaps, so write a function to do the merging (since it's not entirely clear how you want that done).

ADD REPLY • link 8.2 years ago by Devon Ryan 104k

1

Entering edit mode

mergeByOverlaps exists but it won't allow you to select for any particular type of overlap.

As mentionned by Devon Ryan if you want more specific merging a function build around findOverlaps will be the best option.

ADD REPLY • link 8.2 years ago by VHahaut ★ 1.2k

score 1 · Answer 1 · 2016-09-27

I think the simplest way would be to do as you suggested and use the peakNames string.

Your main problem overall however is the following:

example case

You need to have a plan to address the above. And that is only a simple case you could have multiple chains like the above. Are you going to compare sample2's score to the left sample one or the right? Or both?

-

But UNDER THE ASSUMPTION that all of your peaks in c8 ONLY overlap with a single peak in each of the other experiments. Then you may be able to finagle something using the findOverlaps() function from the GenomicRanges library. Then use queryHits() or subjectHits() on the resulting object to get back the indices of the ranges from the query (probably c8) and the subject (the other samples) ranges respectively.

Then use those indices to pull the scores out of the original individual GRanges objects.