calculate percentage of bases covered by the transcriptome
2
0
Entering edit mode
3.0 years ago
alexmondaini ▴ 20

Hello everyone,

I have a GrangeList with 222 Granges objects that looks like that:

GRangesList object of length 222:
$ENCFF005ZCI
GRanges object with 25414 ranges and 2 metadata columns:
          seqnames              ranges strand |             name       score
             <Rle>           <IRanges>  <Rle> |      <character> <character>
      [1]     chr1 109100352-109100376      + | DDX21_K562_rep02         200
      [2]     chr1 109100322-109100351      + | DDX21_K562_rep02         200
      [3]     chr1   44776511-44776537      + | DDX21_K562_rep02         200
      [4]     chr1 109100273-109100313      + | DDX21_K562_rep02         200
      [5]     chr1   75789482-75789548      + | DDX21_K562_rep02         200
      ...      ...                 ...    ... .              ...         ...
  [25410]     chrX   73827664-73827696      - | DDX21_K562_rep02         200
  [25411]     chrX 154784030-154784047      - | DDX21_K562_rep02         200
  [25412]     chrX   73851381-73851404      - | DDX21_K562_rep02         200
  [25413]     chrX   53380052-53380072      - | DDX21_K562_rep02         200
  [25414]     chrY     7324489-7324510      - | DDX21_K562_rep02         200

and I also have a single Granges object with transcriptome ranges:

> transcriptome
GRanges object with 78807 ranges and 2 metadata columns:
          seqnames            ranges strand |     tx_id     tx_name
             <Rle>         <IRanges>  <Rle> | <integer> <character>
      [1]     chr1       11874-14409      + |         1  uc001aaa.3
      [2]     chr1       11874-14409      + |         2  uc010nxq.1
      [3]     chr1       11874-14409      + |         3  uc010nxr.1
      [4]     chr1       69091-70008      + |         4  uc001aal.1
      [5]     chr1     321084-321115      + |         5  uc001aaq.2
      ...      ...               ...    ... .       ...         ...
  [78803]     chrY 27605645-27605678      - |     78803  uc004fwx.1
  [78804]     chrY 27606394-27606421      - |     78804  uc022cpc.1
  [78805]     chrY 27607404-27607432      - |     78805  uc004fwz.3
  [78806]     chrY 27635919-27635954      - |     78806  uc022cpd.1
  [78807]     chrY 59358329-59360854      - |     78807  uc011ncc.1
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

I would like to calculate for every item in the GrangeList the percentage/fraction of bases covered by the transcriptome Granges object. I'm aware of the coverage function in the Granges package but after much reading I struggle to understand even with a simple example what this function is really doing. Any help is much appreciated, thanks.

coverage granges • 1.5k views
ADD COMMENT
1
Entering edit mode
3.0 years ago
Papyrus ★ 3.0k

If I understood correctly, you want, for each element in your first list, to get the % of bases that overlap with the transcriptome?

If you are using GRanges, maybe you could so something like this example below. Take note that 1) the overlap can be strand specific or not (ignore.strand argument controls this), and 2) I'm using reduce() to remove redundant bases from overlapping elements. This may or may not be desired. This will have an effect if there is overlap between ranges in each element of your list of ranges.

# Generate example data
gr1 <- GRanges(seqnames="chr2", ranges=IRanges(c(3,10), c(6,16)),
               strand="+")
gr2 <- GRanges(seqnames=c("chr1", "chr1"),
               ranges=IRanges(c(7,13), width=3),
               strand=c("+", "-"))
gr3 <- GRanges(seqnames=c("chr1", "chr2","chr1"),
               ranges=IRanges(c(1, 4, 55), c(3, 9, 65)),
               strand=c("-", "-", "+"))
grl <- GRangesList("gr1"=gr1, "gr2"=gr2, "gr3"=gr3)

transcriptome <- GRanges(seqnames = c("chr2","chr2","chr1"),
                         ranges = IRanges(c(2,5,50),c(3,10,60)),
                         strand = c("+","+","+")) 

# Get the percentage of bases in each element of the list of ranges which intersect with transcriptome
lapply(grl,function(x){
  sum(width(intersect(reduce(x), reduce(transcriptome), ignore.strand = F))) / sum(width(reduce(x)))
})
ADD COMMENT
0
Entering edit mode

yeah I think that makes sense, perhaps I should divide by sum(width(reduce(transcriptome))) instead to get the percentage of the grangelist covered by the transcriptome object.

ADD REPLY
0
Entering edit mode
3.0 years ago
jtull • 0

Hi, if you are not restricted to solving your problem with Granges, I recommend using bedtools. You could use bedtools -wao option to report the number of base pairs of overlap including when overlap = 0. Plus, you could also restrict reporting to a minimum overlap as a fraction of each GrangeList item. I recommend taking a look at https://bedtools.readthedocs.io/en/latest/content/tools/intersect.html

An example usage for -wao is below:

$ cat A.bed

chr1 10 20
chr1 30 40

$ cat B.bed

chr1 15 20
chr1 18 25

$ bedtools intersect -a A.bed -b B.bed -wao

chr1 10 20 chr1 15 20 5

chr1 10 20 chr1 18 25 2

chr1 30 40 . -1 -1 0
ADD COMMENT
0
Entering edit mode

I know bedtools has even a coverage function https://bedtools.readthedocs.io/en/latest/content/tools/coverage.html, but I was looking for a solution using Granges

ADD REPLY

Login before adding your answer.

Traffic: 2399 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6