Question

Transform genomic intervals to genomic positions in an R dataframe

0

Entering edit mode

5.2 years ago

jeni ▴ 90

Hi everyone!

I have a dataframe with some genomic intervals and its corresponding coverage in several samples:

            sample1  sample2   sample3
     1:1-3    30        NA      NA
     1:1-4    NA        40      35
     1:4-5    35        NA      NA
     1:5-7    NA        50      50
     1:6-7    60        NA      NA

I would like to obtain the same dataframe but for genomic positions:

            sample1    sample2     sample3
     1:1      30         40          35
     1:2      30         40          35
     1:3      30         40          35
     1:4      35         40          35
     1:5      35         50          50 
     1:6      60         50          50
     1:7      60         50          50

How could I get this?

R • 1.3k views

ADD COMMENT • link 5.2 years ago by jeni ▴ 90

0

Entering edit mode

The intervals can be obtained first by rownames. Then use strsplit to get the chromosome (first element) and the ranges (2nd and 3rd element). You can either put this into a data frame and use then makeGRangesFromDataFrame or use GRanges directly to construct a GRanges object. The coverages could be stored as elementMetadata in the resulting GRanges object. I suggest you try that out. It is a good practice to improve yourself.

ADD REPLY • link 5.2 years ago by ATpoint 88k

0

Entering edit mode

Okay, thanks! I have already done that.

But now how can I get genomic positions from each interval, indicating the coverage value of each sample for each position?

ADD REPLY • link 5.2 years ago by jeni ▴ 90

0

Entering edit mode

Can you show what you have done?

ADD REPLY • link 5.2 years ago by ATpoint 88k

0

Entering edit mode

Sure! I've transformed my dataframe in a GRanges object (I've splitted first genomic coordinates to this format -> chr start end):

gr<-makeGRangesFromDataFrame(df, seqnames.field = 'chrm', start.field = 'start', end.field = 'end', keep.extra.columns = TRUE)

GRanges object with 5 ranges and 3 metadata columns:
      seqnames    ranges strand |     sample1     sample2     sample3
         <Rle> <IRanges>  <Rle> | <character> <character> <character>
  [1]        1       1-3      * |          30          NA          NA
  [2]        1       1-4      * |          NA          40          35
  [3]        1       4-5      * |          35          NA          NA
  [4]        1       5-7      * |          NA          50          50
  [5]        1       6-7      * |          60          NA          NA
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Now, I have tried:

grd<-disjoin(gr)

and I get this:

GRanges object with 4 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        1       1-3      *
  [2]        1         4      *
  [3]        1         5      *
  [4]        1       6-7      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

In this example I cannot obtain all the positions, but in my real df I can, because I have a lot of overlapped intervals. Now the problem I have is that I dont know how to maintain and adapt metadata columns, what I would like is to obtain this:

GRanges object with 4 ranges and 3 metadata columns:
      seqnames    ranges strand   |   sample1 sample2 sample3
         <Rle> <IRanges>  <Rle>   |  character  character character
  [1]        1       1-3      *                |      30              40            35
  [2]        1         4      *                 |      35              40            35
  [3]        1         5      *                 |      35              50            50
  [4]        1       6-7      *                |      60              50            50

ADD REPLY • link updated 5.2 years ago by Kevin Blighe 89k • written 5.2 years ago by jeni ▴ 90

0

Entering edit mode