Adding gene symbol to GRanges for calculation purposes
1
0
Entering edit mode
3.8 years ago
mm2568 • 0

I am removing the exon coordinates from a file, and I am using GRanges to do this. However, I still want to maintain the information about what gene each coordinate is in (especially because the exons might be in the middle of coordinates splitting the line into two).

This is the file that contains the coordinates that I want exons removed from:

GRanges object with 145 ranges and 0 metadata columns:
        seqnames            ranges strand
           <Rle>         <IRanges>  <Rle>
    [1]        X   2270008-2271068      *
    [2]       3L 20341634-20342694      *
    [3]       3L 20959398-20960458      *
    [4]       3L 20954755-20955815      *
    [5]       3L 21541359-21542419      *
    ...      ...               ...    ...
  [141]       3L   7438554-7439614      *
  [142]       2L   4641972-4643032      *
  [143]       2L   1990381-1991441      *
  [144]       2L   1358837-1359897      *
  [145]       2L   1357925-1358985      *
  -------

I then take the difference between the file that contains all exon coordinates and this GRanges object and get the coordinates without exons (i.e. final):

without_exons <- setdiff(gr_file, gr_exons,ignore.strand = TRUE) #remove overlapping coordinates
final <- reduce(without_exons) #gets rid of any overlaps in the coordinates
write.table(final, "without_exons.bed", sep="\t", quote=FALSE, row.names=FALSE, col.names=FALSE)

My question is if I can add a column that contains the gene symbols that will follow the setdiff calculation I am doing. I want to make sure I know which gene the coordinate is in without having to find the genes after this step (which is also an option). The input file has the gene names in the last column (see below):

X 2270008 2271068 CG2918
3L 20341634 20342694 Spn77Bc
3L 20959398 20960458 CG11037
3L 20954755 20955815 Sems
3L 21541359 21542419 CG33290
3L 22081720 22082780 msopa

Thank you so much for your help!

R • 2.0k views
ADD COMMENT
0
Entering edit mode
3.8 years ago

Example ranges and genes. I'll be using plyranges for this example for the convenience.

library("plyranges")

ranges <- structure(list(seqnames = structure(c(1L, 1L, 1L, 1L), class = "factor", .Label = "I"), 
    start = c(1L, 6L, 11L, 16L), end = c(3L, 8L, 13L, 18L), width = c(3L, 
    3L, 3L, 3L), strand = structure(c(3L, 3L, 3L, 3L), class = "factor", .Label = c("+", 
    "-", "*"))), row.names = c(NA, -4L), class = "data.frame")

ranges <- as_granges(ranges)
> ranges
GRanges object with 4 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        I       1-3      *
  [2]        I       6-8      *
  [3]        I     11-13      *
  [4]        I     16-18      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

genes <- structure(list(seqnames = structure(c(1L, 1L), class = "factor", .Label = "I"), 
    start = c(2L, 10L), end = c(5L, 14L), width = 4:5, strand = structure(c(3L, 
    3L), class = "factor", .Label = c("+", "-", "*")), names = c("ENS001912", 
    "ENS003901")), row.names = c(NA, -2L), class = "data.frame")

genes <- as_granges(genes)
> genes
GRanges object with 2 ranges and 1 metadata column:
      seqnames    ranges strand |       names
         <Rle> <IRanges>  <Rle> | <character>
  [1]        I       2-5      * |   ENS001912
  [2]        I     10-14      * |   ENS003901
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Using left join from plyranges.

overlap <- join_overlap_left(ranges, genes)

> overlap
GRanges object with 4 ranges and 1 metadata column:
      seqnames    ranges strand |       names
         <Rle> <IRanges>  <Rle> | <character>
  [1]        I       1-3      * |   ENS001912
  [2]        I       6-8      * |        <NA>
  [3]        I     11-13      * |   ENS003901
  [4]        I     16-18      * |        <NA>
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
ADD COMMENT
0
Entering edit mode

hi rpolicastro,

I have a Grange look like this:

> vcf_gr
GRanges object with 4598553 ranges and 2 metadata columns:
          seqnames    ranges strand |         ref         alt
             <Rle> <IRanges>  <Rle> | <character> <character>
        1     chr1     51479      * |           T           A
        2     chr1     51803      * |           T           C
        3     chr1     52238      * |           T           G
        4     chr1     54490      * |           G           A
        5     chr1     54712      * |   TTTTCTTTC           T

Do you know if I add gene to single bases, instead of range?

ADD REPLY

Login before adding your answer.

Traffic: 1972 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6