getting previous exon from gtf file
0
0
Entering edit mode
6.6 years ago
HK ▴ 40

Hey everyone,

I have a dataframe with the my selected exon cordinates(dataframe=my exons) and a gtf dataframe (dataframe=gtf). I am trying to first find my exons from the gft and then once found trying to extract the previous exon from the gft as output. i have given an example below.

dataframe my exons

start   end exon_num    width   gene    geneid_exon_num_start_end
61178981    61179171    2   191 AHSA3   AHSA3 -2-61178981-61179171
243613671   243613739   7   69  AKT3    AKT3 -7-243613671-243613739
36952886    36953296    3   411 AATF    AATF -3-36952886-36953296

dataframe gtf

start   end exon_num    width   gene    geneid_exon_num_start_end
61177472    61177618    1   147 AHSA3   AHSA3 -1-61177472-61177618
61178668    61179171    1   504 AHSA3   AHSA3 -1-61178668-61179171
61185526    61185618    1   93  AHSA3   AHSA3 -1-61185526-61185618
61178981    61179171    2   191 AHSA3   AHSA3 -2-61178981-61179171
36950214    36950405    2   192 AATF    AATF -2-36950214-36950405
37056601    37056858    2   258 AATF    AATF -2-37056601-37056858
36952886    36953296    3   411 AATF    AATF -3-36952886-36953296
19958014    19958571    4   558 AATF    AATF -4-19958014-19958571

dataframe required output

start   end exon_num    width   gene    geneid_exon_num_start_end
61177472    61177618    1   147 AHSA3   AHSA3 -1-61177472-61177618
36950214    36950405    2   192 AATF    AATF -2-36950214-36950405

I am using this in order to find my exons from the gtf, once found take the row above to get the previous exon.

previous_exon=gft[which(gtf$geneid_exon_num_start_end %in% myexons$geneid_exon_num_start_end)+c(-1),]

if you see there are multiple previous exons with different coordinates, so i want the row in which the end coordinate of the previous exon is smaller than the start of my exon (withing the same gene)(see the required output). I have a huge dataframe so cant do one by one , so need help to keep this track and get the output.

R exons gtf • 1.4k views
ADD COMMENT
0
Entering edit mode

OP input and expected output are confusing. Please post appropriate example input data and matching output. In addition to exon number, gene (gene symbol) also must be considered.

ADD REPLY
0
Entering edit mode

It is indeed a bit confusing. Is this what you want?

exons
      start       end exon_num width  gene  geneid_exon_num_start_end
1  61178981  61179171        2   191 AHSA3  AHSA3-2-61178981-61179171
2 243613671 243613739        7    69  AKT3 AKT3-7-243613671-243613739
3  36952886  36953296        3   411  AATF   AATF-3-36952886-36953296


gtf
     start      end exon_num width  gene geneid_exon_num_start_end
1 61177472 61177618        1   147 AHSA2 AHSA2-1-61177472-61177618
2 61178668 61179171        1   504 AHSA3 AHSA3-1-61178668-61179171
3 61185526 61185618        1    93 AHSA4 AHSA4-1-61185526-61185618
4 61178981 61179171        2   191 AHSA5 AHSA5-2-61178981-61179171
5 36950214 36950405        2   192  AATF  AATF-2-36950214-36950405
6 37056601 37056858        2   258  AATF  AATF-2-37056601-37056858
7 36952886 36953296        3   411  AATF  AATF-3-36952886-36953296
8 19958014 19958571        4   558  AATF  AATF-4-19958014-19958571

Create a unique key for each region:

exons$key <- paste(exons$start, exons$end, sep=":")

gtf$key <- paste(gtf$start, gtf$end, sep=":")

Find the row indices of your exons in the GTF, and also the row inices less 1 index:

match(exons$key, gtf$key)
[1]  4 NA  7

match(exons$key, gtf$key) - 1
[1]  3 NA  6

indices <- match(exons$key, gtf$key) - 1

Then subset GTF to select the previous exon:

exons
      start       end exon_num width  gene  geneid_exon_num_start_end
1  61178981  61179171        2   191 AHSA3  AHSA3-2-61178981-61179171
2 243613671 243613739        7    69  AKT3 AKT3-7-243613671-243613739
3  36952886  36953296        3   411  AATF   AATF-3-36952886-36953296
                  key
1   61178981:61179171
2 243613671:243613739
3   36952886:36953296

gtf[indices[!is.na(indices)],]
     start      end exon_num width  gene geneid_exon_num_start_end
3 61185526 61185618        1    93 AHSA4 AHSA4-1-61185526-61185618
6 37056601 37056858        2   258  AATF  AATF-2-37056601-37056858
                key
3 61185526:61185618
6 37056601:37056858
ADD REPLY
0
Entering edit mode

@Kevin Blighe i am corrected the input dataframes again, realized there was a mistake. I did made a unique eye i.e geneid_exon_num_start_end and then found the match between exon and gtf by using gft[which(gtf$geneid_exon_num_start_end %in% myexons$geneid_exon_num_start_end)

Now when i have the matched row in the gtf, want to get the row within the gene where the end cordinate is smaller than the matched start cordinate.

ADD REPLY
0
Entering edit mode

@Kevin Blighe can you kindly check the corrected input and output, an help me with the solution

ADD REPLY
0
Entering edit mode

Thanks for the update. How about this? - note that I use GenomicRanges, in this case, in which case I had to create a dummy chromosome ID.

gtf$chr <- rep("NA", nrow(gtf))
exons$chr <- rep("NA", nrow(exons))

require(GenomicRanges)

exonsGR <- makeGRangesFromDataFrame(exons, keep.extra.columns = TRUE)
gtfGR <- makeGRangesFromDataFrame(gtf, keep.extra.columns = TRUE)

matches <- gtf[queryHits(findOverlaps(gtfGR, exonsGR, type="any")),]
     start      end exon_num width  gene geneid_exon_num_start_end chr
2 61178668 61179171        1   504 AHSA3 AHSA3-1-61178668-61179171  NA
4 61178981 61179171        2   191 AHSA3 AHSA3-2-61178981-61179171  NA
7 36952886 36953296        3   411  AATF  AATF-3-36952886-36953296  NA

Now get what you want:

result <- data.frame()
for (i in 1:nrow(matches))
{
  df <- gtf[which(gtf$gene %in% matches[i, "gene"]),]
  result <- rbind(result, df[which(df$end < matches[i, "start"] && df$exon_num < matches[i, "exon_num"]),])
}

unique(result)

     start      end exon_num width  gene geneid_exon_num_start_end chr
1 61177472 61177618        1   147 AHSA3 AHSA3-1-61177472-61177618  NA
5 36950214 36950405        2   192  AATF  AATF-2-36950214-36950405  NA
ADD REPLY
0
Entering edit mode

@cpad0112 i have corrected the input and output now.

ADD REPLY
0
Entering edit mode

np and thanks

ADD REPLY

Login before adding your answer.

Traffic: 2665 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6