how to extract unique and the first shared exon in a given gene?
1
0
Entering edit mode
5.3 years ago
star ▴ 350

I have a table including Exon coordinates related to each transcript of a given gene (extracted from Ensembl), I would like to keep only those rows with the unique and the first shared exonic coordinate for each gene (e.g. there are two 'ENST00000006777' and 'ENST00000318622' transcripts for RHBDD2 gene that '7 75879034 75879260' exon coordinate is unique in the first transcript and '7 75878999 75879260' and '7 75881365 75881486' exon coordinates are unique in the second transcript, and '7 75881829 75882236' coordinate is the first shared coordinate in both transcripts ).

Input:

Gene.ID        Gene.name Transcript.ID Chr  Exon.start  Exon.end

ENSG00000005486 RHBDD2  ENST00000006777 7   75879034    75879260
ENSG00000005486 RHBDD2  ENST00000006777 7   75881829    75882236
ENSG00000005486 RHBDD2  ENST00000006777 7   75883698    75883848
ENSG00000005486 RHBDD2  ENST00000318622 7   75878999    75879260
ENSG00000005486 RHBDD2  ENST00000318622 7   75881365    75881486
ENSG00000005486 RHBDD2  ENST00000318622 7   75881829    75882236
ENSG00000005486 RHBDD2  ENST00000318622 7   75883698    75883848
ENSG00000183580 FBXL7   ENST00000504595 5   15500180    15500713
ENSG00000183580 FBXL7   ENST00000504595 5   15615983    15616072
ENSG00000183580 FBXL7   ENST00000504595 5   15927890    15928501
ENSG00000183580 FBXL7   ENST00000504595 5   15936450    15939793
ENSG00000183580 FBXL7   ENST00000510662 5   15501438    15501723
ENSG00000183580 FBXL7   ENST00000510662 5   15615983    15616072
ENSG00000183580 FBXL7   ENST00000510662 5   15927890    15928501
ENSG00000183580 FBXL7   ENST00000510662 5   15936450    15937219
ENSG00000169783 LINGO1  ENST00000355300 15  77613027    77615900
ENSG00000169783 LINGO1  ENST00000355300 15  77632310    77632490
ENSG00000169783 LINGO1  ENST00000561030 15  77613670    77615900
ENSG00000169783 LINGO1  ENST00000561030 15  77677089    77677174
ENSG00000169783 LINGO1  ENST00000561030 15  77690720    77690901
ENSG00000169783 LINGO1  ENST00000561030 15  77695971    77696148

desire Output:

Gene.ID        Gene.name Transcript.ID Chr  Exon.start  Exon.end

ENSG00000005486 RHBDD2  ENST00000006777 7   75879034    75879260
ENSG00000005486 RHBDD2  ENST00000006777 7   75881829    75882236
ENSG00000005486 RHBDD2  ENST00000318622 7   75878999    75879260
ENSG00000005486 RHBDD2  ENST00000318622 7   75881365    75881486
ENSG00000005486 RHBDD2  ENST00000318622 7   75881829    75882236
ENSG00000183580 FBXL7   ENST00000504595 5   15500180    15500713
ENSG00000183580 FBXL7   ENST00000504595 5   15615983    15616072
ENSG00000183580 FBXL7   ENST00000504595 5   15936450    15939793
ENSG00000183580 FBXL7   ENST00000510662 5   15501438    15501723
ENSG00000183580 FBXL7   ENST00000510662 5   15615983    15616072
ENSG00000183580 FBXL7   ENST00000510662 5   15936450    15937219
ENSG00000169783 LINGO1  ENST00000355300 15  75879034    75879260
ENSG00000169783 LINGO1  ENST00000355300 15  77632310    77632490
ENSG00000169783 LINGO1  ENST00000561030 15  77613670    77615900
ENSG00000169783 LINGO1  ENST00000561030 15  77677089    77677174
ENSG00000169783 LINGO1  ENST00000561030 15  77690720    77690901
ENSG00000169783 LINGO1  ENST00000561030 15  77695971    77696148
groupby linux dyplr R bedtools • 1.8k views
ADD COMMENT
0
Entering edit mode

anything you've tried already to resolve this?

have you looked into using some basic linux tools? sort? uniq? ... or even google?

ADD REPLY
0
Entering edit mode

Yes, I have tried groupBy of bedtools to group genes, then unique them using 'uniq' command, but it gave me only unique regions, not both unique and the first shared regions.

ADD REPLY
0
Entering edit mode

what about something like this (from the top of my head):

sort -u -k5,5 <file>

that should give you non-redundant output based on data in column 5

ADD REPLY
0
Entering edit mode

Thanks for the solution, but it doesn`t work, it gives me more than one shared region. e.g.

ENSG00000005102 MEOX1   ENST00000318579 17  43640389    43642032
ENSG00000005102 MEOX1   ENST00000318579 17  43643488    43643660
ENSG00000005102 MEOX1   ENST00000318579 17  43661066    43661922
ENSG00000005102 MEOX1   ENST00000393661 17  43641380    43642032
ENSG00000005102 MEOX1   ENST00000393661 17  43643488    43643660
ENSG00000005102 MEOX1   ENST00000393661 17  43661066    43661393
ENSG00000005102 MEOX1   ENST00000393661 17  43661871    43661922

sort -u -k5,5 data.bed | sort -k1,1 | head -n 150

ENSG00000005102 MEOX1   ENST00000318579 17  43640389    43642032
ENSG00000005102 MEOX1   ENST00000318579 17  43643488    43643660
ENSG00000005102 MEOX1   ENST00000318579 17  43661066    43661922
ENSG00000005102 MEOX1   ENST00000393661 17  43641380    43642032
ENSG00000005102 MEOX1   ENST00000393661 17  43661871    43661922

The '17 43661066 43661393' is the second shared region that I got in my output.

My output should be like:

ENSG00000005102 MEOX1   ENST00000318579 17  43640389    43642032
ENSG00000005102 MEOX1   ENST00000318579 17  43643488    43643660
ENSG00000005102 MEOX1   ENST00000393661 17  43641380    43642032
ENSG00000005102 MEOX1   ENST00000393661 17  43643488    43643660
ENSG00000005102 MEOX1   ENST00000393661 17  43661871    43661922
ADD REPLY
0
Entering edit mode

As @lieven.sterck suggested, sort -u -k... should accomplish what you're asking.

I don't understand how the following lines are removed,

ENSG00000005486 RHBDD2  ENST00000006777 7   75883698    75883848
ENSG00000005486 RHBDD2  ENST00000318622 7   75883698    75883848

and yet

ENSG00000005486 RHBDD2  ENST00000318622 7   75881829    75882236

is included in your desired output.

Please be very specific with what you need.

ADD REPLY
0
Entering edit mode

No code, just general suggestions if I am coding in R

  • To find the first/last, etc, it's the easiest to sort/order the data.frame/ data.table by several columns and remove duplicates

  • To find if something is unique or not, it's the easiest to use dplyr group_by() %>% count/summarise

  • For you application, you original data lacks the "strand" column. If you really want to find something like "first exon", "strand" needs to be considered. A hack way to do so will be flip the coordinates into negative if a gene is in "-" strand, and then do the sorting/ordering

ADD REPLY
2
Entering edit mode
5.3 years ago

Using R

Assuming the following...

  • You have a coordinate sorted data.frame type object that is strand agnostic. (FYI, the first exon of a gene expressed from the negative strand will be located "downstream" in the genome)
  • Every single gene will only be made up of two transcripts... this code would need significant adaptation across multiple transcripts and/or shared exons.

Also your logic doesn't really follow.

ENSG00000183580 FBXL7 ENST00000504595 5 15936450 15939793

ENSG00000183580 FBXL7 ENST00000510662 5 15936450 15937219

These are not shared exons based on their end coordinates. You should be specific if you want to do this based on start, or stop coordinates, or both etc...

What is your actual end goal here?

df <- read.csv('~/your_data_table')
colnames(df) <- c('ens','name','tx','chr','start','end')

## maintain order
df$ens <- factor(df$ens,levels=unique(df$ens))

## split by gene
df.split <- split(df,df$ens)

df.split.processed <- lapply(df.split,function(df.split.sub){
  ## give unique id to each set of exons
  location.id <- paste0(df.split.sub$chr,':',df.split.sub$start,'-',df.split.sub$end)
  ## find the duplicated ones
  nonunique.names <- uniquelocation.id[whichduplicatedlocation.id))])
  ## locate all potential shared exons
  nonunique.idx <- whichlocation.id %in% nonunique.names)
  ## extract first exon that is shared - assuming file is coordinate sorted and that strand doesn't matter
  first.shared <- whichlocation.id %in% location.id[nonunique.idx[1]])
  ## extract unique exons
  unique.exons <- which(!location.id %in% nonunique.names)
  if(length(unique.exons) > 0){
    output <- unique.exons
  }
  if(length(first.shared) >= 1){
    output <- c(output,first.shared)
  }
  df.split.sub[sort(output),]
})
## merge results
df.final <- do.call(rbind,df.split.processed)
## remove row.names
row.names(df.final) <- NULL
## add back original colmnames
colnames(df.final) <- colnames(df)

## show results
print(df.final)

           Gene.ID Gene.name Transcript.ID Chr Exon.start Exon.end
1  ENSG00000005486  RHBDD2  ENST00000006777   7   75879034 75879260
2  ENSG00000005486  RHBDD2  ENST00000006777   7   75881829 75882236
3  ENSG00000005486  RHBDD2  ENST00000318622   7   75878999 75879260
4  ENSG00000005486  RHBDD2  ENST00000318622   7   75881365 75881486
5  ENSG00000005486  RHBDD2  ENST00000318622   7   75881829 75882236
6  ENSG00000183580   FBXL7  ENST00000504595   5   15500180 15500713
7  ENSG00000183580   FBXL7  ENST00000504595   5   15615983 15616072
8  ENSG00000183580   FBXL7  ENST00000504595   5   15936450 15939793
9  ENSG00000183580   FBXL7  ENST00000510662   5   15501438 15501723
10 ENSG00000183580   FBXL7  ENST00000510662   5   15615983 15616072
11 ENSG00000183580   FBXL7  ENST00000510662   5   15936450 15937219
12 ENSG00000169783  LINGO1  ENST00000355300  15   77613027 77615900
13 ENSG00000169783  LINGO1  ENST00000355300  15   77632310 77632490
14 ENSG00000169783  LINGO1  ENST00000561030  15   77613670 77615900
15 ENSG00000169783  LINGO1  ENST00000561030  15   77677089 77677174
16 ENSG00000169783  LINGO1  ENST00000561030  15   77690720 77690901
17 ENSG00000169783  LINGO1  ENST00000561030  15   77695971 77696148
ADD COMMENT
0
Entering edit mode

Many thanks for the solution! you are right, I like to extract regions (unique and shared) based on both start and end I have updated my post.

I have run your function but I got the error: Error in FUN(X[[i]], ...) : object 'uniquelocation.id' not found / Error in FUN(X[[i]], ...) : object 'whichduplicatedlocation.id' not found

ADD REPLY
1
Entering edit mode

There seems to be an error in the way biostars is rendering the code.

It should be as follows inside the main function: enter image description here

ADD REPLY
0
Entering edit mode

Thanks for re-sending! I have rut it:

df.split.processed <- lapply(df.split,function(df.split.sub){
 location.id <- paste0(df.split.sub$chr,':',df.split.sub$start,'-',df.split.sub$end)
 nonunique.names <- unique ( location.id [which (duplicated ( location.id))])
nonunique.idx <- which ( location.id %in% nonunique.names)
first.shared <- which ( location.id %in% location.id[nonunique.idx[1]])
unique.exons <- which(!location.id %in% nonunique.names)
if(length(unique.exons) > 0){
 output <- unique.exons
}
if(length(first.shared) >= 1){
output <- c(output,first.shared)
}
df.split.sub[sort(output),]
 })

but got the error: Error in FUN(X[[i]], ...) : object 'output' not found

ADD REPLY
0
Entering edit mode

I guess It works well on data that I indicated here (so I will mark it), but not work on my whole data. Would you please let me know, what is the solution when I have multiple transcripts for a gene?

ADD REPLY

Login before adding your answer.

Traffic: 1731 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6