I have a table including Exon coordinates related to each transcript of a given gene (extracted from Ensembl), I would like to keep only those rows with the unique and the first shared exonic coordinate for each gene (e.g. there are two 'ENST00000006777' and 'ENST00000318622' transcripts for RHBDD2 gene that '7 75879034 75879260' exon coordinate is unique in the first transcript and '7 75878999 75879260' and '7 75881365 75881486' exon coordinates are unique in the second transcript, and '7 75881829 75882236' coordinate is the first shared coordinate in both transcripts ).
Input:
Gene.ID Gene.name Transcript.ID Chr Exon.start Exon.end
ENSG00000005486 RHBDD2 ENST00000006777 7 75879034 75879260
ENSG00000005486 RHBDD2 ENST00000006777 7 75881829 75882236
ENSG00000005486 RHBDD2 ENST00000006777 7 75883698 75883848
ENSG00000005486 RHBDD2 ENST00000318622 7 75878999 75879260
ENSG00000005486 RHBDD2 ENST00000318622 7 75881365 75881486
ENSG00000005486 RHBDD2 ENST00000318622 7 75881829 75882236
ENSG00000005486 RHBDD2 ENST00000318622 7 75883698 75883848
ENSG00000183580 FBXL7 ENST00000504595 5 15500180 15500713
ENSG00000183580 FBXL7 ENST00000504595 5 15615983 15616072
ENSG00000183580 FBXL7 ENST00000504595 5 15927890 15928501
ENSG00000183580 FBXL7 ENST00000504595 5 15936450 15939793
ENSG00000183580 FBXL7 ENST00000510662 5 15501438 15501723
ENSG00000183580 FBXL7 ENST00000510662 5 15615983 15616072
ENSG00000183580 FBXL7 ENST00000510662 5 15927890 15928501
ENSG00000183580 FBXL7 ENST00000510662 5 15936450 15937219
ENSG00000169783 LINGO1 ENST00000355300 15 77613027 77615900
ENSG00000169783 LINGO1 ENST00000355300 15 77632310 77632490
ENSG00000169783 LINGO1 ENST00000561030 15 77613670 77615900
ENSG00000169783 LINGO1 ENST00000561030 15 77677089 77677174
ENSG00000169783 LINGO1 ENST00000561030 15 77690720 77690901
ENSG00000169783 LINGO1 ENST00000561030 15 77695971 77696148
desire Output:
Gene.ID Gene.name Transcript.ID Chr Exon.start Exon.end
ENSG00000005486 RHBDD2 ENST00000006777 7 75879034 75879260
ENSG00000005486 RHBDD2 ENST00000006777 7 75881829 75882236
ENSG00000005486 RHBDD2 ENST00000318622 7 75878999 75879260
ENSG00000005486 RHBDD2 ENST00000318622 7 75881365 75881486
ENSG00000005486 RHBDD2 ENST00000318622 7 75881829 75882236
ENSG00000183580 FBXL7 ENST00000504595 5 15500180 15500713
ENSG00000183580 FBXL7 ENST00000504595 5 15615983 15616072
ENSG00000183580 FBXL7 ENST00000504595 5 15936450 15939793
ENSG00000183580 FBXL7 ENST00000510662 5 15501438 15501723
ENSG00000183580 FBXL7 ENST00000510662 5 15615983 15616072
ENSG00000183580 FBXL7 ENST00000510662 5 15936450 15937219
ENSG00000169783 LINGO1 ENST00000355300 15 75879034 75879260
ENSG00000169783 LINGO1 ENST00000355300 15 77632310 77632490
ENSG00000169783 LINGO1 ENST00000561030 15 77613670 77615900
ENSG00000169783 LINGO1 ENST00000561030 15 77677089 77677174
ENSG00000169783 LINGO1 ENST00000561030 15 77690720 77690901
ENSG00000169783 LINGO1 ENST00000561030 15 77695971 77696148
anything you've tried already to resolve this?
have you looked into using some basic linux tools?
sort
?uniq
? ... or even google?Yes, I have tried groupBy of bedtools to group genes, then unique them using 'uniq' command, but it gave me only unique regions, not both unique and the first shared regions.
what about something like this (from the top of my head):
that should give you non-redundant output based on data in column 5
Thanks for the solution, but it doesn`t work, it gives me more than one shared region. e.g.
sort -u -k5,5 data.bed | sort -k1,1 | head -n 150
The '17 43661066 43661393' is the second shared region that I got in my output.
My output should be like:
As @lieven.sterck suggested,
sort -u -k...
should accomplish what you're asking.I don't understand how the following lines are removed,
and yet
is included in your desired output.
Please be very specific with what you need.
No code, just general suggestions if I am coding in R
To find the first/last, etc, it's the easiest to sort/order the data.frame/ data.table by several columns and remove duplicates
To find if something is unique or not, it's the easiest to use dplyr group_by() %>% count/summarise
For you application, you original data lacks the "strand" column. If you really want to find something like "first exon", "strand" needs to be considered. A hack way to do so will be flip the coordinates into negative if a gene is in "-" strand, and then do the sorting/ordering