Question

Filter transcripts based on CDS length

0

Entering edit mode

5.7 years ago

waqaskhokhar999 ▴ 160

I have a large file contains genome annotation information of Arabidopsis thaliana obtained via Biomart. I want to extract transcripts and corresponding information which have highest CDS length (means those transcripts which codes for longest proteins). The file contains 34 columns and n rows, I am pasting first 7 columns and information of just 2 genes for sake of simplicity:

Gene stable ID  Transcript stable ID    Protein stable ID   CDS Length  Chromosome  Gene start  Gene end
AT1G01030   AT1G01030.1 AT1G01030.1 1077    1   11649   13714
AT1G01030   AT1G01030.1 AT1G01030.1 1077    1   11649   13714
AT1G01030   AT1G01030.2 AT1G01030.2 1008    1   11649   13714
AT1G01030   AT1G01030.2 AT1G01030.2 1008    1   11649   13714
AT1G01030   AT1G01030.2 AT1G01030.2 1008    1   11649   13714
AT1G01110   AT1G01110.1 AT1G01110.1 1095    1   51953   54737
AT1G01110   AT1G01110.1 AT1G01110.1 1095    1   51953   54737
AT1G01110   AT1G01110.1 AT1G01110.1 1095    1   51953   54737
AT1G01110   AT1G01110.1 AT1G01110.1 1095    1   51953   54737
AT1G01110   AT1G01110.2 AT1G01110.2 1584    1   51953   54737
AT1G01110   AT1G01110.2 AT1G01110.2 1584    1   51953   54737
AT1G01110   AT1G01110.2 AT1G01110.2 1584    1   51953   54737
AT1G01110   AT1G01110.2 AT1G01110.2 1584    1   51953   54737
AT1G01110   AT1G01110.2 AT1G01110.2 1584    1   51953   54737

Now I am interested only those transcripts which have maximum CDS length, so the desired output should be:

Gene stable ID  Transcript stable ID    Protein stable ID   CDS Length  Chromosome  Gene start  Gene end
AT1G01030   AT1G01030.1 AT1G01030.1 1077    1   11649   13714
AT1G01030   AT1G01030.1 AT1G01030.1 1077    1   11649   13714
AT1G01110   AT1G01110.2 AT1G01110.2 1584    1   51953   54737
AT1G01110   AT1G01110.2 AT1G01110.2 1584    1   51953   54737
AT1G01110   AT1G01110.2 AT1G01110.2 1584    1   51953   54737
AT1G01110   AT1G01110.2 AT1G01110.2 1584    1   51953   54737
AT1G01110   AT1G01110.2 AT1G01110.2 1584    1   51953   54737

RNA-Seq annotation • 1.7k views

ADD COMMENT • link updated 5.7 years ago by Haci ▴ 730 • written 5.7 years ago by waqaskhokhar999 ▴ 160

score 3 · Answer 1 · 2019-09-11

Whereas the solution offered by @EVR would be much faster, here is the dplyr equivalent:

library(data.table) # for fread()
library(dplyr)

input %>%
  group_by(GenestableID)%>%
  filter(CDSLength == max(CDSLength)) %>%
  ungroup()

# A tibble: 7 x 7
  GenestableID TranscriptstableID ProteinstableID CDSLength Chromosome Genestart Geneend
  <chr>        <chr>              <chr>               <int>      <int>     <int>   <int>
1 AT1G01030    AT1G01030.1        AT1G01030.1          1077          1     11649   13714
2 AT1G01030    AT1G01030.1        AT1G01030.1          1077          1     11649   13714
3 AT1G01110    AT1G01110.2        AT1G01110.2          1584          1     51953   54737
4 AT1G01110    AT1G01110.2        AT1G01110.2          1584          1     51953   54737
5 AT1G01110    AT1G01110.2        AT1G01110.2          1584          1     51953   54737
6 AT1G01110    AT1G01110.2        AT1G01110.2          1584          1     51953   54737
7 AT1G01110    AT1G01110.2        AT1G01110.2          1584          1     51953   54737

score 2 · Answer 2 · 2019-09-11

2

Entering edit mode

5.7 years ago

EVR ▴ 610

HI,

You can group the data.frame based on genestabelId and later take max(CDS Length) which will provide what u want.

setDT(data.frame)[, .SD[which.max(CDS Length)], by=Gene stable ID]

ADD COMMENT • link 5.7 years ago by EVR ▴ 610