Question

Merging position of all different CDS of a single gene in one line

0

Entering edit mode

6.2 years ago

1234anjalianjali1234 ▴ 50

Hellow,

I am finding the gene duplication event within genome. For this, I have to have the positional information of CDS of those genes.

The problem is i need only uniq ID with its positional information.

My file:

st1 PGSC0003DMC400026563    152418  152576
st1 PGSC0003DMC400026561    160499  160663
st1 PGSC0003DMC400039465    225140  225225
st1 PGSC0003DMC400039465    225786  225990
st1 PGSC0003DMC400039465    226430  226630
st1 PGSC0003DMC400039465    227247  227461
st1 PGSC0003DMC400039465    228093  228346
st1 PGSC0003DMC400039465    228815  228867
st1 PGSC0003DMC400039465    228960  229439
st1 PGSC0003DMC400039540    249208  249402

What I want:

st1 PGSC0003DMC400026563    152418  152576
st1 PGSC0003DMC400026561    160499  160663
st1 PGSC0003DMC400039465    225140  229439
st1 PGSC0003DMC400039540    249208  249402

Thankyou.

CDS GFF gene duplication • 2.0k views

ADD COMMENT • link updated 3.7 years ago by Biostar 20 • written 6.2 years ago by 1234anjalianjali1234 ▴ 50

0

Entering edit mode

Is your file always sorted like that? That is, are the IDs always grouped together and the coordinates sorted?

ADD REPLY • link 6.2 years ago by Devon Ryan 104k

0

Entering edit mode

No, I have sorted my original GFF file using awk command.

ADD REPLY • link 6.2 years ago by 1234anjalianjali1234 ▴ 50

0

Entering edit mode

What have you tried? You have a clear idea of what you want, so you must have made some headway into getting there, right?

ADD REPLY • link 6.2 years ago by Ram 44k

0

Entering edit mode

Can I also add that the tag 'gene duplication' is misplaced here. Those are not gene duplications but exons (CDS) of a single gene. So you just want the beginning and end coordinate of each gene, rather than the separate exons.

ADD REPLY • link 6.2 years ago by lieven.sterck 15k

0

Entering edit mode

Yes, I know that they are not duplicated genes. I am trying to find gene duplication for which I need to make gff file, and for that I have to make a file of CDS with coordinates. You are right, I want start and end coordinate of a CDS.

Thankyou

ADD REPLY • link 6.2 years ago by 1234anjalianjali1234 ▴ 50

0

Entering edit mode

Please aim for professional communication -

Yes, i know that they are not duplicated genes... i am trying to find gene duplication for which i need to make gff file for that i have to make a file of cds with coordinates.... and u r right, i want start and end coordinate of a CDS.

thankyou

would be:

Yes, I know that they are not duplicated genes. I am trying to find gene duplication for which I need to make gff file, and for that I have to make a file of cds with coordinates. And you are right, I want start and end coordinate of a CDS.

Thank You

ADD REPLY • link 6.2 years ago by Ram 44k

score 3 · Accepted Answer · 2018-09-11

3

Entering edit mode

6.2 years ago

rjactonspsfcf ▴ 180

There are a number of tool you could use to do that and i'm not sure what you would prefer, but here is a solution in R using tools from the 'tidyverse':

library(tidyverse)
data <- read_delim("~/Documents/tmp/cds.txt",delim="\t",col_names = c("contig","name","start","end"))

# get the min start and max end value for each ID
uniqueData <- data %>% 
    group_by(contig,name) %>%
    summarise(start=min(start),end=max(end)) 

write.table(uniqueData,file = "~/Documents/tmp/uniqueCDS.txt",sep = "\t",row.names = FALSE,col.names = FALSE,quote = FALSE)

ADD COMMENT • link 6.2 years ago by rjactonspsfcf ▴ 180

0

Entering edit mode

Thankyou, it worked.

ADD REPLY • link 6.2 years ago by 1234anjalianjali1234 ▴ 50

0

Entering edit mode

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one answer if they all work.

Upvote|Bookmark|Accept

ADD REPLY • link 6.2 years ago by Ram 44k