Split columns keep the first coordinate from start and end
2
0
Entering edit mode
6.2 years ago
1769mkc ★ 1.2k

I have this data file trying to format it for circos plot so far i have made the data files as such the structure of my dataframe is as such

Symbol  Chr Start   End
RBM11   hs21    14216130;14216145;14216153;14216154;14216178;14219553;14219553;14219563;14219563;14221097;14221097;14221097;14221097;14221097;14224375;14224438;14224438;14224438;14224438;14226859;14226880;14226880;14226880;14226880 14216282;14216282;14216282;14216282;14216282;14219725;14219725;14219725;14219725;14221453;14221169;14221169;14221169;14221169;14224537;14224537;14224537;14224537;14224537;14227054;14228372;14227384;14228173;14228372

So what i need is I need the Symbol Chr then probably first coordinate from Start and End in the dataframe tried with various ways not been able to do it

Something like this

Symbol  Chr Start   End
RBM11   hs21 14216130 14216282

I tried this library

library(splitstackshape)

but I can't resolve .

Any simple way to resolve this issue

R • 1.3k views
ADD COMMENT
4
Entering edit mode
6.2 years ago
thomaskuilman ▴ 850

It is usually helpful to provide an example. This can be done by using the dput() function on the variable that contains your data. In this case, I have used a data.frame called test:

> dput(test)
structure(list(Symbol = "RBM11", Chr = "hs21", Start = "14216130;14216145;14216153;14216154;14216178;14219553;14219553;14219563;14219563;14221097;14221097;14221097;14221097;14221097;14224375;14224438;14224438;14224438;14224438;14226859;14226880;14226880;14226880;14226880", 
    End = "14216282;14216282;14216282;14216282;14216282;14219725;14219725;14219725;14219725;14221453;14221169;14221169;14221169;14221169;14224537;14224537;14224537;14224537;14224537;14227054;14228372;14227384;14228173;14228372"), row.names = 2L, class = "data.frame")

In this case you can get what you want using the following code:

test[, c("Start", "End")] <- lapply(test[, c("Start", "End")], function(x) {gsub(";.*", "", x)})

Resulting in

> test
  Symbol  Chr    Start      End
2  RBM11 hs21 14216130 14216282

lapply applies a function to all the lists (columns in a data.frame) provided as the first argument (in this case, the columns named "Start" and "End"). The second argument describes the function you would like to apply, in this case function(x) {gsub(";.*", "", x)} which simply replaces everything the semicolon and everything after it by nothing (effectively clipping after the first coordinate).

ADD COMMENT
0
Entering edit mode

i was thinking of giving of dput() but sorry for that next time i would do the needful .let me try it and let you know ,wonderful it worked i been struggling with it quite a while ..

ADD REPLY
4
Entering edit mode
6.2 years ago

with sed: assumption is that columns are tab separated.

$ sed 's/\(^.*\t[0-9]\+\);.*\(\t[1-9]\+\);.*/\1\2/g' test.txt

Symbol  Chr Start   End
RBM11   hs21    14216130    14216282
ADD COMMENT
1
Entering edit mode

With sed -r (--regexp-extended), the expression becomes a lot simpler:

sed -r 's/(^.*\t[0-9]+);.*(\t[1-9]+);.*/\1\2/g' test.txt

OP is looking a solution in R though, so maybe gsub() works better?

ADD REPLY
0
Entering edit mode

well since now mostly use R so i was looking for R based but sed is absolutely fine as well i need to learn sed to make my life bit easier and thanks for the clear cut solution

ADD REPLY

Login before adding your answer.

Traffic: 2607 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6