How to split numbers and strings?
1
0
Entering edit mode
8.8 years ago
mbk0asis ▴ 700

Hi, all!

I've got a file containing CDS mutation information of cancer genes.

The data have mutation positions and sequences in normal/cancer.

c.863_864insTCTG
c.1799T>A
c.1849G>T
c.2504A>T
c.2509_2510AT>CC
c.2506_2508ATC>TTT

I want to extract positions from it, but no separator is there between numbers and text.

How can I separate them?

Thank you!

split • 2.5k views
ADD COMMENT
3
Entering edit mode
8.8 years ago
Prakki Rama ★ 2.7k

In terminal

​cat file.txt | grep -Po "\d+.*\d+"
ADD COMMENT
0
Entering edit mode

Wow! It worked. Would you explain what the '\d+' mean in this code?

Thank you!

ADD REPLY
0
Entering edit mode

Nervermind, I found an answer.

http://stackoverflow.com/questions/14017134/what-is-d-d-in-regex

Actually, the example above is one of columns from a data with multiple columns (~ 30 columns). I tried to apply your code but found no luck. I tested using 2 column data, but didn't work. I was going to 'paste' the results to original data, but rows without numbers disappeared (e.g. "c.?") in output.

How can I do it when data are composed of multiple columns?

EGFR    c.?
JAK2    c.1849G>T
BRAF    c.?
FLT3    c.?_?ins?
NPM1    c.?
KIT    c.?
BRAF    c.?
IDH1    c.395G>A
JAK2    c.1849G>T
ADD REPLY
0
Entering edit mode

It should work even if the data is having multiple columns. The above command line will extract the pattern in a line [the number (represented by \d) and anything in between (represented by .*) and again number (\d)]. can you show how the pattern will look like? then may be we can try something else!

ADD REPLY
0
Entering edit mode

Here is my test data and results. As you can see, the first column replace by some numbers.

I forgot to mention that I wanted to keep other columns in the output.

Thank you!

bio1@bio1:~/00-NGS/Cancer_genes/Top10_Cancers$ cat test.txt 
EGFR    c.?
JAK2    c.1849G>T
BRAF    c.?
FLT3    c.?_?ins?
NPM1    c.?
KIT    c.?
BRAF    c.?
IDH1    c.395G>A
JAK2    c.1849G>T
bio1@bio1:~/00-NGS/Cancer_genes/Top10_Cancers$ grep -Po "\d+.*\d+" test.txt
2    c.1849
1    c.395
2    c.1849
ADD REPLY
0
Entering edit mode

Oh Ok . I got it. In your first post, you gave me only one column. Thats why it worked. Now, by using multiple columns it is printing the number found in the first column and other column. That number 2 is the number in JAK 2 followed by space and c.1849.

OK. Try this if your positions are all present only one column.

cut -f2 test.txt | grep -Po "\d+\_\{0,1\}'\d*"

-f2 is the column 2. Replace 2 with the column number in your file.

ADD REPLY
0
Entering edit mode

Ah. That's where the numbers in column 1 came from. I understood.

Another questions is If I cut the column I will lose other columns.

If I want to keep the other columns, I don't think grep will do it.

Do you have any thoughts on that?

Thank you!

ADD REPLY
1
Entering edit mode

Yeah. I have many tricks in my pocket!! :)

Try this (assuming all the patterns you want have c. infront) :

$ grep -Po '.+c.\d+\_?\d*' test.txt | sed 's/c.//g'
ADD REPLY
0
Entering edit mode

It worked!

I think I just overcome the biggest hurdle.

Thank you for your help! You rock!

ADD REPLY

Login before adding your answer.

Traffic: 2973 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6