Question

How to split numbers and strings?

0

Entering edit mode

8.8 years ago

mbk0asis ▴ 700

Hi, all!

I've got a file containing CDS mutation information of cancer genes.

The data have mutation positions and sequences in normal/cancer.

c.863_864insTCTG
c.1799T>A
c.1849G>T
c.2504A>T
c.2509_2510AT>CC
c.2506_2508ATC>TTT

I want to extract positions from it, but no separator is there between numbers and text.

How can I separate them?

Thank you!

split • 2.5k views

ADD COMMENT • link updated 2.3 years ago by Ram 44k • written 8.8 years ago by mbk0asis ▴ 700

Ram · Accepted Answer · 2016-01-21

3

Entering edit mode

8.8 years ago

Prakki Rama ★ 2.7k

In terminal

cat file.txt | grep -Po "\d+.*\d+"

ADD COMMENT • link 8.8 years ago by Prakki Rama ★ 2.7k

0

Entering edit mode

Wow! It worked. Would you explain what the '\d+' mean in this code?

Thank you!

ADD REPLY • link 8.8 years ago by mbk0asis ▴ 700

0

Entering edit mode

Nervermind, I found an answer.

http://stackoverflow.com/questions/14017134/what-is-d-d-in-regex

Actually, the example above is one of columns from a data with multiple columns (~ 30 columns). I tried to apply your code but found no luck. I tested using 2 column data, but didn't work. I was going to 'paste' the results to original data, but rows without numbers disappeared (e.g. "c.?") in output.

How can I do it when data are composed of multiple columns?

EGFR    c.?
JAK2    c.1849G>T
BRAF    c.?
FLT3    c.?_?ins?
NPM1    c.?
KIT    c.?
BRAF    c.?
IDH1    c.395G>A
JAK2    c.1849G>T

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.8 years ago by mbk0asis ▴ 700

0

Entering edit mode

It should work even if the data is having multiple columns. The above command line will extract the pattern in a line [the number (represented by \d) and anything in between (represented by .*) and again number (\d)]. can you show how the pattern will look like? then may be we can try something else!

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.8 years ago by Prakki Rama ★ 2.7k

0

Entering edit mode

Here is my test data and results. As you can see, the first column replace by some numbers.

I forgot to mention that I wanted to keep other columns in the output.

Thank you!

bio1@bio1:~/00-NGS/Cancer_genes/Top10_Cancers$ cat test.txt 
EGFR    c.?
JAK2    c.1849G>T
BRAF    c.?
FLT3    c.?_?ins?
NPM1    c.?
KIT    c.?
BRAF    c.?
IDH1    c.395G>A
JAK2    c.1849G>T
bio1@bio1:~/00-NGS/Cancer_genes/Top10_Cancers$ grep -Po "\d+.*\d+" test.txt
2    c.1849
1    c.395
2    c.1849

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.8 years ago by mbk0asis ▴ 700

0

Entering edit mode

Oh Ok . I got it. In your first post, you gave me only one column. Thats why it worked. Now, by using multiple columns it is printing the number found in the first column and other column. That number 2 is the number in JAK 2 followed by space and c.1849.

OK. Try this if your positions are all present only one column.

cut -f2 test.txt | grep -Po "\d+\_\{0,1\}'\d*"

-f2 is the column 2. Replace 2 with the column number in your file.

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.8 years ago by Prakki Rama ★ 2.7k

0

Entering edit mode

Ah. That's where the numbers in column 1 came from. I understood.

Another questions is If I cut the column I will lose other columns.

If I want to keep the other columns, I don't think grep will do it.

Do you have any thoughts on that?

Thank you!

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.8 years ago by mbk0asis ▴ 700

1

Entering edit mode

Yeah. I have many tricks in my pocket!! :)

Try this (assuming all the patterns you want have c. infront) :

$ grep -Po '.+c.\d+\_?\d*' test.txt | sed 's/c.//g'

ADD REPLY • link 8.8 years ago by Prakki Rama ★ 2.7k

0

Entering edit mode

It worked!

I think I just overcome the biggest hurdle.

Thank you for your help! You rock!

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.8 years ago by mbk0asis ▴ 700