Question

Splitting string in a column using character

0

Entering edit mode

5.8 years ago

sofie_carolina ▴ 30

I'm trying to parse values present in rows in a column to two parts using a specific string as parser. But, unable to parse it, most of online available examples uses delimiter for their examples, but I want a small string (two letters) to act as parser. Is it recommended to do it using awk & sed ? Example:

Col1
BOT-rs10136766
BOT-rs104894363
BOT-rs10774624
BOT-rs111647200
GSA-rs117306900
GSA-rs117306950
GSA-rs117306954
GSA-rs117306975
GSA-rs117306989
BOT-seq-rs532891158.1
BOT-seq-rs794728599
DUP-rs121913344
DUP-rs12979860
DUP-seq-rs397518008
DUP-seq-rs397518039
rs6837175
rs6837180
rs6837215
rs6837250
seq-rs794727444.1
seq-rs794727773.1
seq-rs794728252.1
seq-rs794728252.2

Here, I want to parse only rsID (rs followed with numericID) to be parsed separately from the prefixes.

SNP regex awk sed • 1.6k views

ADD COMMENT • link updated 5.8 years ago by zx8754 12k • written 5.8 years ago by sofie_carolina ▴ 30

1

Entering edit mode

sed 's/.*\(rs\w\+\).*/\1/g' test.txt
Col1
rs10136766
rs104894363
rs10774624
rs111647200
rs117306900
rs117306950
rs117306954
rs117306975
rs117306989
rs532891158
rs794728599
rs121913344
rs12979860
rs397518008
rs397518039
rs6837175
rs6837180
rs6837215
rs6837250
rs794727444
rs794727773
rs794728252
rs794728252

ADD REPLY • link 5.8 years ago by cpad0112 21k

0

Entering edit mode

Maybe move to answer?

ADD REPLY • link 5.8 years ago by zx8754 12k

0

Entering edit mode

Guessing from .1, .2 suffixes, is this an output from an R script?

ADD REPLY • link 5.8 years ago by zx8754 12k

score 2 · Answer 1 · 2019-03-14

2

Entering edit mode

5.8 years ago

lakhujanivijay 5.9k

grep -P 'rs\d+\.?\d+?' test.txt -o

where test.txt is the file containing the ids you have mentioned above

output

rs10136766
rs104894363
rs10774624
rs111647200
rs117306900
rs117306950
rs117306954
rs117306975
rs117306989
rs532891158.1
rs794728599
rs121913344
rs12979860
rs397518008
rs397518039
rs6837175
rs6837180
rs6837215
rs6837250
rs794727444.1
rs794727773.1
rs794728252.1
rs794728252.2

ADD COMMENT • link 5.8 years ago by lakhujanivijay 5.9k

0

Entering edit mode

How to define col here, If I wish to give col ID = 1 ? And also I don't need integers present after decimal ? Like in some rs'ids I have .1, .2 .. Don't need them. Can we mention these two things in your script ?

ADD REPLY • link 5.8 years ago by sofie_carolina ▴ 30

0

Entering edit mode

Can you paste an example how should your result look like?

ADD REPLY • link 5.8 years ago by lakhujanivijay 5.9k

0

Entering edit mode

I think they just want rsXXX, drop prefixes anything before and including dash, and suffixes anything after including dot (.) .

ADD REPLY • link 5.8 years ago by zx8754 12k

0

Entering edit mode

$grep -Po '(?<=^|-)rs\w*' test.txt  
rs10136766
rs104894363
rs10774624
rs111647200
rs117306900
rs117306950
rs117306954
rs117306975
rs117306989
rs532891158
rs794728599
rs121913344
rs12979860
rs397518008
rs397518039
rs6837175
rs6837180
rs6837215
rs6837250
rs794727444
rs794727773
rs794728252
rs794728252

ADD REPLY • link 5.8 years ago by cpad0112 21k

0

Entering edit mode

try this

grep -P 'rs\d+' test.txt -o

ADD REPLY • link 5.8 years ago by lakhujanivijay 5.9k

0

Entering edit mode

I have rsid's in col2. Where to specify col name in this script ?

ADD REPLY • link 5.7 years ago by sofie_carolina ▴ 30

0

Entering edit mode

I don't want to fetch rsid's to another file. I want to print the o/p in the same col. Where rs not found that row will not be printed or it will be omitted.

ADD REPLY • link 5.7 years ago by sofie_carolina ▴ 30