How to split a string and swap the result in bash
2
0
Entering edit mode
4.6 years ago

I have a column which is the output of an annotation command (bcftools query -f '%POS\t%REF\t%ALT\t%BCSQ\n')

The column I'm interested in is the amino acid position and amino acid change for SNPs, like so:

402Y>402H

How can I use bash code to swap these two around (if it contains a ">". Synonymous changes would just be 402Y for example)

So the result would be

402H>402Y

Thanks

text string bash split • 1.2k views
ADD COMMENT
2
Entering edit mode
4.6 years ago
JC 13k

Could be better to provide some example, but in general, you can do this:

$ echo "402Y>402H" | perl -pe 's/(\d+)(\w)>(\d+)(\w)/$1$4>$3$2/'
402H>402Y
ADD COMMENT
0
Entering edit mode

Great, thanks.

I have a couple of questions though -

1) it seems to fail where I have 'stop lost' annotations: 75*>75Q I presume this is because the * appears just before the >.

2) Do you know how I can do this in-place in the original file/table? For example the first two rows look like this:

123520  T   C   missense    Rv0104  Rv0104  protein_coding  +   402Y>402H   123520T>C  
199470  T   G   missense    mce1A   Rv0169  protein_coding  +   313S>313A   199470T>G

I just need to apply this change to the 9th column here. Thanks

ADD REPLY
0
Entering edit mode

echo "75*>75Q" | perl -pe 's/(\d+)(.)>(\d+)(.)/$1$4>$3$2/'

The "." means anything. This will apply to a new file, is not recommended to do in the origin file.

$ perl  -pe 's/(\d+)(.)>(\d+)(.)/$1$4>$3$2/' < test.txt > out.txt
$ cat out.txt
123520  T   C   missense    Rv0104  Rv0104  protein_coding  +   402H>402Y   123520T>C
199470  T   G   missense    mce1A   Rv0169  protein_coding  +   313A>313S   199470T>G
ADD REPLY
1
Entering edit mode
4.6 years ago
wm ▴ 570

How about try this command?

$ cat in.txt
123520  T       C       missense        Rv0104  Rv0104  protein_coding  +       402Y>402H       123520T>C
199470  T       G       missense        mce1A   Rv0169  protein_coding  +       313S>313A       199470T>G
199470  T       G       missense        mce1A   Rv0169  protein_coding  +       75*>75Q 199470T>G

$ awk '{OFS="\t"; split($9,a,">"); $9=a[2]">"a[1]; print}' in.txt
123520  T       C       missense        Rv0104  Rv0104  protein_coding  +       402H>402Y       123520T>C
199470  T       G       missense        mce1A   Rv0169  protein_coding  +       313A>313S       199470T>G
199470  T       G       missense        mce1A   Rv0169  protein_coding  +       75Q>75* 199470T>G
ADD COMMENT

Login before adding your answer.

Traffic: 1843 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6