Question

How to subtract or sum 12 depending on the last and following line using awk?

2

Entering edit mode

2.6 years ago

Rafael Soler ★ 1.3k

I have this data:

##sequence-region Q75T13 1 641
Q75T13  UniProtKB   Chain   1   641 .   .   .   ID
Q75T13  UniProtKB   Topological domain  1   60  .   .   .   Note=Cytoplasmic    
Q75T13  UniProtKB   Transmembrane   61  85  .   .   .   Note=Helical
Q75T13  UniProtKB   Topological domain  86  641 .   .   .   Note=Lumenal


##sequence-region Q9BRR3 1 403
Q9BRR3  UniProtKB   Chain   1   403 .   .   .   ID
Q9BRR3  UniProtKB   Topological domain  1   22  .   .   .   Note=Lumenal
Q9BRR3  UniProtKB   Transmembrane   23  43  .   .   .   Note=Helical
Q9BRR3  UniProtKB   Topological domain  44  259 .   .   .   Note=Cytoplasmic

##sequence-region Q96FM1 1 250
Q96FM1  UniProtKB   Topological domain  120 135 .   .   .   Note=Cytoplasmic
Q96FM1  UniProtKB   Transmembrane   136 156 .   .   .   Note=Helical
Q96FM1  UniProtKB   Topological domain  157 169 .   .   .   Note=Lumenal
Q96FM1  UniProtKB   Transmembrane   170 190 .   .   .   Note=Helical
Q96FM1  UniProtKB   Topological domain  191 250 .   .   .   Note=Lumenal

And I was wondering what the awk code would look like for:

The rows that have the word lumenal, if in the previous row it has the word transmembrane, subtract -12 in column 4 and print the row with the word lumenal. If the row with the word lumenal has the word "transmembrane" in the next row, add +12 in column 5 and print the row with the word lumenal. The final file would be:

Q75T13  UniProtKB   Topological domain  74  641 .   .   .   Note=Lumenal
Q9BRR3  UniProtKB   Topological domain  1   34  .   .   .   Note=Lumenal
Q96FM1  UniProtKB   Topological domain  145 169 .   .   .   Note=Lumenal
Q96FM1  UniProtKB   Topological domain  157 181 .   .   .   Note=Lumenal
Q96FM1  UniProtKB   Topological domain  179 250 .   .   .   Note=Lumenal

Can someone help me? I am a little bit stuck. I am trying with awk and grep

grep Awk bash • 1.2k views

ADD COMMENT • link updated 2.6 years ago by Matthias Zepper 5.0k • written 2.6 years ago by Rafael Soler ★ 1.3k

2

Entering edit mode

Try this in R:

library(dplyr)
library(tidyr)
library(stringr)

df %>% 
     mutate(after=lead(V3, default = "None"), before=lag(V3, default = "None")) %>% 
     filter(str_detect(V9,"Lumenal")) %>% 
     pivot_longer(cols=c("before","after"),names_to = "k",values_to = "v") %>% 
     filter(v=="Transmembrane") %>% 
     mutate(V4=ifelse(k=="before" & v =="Transmembrane", V4-12,V4), V5=ifelse(k=="after" & v=="Transmembrane" ,V5+12,V5)) %>% 
     select(-c(k,v))

 #  A tibble: 5 × 9
  V1     V2        V3                    V4    V5 V6    V7    V8    V9          
  <chr>  <chr>     <chr>              <dbl> <dbl> <chr> <chr> <chr> <chr>       
1 Q75T13 UniProtKB Topological domain    74   641 .     .     .     Note=Lumenal
2 Q9BRR3 UniProtKB Topological domain     1    34 .     .     .     Note=Lumenal
3 Q96FM1 UniProtKB Topological domain   145   169 .     .     .     Note=Lumenal
4 Q96FM1 UniProtKB Topological domain   157   181 .     .     .     Note=Lumenal
5 Q96FM1 UniProtKB Topological domain   179   250 .     .     .     Note=Lumenal

ADD REPLY • link 2.6 years ago by cpad0112 21k

1

Entering edit mode

Use python instead of an awk script and save the code somewhere. This is not a trivial re-formatting issue and you'll revisit this exact code some time in the future, do not waste your time writing a 'throw-away" script.

ADD REPLY • link 2.6 years ago by Ram 44k

0

Entering edit mode

This is the version in csv:

##sequence-region Q75T13 1 641
Q75T13,UniProtKB,Chain,1,641,.,.,.,ID
Q75T13,UniProtKB,Topological domain,1,60,.,.,.,Note=Cytoplasmic
Q75T13,UniProtKB,Transmembrane,61,85,.,.,.,Note=Helical
Q75T13,UniProtKB,Topological domain,86,641,.,.,.,Note=Lumenal


##sequence-region Q9BRR3 1 403
Q9BRR3,UniProtKB,Chain,1,403,.,.,.,ID
Q9BRR3,UniProtKB,Topological domain,1,22,.,.,.,Note=Lumenal
Q9BRR3,UniProtKB,Transmembrane,23,43,.,.,.,Note=Helical
Q9BRR3,UniProtKB,Topological domain,44,259,.,.,.,Note=Cytoplasmic

##sequence-region Q96FM1 1 250
Q96FM1,UniProtKB,Topological domain,120,135,.,.,.,Note=Cytoplasmic
Q96FM1,UniProtKB,Transmembrane,136,156,.,.,.,Note=Helical
Q96FM1,UniProtKB,Topological domain,157,169,.,.,.,Note=Lumenal
Q96FM1,UniProtKB,Transmembrane,170,190,.,.,.,Note=Helical
Q96FM1,UniProtKB,Topological domain,191,250,.,.,.,Note=Lumenal

And the output:

Q75T13,UniProtKB,Topological domain,74,641,.,.,.,Note=Lumenal
Q9BRR3,UniProtKB,Topological domain,1,34,.,.,.,Note=Lumenal
Q96FM1,UniProtKB,Topological domain,145,169,.,.,.,Note=Lumenal
Q96FM1,UniProtKB,Topological domain,157,181,.,.,.,Note=Lumenal
Q96FM1,UniProtKB,Topological domain,179,250,.,.,.,Note=Lumenal

ADD REPLY • link 2.6 years ago by Rafael Soler ★ 1.3k

0

Entering edit mode

@ Rafael Soler

Why did you delete the post?

ADD REPLY • link 2.6 years ago by Ram 44k

score 3 · Accepted Answer · 2022-05-02

3

Entering edit mode

2.6 years ago

Matthias Zepper 5.0k

Since awk processes a file line by line, I don't think this is possible with one pass. The common NR==FNR trick/idiom helps, though:

awk -F "\t" 'NR==FNR { array[FNR]=$3; next };{if (array[FNR-1]=="Transmembrane") $4=$4-12};{if (array[FNR+1]=="Transmembrane") $5=$5+12};/Lumenal/' uniprot.txt uniprot.txt

While reading the file for the first time, the value of column $3 is written into the array with the index of the FNR (NR would work, too). On the second pass, the entries before and after are checked if being "Transmembrane" and the columns are respectively altered. The last part outputs only the lines containing /Lumenal/ whereby the default print $0 is skipped.

ADD COMMENT • link 2.6 years ago by Matthias Zepper 5.0k

0

Entering edit mode

Programmatically output is correct, but not the way OP wants. Trick is that some rows have "transmembrane" above and below (third line in this case). Current AWK code doesn't consider that. That is why it outputs only 4 lines. It should output 5 lines as per OP.

$ awk -F "\t" 'NR==FNR { array[FNR]=$3; next };{if (array[FNR-1]=="Transmembrane") $4=$4-12};{if (array[FNR+1]=="Transmembrane") $5=$5+12};/Lumenal/' test.txt test.txt

Q75T13 UniProtKB Topological domain 74 641 . . . Note=Lumenal
Q9BRR3 UniProtKB Topological domain 1 34 . . . Note=Lumenal
Q96FM1 UniProtKB Topological domain 145 181 . . . Note=Lumenal
Q96FM1 UniProtKB Topological domain 179 250 . . . Note=Lumena

line:

Q96FM1 UniProtKB Topological domain 145 181 . . . Note=Lumenal

must be broken down to:

3 Q96FM1 UniProtKB Topological domain   145   169 .     .     .     Note=Lumenal
4 Q96FM1 UniProtKB Topological domain   157   181 .     .     .     Note=Lumenal

as the line satisfies both the conditions.

ADD REPLY • link 2.6 years ago by cpad0112 21k

1

Entering edit mode

True, I overlooked this important detail. Then the code should look like so:

awk 'NR==FNR { array[FNR]=$3; next };/Lumenal/{if (array[FNR-1]=="Transmembrane") {$4=$4-12;print $0;printed=1}};/Lumenal/{if (array[FNR+1]=="Transmembrane"){$5=$5+12; print $0;printed=printed+1}};/Lumenal/{if (printed>0){printed=0;next} else {print $0}}' uniprot.txt uniprot.txt

In this case, also Lumenal entries are output that have no prior or subsequent Transmembrane entries. If Lumenal entries should be skipped that have no neighboring Transmembrane ones, it is even easier:

awk 'NR==FNR { array[FNR]=$3; next };/Lumenal/{if (array[FNR-1]=="Transmembrane") {$4=$4-12;print $0}};/Lumenal/{if (array[FNR+1]=="Transmembrane"){$5=$5+12; print $0}}' uniprot.txt uniprot.txt

ADD REPLY • link 2.6 years ago by Matthias Zepper 5.0k