How can I get only the rows when for the converge points : - +
5
0
Entering edit mode
6.2 years ago
Lila M ★ 1.3k

Hi everybody, Maybe this is more a programming question related, but any advice could be very handy. I'm trying to select the rows when the 4th has convergence points ( - following by +) in this data sets:

chr1    1275000 1284999 +
chr1    1285000 1294999 -
chr1    1295000 1304999 -
chr1    1385000 1394999 -
chr1    1415000 1424999 -
chr1    1425000 1434999 +
chr1    1435000 1444999 +
chr1    1715000 1724999 +
chr1    1725000 1734999 -
chr1    1735000 1744999 -
chr1    1745000 1754999 -
chr1    1795000 1804999 -
chr1    1805000 1814999 +
chr1    1815000 1824999 -
chr1    1865000 1874999 -

Expected:

 chr1   1415000 1424999 -
 chr1   1425000 1434999 +
 chr1   1795000 1804999 -
 chr1   1805000 1814999 +

I would like to use R, but I have no idea how to start with it. Any suggestion? Thanks!

R convergence points • 2.0k views
ADD COMMENT
3
Entering edit mode
6.2 years ago
biocyberman ▴ 870

Haven't got an R solution yet, but these kind of manipulation I usually do with gawk/awk. Suppose your data file is in test.tsv. This is how you do it:

gawk 'BEGIN{pr=""; ps=""}(ps == "-" && $4 == "+" ) {print pr; print; pr =""; ps =""; next}{ pr = $0; ps=$4}' test.tsv    
chr1    1415000 1424999 -
chr1    1425000 1434999 +
chr1    1795000 1804999 -
chr1    1805000 1814999 +
ADD COMMENT
1
Entering edit mode

It works as expected, thank you very much. awk is very very helpful to solve this kinds of problems, but for me takes a while to learn it...

ADD REPLY
1
Entering edit mode

@Lila M: Make sure to check out the updated answer. I forgot to reset pr and ps (previous row, and previous sign) after printing. The updated version works correctly now.

ADD REPLY
0
Entering edit mode

thank you for the update!

ADD REPLY
3
Entering edit mode
6.2 years ago

You could just use awk:

$ awk -vFS="\t" -vOFS="\t" '{ if ($4 == "-") { curr = $0; flag = 1; } else if (($4 == "+") && (flag == 1)) { print curr; print $0; flag = 0; }  }' foo.bed
chr1    1415000 1424999 -
chr1    1425000 1434999 +
chr1    1795000 1804999 -
chr1    1805000 1814999 +

If you want to use R, the same logic shown above would work by iterating/looping over rows in a data frame, testing $V4 at each row, and setting curr and flag variables accordingly.

ADD COMMENT
0
Entering edit mode

Do you know any tutorial or useful documentation to learn awk? I know that should be a separate question but I hope could be permitted

ADD REPLY
1
Entering edit mode

You can find lot's of examples and other tutorial's on Grymoire.

ADD REPLY
0
Entering edit mode

Thank you! I've never heard about this toturial before.

ADD REPLY
3
Entering edit mode
6.2 years ago

Here's one way to do this with R, which creates a temporary fifth column from shifting the fourth column by a row. The fourth and fifth columns can then be compared directly and used to build a new data frame:

> d <- read.table("foo.bed", header=F)
> r <- as.numeric(row.names(d))
> d$V5 <- d[r+1, ]$V4
> r2 <- as.numeric(row.names(d[d$V4 %in% "-" & d$V5 %in% "+", ]))
> a <- rbind(d[r2, ], d[r2+1, ])
> a2 <- a[order(as.numeric(row.names(a))), 1:4]
> print(a2)
     V1      V2      V3 V4
5  chr1 1415000 1424999  -
6  chr1 1425000 1434999  +
12 chr1 1795000 1804999  -
13 chr1 1805000 1814999  +
ADD COMMENT
2
Entering edit mode
6.2 years ago

The lag and lead functions of dplyr package allows you to fetch previous or next values from a column, respectively. Here is the complete code

library(dplyr)

## Generate a random df 
set.seed(33)
df = data.frame(x=LETTERS[1:10], y=rnorm(10), z=sample(c("+", "-"), 10, replace = T))

## concatenate previous value (lag), current value, and next value (lead) of z
det = with(df, paste0(lag(z), z, lead(z)))

## check how the df looks with det
cbind(df, det)

> cbind(df, det)
   x           y z  det
1  A -0.13592452 - NA-+
2  B -0.04079697 +  -++
3  C  1.01053901 +  +++
4  D -0.15826244 +  ++-
5  E -2.15663750 -  +--
6  F  0.49864683 -  ---
7  G -0.75524431 -  --+
8  H  0.77858519 +  -+-
9  I  0.75457795 -  +-+
10 J -1.09954561 + -+NA


## you need to get all rows with  "-+" in det
df[grepl("-+", det, fixed=T), ]

   x           y z
1  A -0.13592452 -
2  B -0.04079697 +
7  G -0.75524431 -
8  H  0.77858519 +
9  I  0.75457795 -
10 J -1.09954561 +
ADD COMMENT
2
Entering edit mode
6.2 years ago
$  grep  -A 1  "-" test.txt | grep --no-group-separator -B 1 "+"
chr1    1415000 1424999 -
chr1    1425000 1434999 +
chr1    1795000 1804999 -
chr1    1805000 1814999 +

$ cat test.txt 
chr1    1275000 1284999 +
chr1    1285000 1294999 -
chr1    1295000 1304999 -
chr1    1385000 1394999 -
chr1    1415000 1424999 -
chr1    1425000 1434999 +
chr1    1435000 1444999 +
chr1    1715000 1724999 +
chr1    1725000 1734999 -
chr1    1735000 1744999 -
chr1    1745000 1754999 -
chr1    1795000 1804999 -
chr1    1805000 1814999 +
chr1    1815000 1824999 -
chr1    1865000 1874999 -
ADD COMMENT
1
Entering edit mode

in R with dataframe:

> df1=read.csv("test.txt", sep="\t", stringsAsFactors = F, strip.white = T, header = F)
> df2=df1[grep("\\+", df1$V4,value = F)-1,]
> df3=df2[grep("\\-", df2$V4,value = F),]
> df1[sort(c(as.integer(row.names(df3)),as.integer(row.names(df3))+1)),]
     V1      V2      V3 V4
5  chr1 1415000 1424999  -
6  chr1 1425000 1434999  +
12 chr1 1795000 1804999  -
13 chr1 1805000 1814999  +
ADD REPLY
1
Entering edit mode

Much simpler with pcregrep (available in most of the linux repos):

$ pcregrep -M  '\-$\n.*\+$' test1.txt
chr1    1415000 1424999 -
chr1    1425000 1434999 +
chr1    1795000 1804999 -
chr1    1805000 1814999 +

on *buntu, sudo apt install pcregrep

ADD REPLY
1
Entering edit mode

with awk:

$ awk '/+/ {if(s == "-") {print $0}} {s = $4 }' test1.txt | grep --no-group-separator -f - -B 1 test1.txt 
chr1    1415000 1424999 -
chr1    1425000 1434999 +
chr1    1795000 1804999 -
chr1    1805000 1814999 +
ADD REPLY
1
Entering edit mode

+1 for a neat and elegant solution. Probably you could add the explanation of various flags for a novice user.

ADD REPLY

Login before adding your answer.

Traffic: 3053 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6