Question

Extract rows if 2 column values has 21 in between

0

Entering edit mode

9.3 years ago

waqasnayab ▴ 250

Hi,

I have a space separated file like this:

start   end
23  36
15  34
7   15
6   15
6   25
21  29
34  41
23  39
22  28
21  29

and I want only those lines if two columns have a value of 21 in between. The desired output would be:

I searched, there are instances like if column value greater or less than, but not a scenrio like this,

Any help appreciated.

Waqas.

bed sequence • 2.2k views

ADD COMMENT • link updated 2.4 years ago by Ram 45k • written 9.3 years ago by waqasnayab ▴ 250

0

Entering edit mode

What do you mean by "21 in between"? Since you list the interval "6 15" in your desired output I don't understand it.

ADD REPLY • link 9.3 years ago by stianlagstad ★ 1.1k

1

Entering edit mode

Changed the formatting, guess op wants to return the intervals that contain a given position. however, [6,15] doesn't contain 21 and therefore the example is wrong. Otherwise this looks like simple case of interval arithmetics. I would like to ask for the biological application of the case, it determines which method is best.

ADD REPLY • link 9.3 years ago by Michael 56k

0

Entering edit mode

In case you need to search for different locations more than once in a large set of intervals: What Is The Quickest Algorithm For Range Overlap?

ADD REPLY • link 9.3 years ago by Michael 56k

score 3 · Answer 1 · 2016-04-20

If your (tab-delimited) text file does not have a header row:

$ awk '($1 < 21) && ($2 > 21)' data.txt > answer.txt

If your text file has a header row ("start" and "end"):

$ tail -n +2 data.txt | awk '($1 < 21) && ($2 > 21)' > answer.txt

Since you tagged your question with the BED tag, if you're working with a BED file, there is a faster way to do this.

The BEDOPS bedextract tool can do a binary or O(log n) search over a sorted BED file, for instance, whereas a simplistic use of awk (such as the ones I wrote above) will read through the entire file, which is a linear or O(n) search. For large BED files, if sorted, a linear scan is a waste of time.

For example, a search for position 21 along a hypothetical chromosome chrN is much faster this way:

$ echo -e "chrN\t21\t22" | bedextract query.bed - > answer.bed

The file answer.bed contains elements from query.bed that overlap ("contain") position 21 — the half-open genomic region [21, 22) — along chromosome chrN.

score 0 · Answer 2 · 2016-04-20

0

Entering edit mode

9.3 years ago

karl.stamm 4.1k

You're looking for the AND function, and a basic primer on programming concepts.

In R, the code would look like

   y <- x[ x[,1]<21 & x[,2]>21 ,]

ADD COMMENT • link 9.3 years ago by karl.stamm 4.1k