Question

How to get only rows with values in at least 3 columns?

0

Entering edit mode

9.2 years ago

MatTesla • 0

Hi all,

I am sure it is a noob question but have some data from different replicates but in some cases I could measure values only in some cases.

How can I select only rows that have values different from zero in at least 3 columns (replicates)?

Unfortunately I am not sure how to do this yet.

An example of the data:

Cbx5                    0   19    0    11
Vcl                     0   19    0    0
U5-ENSMUSG00000064430   0   18    74   168
Tmbim6                  0   18    7    5
Prcp                    0   18    2    0
Ncl                     0   18    0    1
U4-ENSMUSG00000064606   0   18    0    0
AY172335.5              0   170   38   98
Mest                    0   17    5    10
Cdh1                    0   17    4    6
Spint2                  0   17    2    8
Uggt1                   0   17    2    0
Acly                    0   17    2    0

Thank you for your help.

gene ChIP-Seq • 10k views

ADD COMMENT • link updated 2.3 years ago by Ram 44k • written 9.2 years ago by MatTesla • 0

Ram · Answer 1 · 2015-10-28

5

Entering edit mode

9.2 years ago

matted 7.8k

I'll give a Python answer (using Pandas):

import pandas
counts = pandas.read_csv("input.tsv", sep="\t", index_col=0)
counts = counts[(counts > 0).sum(axis=1) >= 3]
counts.to_csv("output.tsv", sep="\t", header=False)

ADD COMMENT • link updated 2.3 years ago by Ram 44k • written 9.2 years ago by matted 7.8k

Ram · Answer 2 · 2015-10-28

4

Entering edit mode

9.2 years ago

5heikki 11k

Assuming tab-separated values:

awk -F '\t' '{count=0} {for(i=2; i<6; i++) if($i==0) count++} {if(count<2) print $0}' file

ADD COMMENT • link updated 5.1 years ago by Ram 44k • written 9.2 years ago by 5heikki 11k

1

Entering edit mode

Nice go at a one-liner solution, but I think this will give a different answer, it will print if fewer than three of the replicates have a '0' value.

To find only rows with at least three replicates that are non-zero, this awk command should work:

awk 'BEGIN{FS="\t";OFS="\t"};{count=0;for(i=2; i<6; i++) {if($i!=0) {count++}};if (count>=3){print $0}}' boogens.txt

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 9.2 years ago by Joseph Pearson ▴ 480

0

Entering edit mode

OP asked "values different from zero in at least 3 columns"

My solution counts zeros, if there are less than 2 (e.g. at least 3 fields have non-zero value), the line is printed.

Your solution counts non-zeros. If there are 3 or more, the line is printed.

I think with OP's data we arrive at the same exact output with both strategies (now that I edited the answer).

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 9.2 years ago by 5heikki 11k

1

Entering edit mode

There are four data columns, so less-than-three could be two, which means that only two have non-zero values. I'm assuming (and you did too in your for loop) that OP didn't mean to count the label column as a non-zero.

ADD REPLY • link 9.2 years ago by Joseph Pearson ▴ 480

0

Entering edit mode

Haha, I'm too tired. You're right. I fixed it :)

ADD REPLY • link 9.2 years ago by 5heikki 11k

Ram · Answer 3 · 2015-10-28

3

Entering edit mode

9.2 years ago

dariober 15k

In R? Try:

dat<- data.frame(x= c(1, 0, 0, 0),
                 y= c(1, 0, 0, 1),
                 z= c(1, 0, 0, 1),
                 w= c(0, 1, 1, 1))
dat
  x y z w
1 1 1 1 0
2 0 0 0 1
3 0 0 0 1
4 0 1 1 1
cntNonZero<- apply(dat, 1, function(x) sum(x != 0))
dat[which(cntNonZero >=3),]
  x y z w
1 1 1 1 0
4 0 1 1 1

ADD COMMENT • link updated 2.3 years ago by Ram 44k • written 9.2 years ago by dariober 15k

4

Entering edit mode

dat[rowSums( dat != 0 ) >= 3,] #;)

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 9.2 years ago by Michael 55k