Question

Filter columns by group and condition

0

Entering edit mode

20 months ago

Sissi ▴ 60

Hi there,

I have a kind of easy task but still can't figure it out. I have a csv binary matrix, with genes as rows and samples as columns, like this:

Gene	sampleA	sampleB	sampleC	sampleD	sampleE	sampleF	sampleG
gene1	1	0	0	1	0	0	0
gene2	0	0	0	0	1	1	0
gene3	0	0	0	0	0	0	1
gene4	0	1	0	0	0	0	0
gene5	1	1	1	1	0	0	0
gene6	1	1	1	1	0	0	0
gene8	0	0	0	0	0	0	1
gene9	0	0	0	0	0	0	0
gene10	1	0	0	1	0	0	0
gene11	0	0	0	0	1	1	1
gene12	0	0	0	0	1	1	1
gene13	0	0	0	0	0	0	0
gene14	0	0	0	0	0	1	0
gene16	1	0	0	0	0	0	0
gene17	1	0	0	1	0	0	0
gene18	1	0	0	1	0	0	0
gene19	1	0	0	1	0	0	0
gene20	1	0	0	1	0	0	0

The samples belong to specific clusters, like:

cluster1 = c(sampleA, sampleB, sampleC, sampleD)
cluster2 = c(sampleE, sampleF, sampleG)

I would like to subset/filter the columns according to the gene presence in only one cluster, like this:

Gene	sampleA	sampleB	sampleC	sampleD	sampleE	sampleF	sampleG
gene5	1	1	1	1	0	0	0
gene6	1	1	1	1	0	0	0
gene11	0	0	0	0	1	1	1
gene12	0	0	0	0	1	1	1

In a way to see which gene is present only in one of the two clusters.

Is there any easy way with R or bash?

Thanks

filter cluster • 942 views

ADD COMMENT • link 20 months ago by Sissi ▴ 60

1

Entering edit mode

Add a column for cluster and use group by and sum across all but the gene and cluster columns. That'd be the general approach I'd take, you should be able to figure out specific dplyr functions with that outline in mind.

ADD REPLY • link 20 months ago by Ram 44k

0

Entering edit mode

Hi Ram, Thanks. But then the cluster column would refers to the gene column and not to the sample.

ADD REPLY • link 20 months ago by Sissi ▴ 60

0

Entering edit mode

You need to convert your data to long format first using e.g. pivot_longer from tidyr.

ADD REPLY • link 20 months ago by rpolicastro 13k

score 2 · Accepted Answer · 2023-04-12

If you just have two clusters, you can also run a summarization function first:

# simulate data
binary_matrix <- matrix(rbinom(1e4,1,0.6),ncol=10)
colnames(binary_matrix) <- paste0("sample",LETTERS[1:10])
rownames(binary_matrix) <- paste0("gene",c(1:1000))

expressed_clusterA <- apply(binary_matrix[,1:5],1,all)
expressed_clusterB <- apply(binary_matrix[,6:10],1,any)

#get those genes that are expressed in all A samples, but not in B
binary_matrix[expressed_clusterA & !expressed_clusterB,]

You can apply arbitrary and custom functions like this, not only any(), all() etc.

apply(binary_matrix[,1:5],1,function(x){all(abs(diff(x))==1)})

for example finds genes with alternating expression within clusterA (that is either 0,1,0,1,0 or 1,0,1,0,1)

Like so, you can flexibly configure the search patterns you are looking for. Use sum() for conditions like "expressed in at least 4 out of 5" etc.