Filter columns by group and condition
1
0
Entering edit mode
19 months ago
Sissi ▴ 60

Hi there,

I have a kind of easy task but still can't figure it out. I have a csv binary matrix, with genes as rows and samples as columns, like this:

Gene sampleA sampleB sampleC sampleD sampleE sampleF sampleG
gene1 1 0 0 1 0 0 0
gene2 0 0 0 0 1 1 0
gene3 0 0 0 0 0 0 1
gene4 0 1 0 0 0 0 0
gene5 1 1 1 1 0 0 0
gene6 1 1 1 1 0 0 0
gene8 0 0 0 0 0 0 1
gene9 0 0 0 0 0 0 0
gene10 1 0 0 1 0 0 0
gene11 0 0 0 0 1 1 1
gene12 0 0 0 0 1 1 1
gene13 0 0 0 0 0 0 0
gene14 0 0 0 0 0 1 0
gene16 1 0 0 0 0 0 0
gene17 1 0 0 1 0 0 0
gene18 1 0 0 1 0 0 0
gene19 1 0 0 1 0 0 0
gene20 1 0 0 1 0 0 0

The samples belong to specific clusters, like:

cluster1 = c(sampleA, sampleB, sampleC, sampleD)
cluster2 = c(sampleE, sampleF, sampleG)

I would like to subset/filter the columns according to the gene presence in only one cluster, like this:

Gene sampleA sampleB sampleC sampleD sampleE sampleF sampleG
gene5 1 1 1 1 0 0 0
gene6 1 1 1 1 0 0 0
gene11 0 0 0 0 1 1 1
gene12 0 0 0 0 1 1 1

In a way to see which gene is present only in one of the two clusters.

Is there any easy way with R or bash?

Thanks

filter cluster • 923 views
ADD COMMENT
1
Entering edit mode

Add a column for cluster and use group by and sum across all but the gene and cluster columns. That'd be the general approach I'd take, you should be able to figure out specific dplyr functions with that outline in mind.

ADD REPLY
0
Entering edit mode

Hi Ram, Thanks. But then the cluster column would refers to the gene column and not to the sample.

ADD REPLY
0
Entering edit mode

You need to convert your data to long format first using e.g. pivot_longer from tidyr.

ADD REPLY
2
Entering edit mode
19 months ago

If you just have two clusters, you can also run a summarization function first:

# simulate data
binary_matrix <- matrix(rbinom(1e4,1,0.6),ncol=10)
colnames(binary_matrix) <- paste0("sample",LETTERS[1:10])
rownames(binary_matrix) <- paste0("gene",c(1:1000))

expressed_clusterA <- apply(binary_matrix[,1:5],1,all)
expressed_clusterB <- apply(binary_matrix[,6:10],1,any)

#get those genes that are expressed in all A samples, but not in B
binary_matrix[expressed_clusterA & !expressed_clusterB,]

You can apply arbitrary and custom functions like this, not only any(), all() etc.

apply(binary_matrix[,1:5],1,function(x){all(abs(diff(x))==1)})

for example finds genes with alternating expression within clusterA (that is either 0,1,0,1,0 or 1,0,1,0,1)

Like so, you can flexibly configure the search patterns you are looking for. Use sum() for conditions like "expressed in at least 4 out of 5" etc.

ADD COMMENT
0
Entering edit mode

That perfectly worked. Thank you!

ADD REPLY

Login before adding your answer.

Traffic: 1621 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6