Hi there,
I have a kind of easy task but still can't figure it out. I have a csv binary matrix, with genes as rows and samples as columns, like this:
Gene | sampleA | sampleB | sampleC | sampleD | sampleE | sampleF | sampleG |
---|---|---|---|---|---|---|---|
gene1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
gene2 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
gene3 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
gene4 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
gene5 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |
gene6 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |
gene8 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
gene9 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
gene10 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
gene11 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
gene12 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
gene13 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
gene14 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
gene16 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
gene17 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
gene18 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
gene19 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
gene20 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
The samples belong to specific clusters, like:
cluster1 = c(sampleA, sampleB, sampleC, sampleD)
cluster2 = c(sampleE, sampleF, sampleG)
I would like to subset/filter the columns according to the gene presence in only one cluster, like this:
Gene | sampleA | sampleB | sampleC | sampleD | sampleE | sampleF | sampleG |
---|---|---|---|---|---|---|---|
gene5 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |
gene6 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |
gene11 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
gene12 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
In a way to see which gene is present only in one of the two clusters.
Is there any easy way with R or bash?
Thanks
Add a column for cluster and use group by and sum across all but the gene and cluster columns. That'd be the general approach I'd take, you should be able to figure out specific dplyr functions with that outline in mind.
Hi Ram, Thanks. But then the cluster column would refers to the gene column and not to the sample.
You need to convert your data to long format first using e.g.
pivot_longer
fromtidyr
.