Hi everyone,
I have a query about R,
How to convert this
Protein1 Protein2 Protein3
chr1_1564 1 0 0
chr3_9087 0 1 1
chr4_877671 1 1 0
chr9_90988 0 1 1
chr11_87676 1 0 0
chrX_1546 0 1 1
to this
Protein1 Protein2 Protein3
Protein1 3 1 0
Protein2 1 4 3
Protein3 0 3 3
using R based functions which can do it quickly?
Further explaination, and relevance to biology and bioinformatics:
On rows are genomic sites and columns are protein names. This is a ChIP-seq data for some of my proteins for which I checked the occupancy on some genomic bins and took the start positon and chr name to give an ID to interval. 0 and 1 shows presence or absence of protein on that genomic sites.
I have applied a for-loop but it is taking a long time for large data. Here is my script as of now which is taking long time for a data of 900000 rows and 500 columns (900000 genomic sites and 500 proteins)
mydf <- data.frame(Protein1= c(1,0,1,0,1,0), Protein2=c(0,1,1,1,0,1), Protein3=c(0,1,0,1,0,1))
mydf <- as.matrix(mydf)
# Create empty matrix to store data
converted_mat <- matrix(0, nrow = ncol(mydf), ncol = ncol(mydf))
rownames(converted_mat) <- colnames(mydf)
colnames(converted_mat) <- colnames(mydf)
for (i in 1:ncol(mydf)){
for (j in 1:ncol(mydf)){
converted_mat[i,j] <- sum(ifelse(mydf[,i] == 1 & mydf[,j] == 1, 1,0))
}
}
Any suggestions?
How is this related to bioinformatics? If it is related, please add necessary context. If not, please delete this question and consult StackOverflow.
It is a bioinformatics query about binding pattern of proteins across genomic regions.
If you add more information to your question, we may be able to provide better contextual advice for your actual end goal.
Please edit your post and add as much biological context as it takes for your post to make sense. If not, the post will be removed as off-topic.
The detailed explanation has been added to the post along with biological context. If I am still missing things let me know,