Hi everyone,
I am trying to read two .bed files in R, create matrix format [genes x cells] then aggregate mat1.counts and mat2.counts into one file. The column names (cells) are the exact same between the two files, but the column row names (genes) are not the same, although there are some common genes between the two files. There are also some duplicate genes in both two files. What I have tried is:
mat1 <- read.table(file = "file1.bed", sep = "\t", as.is = c(4,7), header = FALSE)
mat2 <- read.table(file = "file2.bed", sep = "\t", as.is = c(4,7), header = FALSE)
atac <- read.table('chromatin_counts.tsv', sep = '\t', header = TRUE, as.is = TRUE)
barcodes <- colnames(atac)
library(rliger)
mat1.counts <- makeFeatureMatrix(data1, barcodes)
mat2.counts <- makeFeatureMatrix(data2, barcodes)
mat1.counts <- mat1.counts[order(rownames(mat1.counts)),]
mat2.counts <- mat2.counts[order(rownames(mat2.counts)),]
# final_mat = mat1.counts + mat2.counts # commented out because of size mismatch error
- Size of mat1.counts: 1,102,170 genes x 1,047 cells
- Size of mat2.counts: 50,170 genes x 1,047 cells
I tried converting them to data frame format to join them in one data frame and sum up repeating genes, but my system crashed due to the large size of the data frame. To get final_mat, is there another workaround to sum up and aggregate two matrices mat1.counts and mat2.counts to get final_mat?
- I want to sum up the duplicated gene values (rows) in each file mat1.counts and mat2.counts, separately.
- I want to join mat1.counts and mat2.counts together and sum up the intersected genes.
I appreciate any recommendations!
Would merging the bedfiles upfront be possible? I am thinking of something like
bedtools merge -c 4,5,6,7 -o sum,sum,sum,sum
or something similar...Thank you for your comment @Matthias Zepper. I give it a try and update it here if it works. Thanks again!
It looks like bedtools merge combines my regions in the first three columns of my data, but that is not desired in my case.
Always post some example data.
If genes are in first column, you can do
cut -f1 <input> | uniq -d
will print duplicate genes. What do you mean by sum up here?Again, a few example records from each file would have helped. Join on column 1 (genes). This would give common genes between both the datasets (intersetion). What do you mean by summing up the intersected genes.