Question

Join two .bed file in one RNAseq matrix and sum up duplicate rows

0

Entering edit mode

2.5 years ago

el24 ▴ 40

Hi everyone,

I am trying to read two .bed files in R, create matrix format [genes x cells] then aggregate mat1.counts and mat2.counts into one file. The column names (cells) are the exact same between the two files, but the column row names (genes) are not the same, although there are some common genes between the two files. There are also some duplicate genes in both two files. What I have tried is:

mat1 <- read.table(file = "file1.bed", sep = "\t", as.is = c(4,7), header = FALSE)
mat2 <- read.table(file = "file2.bed", sep = "\t", as.is = c(4,7), header = FALSE)
atac <- read.table('chromatin_counts.tsv', sep = '\t', header = TRUE, as.is = TRUE)
barcodes <- colnames(atac)
library(rliger)
mat1.counts <- makeFeatureMatrix(data1, barcodes)
mat2.counts <- makeFeatureMatrix(data2, barcodes)

mat1.counts <- mat1.counts[order(rownames(mat1.counts)),]
mat2.counts <- mat2.counts[order(rownames(mat2.counts)),]
# final_mat = mat1.counts + mat2.counts # commented out because of size mismatch error

Size of mat1.counts: 1,102,170 genes x 1,047 cells
Size of mat2.counts: 50,170 genes x 1,047 cells

I tried converting them to data frame format to join them in one data frame and sum up repeating genes, but my system crashed due to the large size of the data frame. To get final_mat, is there another workaround to sum up and aggregate two matrices mat1.counts and mat2.counts to get final_mat?

I want to sum up the duplicated gene values (rows) in each file mat1.counts and mat2.counts, separately.
I want to join mat1.counts and mat2.counts together and sum up the intersected genes.

I appreciate any recommendations!

big_data matrix bed RNA-seq R • 984 views

ADD COMMENT • link updated 2.5 years ago by cpad0112 21k • written 2.5 years ago by el24 ▴ 40

1

Entering edit mode

Would merging the bedfiles upfront be possible? I am thinking of something like bedtools merge -c 4,5,6,7 -o sum,sum,sum,sum or something similar...

ADD REPLY • link 2.5 years ago by Matthias Zepper 5.0k

0

Entering edit mode

Thank you for your comment @Matthias Zepper. I give it a try and update it here if it works. Thanks again!

It looks like bedtools merge combines my regions in the first three columns of my data, but that is not desired in my case.

ADD REPLY • link 2.5 years ago by el24 ▴ 40

0

Entering edit mode

Always post some example data.

I want to sum up the duplicated gene values (rows) in each file mat1.counts and mat2.counts, separately.

If genes are in first column, you can do cut -f1 <input> | uniq -d will print duplicate genes. What do you mean by sum up here?

I want to join mat1.counts and mat2.counts together and sum up the intersected genes.

Again, a few example records from each file would have helped. Join on column 1 (genes). This would give common genes between both the datasets (intersetion). What do you mean by summing up the intersected genes.

ADD REPLY • link 2.5 years ago by cpad0112 21k