Join two .bed file in one RNAseq matrix and sum up duplicate rows
0
0
Entering edit mode
2.6 years ago
el24 ▴ 40

Hi everyone,

I am trying to read two .bed files in R, create matrix format [genes x cells] then aggregate mat1.counts and mat2.counts into one file. The column names (cells) are the exact same between the two files, but the column row names (genes) are not the same, although there are some common genes between the two files. There are also some duplicate genes in both two files. What I have tried is:

mat1 <- read.table(file = "file1.bed", sep = "\t", as.is = c(4,7), header = FALSE)
mat2 <- read.table(file = "file2.bed", sep = "\t", as.is = c(4,7), header = FALSE)
atac <- read.table('chromatin_counts.tsv', sep = '\t', header = TRUE, as.is = TRUE)
barcodes <- colnames(atac)
library(rliger)
mat1.counts <- makeFeatureMatrix(data1, barcodes)
mat2.counts <- makeFeatureMatrix(data2, barcodes)

mat1.counts <- mat1.counts[order(rownames(mat1.counts)),]
mat2.counts <- mat2.counts[order(rownames(mat2.counts)),]
# final_mat = mat1.counts + mat2.counts # commented out because of size mismatch error
  • Size of mat1.counts: 1,102,170 genes x 1,047 cells
  • Size of mat2.counts: 50,170 genes x 1,047 cells

I tried converting them to data frame format to join them in one data frame and sum up repeating genes, but my system crashed due to the large size of the data frame. To get final_mat, is there another workaround to sum up and aggregate two matrices mat1.counts and mat2.counts to get final_mat?

  1. I want to sum up the duplicated gene values (rows) in each file mat1.counts and mat2.counts, separately.
  2. I want to join mat1.counts and mat2.counts together and sum up the intersected genes.

I appreciate any recommendations!

big_data matrix bed RNA-seq R • 986 views
ADD COMMENT
1
Entering edit mode

Would merging the bedfiles upfront be possible? I am thinking of something like bedtools merge -c 4,5,6,7 -o sum,sum,sum,sum or something similar...

ADD REPLY
0
Entering edit mode

Thank you for your comment @Matthias Zepper. I give it a try and update it here if it works. Thanks again!

It looks like bedtools merge combines my regions in the first three columns of my data, but that is not desired in my case.

ADD REPLY
0
Entering edit mode

Always post some example data.

I want to sum up the duplicated gene values (rows) in each file mat1.counts and mat2.counts, separately. 

If genes are in first column, you can do cut -f1 <input> | uniq -d will print duplicate genes. What do you mean by sum up here?

I want to join mat1.counts and mat2.counts together and sum up the intersected genes.

Again, a few example records from each file would have helped. Join on column 1 (genes). This would give common genes between both the datasets (intersetion). What do you mean by summing up the intersected genes.

ADD REPLY

Login before adding your answer.

Traffic: 2558 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6