Question

filtering dge data with cpm

0

Entering edit mode

2.6 years ago

biology_inform ▴ 50

Hi I have little bit R problem when I am trying to filtering my DGE data taken from edgeR. My trial data is like below:

     SRR13826419_counts SRR13826420_counts SRR13826421_counts SRR13826422_counts SRR13826423_counts
ENSG00000000003.15   55.4289230485435   48.6042625938295   50.0892015765418   39.9978650657232    45.745056983312
ENSG00000000005.6                   0                  0                  0                  0                  0
ENSG00000000419.14   36.8048049042329   36.6032101015259    35.981799866693   32.9242876425245   39.1630343957851
ENSG00000000457.14     6.799281227288   4.50039468461384   5.54785460499672   6.17330393297335     6.033520705233
ENSG00000000460.17   19.6587913745501   25.8022628584527   20.6063171042735   18.7771327961273   21.0624722800861
ENSG00000000938.13                  0                  0                  0                  0                  0
                   SRR13826424_counts
ENSG00000000003.15   38.0779604481818
ENSG00000000005.6                   0
ENSG00000000419.14   42.1830193812572
ENSG00000000457.14    5.5205964962048
ENSG00000000460.17   14.7215906565461
ENSG00000000938.13                  0

I would like to filter out genes with expression greater than 1 (in CPM) in more than 50% of samples in at least 1 sample group.

My samples are in 2 conditions and they are triplicate.

CPM_matrix <- read.delim("~/Desktop/CPM_matrix.txt", row.names = "gene_id")
labels <- names(CPM_matrix)
CPM_matrix[labels] <- lapply(CPM_matrix[labels] , factor)

CPM_try <- head(CPM_matrix)
group <- factor(c(1,1,1,2,2,2))
countMatrix_groups <- data.frame(label=colnames(CPM_matrix),group)

aa <- noquote(countMatrix_groups[countMatrix_groups$group == '1',]$label[1])
bb <- noquote(countMatrix_groups[countMatrix_groups$group == '1',]$label[2])
cc <- noquote(countMatrix_groups[countMatrix_groups$group == '1',]$label[3])
dd <- noquote(countMatrix_groups[countMatrix_groups$group == '2',]$label[1])
ee <- noquote(countMatrix_groups[countMatrix_groups$group == '2',]$label[2])
ff <- noquote(countMatrix_groups[countMatrix_groups$group == '2',]$label[3])


for (i in 1:nrow(CPM_try)) {

aaa <-  filter(CPM_try, aa < 1 & bb < 1 | aa < 1 & cc < 1 | bb < 1 & cc < 1 | dd < 1 & ee < 1 | dd < 1 & ff < 1 | ee < 1 & ff < 1 )
}

When I apply this code it gives me nothing, there is sth wrong but I couldnt figure out thanks in advance.

expression r gene rna-seq differential • 729 views

ADD COMMENT • link updated 2.6 years ago by Dunois ★ 2.8k • written 2.6 years ago by biology_inform ▴ 50

score 1 · Answer 1 · 2022-04-11

In my experience it's not worth performing such a specific filtering step, as opposed to a more general filter to remove consistently low or non expressed gene before doing differential expression analysis - that's of course assuming that that is your next step here. For some context see: https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#pre-filtering

To do the filtering you've described here I suggest a tidyverse/dplyr approach in which you convert your counts matrix to long-format and use group_by to calculate sample group metrics using summarise for which the following resources will be helpful:

https://tidyr.tidyverse.org/reference/pivot_longer.html
https://dplyr.tidyverse.org/reference/summarise.html#ref-examples